In general, the reader is encouraged to go through the excellent documentation of the original OneMap package before going through this vignette. An up-to-date version can be found here. The majority of the pipeline still works the same or very similar to the implementations in OneMap (also internally). For those already familiar with the original or those looking for a quick summary, feel free to go on.
NOTE: BatchMap has been written specifically for use in outcrossing species. All OneMap functionality pertaining to back-crosses, f2, ril etc. has been removed for the sake of easier code maintenance. If your use case is not an outcrossing F1 population, turn back now (and use OneMap instead).
Disclaimer: Due to the processing times being rather long for a tutorial the results of record.parallel
and map.overlapping.batches
are cached. Since there are some random factors involved in the map creation, you might get slightly different results should you choose to run this yourself. I could have used a small toy dataset, but I wanted to show this use case on real (well… simulated) data of at least two hundred markers per LG. Now on to the good part.
BatchMap keeps with the paradigm and format of the original OneMap data format, but includes a faster function for reading the input file read.outcross2
. Further, BatchMap ignores all lines following the marker definitions (e.g. phenotypes) as all exploration beyond the construction of the linkage map is not intended to be handled by this package.
suppressPackageStartupMessages(library(BatchMap))
input_file <- system.file("example/sim7.5k.txt.gz",package = "BatchMap")
outcross <- read.outcross2(input_file)
## Reading data...
## 0% 100%
## [----------------------------------------]
## [########################################]
outcross
## This is an object of class 'outcross'
## No. individuals: 800
## No. markers: 2368
## Segregation types:
## B3.7: 639
## D1.10: 843
## D2.15: 886
## No. traits: 0
High density marker data often has bins of identical markers, which cause problems when estimating recombination fractions, and can in the case of the BatchMap approach make the resulting map worse. OneMap provides functions to detect and resolve such bins. Note the exact
option to find.bins()
, which controls wether missing information should be considered when binning data:
bins <- find.bins(outcross, exact = FALSE)
outcross_clean <- create.data.bins(outcross, bins)
outcross_clean
## This is an object of class 'outcross'
## No. individuals: 800
## No. markers: 1890
## Segregation types:
## B3.7: 581
## D1.10: 628
## D2.15: 681
## No. traits: 0
Note the difference in the number of markers.
The function rf.2pts()
calculates the twopoint table for markers. Note that with very high density datasets, a lot of RAM can be required to hold the twopoint table. As a general rule, this datastructure will require \(M * M * 32\) bytes, where \(M\) is the number of markers. In our case, with a small dataset of 1890 markers, we’ll need about 109Mb. A large dataset of 20,000 markers will need >48Gb. This would be typically run on a server machine (e.g. see some cloud server providers).
twopt_table <- rf.2pts(outcross_clean)
## Computing 1785105 recombination fractions:
##
## 0% 100%
## [----------------------------------------]
## [########################################]
# Check the size
format(object.size(twopt_table),units = "Mb")
## [1] "109.8 Mb"
In order to separate the data into linkage groups, we use the group()
function:
linkage_groups <- group(make.seq(input.obj = twopt_table, "all"),
LOD = 12)
## Selecting markers:
## group 1
## ............................................................
## ............................................................
## ............................................................
## ............................................................
## ............................................................
## ............................................................
## ......................................................
## group 2
## ............................................................
## ............................................................
## ............................................................
## ............................................................
## ............................................................
## ......................................................
## group 3
## ............................................................
## ............................................................
## ............................................................
## ............................................................
## ............................................................
## ............................................................
## ...................
## group 4
## ............................................................
## ............................................................
## ............................................................
## ............................................................
## ............................................................
## .......................................................
## group 5
## ............................................................
## ............................................................
## ............................................................
## ............................................................
## ............................................................
## ............................................................
## ............................
In order to calculate a map for each parent and then join them afterwards, we provide a function pseudo.testcross.split()
, that creates a list of testcrosses. Each list element corresponds to a linkage group and a sequence for markers of type “D1.10” and one for markers of type “D2.15”. Both include all markers of other types.
testcrosses <- pseudo.testcross.split(linkage_groups)
testcrosses$LG1.d2.15
##
## Number of markers: 273
## Markers in the sequence:
## M2 M3 M5 M8 M16 M18 M21 M27 M34 M36 M37 M40 M42 M45 M46 M50 M53 M52 M54
## M55 M67 M68 M70 M72 M77 M86 M89 M92 M95 M106 M114 M127 M134 M136 M138 M139
## M142 M144 M145 M151 M152 M153 M161 M192 M199 M201 M233 M235 M236 M242 M245
## M250 M253 M256 M263 M270 M271 M276 M277 M279 M282 M287 M288 M293 M305 M316
## M317 M322 M323 M330 M329 M331 M333 M336 M342 M344 M354 M355 M367 M372 M387
## M389 M394 M399 M406 M416 M429 M426 M438 M447 M450 M463 M460 M462 M464 M477
## M487 M494 M497 M501 M512 M514 M517 M518 M521 M522 M526 M528 M530 M532 M537
## M552 M553 M565 M566 M576 M579 M590 M599 M602 M606 M608 M612 M613 M622 M624
## M657 M667 M682 M683 M686 M690 M694 M695 M710 M711 M712 M717 M727 M731 M736
## M739 M741 M758 M764 M768 M770 M772 M785 M781 M789 M800 M803 M809 M816 M817
## M821 M825 M830 M831 M837 M838 M848 M850 M865 M866 M873 M876 M882 M884 M886
## M887 M891 M893 M895 M899 M908 M919 M920 M930 M940 M942 M946 M947 M962 M963
## M966 M968 M970 M982 M985 M987 M989 M990 M993 M1008 M1010 M1018 M1028 M1036
## M1035 M1058 M1060 M1061 M1072 M1074 M1076 M1082 M1108 M1109 M1110 M1112
## M1116 M1118 M1120 M1121 M1122 M1124 M1129 M1131 M1132 M1157 M1162 M1177
## M1186 M1200 M1205 M1206 M1237 M1241 M1247 M1255 M1260 M1261 M1267 M1268
## M1271 M1273 M1276 M1278 M1283 M1284 M1291 M1298 M1306 M1314 M1315 M1318
## M1323 M1333 M1334 M1336 M1337 M1339 M1341 M1355 M1359 M1360 M1362 M1366
## M1367 M1410 M1418 M1426 M1430 M1451 M1454 M1457 M1459 M1461 M1467 M1492
## M1494
##
## Parameters not estimated.
Before the map is calculated using the EM model, the sequences need to be ordered by a heuristic. The RECORD algorithm usually performs very well and has desireable characteristics, which make it trivial to parallelize. We use the function record.parallel()
, which takes a sequence
as input and we replicate RECORD 10 times (see the times
argument). We then pick the best of those replicates as our final order. Note that it is rare for times > 10
to yield any significant improvement. Finally, the cores
argument defines how many of those RECORD replicates we can process in parallel. Set this to your computers number of CPUs (or maximally the number of the times
argument).
# The result of this function is cached
ordered_sequences <- lapply(testcrosses, record.parallel, times = 10, cores = 1)
With the sequences neatly ordered, we can now go ahead with creating BatchMaps. For this, we define an overall batch size as well as an overlap size and let the function pick.batch.sizes()
decide on the final size in order to split batches evenly. The around
argument to the function defines how much smaller or larger the batch size is allowed to be in order to create evenly sized batches. We will work with linkage group 1 from here on to save time:
LG1_d1.10 <- ordered_sequences$LG1.d1.10
LG1_d2.15 <- ordered_sequences$LG1.d2.15
batch_size_LG1_d1.10 <- pick.batch.sizes(LG1_d1.10,
size = 50,
overlap = 30,
around = 10)
batch_size_LG1_d2.15 <- pick.batch.sizes(LG1_d2.15,
size = 50,
overlap = 30,
around = 10)
c(batch_size_LG1_d1.10, batch_size_LG1_d2.15)
## [1] 44 53
Now all that’s left to do is to call map.overlapping.batches()
. This function has a great deal of options. For now, take away that phase.cores
controls the number of parallel threads used to estimate the correct linkage phase between a pair of markers. As there are no more than four possible phases, this should never exceed four. The size
and overlap
arguments should match the output of pick.batch.sizes()
with the given overlap. The verbosity
option can be set to output different types of progress reports.
# The result of this function is cached
map_LG1_d1.10 <- map.overlapping.batches(input.seq = LG1_d1.10,
size = batch_size_LG1_d1.10,
phase.cores = 4,
overlap = 30)
The result of map.overlapping.batches()
has a data member $Map
, which corresponds to the final map:
map_LG1_d1.10$Map
##
## Printing map:
##
## Markers Position Parent 1 Parent 2
##
## 414 M1494 0.00 a | | b a | | b
## 413 M1492 0.14 a | | b a | | b
## 412 M1488 0.33 b | | a a | | a
## 411 M1486 0.49 b | | a a | | a
## 410 M1480 0.72 a | | b a | | a
## 409 M1478 0.95 b | | a a | | a
## 407 M1470 1.40 b | | a a | | a
## 408 M1475 1.40 a | | b a | | a
## 404 M1461 2.21 b | | a b | | a
## 405 M1463 2.29 a | | b a | | a
## 399 M1449 3.10 a | | b a | | a
## 398 M1444 3.47 b | | a a | | a
## 394 M1420 4.94 b | | a a | | a
## 396 M1426 4.94 b | | a a | | b
## 395 M1421 5.08 a | | b a | | a
## 392 M1413 5.64 b | | a a | | a
## 391 M1410 5.92 a | | b b | | a
## 390 M1409 5.92 b | | a a | | a
## 389 M1408 6.11 a | | b a | | a
## 388 M1407 6.28 a | | b a | | a
## 387 M1399 6.39 b | | a a | | a
## 386 M1396 6.55 b | | a a | | a
## 385 M1371 8.36 a | | b a | | a
## 384 M1368 8.51 b | | a a | | a
## 381 M1365 8.63 b | | a a | | a
## 378 M1359 8.86 b | | a a | | b
## 377 M1355 9.38 b | | a a | | b
## 376 M1354 9.59 b | | a a | | a
## 374 M1341 10.03 b | | a a | | b
## 375 M1349 10.03 b | | a a | | a
## 369 M1333 11.00 a | | b a | | b
## 371 M1336 11.06 b | | a b | | a
## 372 M1337 11.08 b | | a a | | b
## 373 M1339 11.33 a | | b b | | a
## 368 M1328 11.63 a | | b a | | a
## 364 M1314 12.98 b | | a b | | a
## 363 M1306 13.96 b | | a a | | b
## 362 M1304 14.11 b | | a a | | a
## 361 M1302 14.16 a | | b a | | a
## 360 M1298 14.30 b | | a b | | a
## 359 M1295 14.61 a | | b a | | a
## 357 M1285 14.79 b | | a a | | a
## 356 M1284 14.79 a | | b b | | a
## 354 M1278 15.00 b | | a a | | b
## 353 M1276 15.09 a | | b b | | a
## 351 M1271 15.26 a | | b b | | a
## 352 M1273 15.26 b | | a a | | b
## 349 M1267 15.61 a | | b b | | a
## 350 M1268 15.61 b | | a a | | b
## 348 M1263 15.93 a | | b a | | a
## 346 M1260 16.30 a | | b a | | b
## 347 M1261 16.30 a | | b a | | b
## 343 M1242 16.69 b | | a a | | a
## 341 M1235 16.92 a | | b a | | a
## 339 M1210 18.88 a | | b a | | a
## 336 M1200 19.33 b | | a b | | a
## 338 M1206 19.64 a | | b a | | b
## 337 M1205 19.64 b | | a b | | a
## 335 M1199 19.86 b | | a a | | a
## 334 M1189 20.20 a | | b a | | a
## 333 M1186 20.59 b | | a b | | a
## 331 M1174 21.75 b | | a a | | a
## 328 M1148 22.62 b | | a a | | a
## 327 M1146 22.62 b | | a a | | a
## 326 M1136 23.40 b | | a a | | a
## 320 M1124 23.80 a | | b a | | b
## 321 M1125 23.94 a | | b a | | a
## 324 M1131 23.94 b | | a a | | b
## 322 M1134 23.94 b | | a a | | a
## 319 M1122 24.28 a | | b a | | b
## 318 M1121 24.28 a | | b b | | a
## 316 M1118 24.35 b | | a a | | b
## 310 M1107 24.63 a | | b a | | a
## 311 M1108 24.63 b | | a a | | b
## 313 M1110 24.66 b | | a a | | b
## 314 M1112 24.66 b | | a a | | b
## 312 M1109 24.66 a | | b b | | a
## 308 M1081 25.62 a | | b a | | a
## 307 M1080 25.62 b | | a a | | a
## 305 M1074 25.92 b | | a b | | a
## 306 M1076 25.92 a | | b a | | b
## 303 M1070 26.59 a | | b a | | a
## 302 M1065 27.16 a | | b a | | a
## 301 M1061 27.71 a | | b a | | b
## 298 M1047 28.81 b | | a a | | a
## 296 M1035 29.95 a | | b b | | a
## 295 M1034 29.95 a | | b a | | a
## 297 M1038 29.95 b | | a a | | a
## 292 M1020 30.80 a | | b a | | a
## 288 M1002 31.91 a | | b a | | a
## 284 M988 32.34 a | | b a | | a
## 282 M985 32.34 a | | b a | | b
## 285 M989 32.42 a | | b a | | b
## 278 M967 34.00 b | | a a | | a
## 275 M963 34.25 a | | b a | | b
## 276 M965 34.25 b | | a a | | a
## 277 M966 34.43 b | | a b | | a
## 266 M929 35.60 a | | b a | | a
## 273 M955 36.02 a | | b a | | a
## 270 M942 36.22 b | | a a | | b
## 268 M933 36.46 a | | b a | | a
## 271 M946 36.51 a | | b b | | a
## 265 M926 37.02 a | | b a | | a
## 264 M924 37.40 a | | b a | | a
## 263 M920 37.68 b | | a b | | a
## 261 M912 37.94 a | | b a | | a
## 259 M904 38.58 a | | b a | | a
## 257 M895 38.97 b | | a a | | b
## 253 M889 39.31 a | | b a | | a
## 254 M891 39.32 b | | a b | | a
## 255 M892 39.32 b | | a a | | a
## 250 M884 40.01 b | | a b | | a
## 249 M882 40.01 a | | b a | | b
## 248 M876 40.30 b | | a b | | a
## 247 M873 40.45 a | | b a | | b
## 245 M865 40.76 a | | b a | | b
## 244 M855 41.04 a | | b a | | a
## 241 M846 42.16 b | | a a | | a
## 242 M848 42.16 b | | a b | | a
## 235 M827 43.80 a | | b a | | a
## 238 M832 43.80 b | | a a | | a
## 234 M825 43.80 a | | b b | | a
## 232 M819 44.18 b | | a a | | a
## 231 M817 44.45 b | | a a | | b
## 229 M815 44.81 a | | b a | | a
## 227 M808 45.03 a | | b a | | a
## 224 M794 45.61 b | | a a | | a
## 223 M790 45.79 b | | a a | | a
## 220 M780 47.13 b | | a a | | a
## 218 M772 47.67 b | | a a | | b
## 216 M768 47.76 a | | b a | | b
## 214 M764 47.89 a | | b a | | b
## 215 M765 47.89 a | | b a | | a
## 217 M770 47.89 a | | b b | | a
## 212 M745 48.65 b | | a a | | a
## 213 M758 48.66 a | | b b | | a
## 210 M740 49.01 a | | b a | | a
## 209 M739 49.01 a | | b a | | b
## 208 M738 49.38 b | | a a | | a
## 206 M733 49.56 a | | b a | | a
## 205 M731 49.70 a | | b b | | a
## 203 M719 50.18 a | | b a | | a
## 199 M710 50.47 a | | b a | | b
## 198 M709 50.47 a | | b a | | a
## 201 M712 50.47 b | | a b | | a
## 197 M702 50.63 a | | b a | | a
## 196 M697 51.08 b | | a a | | a
## 194 M694 51.29 a | | b a | | b
## 189 M683 51.96 a | | b a | | b
## 192 M688 51.98 a | | b a | | a
## 188 M682 52.04 a | | b b | | a
## 187 M678 52.29 b | | a a | | a
## 190 M684 52.29 a | | b a | | a
## 186 M677 52.83 b | | a a | | a
## 185 M674 52.83 a | | b a | | a
## 182 M654 54.01 a | | b a | | a
## 183 M657 54.01 a | | b b | | a
## 181 M640 54.43 a | | b a | | a
## 180 M641 54.43 b | | a a | | a
## 177 M616 55.59 a | | b a | | a
## 178 M622 55.59 a | | b a | | b
## 174 M610 56.02 b | | a a | | a
## 172 M606 56.16 a | | b a | | b
## 173 M608 56.16 b | | a b | | a
## 175 M612 56.28 a | | b b | | a
## 168 M592 57.39 b | | a a | | a
## 169 M593 57.39 a | | b a | | a
## 163 M565 58.74 a | | b b | | a
## 165 M576 59.51 b | | a b | | a
## 164 M566 60.45 b | | a a | | b
## 158 M530 62.58 a | | b b | | a
## 157 M528 62.66 b | | a a | | b
## 155 M525 62.99 b | | a a | | a
## 154 M522 63.24 a | | b b | | a
## 153 M521 63.24 b | | a b | | a
## 152 M520 63.24 b | | a a | | a
## 151 M518 63.33 a | | b b | | a
## 149 M515 64.42 b | | a a | | a
## 147 M512 64.77 b | | a a | | b
## 146 M510 65.00 a | | b a | | a
## 145 M501 65.68 b | | a a | | b
## 143 M494 66.08 a | | b b | | a
## 144 M497 66.23 a | | b a | | b
## 141 M483 67.41 a | | b a | | a
## 140 M477 67.60 b | | a b | | a
## 138 M462 68.31 a | | b a | | b
## 139 M464 68.43 a | | b b | | a
## 136 M458 68.43 b | | a a | | a
## 134 M450 69.05 b | | a b | | a
## 129 M419 70.85 b | | a a | | a
## 131 M426 70.85 b | | a b | | a
## 128 M416 71.09 a | | b a | | b
## 133 M447 72.40 b | | a a | | b
## 109 M321 79.39 a | | b a | | a
## 110 M322 79.39 a | | b b | | a
## 111 M323 79.39 b | | a a | | b
## 113 M329 79.64 a | | b b | | a
## 114 M331 79.64 b | | a a | | b
## 106 M310 80.58 b | | a a | | a
## 103 M301 81.78 a | | b a | | a
## 104 M302 81.89 a | | b a | | a
## 100 M288 82.96 a | | b a | | b
## 102 M297 82.96 a | | b a | | a
## 99 M287 83.13 b | | a b | | a
## 98 M282 83.32 b | | a b | | a
## 96 M277 83.53 a | | b a | | b
## 97 M279 83.53 b | | a b | | a
## 95 M276 83.97 b | | a b | | a
## 92 M263 84.15 b | | a b | | a
## 94 M271 84.39 a | | b b | | a
## 93 M270 84.39 b | | a a | | b
## 90 M256 85.26 a | | b a | | b
## 91 M260 85.31 a | | b a | | a
## 89 M254 85.47 b | | a a | | a
## 87 M250 85.88 a | | b a | | b
## 86 M249 85.88 b | | a a | | a
## 80 M235 87.01 b | | a b | | a
## 82 M238 87.01 a | | b a | | a
## 83 M240 87.01 a | | b a | | a
## 78 M231 87.32 a | | b a | | a
## 77 M229 87.53 b | | a a | | a
## 76 M222 87.53 a | | b a | | a
## 74 M212 87.82 a | | b a | | a
## 75 M214 87.82 b | | a a | | a
## 71 M202 88.27 a | | b a | | a
## 72 M203 88.31 a | | b a | | a
## 73 M210 88.31 b | | a a | | a
## 70 M201 88.52 b | | a a | | b
## 67 M186 89.70 b | | a a | | a
## 66 M180 90.53 a | | b a | | a
## 65 M171 90.93 b | | a a | | a
## 63 M159 91.68 a | | b a | | a
## 61 M153 92.19 a | | b b | | a
## 62 M155 92.30 a | | b a | | a
## 59 M151 92.34 a | | b b | | a
## 57 M144 92.79 a | | b a | | b
## 53 M136 93.74 b | | a a | | b
## 50 M130 93.91 b | | a a | | a
## 51 M132 93.98 b | | a a | | a
## 48 M124 94.40 a | | b a | | a
## 46 M106 96.10 a | | b b | | a
## 45 M101 96.42 b | | a a | | a
## 40 M89 97.17 b | | a b | | a
## 41 M90 97.24 b | | a a | | a
## 42 M92 97.29 b | | a a | | b
## 43 M94 97.46 b | | a a | | a
## 33 M67 98.91 b | | a a | | b
## 35 M70 98.91 b | | a b | | a
## 34 M68 98.91 a | | b b | | a
## 38 M80 99.13 a | | b a | | a
## 36 M72 99.27 a | | b a | | b
## 31 M58 99.88 a | | b a | | a
## 25 M50 100.15 a | | b a | | b
## 28 M54 100.21 b | | a b | | a
## 27 M52 100.21 a | | b a | | b
## 32 M61 100.21 b | | a a | | a
## 29 M55 100.21 a | | b a | | b
## 30 M57 100.21 a | | b a | | a
## 24 M46 100.60 b | | a b | | a
## 22 M42 100.75 a | | b a | | b
## 21 M40 100.91 b | | a a | | b
## 20 M39 101.13 b | | a a | | a
## 16 M32 101.46 a | | b a | | a
## 15 M31 101.46 b | | a a | | a
## 19 M37 101.60 a | | b a | | b
## 17 M34 101.61 b | | a b | | a
## 18 M36 101.61 a | | b a | | b
## 13 M25 102.38 b | | a a | | a
## 14 M27 102.38 b | | a a | | b
## 12 M23 102.62 b | | a a | | a
## 11 M21 102.70 b | | a a | | b
## 10 M19 102.70 a | | b a | | a
## 9 M18 102.97 a | | b b | | a
## 7 M15 103.26 a | | b a | | a
## 6 M13 103.26 b | | a a | | a
## 4 M8 103.41 b | | a b | | a
## 5 M9 103.41 a | | b a | | a
##
## 277 markers log-likelihood: -10128.59
The maps were simulated to be 100cM, which we come very close to. However, the markers in the simulated map are also ordered by their name, so M1 -> M2 -> M3 et cetera. We can spot some errors in the results, which can be improved in the next section.
As we saw at the end of the previous section, the markers still have some order error. While we can probably never recover the true map, we can expend resources (CPU time) to improve the current order. To do this, we can supply an ordering function to map.overlapping.batches()
using the fun.ord
argument. Currently there exists an umbrella function called ripple.ord()
that should be supplied to this argument. This function will go through sliding windows within each batch and test alternative orders according to a given rule set. If an order improves the map likelihood, it is kept. The default and recommended ruleset is called “one”, and will test each pairwise marker swap within a window. Further, a number of alternative orders can be considered in parallel. This is controlled by the ripple.cores
argument. Note that the total number of threads used, will be ripple.cores
* phase.cores
.
How many cores will I need?
Depending on the rule set and window size that ripple.ord()
uses, the number of comparisons can be calculated. Let the \(w\) be the window size:
The rule set “random” can be supplied with the number of desired alternative orders. Let’s consider a window size of 4 for our dataset. We will need to test \(\frac{3 * 2}{2} = 6\) alternative order per window. On a machine with 16 threads available, a good combination would be phase.cores=3
and ripple.cores=6
. This comes to a maximum of 18 threads, but on average less are going to be used as often no more than two phases are plausible and even considered in the model. I am writing this vignette on a laptop with four cores available, which I will all use for ripple.cores
, setting phase.cores
to one. The rule set used by ripple.ord()
is controlled by the method
argument, the window size by the ws
argument. Even with only about 100 markers, this function can take some time, it is advised you don’t run it here.:
# The result of this function is cached
rip_LG1_d1.10 <- map.overlapping.batches(input.seq = LG1_d1.10,
size = batch_size_LG1_d1.10,
phase.cores = 4,
overlap = 30,
fun.order = ripple.ord,
ripple.cores = 10,
method = "one",
min.tries = 1,
ws = 4)
We can evaluate the number of mistakes in the order, because the true order is known in the simulated dataset:
err_rate <- function(seq)
{
# Get the marker position
s_num <- seq$seq.num
# If the sequence is reverse, turn it around
if(cor(s_num, 1:length(s_num)) < 0)
s_num <- rev(s_num)
# Get the number of misorders and divide by the total length
sum(order(s_num) - 1:length(s_num) != 0) / length(s_num)
}
c("BatchMap" = err_rate(map_LG1_d1.10$Map),
"RippleBatchMap" = err_rate(rip_LG1_d1.10$Map))
## BatchMap RippleBatchMap
## 0.4765343 0.3898917