The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
The BinaryDosage package provides three functions for reading individual SNPs from Format 5 binary dosage files. They return the same data but differ in how they manage memory and file connections, which matters when reading many SNPs in a loop.
| Function | Allocates output? | Opens file per call? |
|---|---|---|
getbd5snp |
Yes — returns a new list | Yes |
getbd5snp_buf |
No — writes into caller vectors | Yes |
getbd5snp_con |
No — writes into caller vectors | No — reuses open connection |
All three functions require the "genetic-info" list
returned by getbdinfo.
The examples below use the small bgzipped VCF file included with the package. First, convert it to a Format 5 file pair and load the metadata.
bdose_file <- file.path(tempdir(), "set1a.bdose")
if (requireNamespace("vcfppR", quietly = TRUE)) {
vcftobd(vcffile = system.file("extdata", "set1a.vcf.gz", package = "BinaryDosage"),
bdose_file = bdose_file)
} else {
updatebd(bdfiles = system.file("extdata", "vcf1a.bdose", package = "BinaryDosage"),
bdose_file = bdose_file)
}
bd5 <- getbdinfo(bdose_file)
n_snps <- nrow(bd5$snps)
n_samp <- nrow(bd5$samples)
cat("SNPs:", n_snps, " Samples:", n_samp, "\n")
#> SNPs: 10 Samples: 60getbd5snp is the simplest interface. It opens the
.bdose file, seeks to the requested SNP’s compressed block,
decompresses it, decodes the values, and returns them as a new list. The
file is opened and closed on every call.
bd5info — object returned by
getbdinfosnp — 1-based integer index or character SNP IDA list with four numeric vectors of length n_samples:
dosage — DS values in \[0,
2\]; NA = missingp0 — Pr(g=0) in \[0, 1\]; NA = missingp1 — Pr(g=1) in \[0, 1\]; NA = missingp2 — Pr(g=2) in \[0, 1\]; NA = missingsnp1 <- getbd5snp(bd5info = bd5, snp = 1L)
knitr::kable(
data.frame(
SampleID = head(bd5$samples$sid, 10),
Dosage = round(head(snp1$dosage, 10), 4),
P_00 = round(head(snp1$p0, 10), 4),
P_01 = round(head(snp1$p1, 10), 4),
P_11 = round(head(snp1$p2, 10), 4)
),
caption = paste("SNP", bd5$snps$snpid[1], "— first 10 subjects")
)| SampleID | Dosage | P_00 | P_01 | P_11 |
|---|---|---|---|---|
| I1 | 2.000 | 0.000 | 0.000 | 1.000 |
| I2 | 1.002 | 0.000 | 0.998 | 0.002 |
| I3 | 1.000 | 0.000 | 1.000 | 0.000 |
| I4 | 0.122 | 0.878 | 0.122 | 0.000 |
| I5 | 1.000 | 0.000 | 1.000 | 0.000 |
| I6 | 0.000 | 1.000 | 0.000 | 0.000 |
| I7 | 1.004 | 0.000 | 0.996 | 0.004 |
| I8 | 2.000 | 0.000 | 0.000 | 1.000 |
| I9 | 0.000 | 1.000 | 0.000 | 0.000 |
| I10 | 0.918 | 0.082 | 0.918 | 0.000 |
snp3 <- getbd5snp(bd5info = bd5, snp = "1:12000:T:C")
knitr::kable(
data.frame(
SampleID = head(bd5$samples$sid, 10),
Dosage = round(head(snp3$dosage, 10), 4),
P_00 = round(head(snp3$p0, 10), 4),
P_01 = round(head(snp3$p1, 10), 4),
P_11 = round(head(snp3$p2, 10), 4)
),
caption = "SNP 1:12000:T:C — first 10 subjects"
)| SampleID | Dosage | P_00 | P_01 | P_11 |
|---|---|---|---|---|
| I1 | 1.000 | 0.000 | 1.000 | 0.000 |
| I2 | 0.000 | 1.000 | 0.000 | 0.000 |
| I3 | 0.000 | 1.000 | 0.000 | 0.000 |
| I4 | 0.000 | 1.000 | 0.000 | 0.000 |
| I5 | 0.000 | 1.000 | 0.000 | 0.000 |
| I6 | 1.016 | 0.012 | 0.959 | 0.029 |
| I7 | 0.000 | 1.000 | 0.000 | 0.000 |
| I8 | 0.000 | 1.000 | 0.000 | 0.000 |
| I9 | 0.000 | 1.000 | 0.000 | 0.000 |
| I10 | 0.000 | 1.000 | 0.000 | 0.000 |
getbd5snp is convenient but allocates a new list and
opens the file on every call. For a small number of SNPs this is fine.
The following calculates the alternate allele frequency for every
SNP.
aaf <- numeric(n_snps)
for (i in seq_len(n_snps)) {
aaf[i] <- mean(getbd5snp(bd5, i)$dosage, na.rm = TRUE) / 2
}
knitr::kable(
data.frame(snpid = bd5$snps$snpid, aaf = round(aaf, 4)),
caption = "Alternate allele frequencies via getbd5snp"
)| snpid | aaf |
|---|---|
| 1:10000:C:A | 0.3468 |
| 1:11000:T:C | 0.0159 |
| 1:12000:T:C | 0.2651 |
| 1:13000:T:C | 0.2966 |
| 1:14000:G:C | 0.2040 |
| 1:15000:A:C | 0.5236 |
| 1:16000:G:A | 0.4246 |
| 1:17000:C:A | 0.4434 |
| 1:18000:C:G | 0.2655 |
| 1:19000:T:G | 0.2446 |
getbd5snp_buf eliminates the per-call list allocation by
writing results directly into four numeric vectors that the caller
pre-allocates once before the loop. This avoids repeated garbage
collection pressure when reading many SNPs. The file is still opened and
closed on every call.
bd5info — object returned by
getbdinfosnp — 1-based integer index or character SNP IDdosage, p0, p1,
p2 — pre-allocated numeric(n_samples)
vectorsNULL invisibly. The four output vectors are updated in
place.
The output vectors must not have more than one R binding at the call site. If another variable also points to the same vector, R’s copy-on-modify rule will copy the vector before writing, so the caller’s variable will not be updated. Always use plain, freshly allocated vectors:
dosage <- numeric(n_samp)
p0 <- numeric(n_samp)
p1 <- numeric(n_samp)
p2 <- numeric(n_samp)
aaf_buf <- numeric(n_snps)
for (i in seq_len(n_snps)) {
getbd5snp_buf(bd5info = bd5, snp = i, dosage = dosage,
p0 = p0, p1 = p1, p2 = p2)
aaf_buf[i] <- mean(dosage, na.rm = TRUE) / 2
}
knitr::kable(
data.frame(snpid = bd5$snps$snpid, aaf = round(aaf_buf, 4)),
caption = "Alternate allele frequencies via getbd5snp_buf"
)| snpid | aaf |
|---|---|
| 1:10000:C:A | 0.3468 |
| 1:11000:T:C | 0.0159 |
| 1:12000:T:C | 0.2651 |
| 1:13000:T:C | 0.2966 |
| 1:14000:G:C | 0.2040 |
| 1:15000:A:C | 0.5236 |
| 1:16000:G:A | 0.4246 |
| 1:17000:C:A | 0.4434 |
| 1:18000:C:G | 0.2655 |
| 1:19000:T:G | 0.2446 |
getbd5snp_con is the highest-performance option. It
combines the pre-allocated vector approach of getbd5snp_buf
with a persistent open file connection, so the .bdose file
is opened once before the loop and kept open for all reads. The C-level
read uses zlib directly rather than R’s memDecompress.
Use this when reading a large number of SNPs sequentially and minimizing elapsed time matters.
openbd5con(bd5info) to open the file and get a
"bd5con" object.getbd5snp_con inside the loop, passing the
"bd5con" object.closebd5con when finished to release the file
handle promptly. (The connection is also closed automatically when the
"bd5con" object is garbage-collected or when R exits, so
the explicit close is optional but recommended.)Opens a persistent connection to the .bdose file.
Parameters
bd5info — object returned by
getbdinfoReturn value
An object of class "bd5con" to be passed to
getbd5snp_con and closebd5con.
Explicitly closes the connection.
Parameters
bd5con — object returned by
openbd5conReturn value
NULL invisibly.
Parameters
bd5info — object returned by
getbdinfosnp — 1-based integer index or character SNP IDdosage, p0, p1,
p2 — pre-allocated numeric(n_samples)
vectorsbd5con — object returned by
openbd5conReturn value
NULL invisibly. The four output vectors are updated in
place.
The same copy-on-modify constraint as getbd5snp_buf
applies.
dosage <- numeric(n_samp)
p0 <- numeric(n_samp)
p1 <- numeric(n_samp)
p2 <- numeric(n_samp)
con <- openbd5con(bd5)
aaf_con <- numeric(n_snps)
for (i in seq_len(n_snps)) {
getbd5snp_con(bd5info = bd5, snp = i, dosage = dosage,
p0 = p0, p1 = p1, p2 = p2, bd5con = con)
aaf_con[i] <- mean(dosage, na.rm = TRUE) / 2
}
closebd5con(con)
knitr::kable(
data.frame(snpid = bd5$snps$snpid, aaf = round(aaf_con, 4)),
caption = "Alternate allele frequencies via getbd5snp_con"
)| snpid | aaf |
|---|---|
| 1:10000:C:A | 0.3468 |
| 1:11000:T:C | 0.0159 |
| 1:12000:T:C | 0.2651 |
| 1:13000:T:C | 0.2966 |
| 1:14000:G:C | 0.2040 |
| 1:15000:A:C | 0.5236 |
| 1:16000:G:A | 0.4246 |
| 1:17000:C:A | 0.4434 |
| 1:18000:C:G | 0.2655 |
| 1:19000:T:G | 0.2446 |
getbd5snp — use when reading a small
number of SNPs, or when simplicity matters more than speed.getbd5snp_buf — use in loops where
allocation overhead is measurable but the file seek cost is
acceptable.getbd5snp_con — use when reading all
or most SNPs sequentially and elapsed time is important. The persistent
connection avoids both the allocation and the file open/close on every
iteration.All three functions produce identical results, as confirmed below.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.