The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
Format 5 is a binary dosage file format designed for reading imputed genotype data from bgzipped, tabix-indexed VCF files — such as those returned by the Michigan Imputation Server using minimac.
Format 5 differs from earlier formats (1–4) in two important ways.
First, it uses per-SNP gzip compression. Each SNP’s dosage and genotype probability values are compressed independently and written as a contiguous block in the data file. This means any SNP can be decompressed in isolation without reading surrounding data.
Second, the metadata — sample IDs, the SNP table, byte offsets, and file information — is stored in a companion RDS file rather than embedded in a binary header. This makes the metadata immediately accessible from R without reading the data file.
A Format 5 dataset consists of two files.
.bdose — the data file. Begins with a
4-byte magic number followed by one gzip-compressed block per SNP..bdose.bdi — an RDS file containing a
list of class "genetic-info" with the sample table, SNP
table, byte offsets, and format metadata. The name is always the
.bdose filename with .bdi appended (e.g.
set1a.bdose.bdi). This structure is identical to the one
returned by getbdinfo, getvcfinfo, and
getgeninfo for other file types.The BinaryDosage package includes a small bgzipped VCF file, set1a.vcf.gz, which contains dosage (DS) and genotype probability (GP) data for 60 subjects and 10 SNPs on chromosome 1. This file is used in all examples below.
The output files in each example are written to a temporary directory so that nothing is left behind after the vignette runs.
The vcftobd function converts a bgzipped VCF file to a
Format 5 file pair. The VCF file must contain the FORMAT fields
DS (dosage) and GP (genotype probabilities),
as produced by minimac and the Michigan Imputation Server.
The function takes the following parameters.
vcffile — path to the bgzipped VCF file.bdose_file — path for the output .bdose
file. The companion .bdi metadata file is written
automatically to paste0(bdose_file, ".bdi").region — optional genomic region string in bcftools
format (e.g. "chr21" or "chr21:1-5000000").
Requires a tabix index. Default NULL processes the entire
file.The following code converts set1a.vcf.gz to a Format 5 file pair.
if (requireNamespace("vcfppR", quietly = TRUE)) {
vcftobd(vcffile = vcf_file, bdose_file = bdose_file)
} else {
updatebd(bdfiles = bd4_file, bdose_file = bdose_file)
}To convert only a specific region of the file, supply the
region parameter. The VCF file must have a tabix index (a
.tbi file alongside it).
The getbdinfo function reads and validates a Format 5
file pair, returning a list of class "genetic-info" that is
used by getbd5snp to retrieve individual SNPs. It
automatically detects the Format 5 magic header and delegates to the
internal Format 5 reader.
The function takes the following parameters.
bdose_file — path to the .bdose file. The
companion .bdi file is read automatically from
paste0(bdose_file, ".bdi").The object returned has the same structure as the list returned by
getbdinfo for older formats. For a full description of all
fields see Genetic File Information.
The key fields are described below.
cat("Class: ", class(bd5), "\n")
#> Class: genetic-info
cat("Uses family ID: ", bd5$usesfid, "\n")
#> Uses family ID: FALSE
cat("One chromosome: ", bd5$onechr, "\n")
#> One chromosome: TRUE
cat("SNP ID format: ", bd5$snpidformat, "\n")
#> SNP ID format: 2
cat("Format: ", bd5$additionalinfo$format, "\n")
#> Format: 5
cat("Header size: ", bd5$additionalinfo$headersize, "bytes\n")
#> Header size: 4 bytesThe samples element is a data frame with one row per
subject. The fid column holds the family ID (empty for VCF
files, which carry no family information) and the sid
column holds the subject ID.
cat("Number of samples:", nrow(bd5$samples), "\n")
#> Number of samples: 60
knitr::kable(head(bd5$samples, 5), caption = "First 5 samples")| fid | sid |
|---|---|
| I1 | |
| I2 | |
| I3 | |
| I4 | |
| I5 |
The snps element is a data frame with one row per
SNP.
cat("Number of SNPs:", nrow(bd5$snps), "\n")
#> Number of SNPs: 10
knitr::kable(bd5$snps, caption = "SNP table")| chromosome | location | snpid | reference | alternate |
|---|---|---|---|---|
| 1 | 10000 | 1:10000:C:A | C | A |
| 1 | 11000 | 1:11000:T:C | T | C |
| 1 | 12000 | 1:12000:T:C | T | C |
| 1 | 13000 | 1:13000:T:C | T | C |
| 1 | 14000 | 1:14000:G:C | G | C |
| 1 | 15000 | 1:15000:A:C | A | C |
| 1 | 16000 | 1:16000:G:A | G | A |
| 1 | 17000 | 1:17000:C:A | C | A |
| 1 | 18000 | 1:18000:C:G | C | G |
| 1 | 19000 | 1:19000:T:G | T | G |
The getbd5snp function decompresses and returns the
dosage and genotype probability data for a single SNP. The
general-purpose getsnp function also works with Format 5
files and can be used as an alternative.
For details on all three Format 5 SNP reader functions — including
getbd5snp_buf (pre-allocated output vectors) and
getbd5snp_con (persistent file connection for
high-throughput loops) — see the Reading
SNPs from Format 5 Files vignette.
The function takes the following parameters.
bd5info — object returned by
getbdinfo.snp — the SNP to retrieve, either as a 1-based integer
index or as a character SNP ID matching a value in
bd5info$snps$snpid.The function returns a list with four numeric vectors, each of length equal to the number of samples.
dosage — DS values in \[0,
2\]; NA = missing.p0 — \(\Pr(g=0)\)
values in \[0, 1\]; NA =
missing.p1 — \(\Pr(g=1)\)
values in \[0, 1\]; NA =
missing.p2 — \(\Pr(g=2)\)
values in \[0, 1\]; NA =
missing.The following code retrieves the first SNP by its 1-based index.
snp1 <- getbd5snp(bd5info = bd5, snp = 1L)
snp1_df <- data.frame(
SampleID = bd5$samples$sid,
Dosage = round(snp1$dosage, 4),
P_00 = round(snp1$p0, 4),
P_01 = round(snp1$p1, 4),
P_11 = round(snp1$p2, 4)
)
knitr::kable(head(snp1_df, 10),
caption = paste("SNP", bd5$snps$snpid[1], "— first 10 subjects"))| SampleID | Dosage | P_00 | P_01 | P_11 |
|---|---|---|---|---|
| I1 | 2.000 | 0.000 | 0.000 | 1.000 |
| I2 | 1.002 | 0.000 | 0.998 | 0.002 |
| I3 | 1.000 | 0.000 | 1.000 | 0.000 |
| I4 | 0.122 | 0.878 | 0.122 | 0.000 |
| I5 | 1.000 | 0.000 | 1.000 | 0.000 |
| I6 | 0.000 | 1.000 | 0.000 | 0.000 |
| I7 | 1.004 | 0.000 | 0.996 | 0.004 |
| I8 | 2.000 | 0.000 | 0.000 | 1.000 |
| I9 | 0.000 | 1.000 | 0.000 | 0.000 |
| I10 | 0.918 | 0.082 | 0.918 | 0.000 |
A SNP can also be retrieved by passing its SNP ID as a character string.
snp_id <- "1:12000:T:C"
snp3 <- getbd5snp(bd5info = bd5, snp = snp_id)
snp3_df <- data.frame(
SampleID = bd5$samples$sid,
Dosage = round(snp3$dosage, 4),
P_00 = round(snp3$p0, 4),
P_01 = round(snp3$p1, 4),
P_11 = round(snp3$p2, 4)
)
knitr::kable(head(snp3_df, 10),
caption = paste("SNP", snp_id, "— first 10 subjects"))| SampleID | Dosage | P_00 | P_01 | P_11 |
|---|---|---|---|---|
| I1 | 1.000 | 0.000 | 1.000 | 0.000 |
| I2 | 0.000 | 1.000 | 0.000 | 0.000 |
| I3 | 0.000 | 1.000 | 0.000 | 0.000 |
| I4 | 0.000 | 1.000 | 0.000 | 0.000 |
| I5 | 0.000 | 1.000 | 0.000 | 0.000 |
| I6 | 1.016 | 0.012 | 0.959 | 0.029 |
| I7 | 0.000 | 1.000 | 0.000 | 0.000 |
| I8 | 0.000 | 1.000 | 0.000 | 0.000 |
| I9 | 0.000 | 1.000 | 0.000 | 0.000 |
| I10 | 0.000 | 1.000 | 0.000 | 0.000 |
The simplest option is bdapply, which handles the loop
internally and uses getbd5snp_con automatically for Format
5 files.
getaaf <- function(dosage, p0, p1, p2) mean(dosage, na.rm = TRUE) / 2
aaf_list <- bdapply(bdinfo = bd5, func = getaaf)
aaf_table <- data.frame(snpid = bd5$snps$snpid,
aaf = round(unlist(aaf_list), 4))
knitr::kable(aaf_table, caption = "Alternate allele frequencies via bdapply")| snpid | aaf |
|---|---|
| 1:10000:C:A | 0.3468 |
| 1:11000:T:C | 0.0159 |
| 1:12000:T:C | 0.2651 |
| 1:13000:T:C | 0.2966 |
| 1:14000:G:C | 0.2040 |
| 1:15000:A:C | 0.5236 |
| 1:16000:G:A | 0.4246 |
| 1:17000:C:A | 0.4434 |
| 1:18000:C:G | 0.2655 |
| 1:19000:T:G | 0.2446 |
For manual loops, getbd5snp_buf and
getbd5snp_con are faster than getbd5snp
because they avoid allocating a new list on every call.
getbd5snp_con is the fastest option as it also keeps the
file open across all iterations. See the Reading SNPs from Format 5 Files vignette
for details and a performance comparison.
n_snps <- nrow(bd5$snps)
n_samp <- nrow(bd5$samples)
dosage <- numeric(n_samp)
p0 <- numeric(n_samp)
p1 <- numeric(n_samp)
p2 <- numeric(n_samp)
con <- openbd5con(bd5)
aaf <- numeric(n_snps)
for (i in seq_len(n_snps)) {
getbd5snp_con(bd5info = bd5, snp = i,
dosage = dosage, p0 = p0, p1 = p1, p2 = p2,
bd5con = con)
aaf[i] <- mean(dosage, na.rm = TRUE) / 2
}
closebd5con(con)
knitr::kable(data.frame(snpid = bd5$snps$snpid, aaf = round(aaf, 4)),
caption = "Alternate allele frequencies via getbd5snp_con")| snpid | aaf |
|---|---|
| 1:10000:C:A | 0.3468 |
| 1:11000:T:C | 0.0159 |
| 1:12000:T:C | 0.2651 |
| 1:13000:T:C | 0.2966 |
| 1:14000:G:C | 0.2040 |
| 1:15000:A:C | 0.5236 |
| 1:16000:G:A | 0.4246 |
| 1:17000:C:A | 0.4434 |
| 1:18000:C:G | 0.2655 |
| 1:19000:T:G | 0.2446 |
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.