This vignette gives an introduction to the R package gsrc. It explains the overall workflow and provides details about important steps in the pipeline. The goal is to obtain genotypes, copy number variations (CNVs) and translocations. These can be used for association studies later on.
We demonstrate the process with our own data (Brassica napus) from the package Brassica_napus_data. Raw data files are too large to be included for all samples. We add raw data of two samples for demonstration purpose. The remainder of our data set is included as processed R data.
library(gsrc)
require(devtools)
## Loading required package: devtools
devtools::install_github("grafab/brassicaData")
## Downloading GitHub repo grafab/brassicaData@master
## Installing brassicaData
## '/usr/lib/R/bin/R' --no-site-file --no-environ --no-save --no-restore \
## CMD INSTALL \
## '/tmp/RtmpjLl27o/devtools170c54a454db/grafab-brassicaData-968c6a4' \
## --library='/tmp/RtmpdwUohh/Rinst16f91eef0744' --install-tests
##
One data source for this package are idat files. The user might want to use list.files
to read in all files from a directory. The red and green signal files should be in alternating order because the prefix is identical.
files <- list.files("/YOUR/DATA/REPOSITORY/",
pattern = "idat",full.names = TRUE)
We load our example data:
files <- list.files(system.file("extdata",
package = "brassicaData"),
full.names = TRUE,
pattern = "idat")
idat files usually have cryptic names. In order to get the easier to interpret sample names we need to read in the sample sheets with read_sample_sheets
.
samples <- read_sample_sheets(files =
list.files(system.file("extdata",package = "brassicaData"),
full.names = TRUE,
pattern = "csv"))
Users might want to remove all control samples (e.g. H2O) and update files
. For instance:
controls <- grep("H2O", samples$Names)
if(length(controls) > 0) samples <- samples[-controls, ]
files <- grep(paste(samples$ID, collapse = "|"), files, value = TRUE)
files
contains the full path names of the idat files. We trim them to the actual file name and use it as columns names for our raw data file. For Unix file systems this can be done like this:
column_names <- sapply(strsplit(files, split = "/"), FUN=function(x) x[length(x)])
Next we load dictionary
and chrPos
. dictionary
is an R object to translate the cryptic SNP identifiers in the idat files to meaningful SNP names. chrPos
provides chromosome and position information for the SNPs. We provide multiple files, because there are different ways to locate the SNPs on the genomes.
data(dictionary, package = "brassicaData", envir = environment())
data(chrPos, package = "brassicaData", envir = environment())
It is advantagous to load the position before the data. SNPs with unknown positions are usually not of interest and can be skipped from the analysis. The earlier they are removed, the more computational time and memory is saved.
One indicator for data quality is the number of beads. The number of beads is included in the idat file and describes how many beads per signal have been used for each sample. In our data sets we see that the number of beads follows a bell-shaped distribution. Signals with a low number of beads (e.g. < 5) can be filtered out to increase the confidence of the value.
Similar to the filtering of the bead number, we can filter out signals based on the standard deviation. A high standard deviation indicates doubtful results.
If a signal falls below a threshold it should be set to NA. If a SNP does not work in multiple samples, it should be filtered out entirely.
read_internsities
ia a wrapper to the readIDAT
function from illuminaio.
raw_data <- read_intensities(files = files,
dict = dictionary,
cnames = column_names,
pos = chrPos)
We read in the idat files and got a new object raw_data
. Inspection shows that it is a list of the information we provided (e.g. positions and chromosomes) and the raw data values from the idat files. Further, we see the number of SNPs and samples.
str(raw_data)
## List of 7
## $ chr : chr [1:47805] "A01" "A01" "A01" "A01" ...
## $ pos : int [1:47805] 5435 6089 11833 70979 80347 88548 89913 95658 97603 97691 ...
## $ raw : int [1:47805, 1:4] 32497 29445 29263 17858 6587 1219 1275 25080 26646 25020 ...
## $ snps : chr [1:47805] "Bn-A01-p10000230" "Bn-A01-p1001022" "Bn-A01-p10013185" "Bn-A01-p10026020" ...
## $ beads : logi [1, 1] NA
## $ sd : logi [1, 1] NA
## $ samples: chr [1:4] "3999858026_R06C02_Grn.idat" "3999858026_R06C02_Red.idat" "3999858026_R10C02_Grn.idat" "3999858026_R10C02_Red.idat"
## - attr(*, "class")= chr "raw_data"
We rename the samples to get meaningful names and improve interpretability of the data.
raw_data <- rename_samples(raw_data,
samples = samples[,2:1],
suffix = c("_Grn", "_Red"))
These steps have been applied to our full data set. To load the full data set use data
:
data(raw_napus, package = "brassicaData", envir = environment())
We have a look at the raw data values. The histogram shows the combined red and green values for each sample. Outliers on the left side might should be inspected and probably filtered out. The threshold returns the indices of the green and red value for the sample below the threshold.
check_raw(raw_napus, thresh = 28000, breaks = 20)
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 21 22 25 26 29
## [24] 30
On the right of our threshold we see “normal” samples. The samples left to it have a reduced mean signal. They could have many deletions (e.g. resynthesized samples) or the sample preparation went wrong. In any case we want to filter them out. To remove them, we use filt_samp
:
length(raw_napus$samples)
## [1] 304
raw_napus <- filt_samp(raw_napus, check_raw(raw = raw_napus, plot = FALSE, thresh = 28000))
length(raw_napus$samples)
## [1] 280
Now that we read in the raw data it is time that we combine the green and red signal for each sample.
The intensities are quite different for the two channels.
boxplot(as.vector(raw_napus$raw[, seq(1, length(raw_napus$samples), 2)]),
as.vector(raw_napus$raw[, seq(2, length(raw_napus$samples), 2)]),
names = c("Green", "Red"))
Boxplot comparing green and red signal distributions. Green values are generally lower because the chemical reagents behave differently.
In the check_raw
plot we saw, that there is also a difference between the samples. We account for both effect by normalization. We provide four strategies:
The latter one makes a quantiles normalization between the red and green signal within each sample and then a mean normalization between all samples. This is recommended if you have “strange” samples (e.g. resynthesized samples) where you expect many deletions. They often have a different signal distribution and should not be quantile normalized with “normal” samples. We recommend to use as many samples as possible for the normalization. Best choice is a diversity set, because crossing populations are biased.
The raw signals are heteroscedastic and a transformation is recommended. Again multiple ways are implemented:
Gidskehaug et al provide an illustrative comparison.
Each SNP behaves differently on a chip and we recommend scaling of each SNP. Here we provide three ways:
The latter one subtracts for each SNP the difference between the SNP mean \({\mu}_{i}\) and the mean of all signals \(\overline{\mu}\): \[{{S}_{i,j}} = {R_{i,j}} - {{{\mu}_{i}} - {\overline{\mu}}}\]
Where \({R_{i,j}}\) and \({{S}_{i,j}}\) are the raw and scaled values, respectively.
Green and red signals measure for one of two alleles (e.g. A or T). In order to get genotypes and CNVs we need to combine both signals.
Genotype information is described by the difference between the signals (\(\theta\)). High red and low green signal would indicate a homozygous “red genotype” and vice versa. Similar signals indicate heterozygous genotypes. There are different ways to calculate this \(\theta\). We use the atan2 method and divide by \(\frac{\pi}{2}\).
The signal intensity provides information about the signal strength for a SNP. Low and high values indicate deletions and duplications, respectively. We use a Minkowski distance to calculate intensity values from the two signals.
norm_dat <- intens_theta(raw_napus, norm = "both", scaling = "mean", transf = "log")
str(norm_dat)
## List of 6
## $ samples : chr [1:140] "Sample_332_Grn" "Sample_328_Grn" "Sample_333_Grn" "Sample_329_Grn" ...
## $ snps : chr [1:40491] "Bn-A01-p10000230" "Bn-A01-p1001022" "Bn-A01-p10013185" "Bn-A01-p10026020" ...
## $ chr : chr [1:40491] "A01" "A01" "A01" "A01" ...
## $ pos : int [1:40491] 5435 6089 11833 80347 88548 89913 95658 97603 100381 100635 ...
## $ intensity: num [1:40491, 1:140] 13.6 13 13.6 12 12 ...
## $ theta : num [1:40491, 1:140] 0.501 0.58 0.496 0.389 0.578 ...
## - attr(*, "class")= chr "norm_data"
norm_dat
contains sample, SNP and location information as the raw data file. In addition two matrices intensity
and theta
have been added. By default the sample names have a suffix, which we can remove because it is not informative.
head(norm_dat$samples)
## [1] "Sample_332_Grn" "Sample_328_Grn" "Sample_333_Grn" "Sample_329_Grn"
## [5] "Sample_317_Grn" "Sample_334_Grn"
norm_dat <- remove_suffix(norm_dat, "_Grn")
head(norm_dat$samples)
## [1] "Sample_332" "Sample_328" "Sample_333" "Sample_329" "Sample_317"
## [6] "Sample_334"
We want to have a look at the data to make sure the transformation went well.
hist(norm_dat$intensity, breaks = 1000)
hist(norm_dat$theta, breaks = 1000)
We expect to see three peaks in the theta plot, one for the heterozygous and two for homozygous SNPs. The distribution in the intensity plot is dependend on the population. Usually, we see on large peak representing the “normal” signal intensity. Values or even peaks on the left indicate deletions. A minimum region between two peaks indicates a reasonable threshold for deletions.
We are satisfied with our data and can move on with processing. The raw data is not longer required and we can free some memory:
rm(raw_napus)
Theta and intensity values give a rough idea about genotypes and copy numbers. We can refine this by calling genotypes. Afterwards we are able to calculate B-Allele frequencies and Log R ratios.
We use a one dimensional k-means clustering from Ckmeans.1d.dp for the genotype calling. We treat every SNP as diploid and use a maximum of three clusters.
Based on the genotypes, we calculate B-Allele frequency and Log R ratio as described by Pfeiffer et al.
norm_dat <- geno_baf_rratio(norm_dat, delthresh = 11)
str(norm_dat)
## List of 9
## $ samples : chr [1:140] "Sample_332" "Sample_328" "Sample_333" "Sample_329" ...
## $ snps : chr [1:40491] "Bn-A01-p10000230" "Bn-A01-p1001022" "Bn-A01-p10013185" "Bn-A01-p10026020" ...
## $ chr : chr [1:40491] "A01" "A01" "A01" "A01" ...
## $ pos : int [1:40491] 5435 6089 11833 80347 88548 89913 95658 97603 100381 100635 ...
## $ intensity: num [1:40491, 1:140] 13.6 13 13.6 12 12 ...
## $ theta : num [1:40491, 1:140] 0.501 0.58 0.496 0.389 0.578 ...
## $ baf : num [1:40491, 1:140] NA 1 NA NA NA ...
## $ geno : num [1:40491, 1:140] NA 2 NA NA NA 1 2 2 NA 2 ...
## $ rratio : num [1:40491, 1:140] NA -0.00209 NA NA NA ...
## - attr(*, "class")= chr "norm_data"
We see three new matrices in norm_dat
:
Again we have a look at the data:
hist(norm_dat$baf, breaks = 1000)
tmp <- table(norm_dat$geno, useNA = "ifany")
barplot(tmp, names.arg = c(names(tmp)[1:4], "NA"))
hist(norm_dat$rratio, breaks = 1000)
The large peaks on the left and right side of the BAF plot indicate that most values are homozygous. The little bump at 0.5 indicates a small proportion of heterozygous SNPs. The right bar in the barplot shows missing values (genotypes that could not be called.) The large peak in the R ratio plot indicates that most SNPs are neither deleted nor duplicated.
We are satisfied with our B-Allele frequencies and Log R ratios. Hence, we do remove theta and intensity values to free some memory.
norm_dat$theta <- norm_dat$intensities <- NULL
We filter out SNPs that could not be genotyped properly.
length(norm_dat$snps)
## [1] 40491
norm_dat <- filt_snps(norm_dat, norm_dat$snps[is.na(rowMeans(norm_dat$baf, na.rm = TRUE))])
length(norm_dat$snps)
## [1] 32184
In order to call CNVs we first separate each chromosome into blocks. We provide a wrapper to methods from the R package DNAcopy. segm
segments the data into continuous blocks of similar Log R ratio. We separate this step from the CNV calling because it is computationally expensive. That way we can call CNVs with varying thresholds without repeating the segmentation step.
norm_dat <- segm(norm_dat)
## Warning in DNAcopy::CNA(genomdat = dat$rratio, chrom = dat$chr, maploc = dat$pos, : array has repeated maploc positions
str(norm_dat)
## List of 9
## $ samples : chr [1:140] "Sample_332" "Sample_328" "Sample_333" "Sample_329" ...
## $ snps : chr [1:32184] "Bn-A01-p1001022" "Bn-A01-p10031169" "Bn-A01-p10036602" "Bn-A01-p100441" ...
## $ chr : chr [1:32184] "A01" "A01" "A01" "A01" ...
## $ pos : int [1:32184] 6089 89913 95658 97603 100635 100921 101034 101391 101819 102098 ...
## $ intensity: num [1:32184, 1:140] 13 13.8 13 13 13.2 ...
## $ baf : num [1:32184, 1:140] 1 0.518 1 0.986 0.962 ...
## $ geno : num [1:32184, 1:140] 2 1 2 2 2 2 2 0 0 2 ...
## $ rratio : num [1:32184, 1:140] -0.00209 0.00546 -0.00571 -0.00607 -0.0044 ...
## $ cna :'data.frame': 32945 obs. of 6 variables:
## ..$ ID : chr [1:32945] "Sample_332" "Sample_332" "Sample_332" "Sample_332" ...
## ..$ chrom : Factor w/ 19 levels "A01","A02","A03",..: 1 1 1 1 1 1 1 1 1 1 ...
## ..$ loc.start: int [1:32945] 6089 14102936 14143364 14176341 18472417 18490451 24390933 24495043 26567822 26649357 ...
## ..$ loc.end : int [1:32945] 14088047 14142570 14174789 18471082 18487395 24387698 24471027 26551060 26648005 28480338 ...
## ..$ num.mark : num [1:32945] 982 5 7 228 5 344 5 120 8 118 ...
## ..$ seg.mean : num [1:32945] 0.0015 -0.0457 -0.1473 0.0027 -0.0286 ...
## - attr(*, "class")= chr "norm_data"
Now, norm_dat
has a cna object. It contains all segments for all samples. To call CNVs we use cnv
:
norm_dat <- cnv(norm_dat, dup = 0.03, del = -0.06)
str(norm_dat)
## List of 10
## $ samples : chr [1:140] "Sample_332" "Sample_328" "Sample_333" "Sample_329" ...
## $ snps : chr [1:32184] "Bn-A01-p1001022" "Bn-A01-p10031169" "Bn-A01-p10036602" "Bn-A01-p100441" ...
## $ chr : chr [1:32184] "A01" "A01" "A01" "A01" ...
## $ pos : int [1:32184] 6089 89913 95658 97603 100635 100921 101034 101391 101819 102098 ...
## $ intensity: num [1:32184, 1:140] 13 13.8 13 13 13.2 ...
## $ baf : num [1:32184, 1:140] 1 0.518 1 0.986 0.962 ...
## $ geno : num [1:32184, 1:140] 2 1 2 2 2 2 2 0 0 2 ...
## $ rratio : num [1:32184, 1:140] -0.00209 0.00546 -0.00571 -0.00607 -0.0044 ...
## $ cna :'data.frame': 32945 obs. of 6 variables:
## ..$ ID : chr [1:32945] "Sample_332" "Sample_332" "Sample_332" "Sample_332" ...
## ..$ chrom : Factor w/ 19 levels "A01","A02","A03",..: 1 1 1 1 1 1 1 1 1 1 ...
## ..$ loc.start: int [1:32945] 6089 14102936 14143364 14176341 18472417 18490451 24390933 24495043 26567822 26649357 ...
## ..$ loc.end : int [1:32945] 14088047 14142570 14174789 18471082 18487395 24387698 24471027 26551060 26648005 28480338 ...
## ..$ num.mark : num [1:32945] 982 5 7 228 5 344 5 120 8 118 ...
## ..$ seg.mean : num [1:32945] 0.0015 -0.0457 -0.1473 0.0027 -0.0286 ...
## $ cnv : num [1:32184, 1:140] 0 0 0 0 0 0 0 0 0 0 ...
## - attr(*, "class")= chr "norm_data"
We added a cnv
object to norm_dat
. It contains the CNV calls for all SNPs and samples.
barplot(table(norm_dat$cnv))
-1, 0 and 1 are deletions, normal calls and duplications, respectively.
We can call translocation from the CNV data. We require at least 5 SNPs to be duplicated/deleted to increase the quality of our prediction. We create a synteny block object, as explained in the synteny block vignette.
data(synteny_blocks, package = "brassicaData", envir = environment())
norm_dat <- trans_location(norm_dat, synteny_blocks, min = 5)
We completed all necessary data processing steps.
Now, we look at our results:
plot_gsr(norm_dat, sb = synteny_blocks, samp = 1)
Log R ratios of A and C chromosomes are plotted on top and bottom, respectively. Grey, Red and Green, indicate normal, deleted and duplicated SNPs. In between are synteny blocks indicating homeology between the two subgenomes. The colors correspond to the A genomes.
We can add the B-Allele frequency and translocations:
plot_gsr(norm_dat, sb = synteny_blocks, samp = 1, baf = TRUE, tl =TRUE)
Same plot as before, but with B-Allele frequencies and translocations.
In addition to individual samples, we can plot the whole mean values for the whole dataset. It allows us to find deletion and duplication hotspots.
plot_global(norm_dat, sb = synteny_blocks)
Global plot with mean values of all samples.