This vignette is aimed at presenting additional information on the R package by describing how to use it to perform whole genome scan for footprints of selection using statistics related to the Extended Haplotype Homozygosity (EHH) (Sabeti et al. 2002). Importantly, the current implementation of tests assumes markers are bi-allelic.
The package is currently available for most platforms (Linux, MS Windows and MacOSX) from the CRAN repository () and may be installed using standard procedure. Once the package has been successfully installed on your system, it can be loaded using the following command:
library(rehh)
The package basically requires as input:
Important Note: For a given chromosome, SNPs are assumed to be ordered in the same way in both the haplotype file (columns) and the SNP information file.
For illustration purposes, example files that originate from a previously published study on the Creole cattle breed from Guadeloupe (CGU) (Gautier and Naves 2011) are provided in the package and can be copied in the working directory with the following command:
make.example.files()
In the following examples, the command is assumed to have been run (see above) so that example files are in the working directory.
Three haplotype input file formats are supported:
By default alleles are assumed to be coded as 0 (missing data), 1 (ancestral allele) or 2 (derived allele). Recoding of the alleles in this format, according to the SNP information data file (see ) can be performed with the option of the function (see ).
This data file should contain SNP information as in the example file created when running the function . Each line contains five columns corresponding to:
The fourth and fifth columns (allele coding) should be filled in but the corresponding information is only used when activating the option of the function (see ). In that case, for each SNP, the allele specified in the fourth (respectively fifth) column will be recoded as 1 (respectively 2), any other allele name will be recoded as 0 (i.e., missing data). More importantly, it should be noticed that the ancestral or derived allele information associated to this coding are only relevant for within population tests (based on ). In other words, if one is only interested in across-population tests (based on or ), assignment of the two SNP alleles in the fourth and fifth column may be performed randomly.
As an illustration, the following R command displays the first five row of the example file created when running the function :
head(read.table("map.inp"))
> V1 V2 V3 V4 V5
> 1 F0100190 1 113642 T A
> 2 F0100220 1 244699 C G
> 3 F0100250 1 369419 G C
> 4 F0100270 1 447278 A T
> 5 F0100280 1 487654 T A
> 6 F0100290 1 524507 C G
The function allows to convert data file into an R object of class subsequently used by the functions of the package. The following main options are available to recode alleles or select SNPs (based on Minor Allele Frequency or percentage of missing data) and haplotypes (based on percentage of missing data):
More details about the different arguments of the function are available in the documentation accessible using the command:
?data2haplohh
In the following we detail three different examples based on the example data files provided with the package (see ).
In this example, the example haplotype input file (standard format) and SNP information input files are converted into an object named . Because the map file contains information for SNPs mapping to other chromosomes than the one of interest (BTA12), we use the option . Allele recoding is activated () to allow recoding alleles in the format (0,1 or 2).
hap<-data2haplohh(hap_file="bta12_cgu.hap",map_file="map.inp",
recode.allele=TRUE,chr.name=12)
> Map file seems OK: 1424 SNPs declared for chromosome 12
> Standard rehh input file assumed
> Alleles are being recoded according to map file as:
> 0 (missing data), 1 (ancestral allele) or 2 (derived allele)
> Discard Haplotype with less than 100 % of genotyped SNPs
> No haplotype discarded
> Discard SNPs genotyped on less than 100 % of haplotypes
> No SNP discarded
> Data consists of 280 haplotypes and 1424 SNPs
If no value is specified for argument and more than one chromosome is detected in the map file, the function asks interactively which chromosome to choose:
hap<-data2haplohh(hap_file="bta12_cgu.hap",map_file="map.inp",
recode.allele=TRUE)
> More than one chromosome name in Map file: map.inp
> Which chromosome should be considered among:
> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
> 1:
12
> Map file seems OK: 1424 SNPs declared for chromosome 12
> Standard rehh input file assumed
> Alleles are being recoded according to map file as:
> 0 (missing data), 1 (ancestral allele) or 2 (derived allele)
> Discard Haplotype with less than 100 % of genotyped SNPs
> No haplotype discarded
> Discard SNPs genotyped on less than 100 % of haplotypes
> No SNP discarded
> Data consists of 280 haplotypes and 1424 SNPs
Finally, as an example of error message, the following message is prompted if the number of SNPs in the chromosome (for instance when a wrong chromosome is declared) does not correspond to the one in the haplotype file:
hap<-data2haplohh(hap_file="bta12_cgu.hap",map_file="map.inp",
recode.allele=TRUE,chr.name=18)
> Map file seems OK: 1123 SNPs declared for chromosome 18
> Standard rehh input file assumed
> The number of snp in the haplotypes 1424 is not equal
> to the number of snps declared in the map file 1123
> Error in data2haplohh(hap_file = "bta12_cgu.hap", map_file = "map.inp", : Conversion stopped
In this example, the example haplotype input file (transposed format) and SNP information input files are converted into an object named . Setting informs the function that the haplotype file is in transposed format:
hap<-data2haplohh(hap_file="bta12_cgu.thap",map_file="map.inp",haplotype.in.columns=TRUE,
recode.allele=TRUE,chr.name=12)
> Map file seems OK: 1424 SNPs declared for chromosome 12
> Haplotype are in columns with no header
> Alleles are being recoded according to map file as:
> 0 (missing data), 1 (ancestral allele) or 2 (derived allele)
> Discard Haplotype with less than 100 % of genotyped SNPs
> No haplotype discarded
> Discard SNPs genotyped on less than 100 % of haplotypes
> No SNP discarded
> Data consists of 280 haplotypes and 1424 SNPs
In this example, the example output file and SNP information input files are converted into a object named . As explained above we use the option . Because, haplotypes originate from several populations (the -u option was used), we specify the population of interest (in our example the 280 haplotypes from the CGU population, see above) using the option (7 corresponding to the code of CGU in the example input files).
hap<-data2haplohh(hap_file="bta12_hapguess_switch.out",map_file="map.inp",
recode.allele=TRUE,popsel=7,chr.name=12)
> Map file seems OK: 1424 SNPs declared for chromosome 12
> Looks like a FastPHASE haplotype file
> Haplotypes originate from 8 different populations in the fastPhase output file
> Alleles are being recoded according to map file as:
> 0 (missing data), 1 (ancestral allele) or 2 (derived allele)
> Discard Haplotype with less than 100 % of genotyped SNPs
> No haplotype discarded
> Discard SNPs genotyped on less than 100 % of haplotypes
> No SNP discarded
> Data consists of 280 haplotypes and 1424 SNPs
If no value is specified for the argument and more than one population is detected in the output file, the function asks interactively which population to chose:
hap<-data2haplohh(hap_file="bta12_hapguess_switch.out",map_file="map.inp",
recode.allele=TRUE,chr.name=12)
> Map file seems OK: 1424 SNPs declared for chromosome 12
> Looks like a FastPHASE haplotype file
> Haplotypes originate from 8 different populations in the fastPhase output file
> Chosen pop. is not in the list of pop. number: 1 2 3 4 5 6 7 8
> Which population should be considered among: 1 2 3 4 5 6 7 8
> 1:
7
> Map file seems OK: 1424 SNPs declared for chromosome 12
> Looks like a FastPHASE haplotype file
> Haplotypes originate from 8 different populations in the fastPhase output file
> Alleles are being recoded according to map file as:
> 0 (missing data), 1 (ancestral allele) or 2 (derived allele)
> Discard Haplotype with less than 100 % of genotyped SNPs
> No haplotype discarded
> Discard SNPs genotyped on less than 100 % of haplotypes
> No SNP discarded
> Data consists of 280 haplotypes and 1424 SNPs
where \(n_{a_s}\) represents the number of haplotype carrying the core allele \(a_s\), \(K_{a_s,t}\) represents the number of different extended haplotypes (from SNP \(s\) to SNP \(t\)) carrying \(a_s\) and \(n_k\) is the number of the extended haplotype \(k\) (i.e., \(n_{a_s}=\sum\limits_{k=1}^{K_{a_s,t}}n_k\)).
By definition, irrespective of the allele considered, starts at 1, and decays monotonically to 0 with increasing distance from the focal SNP. For a given core allele, the integrated () is defined as the area under the curve with respect to map position (Voight et al. 2006). In , is computed using the trapezoid method. In practice, the integral might only be computed for the regions of the curve above an arbitrarily small value (e.g., >0.05).
where:
As for the (see ), starts at 1 and decays monotonically to 0 with increasing distance from the focal SNP. For a given focal SNP, and in a similar fashion as the , is defined as the integrated (Tang et al. 2007). Depending on the estimator considered (respectively, \(\mathrm{EHHS}^{\text{Sab}}\) and \(\mathrm{EHHS}^{\text{Tang}}\)), two different estimators, that we further denoted as \(\mathrm{iES}^{\text{Sab}}\) and \(\mathrm{iES}^{\text{Tang}}\) respectively, can be computed. As for , the integral is computed using the trapezoid method and might only be computed for the regions of the curve above an arbitrarily small value (e.g., >0.05).
In the computation of both and from a focal SNP \(s\) to a SNP \(t\), only extended haplotypes with no missing data are considered. As a consequence, the number of extended haplotypes retained to compute these two statistics might decrease with distance from the focal SNP. However if the number of available extended haplotypes falls below a threshold (e.g., =5), and are not computed further. Note however that most phasing programs (such as or ) allow to impute missing genotypes resulting in phased haplotypes with no missing data.
The function allows to compute for both the ancestral (\(a_s=1\)) and derived (\(a_s=2\)) alleles at the \(s^\text{th}\) SNP relative to each SNP (\(t\)) upstream and downstream and corresponding . The two options and allow to specify condition to stop computing (see ). By default =0.05 and =2. Finally, if , the decay of for both the ancestral and derived alleles is plotted against SNP map position ( allows to change the plot legend). More details are available in the R documentation by using the command:
?calc_ehh
In the following example,the statistics are computed for both the ancestral and derived allele of the \(456^{\text{th}}\) focal SNP. Note that the object was generated using the function with the example input files (). For convenience, it is stored as an example object (accessible with the R function ) as shown below:
#example haplohh object (280 haplotypes, 1424 SNPs) see ?haplohh_cgu_bta12 for details
data(haplohh_cgu_bta12)
#computing EHH statistics for the focal SNP at position 456
#which display a strong signal of selection
res.ehh<-calc_ehh(haplohh_cgu_bta12,mrk=456)
The five different elements of the resulting object are as follows:
res.ehh$ehh[1:2,454:458]
> F1205380 F1205390 F1205400 F1205420 F1205440
> Ancestral allele 0.2764706 0.5529412 1 0.8879552 0.6422969
> Derived allele 1.0000000 1.0000000 1 1.0000000 1.0000000
res.ehh$nhaplo_eval[1:2,454:458]
> F1205380 F1205390 F1205400 F1205420 F1205440
> Ancestral allele 85 85 85 85 85
> Derived allele 195 195 195 195 195
res.ehh$freq_all1
> [1] 0.3035714
res.ehh$ihh
> Ancestral allele Derived allele
> 284633 2057152
In addition, as by default, we obtain the following plot (Figure ):
Graphical output for the function
The function allows to compute the (both the \(\mathrm{EHHS}^{\text{Sab}}\) and \(\mathrm{EHHS}^{\text{Tang}}\) estimators) at the \(s^\text{th}\) SNP relative to each SNP (\(t\)) upstream and downstream. This function also compute the corresponding (\(\mathrm{iES}^{\text{Sab}}\) and \(\mathrm{iES}^{\text{Tang}}\) estimators respectively). The two options and allow to specify condition to stop computing (see ). By default =0.05 and =2. Finally, if , the decay of is plotted against SNP map position ( allows to change the plot legend). More details are available in the R documentation by using the command:
?calc_ehhs
In the following example, the statistics are computed for the \(456^{\text{th}}\) focal SNP on the object defined above (see ) was generated using the function with the example input files (see ) described above. For convenience, it is stored as an example object (accessible with the R function ).
#example haplohh object (280 haplotypes, 1424 SNPs) see ?haplohh_cgu_bta12 for details
data(haplohh_cgu_bta12)
#computing EHH statistics for the focal SNP at position 456
#which display a strong signal of selection
res.ehhs<-calc_ehhs(haplohh_cgu_bta12,mrk=456)
The five different elements of the resulting object are as follows:
res.ehhs$EHHS_Sabeti_et_al_2007[453:459]
> F1205370 F1205380 F1205390 F1205400 F1205420 F1205440 F1205450
> 0.5017153 0.5095238 0.5347926 1.0000000 0.5654122 0.5429595 0.5386841
res.ehhs$EHHS_Tang_et_al_2007[453:459]
> F1205370 F1205380 F1205390 F1205400 F1205420 F1205440 F1205450
> 0.8715588 0.8851234 0.9290193 1.0000000 0.9822104 0.9432066 0.9357794
res.ehhs$nhaplo_eval[453:459]
> F1205370 F1205380 F1205390 F1205400 F1205420 F1205440 F1205450
> 280 280 280 280 280 280 280
res.ehhs$IES_Tang_et_al_2007
> [1] 1760565
res.ehhs$IES_Sabeti_et_al_2007
> [1] 964698
In addition, as by default, we obtain the following plot (Figure ):
Graphical output for the function
The function allows to efficiently compute (for both the ancestral and derived alleles) and (both the \(\mathrm{iES}^{\text{Sab}}\) and \(\mathrm{iES}^{\text{Tang}}\) estimators) for all the SNPs in the object considered. The options , and specify conditions to stop computing and . By default ==0.05 and =2. Finally, the option , set by dafault to =1, allows to specify the number of available threads to parallelize computation (parallelization being carried out over SNPs). For instance to scan the object (containing data on 1424 SNPs for 280 haplotypes), one may use the following command:
data(haplohh_cgu_bta12)
res.scan<-scan_hh(haplohh_cgu_bta12)
The resulting object is a data frame with (number of SNPs declared in the object) and seven columns giving for each SNP:
As an example, the following R codes provide the dimension and the first five rows of the data frame obtained above:
dim(res.scan)
> [1] 1424 7
head(res.scan)
> CHR POSITION freq_A iHH_A iHH_D iES_Tang_et_al_2007
> F1200140 12 79823 0.1500000 135102.2 68522.91 69776.85
> F1200150 12 125974 0.4071429 161680.3 107183.15 123607.13
> F1200170 12 175087 0.3571429 157333.1 155777.56 156021.90
> F1200180 12 219152 0.2214286 250037.4 159839.73 166214.75
> F1200190 12 256896 0.1750000 466071.8 173269.33 184453.42
> F1200210 12 316254 0.3892857 292077.5 228681.21 246572.65
> iES_Sabeti_et_al_2007
> F1200140 53669.39
> F1200150 76287.51
> F1200170 92770.96
> F1200180 110712.37
> F1200190 134092.34
> F1200210 130156.22
Note that running is more efficient than running and in turn as shown in the example below (:.
system.time(res.scan<-scan_hh(haplohh_cgu_bta12))
> user system elapsed
> 0.260 0.000 0.257
foo<-function(haplo){
res.ihh=res.ies=matrix(0,haplo@nsnp,2)
for(i in 1:length(haplo@position)){
res.ihh[i,]=calc_ehh(haplo,mrk=i,plotehh=FALSE)$ihh
tmp=calc_ehhs(haplo,mrk=i,plotehhs=FALSE)
res.ies[i,1]=tmp$IES_Tang_et_al_2007
res.ies[i,2]=tmp$IES_Sabeti_et_al_2007
}
list(res.ies=res.ies,res.ihh=res.ihh)
}
system.time(res.scan2<-foo(haplohh_cgu_bta12))
> user system elapsed
> 13.280 0.036 13.337
Note however that the same results are obtained (since the same options were used) as illustrated by the following R code:
sum(res.scan2$res.ihh[,1]!=res.scan[,4]) + sum(res.scan2$res.ihh[,2]!=res.scan[,5]) +
sum(res.scan2$res.ies[,1]!=res.scan[,6]) + sum(res.scan2$res.ies[,2]!=res.scan[,7])
> [1] 0
Let \(\mathrm{UniHS}\) represent the log-ratio of the for its ancestral (\(_a\)) and derived (\(_d\)) alleles: \[\mathrm{UniHS}=\log\left(\frac{\mathrm{iHH}_a}{\mathrm{iHH}_d}\right)\] The of a given focal SNP \(s\) (\(\mathrm{iHS}(s)\)) is then defined as its standardized \(\mathrm{UniHS}\) (\(\mathrm{UniHS}(s)\)) following (Voight et al. 2006): \[\mathrm{iHS}(s)=\frac{\mathrm{UniHS}(s) - \mu^{p_s}_\mathrm{UniHS}}{\sigma^{p_s}_\mathrm{UniHS}}\] where \(\mu^{p_s}_\mathrm{UniHS}\) and \(\sigma^{p_s}_\mathrm{UniHS}\) represent the average and standard deviation of the \(\mathrm{UniHS}\) computed over all the SNPs with a derived allele frequency \(p_s\) similar to that of the core SNP \(s\). In practice, the derived allele frequencies are generally binned so that each bin are large enough (e.g., >10 SNPs) to obtain reliable estimate of \(\mu^{p_s}_\mathrm{UniHS}\) and \(\sigma^{p_s}_\mathrm{UniHS}\).
Note that the is constructed to have an approximately standard Gaussian distribution and to be comparable across SNPs regardless of their underlying allele frequencies. Hence, one may further transform into \(p_\mathrm{iHS}\) (Gautier and Naves 2011): \[p_\mathrm{iHS}=-\log_{10}\left(1-2|\Phi\left(\mathrm{iHS}\right)-0.5|\right)\] where \(\Phi\left(x\right)\) represents the Gaussian cumulative distribution function. Assuming most of the genotyped SNPs behave neutrally (i.e., the genome-wide empirical distribution is a fair approximation of the neutral distribution), \(p_\mathrm{iHS}\) might thus be interpreted as a two-sided P-value (on a \(-\log_{10}\) scale) associated to the neutral hypothesis of no selection.
The function allows to compute using a matrix of statistics (for both the ancestral and derived alleles) in the same format as obtained from the function (see ). The argument allows to remove SNPs according to their MAF (by default SNPs with a MAF<=0.05 are discarded from the standardization). The argument controls the size of the allele frequency bins used to perform standardization (see ). More precisely allele frequency bins vary from to 1- per step of size (by default =0.025). Note that if is set to 0 (e.g., with a large number of SNPs and few haplotypes), standardization is performed considering each observed frequency as a frequency class.
For instance, to perform a whole genome scan one might run in turn on haplotype data from each chromosome and concatenate the resulting matrices before standardization. In the following example, we assume that the haplotype files are named as where the chromosome number \(i\) goes from 1 to 29 and the SNP information file is named . The R code below then generates a matrix with and estimates for all SNPs in an appropriate format to perform standardization with the function:
for(i in 1:29){
hap_file=paste("hap_chr_",i,".pop1",sep="")
data<-data2haplohh(hap_file="hap_file","snp.info",chr.name=i)
res<-scan_hh(data)
if(i==1){wg.res<-res}else{wg.res<-rbind(wg.res,res)}
}
wg.ihs<-ihh2ihs(wg.res)
As a matter of illustration, results of a similar genome scan (Gautier and Naves 2011) are provided as example data sets. The following R code allows to compute the for the CGU population:
data(wgscan.cgu)
## results from a genome scan (44,057 SNPs) see ?wgscan.eut and ?wgscan.cgu for details
ihs.cgu<-ihh2ihs(wgscan.cgu)
The corresponding object is a list with two elements corresponding to
head(ihs.cgu$iHS)
> CHR POSITION iHS -log10(p-value)
> F0100190 1 113642 -0.5582992 0.2390952
> F0100220 1 244699 0.2723337 0.1049282
> F0100250 1 369419 0.4810736 0.2003396
> F0100270 1 447278 1.0618710 0.5401640
> F0100280 1 487654 0.8184060 0.3839181
> F0100290 1 524507 -0.3897024 0.1569189
2.a matrix summarizing the allele frequency bins. For instance, the five first rows of the data frame are displayed below using the following R command:
head(ihs.cgu$frequency.class)
> Number of SNPs mean of the log(iHHA/iHHD) ratio
> 0.05 - 0.075 1635 0.7286087
> 0.075 - 0.1 1316 0.5804760
> 0.1 - 0.125 1478 0.4710504
> 0.125 - 0.15 1593 0.3720585
> 0.15 - 0.175 1078 0.3263215
> 0.175 - 0.2 1325 0.2721166
> sd of the log(iHHA/iHHD) ratio
> 0.05 - 0.075 0.6457742
> 0.075 - 0.1 0.5556798
> 0.1 - 0.125 0.5079392
> 0.125 - 0.15 0.4708235
> 0.15 - 0.175 0.4524270
> 0.175 - 0.2 0.4533404
The function allows to draw a Manhattan plot of the Whole Genome scan results as stored in the list object produced by the function . Various options are available to modify the graphical display (see ).
ihsplot(ihs.cgu,plot.pval=TRUE,ylim.scan=2,main="iHS (CGU cattle breed)")
Graphical output for the function
For a given SNP \(s\), let \[\mathrm{LRiES}(s)^{\text{Tang}}=\log\left(\frac{\mathrm{iES}_\text{pop1}(s)^{\text{Tang}}}{\mathrm{iES}_\text{pop2}(s)^{\text{Tang}}}\right)\] represent the log-ratio of the \(\mathrm{iES}_\text{pop1}(s)^{\text{Tang}}\) and \(\mathrm{iES}_\text{pop2}(s)^{\text{Tang}}\) computed in the pop1 and pop2 populations (see ).
The for a given focal SNP is then defined as the standardized \(\mathrm{LRiES}(s)^{\text{Tang}}\) (Tang et al. 2007):
\begin{equation} \mathrm{rSB}(s)=\frac{\mathrm{LRiES}(s)^{\text{Tang}} - \text{med}_{\mathrm{LRiES}^{\text{Tang}}}}{\sigma_{\mathrm{LRiES}^{\text{Tang}}}} \end{equation}where \(\text{med}_{\mathrm{LRiES}^{\text{Tang}}}\) and \(\sigma_{\mathrm{LRiES}^{\text{Tang}}}\) represent the median and standard deviation of the \(\mathrm{LRiES}(s)^{\text{Tang}}\) computed over all the analyzed SNPs. Note that the median is used instead of the mean because it is less sensitive to extreme data points (Tang et al. 2007). More importantly, it should be noticed that the information about the ancestral and derived status of alleles at the focal SNP is not needed.
As for the (see ), is constructed to have an approximately standard Gaussian distribution and may further be transformed into \(p_\mathrm{rSB}\): \begin{equation} p_\mathrm{rSB}=-\log_{10}\left(1-2|\Phi\left(\mathrm{rSB}\right)-0.5|\right) \end{equation} where \(\Phi\left(x\right)\) represents the Gaussian cumulative distribution function. Assuming most of the genotyped SNPs behave neutrally (i.e., the genome-wide empirical distribution is a fair approximation of their corresponding neutral distributions), \(p_\mathrm{rSB}\) might thus be interpreted as a two-sided P-value (in a \(-\log_{10}\) scale) associated to the neutral hypothesis of no selection. Alternatively, \(p_\mathrm{rSB}\) might also be computed (Gautier and Naves 2011): \begin{equation} p\prime_\mathrm{rSB}=-\log_{10}\left(|\Phi\left(\mathrm{rSB}\right)|\right) \end{equation}\(p\prime_\mathrm{rSB}\) and \(p\prime_\mathrm{rSB}\) might then be interpreted as a one-sided P-value (in a \(-\log_{10}\) scale) allowing the identification of those sites displaying a significantly high extended haplotype homozygosity in population \(pop2\) (represented in the denominator of the corresponding \(\mathrm{LRiES}\)) relatively to the \(pop1\) reference population.
The function allows to compute using two data frames containing the statistics for each of the two populations considered in the same format as the one obtained after running the function (see ). For instance, to perform a genome scan one might first run for each population in turn on haplotype data from each chromosome and concatenate the resulting matrices. In the following example, we assume that the haplotype files are named as and where \(i\) is the chromosome number (going from 1 to 29), the suffixes pop1 and pop2 indicate the population of origin and the SNP information file is named . The R code below then generates two data frames ( and ) containing the results from all SNPs in the appropriate format to compute with the function:
for(i in 1:29){
hap_file=paste("hap_chr_",i,".pop1",sep="")
data<-data2haplohh(hap_file="hap_file","snp.info",chr.name=i)
res<-scan_hh(data)
if(i==1){wg.res.pop1<-res}else{wg.res.pop1<-rbind(wg.res.pop1,res)}
hap_file=paste("hap_chr_",i,".pop2",sep="")
data<-data2haplohh(hap_file="hap_file","snp.info",chr.name=i)
res<-scan_hh(data)
if(i==1){wg.res.pop2<-res}else{wg.res.pop2<-rbind(wg.res.pop2,res)}
}
wg.rsb<-ies2rsb(wg.res.pop1,wg.res.pop2)
As a matter of illustration, one may consider results from a similar genome scan (Gautier and Naves 2011) provided as example data sets and compute for each SNP the between the CGU and EUT populations as follows:
data(wgscan.cgu) ; data(wgscan.eut)
## results from a genome scan (44,057 SNPs) see ?wgscan.eut and ?wgscan.cgu for details
cguVSeut.rsb<-ies2rsb(wgscan.cgu,wgscan.eut,"CGU","EUT")
The resulting object is a data frame with of SNP (and corresponding P-Values assuming are normally distributed under the neutral hypothesis). Note that either bilateral (default) or unilateral might be performed ( argument). The five first rows of the data frame are displayed below using the following R command:
head(cguVSeut.rsb)
> CHR POSITION Rsb (CGU vs. EUT) -log10(p-value) [bilateral]
> F0100190 1 113642 -0.3398574 0.13432529
> F0100220 1 244699 -1.0566283 0.53658299
> F0100250 1 369419 -0.1468326 0.05390941
> F0100270 1 447278 -1.8191608 1.16186336
> F0100280 1 487654 -0.2193069 0.08280392
> F0100290 1 524507 -0.7941300 0.36945032
The function allows to draw a Manhattan plot of the Whole Genome scan results as stored in the data frame produced by the function . Various options are available to modify the graphical display (see ). As an example, the Figure below provides the output of the function for the computed above across the CGU and EUT populations (see ). Figure was drawn using the following R code:
rsbplot(cguVSeut.rsb,plot.pval=TRUE)
Graphical output for the function
The statistics (Sabeti et al. 2007) is similar to the except that it is based on the \(\mathrm{iES}_\text{pop2}(s)^{\text{Sab}}\) (instead of \(\mathrm{iES}_\text{pop2}(s)^{\text{Tang}}\)) estimator of the (see ). Hence, for or a given SNP \(s\), let \[\mathrm{LRiES}(s)^{\text{Sab}}=\log\left(\frac{\mathrm{iES}_\text{pop1}(s)^{\text{Sab}}}{\mathrm{iES}_\text{pop2}(s)^{\text{Sab}}}\right)\] represent the log-ratio of the \(\mathrm{iES}_\text{pop1}(s)^{\text{Sab}}\) and \(\mathrm{iES}_\text{pop2}(s)^{\text{Sab}}\) computed in the pop1 and pop2 populations (see ).
The for a given focal SNP is then defined as the standardized \(\mathrm{LRiES}(s)^{\text{Sab}}\) (Sabeti et al. 2007):
\begin{equation} \mathrm{rSB}(s)=\frac{\mathrm{LRiES}(s)^{\text{Sab}} - \text{med}_{\mathrm{LRiES}^{\text{Sab}}}}{\sigma_{\mathrm{LRiES}^{\text{Sab}}}} \end{equation}where \(\text{med}_{\mathrm{LRiES}^{\text{Sab}}}\) and \(\sigma_{\mathrm{LRiES}^{\text{Sab}}}\) represent the median and standard deviation of the \(\mathrm{LRiES}(s)^{\text{Sab}}\) computed over all the analyzed SNPs. More importantly, it should be noticed that the information about the ancestral and derived status of alleles at the focal SNP is not needed.
As for the (see ) and , is constructed to have an approximately standard Gaussian distribution and may further be transformed into \(p_\mathrm{xpEHH}\): \begin{equation} p_\mathrm{xpEHH}=-\log_{10}\left(1-2|\Phi\left(\mathrm{xpEHH}\right)-0.5|\right) \end{equation} where \(\Phi\left(x\right)\) represents the Gaussian cumulative distribution function. Assuming most of the genotyped SNPs behave neutrally (i.e., the genome-wide empirical distribution is a fair approximation of their corresponding neutral distributions), \(p_\mathrm{xpEHH}\) might thus be interpreted as a two-sided P-value (in a \(-\log_{10}\) scale) associated to the neutral hypothesis of no selection. Alternatively, \(p_\mathrm{xpEHH}\) might also be computed (Gautier and Naves 2011): \begin{equation} p\prime_\mathrm{xpEHH}=-\log_{10}\left(|\Phi\left(\mathrm{xpEHH}\right)|\right) \end{equation}\(p\prime_\mathrm{xpEHH}\) and \(p\prime_\mathrm{xpEHH}\) might then be interpreted as a one-sided P-value (in a \(-\log_{10}\) scale) allowing the identification of those sites displaying a significantly high extended haplotype homozygosity in population \(pop2\) (represented in the denominator of the corresponding \(\mathrm{LRiES}\)) relatively to the \(pop1\) reference population.
The function allows to compute using two data frames containing the statistics for each of the two populations considered in the same format as the one obtained after running the function (see ). For instance, to perform a genome scan one might first run for each population in turn on haplotype data from each chromosome and concatenate the resulting matrices. In the following example, we assume that the haplotype files are named as and where \(i\) is the chromosome number (going from 1 to 29), the suffixes pop1 and pop2 indicate the population of origin and the SNP information file is named . The R code below then generates two data frames ( and ) containing the results from all SNPs in the appropriate format to compute with the function:
for(i in 1:29){
hap_file=paste("hap_chr_",i,".pop1",sep="")
data<-data2haplohh(hap_file="hap_file","snp.info",chr.name=i)
res<-scan_hh(data)
if(i==1){wg.res.pop1<-res}else{wg.res.pop1<-rbind(wg.res.pop1,res)}
hap_file=paste("hap_chr_",i,".pop2",sep="")
data<-data2haplohh(hap_file="hap_file","snp.info",chr.name=i)
res<-scan_hh(data)
if(i==1){wg.res.pop2<-res}else{wg.res.pop2<-rbind(wg.res.pop2,res)}
}
wg.xpehh<-ies2xpehh(wg.res.pop1,wg.res.pop2)
As a matter of illustration, one may consider results from a similar genome scan (Gautier and Naves 2011) provided as example data sets and compute for each SNP the between the CGU and EUT populations as follows:
data(wgscan.cgu) ; data(wgscan.eut)
## results from a genome scan (44,057 SNPs) see ?wgscan.eut and ?wgscan.cgu for details
cguVSeut.xpehh<-ies2xpehh(wgscan.cgu,wgscan.eut,"CGU","EUT")
The resulting object is a data frame with of SNP (and corresponding P-values assuming are normally distributed under the neutral hypothesis). Note that either bilateral (default) or unilateral might be performed ( argument). The five first rows of this data frame are displayed below using the following R command:
head(cguVSeut.xpehh)
> CHR POSITION XPEHH (CGU vs. EUT) -log10(p-value) [bilateral]
> F0100190 1 113642 -0.5555841 0.2377002
> F0100220 1 244699 -0.7516166 0.3445910
> F0100250 1 369419 -0.8885736 0.4268588
> F0100270 1 447278 -0.3470522 0.1375394
> F0100280 1 487654 -0.9182772 0.4455426
> F0100290 1 524507 -0.7521031 0.3448721
The function allows to draw a Manhattan plot of the Whole Genome scan results as stored in the data frame produced by the function . Various options are available to modify the graphical display (see ). As an example, the Figure below provides the output of the function for the computed above across the CGU and EUT populations (see ). Figure was drawn using the following R code:
xpehhplot(cguVSeut.xpehh,plot.pval=TRUE)
Graphical output for the function
A plot of the against estimates across the CGU and EUT populations (see and respectively) is represented in the Figure below. This figure was generated using the following R code:
plot(cguVSeut.rsb[,3],cguVSeut.xpehh[,3],xlab="Rsb",ylab="XPEHH",pch=16,
cex=0.5,cex.lab=0.75,cex.axis=0.75)
abline(a=0,b=1,lty=2)
Comparison of the XPEHH and Rsb estimates across the CGU and EUT populations
The function allows to easily visualize the distributions of the standardized scores (either , or ) and compare them to the standard Gaussian distribution. As an example, the Figure below provides the output the function when considering the estimates obtained for the CGU population (see ) using the following R code:
distribplot(ihs.cgu$iHS[,3],xlab="iHS")
Graphical output for the function
The function function draws haplotype bifurcation diagrams (Sabeti et al. 2002) that allow to better understand the origin of an observed footprints of selection. Such diagrams indeed consist in plotting the breakdown of LD at increasing distances from the core allele at the selected focal SNPs. The root (focal SNP) of each diagram is the core allele and is here identified by a vertical dashed line. The diagram is bi-directional, portraying both centromere-proximal and centromere-distal LD. Moving in one direction, each marker is an opportunity for a node; the diagram either divides or not based on whether both or only one allele is present. Thus the breakdown of LD on the core haplotype background is portrayed at progressively longer distances. The thickness of the lines corresponds to the number of samples with the indicated long-distance haplotype. Several options are available to modify the aspect of the plots (see command ) As a matter of illustration, Figure shows the bifurcation diagrams for both the derived and ancestral alleles at the \(456^{\text{th}}\) SNP on BTA12 CGU haplotypes. This SNP displayed a strong signal of selection (using both and statistics) and is located closed (<5kb) to a strong candidate genes involved in horn development (Gautier and Naves 2011). Figure was obtained with the following R code:
data(haplohh_cgu_bta12)
layout(matrix(1:2,2,1))
bifurcation.diagram(haplohh_cgu_bta12,mrk_foc=456,all_foc=1,nmrk_l=20,nmrk_r=20,
main="Bifurcation diagram (RXFP2 SNP on BTA12): Ancestral Allele")
bifurcation.diagram(haplohh_cgu_bta12,mrk_foc=456,all_foc=2,nmrk_l=20,nmrk_r=20,
main="Bifurcation diagram (RXFP2 SNP on BTA12): Derived Allele")
Graphical output for the function
Gautier M., Naves M., 2011 Footprints of selection in the ancestral admixture of a New World Creole cattle breed. Mol Ecol 20: 3128–3143.
O’Connell J., Gurdasani D., Delaneau O., Pirastu N., Ulivi S., others, 2014 A general approach for haplotype phasing across the full spectrum of relatedness. PLoS Genet 10: e1004234.
Sabeti P. C., Reich D. E., Higgins J. M., Levine H. Z. P., Richter D. J., others, 2002 Detecting recent positive selection in the human genome from haplotype structure. Nature 419: 832–837.
Sabeti P. C., Varilly P., Fry B., Lohmueller J., Hostetter E., others, 2007 Genome-wide detection and characterization of positive selection in human populations. Nature 449: 913–918.
Scheet P., Stephens M., 2006 A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am J Hum Genet 78: 629–644.
Tang K., Thornton K. R., Stoneking M., 2007 A new approach for using genome scans to detect recent positive selection in the human genome. PLoS Biol 5: e171.
Voight B. F., Kudaravalli S., Wen X., Pritchard J. K., 2006 A map of recent positive selection in the human genome. PLoS Biol 4: e72.