The zalpha package contains statistics for identifying areas of the genome that have undergone a selective sweep. The idea behind these statistics is to find areas of the genome that are highly correlated, as this can be a sign that a sweep has occurred recently in the vicinity. For more information on the statistics, please see the paper by Jacobs et al. (2016) referenced below.
The data used in this vignette is a very small simple dataset containing 20 SNPs and a population of 10 chromosomes. Realistically, the dataset would be much bigger. It is highly recommended to use only SNPs with a minor allele frequency of over 5%, as it is hard to find correlations between rare alleles.
The dataset “snps” is included with this package and can be loaded using the code:
library(zalpha)
data(snps)
## This is what the dataset looks like:
snps
#> positions distances chrom_1 chrom_2 chrom_3 chrom_4 chrom_5 chrom_6 chrom_7
#> 1 100 0.0001 1 0 0 1 1 1 0
#> 2 200 0.0002 1 1 1 0 0 0 1
#> 3 300 0.0005 1 1 1 1 1 1 1
#> 4 400 0.0008 0 0 1 1 1 0 1
#> 5 500 0.0010 1 1 1 1 0 1 0
#> 6 600 0.0015 1 0 0 1 1 0 1
#> 7 700 0.0017 1 0 0 1 1 1 1
#> 8 800 0.0019 0 0 0 0 1 1 1
#> 9 900 0.0023 1 1 0 0 1 1 0
#> 10 1000 0.0024 0 0 1 1 0 0 1
#> 11 1100 0.0026 1 0 1 0 0 0 0
#> 12 1200 0.0027 1 0 1 0 1 0 0
#> 13 1300 0.0032 1 1 1 1 0 0 1
#> 14 1400 0.0037 1 0 1 0 1 0 0
#> 15 1500 0.0039 0 0 1 0 1 0 0
#> 16 1600 0.0040 1 0 0 0 0 1 1
#> 17 1700 0.0041 0 0 0 1 0 1 1
#> 18 1800 0.0045 1 0 1 0 0 0 1
#> 19 1900 0.0048 1 0 0 1 1 1 0
#> 20 2000 0.0049 0 1 1 0 0 1 1
#> chrom_8 chrom_9 chrom_10
#> 1 0 0 0
#> 2 0 1 0
#> 3 1 0 1
#> 4 1 0 0
#> 5 0 1 0
#> 6 0 0 1
#> 7 1 1 0
#> 8 1 0 1
#> 9 0 1 1
#> 10 1 1 1
#> 11 0 1 0
#> 12 1 1 1
#> 13 1 1 0
#> 14 1 0 1
#> 15 1 1 0
#> 16 1 0 1
#> 17 1 0 1
#> 18 1 1 0
#> 19 0 0 1
#> 20 0 0 1
This data set contains information about each of the SNPs. The first column gives the physical location of the SNP along the chromosome, in whatever units is useful to the user (usually bp or Kb). In this example, the positions are assumed to be in base pairs (bp).
The next column is the genetic distance of the SNP from the start of the chromosome. This could be in centimorgans (cM), linkage disequilibrium units (LDU) or any other way of measuring genetic distance, as long as it is additive (i.e. the distance between SNP A and SNP C is equal to the distance between SNP A and SNP B plus SNP B and SNP C). This data is only required if the user is interested in adjusting for recombination and supplies an LDprofile.
The final columns are the SNP alleles for each of the chromosomes in the population. Each SNP must be biallelic, but can contain any value, for example 0s and 1s, or A/G/C/Ts.
To test for selection, the user can use the Zalpha function. This function takes each SNP in the dataset, the “target locus”, calculates the \(Z_{\alpha}\) value, then moves on to the next SNP. It works by calculating correlations between alleles on each side of the target locus and averaging them. To do this, the function needs three inputs:
A vector of the physical locations of each of the SNPs
The window size. This is set to 3000 for this small example but realistically a window size of around 200 Kb is appropriate. The window is centred on the target locus, and considers SNPs that are within ws/2 to the left and ws/2 to the right of the target SNP.
A matrix of the SNP alleles across each chromosome in the sample. The number of rows should be equal to the number of SNPs, and the columns are each of the chromosomes.
results<-Zalpha(snps$positions,3000,as.matrix(snps[,3:12]))
results
#> $position
#> [1] 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500
#> [16] 1600 1700 1800 1900 2000
#>
#> $Zalpha
#> [1] NA NA NA NA 0.09893585 0.10944106
#> [7] 0.11602964 0.11215782 0.12419556 0.12962040 0.12583275 0.11980489
#> [13] 0.10802033 0.12064462 0.10873917 0.10267168 NA NA
#> [19] NA NA
plot(results$position,results$Zalpha)
The output shows the positions of each of the SNPs and the \(Z_{\alpha}\) value calculated for each SNP. The NAs are due to the parameters minRandL and minRL, which have defaults 4 and 25 respectively. minRandL specifies the minimum number of SNPs that must be to the left and right of the target SNP within the window. minRL is the product of these numbers.
Say the user is only interested in the output of Zalpha for a particular region of the chromosome; this is achieved by setting the “X” parameter to the lower and upper bounds of the region.
Using an LD profile allows the user to adjust for variable recombination rates along the chromosome. Here is the example LD profile provided with the zalpha R package:
LDprofile
#> bin rsq sd Beta_a Beta_b
#> 1 0.0000 0.12214545 0.2275545 0.2764342 1.8116558
#> 2 0.0001 0.09335562 0.2197797 0.2708984 2.0267642
#> 3 0.0002 0.12218997 0.2370013 0.2698494 1.6785461
#> 4 0.0003 0.09042594 0.2005840 0.2639042 2.2719719
#> 5 0.0004 0.09846861 0.2415291 0.2498579 1.5550545
#> 6 0.0005 0.05234892 0.1645840 0.2734530 3.5077770
#> 7 0.0006 0.09849803 0.2171552 0.2805981 2.0719602
#> 8 0.0007 0.09234729 0.2185910 0.2729321 1.8642312
#> 9 0.0008 0.05612510 0.1463234 0.3270302 4.5627156
#> 10 0.0009 0.05799673 0.1569512 0.2875387 3.8010222
#> 11 0.0010 0.06451333 0.1803580 0.2695071 2.7318059
#> 12 0.0011 0.07202593 0.1980737 0.2575224 2.2259595
#> 13 0.0012 0.10457653 0.2452326 0.2527777 1.5974515
#> 14 0.0013 0.05750545 0.1681630 0.2854224 3.3140481
#> 15 0.0014 0.09774452 0.2388637 0.1924497 0.6386457
#> 16 0.0015 0.06229074 0.1752179 0.1791416 0.6853134
#> 17 0.0016 0.08488753 0.1917311 0.3031776 2.7157927
#> 18 0.0017 0.08160874 0.2037504 0.2639457 2.1465406
#> 19 0.0018 0.08745139 0.2037958 0.2704832 2.2651764
#> 20 0.0019 0.07123330 0.1840094 0.2810918 2.8468435
#> 21 0.0020 0.09485261 0.2191376 0.2723791 1.9259026
#> 22 0.0021 0.05804671 0.1458408 0.3207629 4.4316165
#> 23 0.0022 0.06447594 0.1502141 0.3368500 4.2842122
#> 24 0.0023 0.08147056 0.2012367 0.2690378 2.2633659
#> 25 0.0024 0.09434107 0.2250389 0.2557325 1.8280552
#> 26 0.0025 0.07046571 0.1831713 0.2857604 2.8830237
#> 27 0.0026 0.08397797 0.1926024 0.2900191 2.4698181
#> 28 0.0027 0.05834056 0.1662758 0.3033592 3.2711990
#> 29 0.0028 0.06702507 0.1662532 0.2933794 3.2574873
#> 30 0.0029 0.05820796 0.1723149 0.2682267 2.8738054
#> 31 0.0030 0.09507890 0.2085120 0.2716050 2.1487500
#> 32 0.0031 0.04551497 0.1239696 0.3296394 5.8488772
#> 33 0.0032 0.04241460 0.1352978 0.3349360 5.5737334
#> 34 0.0033 0.10255730 0.2259408 0.2718446 1.9065634
#> 35 0.0034 0.05181171 0.1523572 0.2978946 4.0344464
#> 36 0.0035 0.06537539 0.1633711 0.2937305 3.4617464
#> 37 0.0036 0.07133690 0.1728133 0.2845848 3.0456476
#> 38 0.0037 0.09034151 0.2365844 0.2438361 1.6296772
#> 39 0.0038 0.06661469 0.1718022 0.2842834 3.1988517
#> 40 0.0039 0.11322964 0.2325874 0.2031107 0.6843425
#> 41 0.0040 0.10068963 0.2067405 0.2918418 2.1743057
#> 42 0.0041 0.09924437 0.2321804 0.2601480 1.7887389
#> 43 0.0042 0.07122869 0.1813322 0.2701038 2.7201233
#> 44 0.0043 0.11335340 0.2081721 0.2869211 2.0950029
#> 45 0.0044 0.11286222 0.2357003 0.2679826 1.6981865
#> 46 0.0045 0.05429260 0.1622758 0.2940277 3.7098450
#> 47 0.0046 0.07505181 0.1992234 0.2679613 2.3091843
#> 48 0.0047 0.08873456 0.1996523 0.2754293 2.3102141
#> 49 0.0048 0.06982331 0.1902250 0.2603511 2.4806350
#> 50 0.0049 0.07837201 0.1944029 0.2745399 2.5207091
The LD (linkage disequilibrium) profile contains data about the expected correlation between SNPs given the genetic distance between them. This could be generated using a simulated chromosome where the genetic distances are known, after which the statistics can be calculated. The columns are:
bin: this is the lower bound of the bin. In this example, row 1 would include any SNPs greater than or equal to 0 but less than 0.0001 centimorgans apart (or whatever measure of genetic distance applicable for the user).
rsq: the expected r2 value for SNPs whose genetic distance between them falls within the bin.
sd: the standard deviation of r2 for the bin.
Beta_a: the first shape of the Beta distribution fitted to this bin. The R function fitdist can be used to estimate the Beta parameters.
Beta_b: the second shape of the Beta distribution.
For example, if we assume the bins are in centimorgans, and we know two SNPs are 0.00015 cM apart, the LDprofile tells us that we expect the r2 value to be 0.093, with a standard deviation of 0.22 and fits the distribution Beta(0.27,2.03).
For real world data, Jacobs et al. (2016) recommend using distances up to 2 cM assigned to 20,000 bins.
The expected \(Z_{\alpha}\) value (denoted \(Z_{\alpha}^{E[r^2]}\)) can be calculated for a chromosome given an LD profile and the genetic distances between each SNP in the chromosome. Instead of calculating the r2 values between SNPs, the function works out the genetic distance between them, finds the bin in the LD profile that the genetic distance falls into, and reads out the expected r2 value. The function then calculates \(Z_{\alpha}\) as normal.
Zalpha_expected(snps$positions, 3000, snps$distances, LDprofile$bin, LDprofile$rsq)
#> $position
#> [1] 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500
#> [16] 1600 1700 1800 1900 2000
#>
#> $Zalpha_expected
#> [1] NA NA NA NA 0.08633502 0.08169169
#> [7] 0.07981817 0.08104324 0.08134506 0.08015626 0.07984135 0.08082988
#> [13] 0.08263229 0.08040794 0.07822704 0.08289555 NA NA
#> [19] NA NA
Once \(Z_{\alpha}^{E[r^2]}\) has been calculated, it can be combined with the \(Z_{\alpha}\) results to adjust for recombination, for example by computing \(Z_{\alpha}\) - \(Z_{\alpha}^{E[r^2]}\) or \(Z_{\alpha}\)/\(Z_{\alpha}^{E[r^2]}\).
Other functions that take into account variable recombination rates are Zalpha_rsq_over_expected, Zalpha_log_rsq_over_expected, Zalpha_Zscore, and Zalpha_BetaCDF.
The Zbeta function works in exactly the same way as the Zalpha function, but evaluates correlations between SNPs on either side of the target locus, rather than each side separately. It is useful to use the \(Z_{\beta}\) statistic in conjunction with the \(Z_{\alpha}\) statistic, as they behave differently depending on how close to fixation the sweep is. For example, while a sweep is in progress both \(Z_{\alpha}\) and \(Z_{\beta}\) would be higher than other areas of the chromosome without a sweep present. However, when a sweep reaches near-fixation, \(Z_{\beta}\) would decrease whereas \(Z_{\alpha}\) would remain high. Combining \(Z_{\alpha}\) and \(Z_{\beta}\) into new statistics such as \(Z_{\alpha}\)/\(Z_{\beta}\) is one way of analysing this.
Zalpha_all is the recommended function for using this package. It will run all the statistics included in the package (\(Z_{\alpha}\) and \(Z_{\beta}\) variations), so the user does not have to run multiple functions to achieve all the statistics they want. The function will only calculate the statistics it has been given the appropriate inputs for, so it is flexible.
For example, this code will only run Zalpha, Zbeta and the two diversity statistics LR and L_plus_R, as an LDprofile was not supplied:
Zalpha_all(snps$positions,3000,as.matrix(snps[,3:12]))
#> $position
#> [1] 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500
#> [16] 1600 1700 1800 1900 2000
#>
#> $LR
#> [1] NA NA NA NA 60 70 78 84 88 90 90 88 84 78 70 60 NA NA NA NA
#>
#> $L_plus_R
#> [1] NA NA NA NA 111 101 93 87 83 81 81 83 87 93 101 111 NA NA NA
#> [20] NA
#>
#> $Zalpha
#> [1] NA NA NA NA 0.09893585 0.10944106
#> [7] 0.11602964 0.11215782 0.12419556 0.12962040 0.12583275 0.11980489
#> [13] 0.10802033 0.12064462 0.10873917 0.10267168 NA NA
#> [19] NA NA
#>
#> $Zbeta
#> [1] NA NA NA NA 0.1280042 0.1298619 0.1219965
#> [8] 0.1071535 0.1124896 0.1121871 0.1033178 0.1185118 0.1212802 0.1281512
#> [15] 0.1275420 0.1442328 NA NA NA NA
Supplying an LDprofile will result in more of the statistics being calculated.
There are many ways that the resulting statistics can be combined to give new insights into the data, see Jacobs et al. (2016).
To find candidate regions for selection, first calculate the statistics across the chromosome, including any combined statistics that may be of interest. It is then suggested to find the maximum value for windows of around 200 Kb for each statistic (minimum values for the diversity statistics). Any regions which are outliers compared to the rest of the chromosome could be considered candidates and can be investigated further.
Jacobs, G.S., T.J. Sluckin, and T. Kivisild, Refining the Use of Linkage Disequilibrium as a Robust Signature of Selective Sweeps. Genetics, 2016. 203(4): p. 1807