zalpha

The zalpha package contains statistics for identifying areas of the genome that have undergone a selective sweep. The idea behind these statistics is to find areas of the genome that are highly correlated, as this can be a sign that a sweep has occurred recently in the vicinity. For more information on the statistics, please see the paper by Jacobs et al. (2016) referenced below.

Data

The data used in this vignette is a very small simple dataset containing 20 SNPs and a population of 10 chromosomes. Realistically, the dataset would be much bigger. It is highly recommended to use only SNPs with a minor allele frequency of over 5%, as it is hard to find correlations between rare alleles.

The dataset “snps” is included with this package and can be loaded using the code:

library(zalpha)
data(snps)
## This is what the dataset looks like:
snps
#>    positions distances chrom_1 chrom_2 chrom_3 chrom_4 chrom_5 chrom_6 chrom_7
#> 1        100    0.0001       1       0       0       1       1       1       0
#> 2        200    0.0002       1       1       1       0       0       0       1
#> 3        300    0.0005       1       1       1       1       1       1       1
#> 4        400    0.0008       0       0       1       1       1       0       1
#> 5        500    0.0010       1       1       1       1       0       1       0
#> 6        600    0.0015       1       0       0       1       1       0       1
#> 7        700    0.0017       1       0       0       1       1       1       1
#> 8        800    0.0019       0       0       0       0       1       1       1
#> 9        900    0.0023       1       1       0       0       1       1       0
#> 10      1000    0.0024       0       0       1       1       0       0       1
#> 11      1100    0.0026       1       0       1       0       0       0       0
#> 12      1200    0.0027       1       0       1       0       1       0       0
#> 13      1300    0.0032       1       1       1       1       0       0       1
#> 14      1400    0.0037       1       0       1       0       1       0       0
#> 15      1500    0.0039       0       0       1       0       1       0       0
#> 16      1600    0.0040       1       0       0       0       0       1       1
#> 17      1700    0.0041       0       0       0       1       0       1       1
#> 18      1800    0.0045       1       0       1       0       0       0       1
#> 19      1900    0.0048       1       0       0       1       1       1       0
#> 20      2000    0.0049       0       1       1       0       0       1       1
#>    chrom_8 chrom_9 chrom_10
#> 1        0       0        0
#> 2        0       1        0
#> 3        1       0        1
#> 4        1       0        0
#> 5        0       1        0
#> 6        0       0        1
#> 7        1       1        0
#> 8        1       0        1
#> 9        0       1        1
#> 10       1       1        1
#> 11       0       1        0
#> 12       1       1        1
#> 13       1       1        0
#> 14       1       0        1
#> 15       1       1        0
#> 16       1       0        1
#> 17       1       0        1
#> 18       1       1        0
#> 19       0       0        1
#> 20       0       0        1

This data set contains information about each of the SNPs. The first column gives the physical location of the SNP along the chromosome, in whatever units is useful to the user (usually bp or Kb). In this example, the positions are assumed to be in base pairs (bp).

The next column is the genetic distance of the SNP from the start of the chromosome. This could be in centimorgans (cM), linkage disequilibrium units (LDU) or any other way of measuring genetic distance, as long as it is additive (i.e. the distance between SNP A and SNP C is equal to the distance between SNP A and SNP B plus SNP B and SNP C). This data is only required if the user is interested in adjusting for recombination and supplies an LDprofile.

The final columns are the SNP alleles for each of the chromosomes in the population. Each SNP must be biallelic, but can contain any value, for example 0s and 1s, or A/G/C/Ts.

Zalpha

To test for selection, the user can use the Zalpha function. This function takes each SNP in the dataset, the “target locus”, calculates the \(Z_{\alpha}\) value, then moves on to the next SNP. It works by calculating correlations between alleles on each side of the target locus and averaging them. To do this, the function needs three inputs:

results<-Zalpha(snps$positions,3000,as.matrix(snps[,3:12]))
results
#> $position
#>  [1]  100  200  300  400  500  600  700  800  900 1000 1100 1200 1300 1400 1500
#> [16] 1600 1700 1800 1900 2000
#> 
#> $Zalpha
#>  [1]         NA         NA         NA         NA 0.09893585 0.10944106
#>  [7] 0.11602964 0.11215782 0.12419556 0.12962040 0.12583275 0.11980489
#> [13] 0.10802033 0.12064462 0.10873917 0.10267168         NA         NA
#> [19]         NA         NA
plot(results$position,results$Zalpha)

The output shows the positions of each of the SNPs and the \(Z_{\alpha}\) value calculated for each SNP. The NAs are due to the parameters minRandL and minRL, which have defaults 4 and 25 respectively. minRandL specifies the minimum number of SNPs that must be to the left and right of the target SNP within the window. minRL is the product of these numbers.

Say the user is only interested in the output of Zalpha for a particular region of the chromosome; this is achieved by setting the “X” parameter to the lower and upper bounds of the region.

Zalpha(snps$positions,3000,as.matrix(snps[,3:12]),X=c(500,1000))
#> $position
#> [1]  500  600  700  800  900 1000
#> 
#> $Zalpha
#> [1] 0.09893585 0.10944106 0.11602964 0.11215782 0.12419556 0.12962040

LD Profile

Using an LD profile allows the user to adjust for variable recombination rates along the chromosome. Here is the example LD profile provided with the zalpha R package:

LDprofile
#>       bin        rsq        sd    Beta_a    Beta_b
#> 1  0.0000 0.12214545 0.2275545 0.2764342 1.8116558
#> 2  0.0001 0.09335562 0.2197797 0.2708984 2.0267642
#> 3  0.0002 0.12218997 0.2370013 0.2698494 1.6785461
#> 4  0.0003 0.09042594 0.2005840 0.2639042 2.2719719
#> 5  0.0004 0.09846861 0.2415291 0.2498579 1.5550545
#> 6  0.0005 0.05234892 0.1645840 0.2734530 3.5077770
#> 7  0.0006 0.09849803 0.2171552 0.2805981 2.0719602
#> 8  0.0007 0.09234729 0.2185910 0.2729321 1.8642312
#> 9  0.0008 0.05612510 0.1463234 0.3270302 4.5627156
#> 10 0.0009 0.05799673 0.1569512 0.2875387 3.8010222
#> 11 0.0010 0.06451333 0.1803580 0.2695071 2.7318059
#> 12 0.0011 0.07202593 0.1980737 0.2575224 2.2259595
#> 13 0.0012 0.10457653 0.2452326 0.2527777 1.5974515
#> 14 0.0013 0.05750545 0.1681630 0.2854224 3.3140481
#> 15 0.0014 0.09774452 0.2388637 0.1924497 0.6386457
#> 16 0.0015 0.06229074 0.1752179 0.1791416 0.6853134
#> 17 0.0016 0.08488753 0.1917311 0.3031776 2.7157927
#> 18 0.0017 0.08160874 0.2037504 0.2639457 2.1465406
#> 19 0.0018 0.08745139 0.2037958 0.2704832 2.2651764
#> 20 0.0019 0.07123330 0.1840094 0.2810918 2.8468435
#> 21 0.0020 0.09485261 0.2191376 0.2723791 1.9259026
#> 22 0.0021 0.05804671 0.1458408 0.3207629 4.4316165
#> 23 0.0022 0.06447594 0.1502141 0.3368500 4.2842122
#> 24 0.0023 0.08147056 0.2012367 0.2690378 2.2633659
#> 25 0.0024 0.09434107 0.2250389 0.2557325 1.8280552
#> 26 0.0025 0.07046571 0.1831713 0.2857604 2.8830237
#> 27 0.0026 0.08397797 0.1926024 0.2900191 2.4698181
#> 28 0.0027 0.05834056 0.1662758 0.3033592 3.2711990
#> 29 0.0028 0.06702507 0.1662532 0.2933794 3.2574873
#> 30 0.0029 0.05820796 0.1723149 0.2682267 2.8738054
#> 31 0.0030 0.09507890 0.2085120 0.2716050 2.1487500
#> 32 0.0031 0.04551497 0.1239696 0.3296394 5.8488772
#> 33 0.0032 0.04241460 0.1352978 0.3349360 5.5737334
#> 34 0.0033 0.10255730 0.2259408 0.2718446 1.9065634
#> 35 0.0034 0.05181171 0.1523572 0.2978946 4.0344464
#> 36 0.0035 0.06537539 0.1633711 0.2937305 3.4617464
#> 37 0.0036 0.07133690 0.1728133 0.2845848 3.0456476
#> 38 0.0037 0.09034151 0.2365844 0.2438361 1.6296772
#> 39 0.0038 0.06661469 0.1718022 0.2842834 3.1988517
#> 40 0.0039 0.11322964 0.2325874 0.2031107 0.6843425
#> 41 0.0040 0.10068963 0.2067405 0.2918418 2.1743057
#> 42 0.0041 0.09924437 0.2321804 0.2601480 1.7887389
#> 43 0.0042 0.07122869 0.1813322 0.2701038 2.7201233
#> 44 0.0043 0.11335340 0.2081721 0.2869211 2.0950029
#> 45 0.0044 0.11286222 0.2357003 0.2679826 1.6981865
#> 46 0.0045 0.05429260 0.1622758 0.2940277 3.7098450
#> 47 0.0046 0.07505181 0.1992234 0.2679613 2.3091843
#> 48 0.0047 0.08873456 0.1996523 0.2754293 2.3102141
#> 49 0.0048 0.06982331 0.1902250 0.2603511 2.4806350
#> 50 0.0049 0.07837201 0.1944029 0.2745399 2.5207091

The LD (linkage disequilibrium) profile contains data about the expected correlation between SNPs given the genetic distance between them. This could be generated using a simulated chromosome where the genetic distances are known, after which the statistics can be calculated. The columns are:

For example, if we assume the bins are in centimorgans, and we know two SNPs are 0.00015 cM apart, the LDprofile tells us that we expect the r2 value to be 0.093, with a standard deviation of 0.22 and fits the distribution Beta(0.27,2.03).

For real world data, Jacobs et al. (2016) recommend using distances up to 2 cM assigned to 20,000 bins.

Zalpha_expected

The expected \(Z_{\alpha}\) value (denoted \(Z_{\alpha}^{E[r^2]}\)) can be calculated for a chromosome given an LD profile and the genetic distances between each SNP in the chromosome. Instead of calculating the r2 values between SNPs, the function works out the genetic distance between them, finds the bin in the LD profile that the genetic distance falls into, and reads out the expected r2 value. The function then calculates \(Z_{\alpha}\) as normal.

Zalpha_expected(snps$positions, 3000, snps$distances, LDprofile$bin, LDprofile$rsq)
#> $position
#>  [1]  100  200  300  400  500  600  700  800  900 1000 1100 1200 1300 1400 1500
#> [16] 1600 1700 1800 1900 2000
#> 
#> $Zalpha_expected
#>  [1]         NA         NA         NA         NA 0.08633502 0.08169169
#>  [7] 0.07981817 0.08104324 0.08134506 0.08015626 0.07984135 0.08082988
#> [13] 0.08263229 0.08040794 0.07822704 0.08289555         NA         NA
#> [19]         NA         NA

Once \(Z_{\alpha}^{E[r^2]}\) has been calculated, it can be combined with the \(Z_{\alpha}\) results to adjust for recombination, for example by computing \(Z_{\alpha}\) - \(Z_{\alpha}^{E[r^2]}\) or \(Z_{\alpha}\)/\(Z_{\alpha}^{E[r^2]}\).

Other functions that take into account variable recombination rates are Zalpha_rsq_over_expected, Zalpha_log_rsq_over_expected, Zalpha_Zscore, and Zalpha_BetaCDF.

Zbeta

The Zbeta function works in exactly the same way as the Zalpha function, but evaluates correlations between SNPs on either side of the target locus, rather than each side separately. It is useful to use the \(Z_{\beta}\) statistic in conjunction with the \(Z_{\alpha}\) statistic, as they behave differently depending on how close to fixation the sweep is. For example, while a sweep is in progress both \(Z_{\alpha}\) and \(Z_{\beta}\) would be higher than other areas of the chromosome without a sweep present. However, when a sweep reaches near-fixation, \(Z_{\beta}\) would decrease whereas \(Z_{\alpha}\) would remain high. Combining \(Z_{\alpha}\) and \(Z_{\beta}\) into new statistics such as \(Z_{\alpha}\)/\(Z_{\beta}\) is one way of analysing this.

Zalpha_all

Zalpha_all is the recommended function for using this package. It will run all the statistics included in the package (\(Z_{\alpha}\) and \(Z_{\beta}\) variations), so the user does not have to run multiple functions to achieve all the statistics they want. The function will only calculate the statistics it has been given the appropriate inputs for, so it is flexible.

For example, this code will only run Zalpha, Zbeta and the two diversity statistics LR and L_plus_R, as an LDprofile was not supplied:

Zalpha_all(snps$positions,3000,as.matrix(snps[,3:12]))
#> $position
#>  [1]  100  200  300  400  500  600  700  800  900 1000 1100 1200 1300 1400 1500
#> [16] 1600 1700 1800 1900 2000
#> 
#> $LR
#>  [1] NA NA NA NA 60 70 78 84 88 90 90 88 84 78 70 60 NA NA NA NA
#> 
#> $L_plus_R
#>  [1]  NA  NA  NA  NA 111 101  93  87  83  81  81  83  87  93 101 111  NA  NA  NA
#> [20]  NA
#> 
#> $Zalpha
#>  [1]         NA         NA         NA         NA 0.09893585 0.10944106
#>  [7] 0.11602964 0.11215782 0.12419556 0.12962040 0.12583275 0.11980489
#> [13] 0.10802033 0.12064462 0.10873917 0.10267168         NA         NA
#> [19]         NA         NA
#> 
#> $Zbeta
#>  [1]        NA        NA        NA        NA 0.1280042 0.1298619 0.1219965
#>  [8] 0.1071535 0.1124896 0.1121871 0.1033178 0.1185118 0.1212802 0.1281512
#> [15] 0.1275420 0.1442328        NA        NA        NA        NA

Supplying an LDprofile will result in more of the statistics being calculated.

There are many ways that the resulting statistics can be combined to give new insights into the data, see Jacobs et al. (2016).

Identifying regions under selection

To find candidate regions for selection, first calculate the statistics across the chromosome, including any combined statistics that may be of interest. It is then suggested to find the maximum value for windows of around 200 Kb for each statistic (minimum values for the diversity statistics). Any regions which are outliers compared to the rest of the chromosome could be considered candidates and can be investigated further.

References

Jacobs, G.S., T.J. Sluckin, and T. Kivisild, Refining the Use of Linkage Disequilibrium as a Robust Signature of Selective Sweeps. Genetics, 2016. 203(4): p. 1807