inbreedR

The idea behind inbreedR is to provide a consistent framework for the analysis of inbreeding and heterozygosity-fitness correlations (HFCs) based on genetic markers. This is the initial version of the package and we are happy about any comments and suggestions. Just write a mail to martin.adam.stoffel[at]gmail.com.

Installation

inbreedR is available on CRAN. Here is the code to download and install the current stable release.

install.packages("inbeedR")

The latest version can be downloaded from GitHub with the following code:

install.packages("devtools")
devtools::install_github("mastoffel/inbreedR", build_vignettes = TRUE)

The package provides documentation for every function. To get an overview, just look at inbreedR’s help file.

library("inbreedR")
?inbreedR

inbreedR contains the following functions:

Example datasets

In the following sections, the functionality of inbreedR is illustrated using genetic and phenotypic data from an inbred captive population of oldfield mice (Peromyscus polionotus) (J. I. Hoffman et al. 2014). These mice were paired to produce offspring with a range of inbreeding coefficients (0-0.453) over six generations of laboratory breeding and the resulting pedigree was recorded, from which individual f values were calculated. Example files are provided containing the genotypes of 36 P. polionotus individuals at 12 microsatellites and 13,198 SNPs respectively. Data on body mass at weaning, a fitness proxy, are also available for the same individuals.

library(inbreedR)
data("mouse_msats") # microsatellite data 
data("mouse_snps")  # snp data
data("bodyweight")  # fitness data

Data format and checking

The working format of inbreedR is an individual * loci matrix or data frame in which rows represent individuals and each column represents a locus. If an individual is heterozygous at a given locus, it is coded as 1, whereas a homozygote is coded as 0, and missing data are coded as NA. The mouse_snps dataset accompanying the package is already formatted in the right way.

data("mouse_snps")
mouse_snps[1:10, 1:10]
#>    SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 SNP7 SNP8 SNP9 SNP10
#> 11    0   NA   NA    0    0    0   NA    0    1     1
#> 22    0    0   NA    0    0    0    0    1    0     0
#> 32    0   NA   NA    0    0    0   NA   NA    1     0
#> 33    0    0    0    0    0    0    0    1    0     0
#> 34    1   NA   NA    1    0    0    0    0    0     0
#> 35    0   NA   NA    0    0    0   NA    0    1     0
#> 36    1    0   NA    0    0    0   NA    0    1     0
#> 1     1   NA   NA    0    0    0   NA    0    0     0
#> 2     0    0   NA    0    0    0   NA   NA    0     0
#> 3     0    0   NA    0    0    0   NA   NA    0     0

You can check whether your data is in the right format with the check_data function, which gives an error with a message when something went wrong and TRUE otherwise. Look up the documentation with ?check_data to see what exactly this functions checks for.

check_data(mouse_snps, num_ind = 36, num_loci = 13198)
#> [1] TRUE

Conversion from a more common format

convert_raw is a function to convert a more common format, where each locus is represented by two columns (alleles), into the inbreedR working format. Microsatellite data is often formatted like mouse_msats, which is the second dataset accompanying the package.

data("mouse_msats")
mouse_msats[1:8, 1:8]
#>    Pml01.1 Pml01.2 Po3-68.1 Po3-68.2 Plgt58.1 Plgt58.2 Plgt62.1 Plgt62.2
#> 1       32      32       52       38       30       30       30       20
#> 2       14      14       20       20       36       24       30       30
#> 5       24      14       42       42       36       32       30       30
#> 6       14      14       40       20       32       32       38       30
#> 7       34      20       50       48       32       20       28       28
#> 8       14      14       42       38       32       10       38       28
#> 9       24      24       60       20       32       30       38       28
#> 10      32      20       46       38       30       30       30       20

To convert it into the inbreedR working format, just use the convert_raw function.

mouse_microsats <- convert_raw(mouse_msats) 
mouse_microsats[1:8, 1:8]
#>    V1 V2 V3 V4 V5 V6 V7 V8
#> 1   0  1  0  1  1  1  1  0
#> 2   0  0  1  0  1  0  1  1
#> 5   1  0  1  0  0  0  0  0
#> 6   0  1  0  1  1  0  1  1
#> 7   1  1  1  0  0  1  1  1
#> 8   0  1  1  1  1  0  1  1
#> 9   0  1  1  1  1  1  1  0
#> 10  1  1  0  1  0  1  1  0

The same procedure works when you have letters (e.g. basepairs ‘A’, ‘T’) in two adjacent columns instead of microsatellite allele lengths.

A short theory of heterozygosity-fitness correlations (HFC)

Most HFC studies solely report the correlation between heterozygosity (h) and fitness (W). However, according to HFC theory, this correlation results from the simultaneous effects of inbreeding level (f) on fitness (\(r(W,f)\)) and heterozygosity (\(r(h,f\))) (Slate et al. 2004; Szulkin, Bierne, and David 2010):

\[ r(W,h) = r(h,f)r(W,f) \] (Equation 1)

Thus, empirical data on heterozygosity and fitness, together with the above equation, allow estimation of the impact of inbreeding on fitness. In the absence of a pedigree, f cannot be directly estimated. Instead, one can use the extent to which heterozygosity is correlated across loci, termed identity disequilibrium (ID), as a proxy. A measure of ID that can be related to HFC theory is the two-locus heterozygosity disequilibrium, \(g_2\) (DAVID et al. 2007), which quantifies the extent to which heterozygosities are correlated across pairs of loci. Based on \(g_2\) as an estimate of ID, it is then possible to calculate \(\hat{r}^2(h, f)\) as follows (Szulkin, Bierne, and David 2010):

\[\hat{r}^2(h, f) = \frac{\hat{g}_{2}}{\hat{\sigma}^2(h)}\] (Equation 2)

Finally, the expected determination coefficient between a fitness trait and inbreeding level can simply be derived be rearranging equation 1 (Szulkin, Bierne, and David 2010):

\[\hat{r}^2(W, f) = \frac{\hat{r}^2(W, h)}{\hat{r}^2(h, f)}\] (Equation 3)

Software is already available for calculating \(g_2\) from microsatellite datasets (DAVID et al. 2007). However, for larger datasets, e.g. SNPs, the original formula is not computationally practical, as it requires a double summation over all pairs of loci. For example, with 15.000 loci, the double summations take of the order of 0.2 x 109 computation steps. For this reason, inbreedR implements a computationally more feasible formula based on the assumption that missing values do not vary much between pairs of loci (J. I. Hoffman et al. 2014). In turn, the \(g_2\) parameter builds the foundation for the implementation of the above framework to analyse HFCs, which is recommended to be routinely computed in future HFC studies (Szulkin, Bierne, and David 2010).

Identity disequilibrium

The package provides two functions to calculate \(g_2\), a proxy for Identity disequilibrium, for both small datasets (e.g. microsatellites) and large datasets (e.g.SNPs).

Have a look at the help files with ?g2_microsats and ?g2_snps for more information on the formulae.

For both microsatellites and SNPs, inbreedR calculates confidence intervals by bootstrapping over individuals. It also permutes the genetic data to generate a P-value for the null hypothesis of no variance in inbreeding in the sample (i.e. \(g_2\) = 0).

g2_mouse_microsats <- g2_microsats(mouse_microsats, nperm = 100, nboot = 100, CI = 0.95)
g2_mouse_snps <- g2_snps(mouse_snps, nperm = 100, nboot = 10, 
                         CI = 0.95, parallel = FALSE, ncores = NULL)

To display a summary of the results just print the output of an inbreedR function.

g2_mouse_microsats
#> 
#> 
#> Calculation of identity disequilibrium with g2 for microsatellite data
#> ----------------------------------------------------------------------
#> 
#> Data: 36 observations at 12 markers
#> Function call = g2_microsats(genotypes = mouse_microsats, nperm = 100, nboot = 100,     CI = 0.95)
#> 
#> g2 = 0.02179805, se = 0.02063147
#> 
#> confidence interval 
#>         2.5%        97.5% 
#> -0.005335286  0.074037872 
#> 
#> p (g2 > 0) = 0.1 (based on 100 permutations)

plot shows the distribution of bootstrap results including the confidence interval.

par(mfrow=c(1,2))
plot(g2_mouse_microsats, main = "Microsatellites",
     col = "cornflowerblue", cex.axis=0.85)
plot(g2_mouse_snps, main = "SNPs",
     col = "darkgoldenrod1", cex.axis=0.85)

Distribution of g2 from bootstrapping with confidence interval

Another approach for estimating ID is to divide the marker panel into two random subsets, compute the correlation in heterozygosity between the two, and repeat this hundreds or thousands of times in order to obtain a distribution of heterozygosity-heterozygosity correlation coefficients (HHCs) (Balloux, Amos, and Coulson 2004). This approach is intuitive but can be criticised on the grounds that samples within the HHC distribution are non-independent. Moreover, \(g_2\) is preferable because it directly relates to HFC theory (equation 2). The HHC function in inbreedR calculates HHCs together with confidence intervals, specifying how often the dataset is randomly split into two halves with the niter argument. The results can be outputted as text or plotted as histograms with CIs.

HHC_mouse_microsats <- HHC(mouse_microsats , niter = 1000)
HHC_mouse_snps <- HHC(mouse_snps, niter = 100)
HHC_mouse_microsats
#> 
#> 
#> heterozygosity-heterozygosity correlations
#> ------------------------------------------
#> 
#> Data: 36 observations at 12 markers
#> Function call = HHC(genotypes = mouse_microsats, niter = 1000)
#> 
#> HHC Mean : 0.194
#> HHC SD: 0.121
#> HHC CI: [-0.032, 0.443]
par(mfrow=c(1,2))
plot(HHC_mouse_microsats, main = "Microsatellites",
     col = "cornflowerblue", cex.axis=0.85)
plot(HHC_mouse_snps, main = "SNPs",
     col = "darkgoldenrod1", cex.axis=0.85)

Distribution of heterozygosity-heterozygosity correlations

HFC parameters

Assuming that HFCs are due to inbreeding depression, it is possible to calculate both the expected correlation between heterozygosity and inbreeding level (\(\hat{r}^2(h, f)\)) and the expected correlation between a fitness trait and inbreeding (\(\hat{r}^2(W, f)\)) as described in Equation 1. These are implemented in inbreedR using the functions r2_hf and r2_Wf. Equal to the glm function, the distribution of the fitness trait can be specified in the family argument, as shown below:

# r^2 between inbreeding and heterozygosity
hf <- r2_hf(genotypes = mouse_microsats, type = "msats")
# r^2 between inbreeding and fitness
Wf <- r2_Wf(genotypes = mouse_microsats, trait = bodyweight, 
            family = gaussian, type = "msats")

In addition, bootstrapping over individuals can be used to estimate confidence intervals around these estimates. Also, there is the possibility of parallelization, by specifying parallel = TRUE

# r^2 between inbreeding and heterozygosity with bootstrapping
hf <- r2_hf(genotypes = mouse_microsats, nboot = 100, type = "msats", parallel = FALSE)

Note: For plotting the histogram with confidence interval for r2_hf you have to specify an additional argument to the plot function, because there will be another plotting possibility for the output of this function as you will read later in this vignette.

plot(hf, plottype = "histogram")

Workflow for estimating the impact of inbreeding on fitness using HFC

Szulkin, Bierne, and David (2010) in their online Appendix 1 provide a worked example of how to estimate the impact of inbreeding on fitness within an HFC framework. Below, we show how the required calculations can be implemented in inbreedR. We are now describing a coding workflow to estimate useful parameters for the interpretation of HFCs. We compare the results based on microsatellite and SNP data derived from a single inbred population of oldfield mice. We start with the estimation of identity disequilibrium (\(\hat{g}_2\)) and calculation of the distribution variance of standardized multilocus heterozygosity (\(\hat{\sigma}^2(h)\)), followed by the regression slope of fitness on heterozygosity (\(\hat{\beta}_{Wh}\)) and the three correlations from equation 1. Example code for the microsatellite dataset is shown below and the results for both microsatellites and SNPs are given in Table 1.

# g2
g2 <- g2_microsats(mouse_microsats)
# calculate sMLH
het <- sMLH(mouse_microsats)
# variance in sMLH
het_var <- var(het)
# Linear model of fitness trait on heterozygosity
mod <- lm(bodyweight ~ het)
# regression slope
beta <- coef(mod)[2]
# r2 between fitness and heterozygosity
Wh <- cor(bodyweight,predict(mod))^2
# r2 between inbreeding and heterozygosity
hf <- r2_hf(genotypes = mouse_microsats, type = "msats")
# r2 between inbreeding and fitness
Wf <- r2_Wf(genotypes = mouse_microsats, trait = bodyweight, 
            family = gaussian, type = "msats")
Descriptors of HFCs
\(\hat{g}_2\) \(\hat{\sigma}^2(h)\) \(\hat{\beta}_{Wh}\) \(\hat{r}^2_{Wh}\) \(\hat{r}^2_{hf}\) \(\hat{r}^2_{Wf}\)
microsats 0.022 0.078 1.601 0.121 0.28 0.433

Sensitivity to the number of markers

Subsampling analysis (Miller et al. 2013; Stoffel et al. 2015) can provide insights into the power provided by a given marker panel. The resample_g2 function within inbreedR can be used to randomly select marker subsets of specified sizes, from which trends in the average estimated as well as its variance with marker number can be derived. Note that the variance will scale negatively with an increasing proportion of sampled loci. The resample_g2 function within inbreedR allows specification of subsets and how often each subset should be resampled (nboot) and calculates the respective \(g_2\) values, their mean and standard deviation. The results can be visualised as a series of boxplots with plot.

resamp_g2_mouse_microsats <- resample_g2(mouse_microsats, subsets = c(2,4,6,8,10,12), 
                                     nboot = 100, type = "msats")
resamp_g2_mouse_snps <- resample_g2(mouse_snps, subsets = c(100, 500, 1000, 2000, 5000, 
                                    13000), nboot = 10, type = "snps")
par(mfrow = c(1, 2))
plot(resamp_g2_mouse_microsats, main = "Microsatellites", col = "cornflowerblue", cex.axis=0.85)
plot(resamp_g2_mouse_snps, main = "SNPs", col = "darkgoldenrod1", cex.axis=0.85)

g2 for different subsets of markers

Finally, sensitivity of \(\hat{r}^2(h, f)\) towards the number of markers used can be explored in a similar way using the r2_hf function, which again allows specification of the marker subsets and how often each subset should be sampled.

r2_hf_mouse_microsats <- r2_hf(mouse_microsats, subsets = c(2,4,6,8,10,12), 
                           nboot = 100, type = "msats")
r2_hf_mouse_snps <- r2_hf(mouse_snps, subsets = c(100, 500, 1000, 2000, 5000, 13000), 
                          nboot = 10, type = "snps")

Expected r2 between inbreeding level (f) and heterozygosity

Extracting raw data from inbreedR objects

You may wish to extract and plot the data yourself. Most function outputs are inbreed objects and lists. In the Value section of each functions documentation (?fun ), you can see the data which you can extract. Alternatively, use str() to look at the object’s structure. Just index the function output with [["."]] or $ as in the following example:

Running the function.

g2_seals <- g2_microsats(mouse_microsats, nperm = 100, 
                         nboot = 100, CI = 0.95)

Looking at the structure.

str(g2_seals)
#> List of 9
#>  $ call     : language g2_microsats(genotypes = mouse_microsats, nperm = 100, nboot = 100,      CI = 0.95)
#>  $ g2       : num 0.0218
#>  $ p_val    : num 0.02
#>  $ g2_permut: num [1:100] 0.0218 -0.00514 0.01906 -0.01579 -0.01767 ...
#>  $ g2_boot  : num [1:100] 0.0218 0.00255 0.01156 0.01815 0.0272 ...
#>  $ CI_boot  : Named num [1:2] -0.00711 0.06944
#>   ..- attr(*, "names")= chr [1:2] "2.5%" "97.5%"
#>  $ g2_se    : num 0.0208
#>  $ nobs     : int 36
#>  $ nloc     : int 12
#>  - attr(*, "class")= chr "inbreed"

Now extract whatever you want from the object, such as the \(g_2\) bootstrap results.

g2_bootstrap_results <- g2_seals$g2_boot
str(g2_bootstrap_results)
#>  num [1:100] 0.0218 0.00255 0.01156 0.01815 0.0272 ...

Literature

Balloux, F, W Amos, and T Coulson. 2004. “Does Heterozygosity Estimate Inbreeding in Real Populations?” Molecular Ecology 13 (10). Wiley Online Library: 3021–31.

Coltman, David W, Jill G Pilkington, Judith A Smith, and Josephine M Pemberton. 1999. “Parasite-Mediated Selection Against Inbred Soay Sheep in a Free-Living, Island Population.” Evolution, no. 53(4. JSTOR: 1259–67.

DAVID, PATRICE, BENOÎT PUJOL, FRÉDÉRIQUE VIARD, VINCENT CASTELLA, and JÉRÔME GOUDET. 2007. “Reliable selfing rate estimates from imperfect population genetic data.” Molecular Ecology 16 (12): 2474.

Hoffman, J I, F Simpson, P David, J M Rijks, T Kuiken, M A S Thorne, R C Lacy, and K K Dasmahapatra. 2014. “High-throughput sequencing reveals inbreeding depression in a natural population.” Proceedings of the National Academy of Sciences 111 (10): 3775–80.

Miller, J M, R M Malenfant, P David, C S Davis, J Poissant, J T Hogg, M Festa-Bianchet, and D W Coltman. 2013. “Estimating genome-wide heterozygosity: effects of demographic history and marker type.” Heredity 112 (3): 240–47.

Slate, J, P David, K G Dodds, B A Veenvliet, B C Glass, T E Broad, and J C McEwan. 2004. “Understanding the relationship between the inbreeding coefficient and multilocus heterozygosity: theoretical expectations and empirical data.” Heredity 93 (3): 255.

Stoffel, Martin A, Barbara A Caspers, Jaume Forcada, Athina Giannakara, Markus Baier, Luke Eberhart-Phillips, Caroline Müller, and Joseph I Hoffman. 2015. “Chemical Fingerprints Encode Mother–offspring Similarity, Colony Membership, Relatedness, and Genetic Quality in Fur Seals.” Proceedings of the National Academy of Sciences. National Acad Sciences, 201506076.

Szulkin, Marta, Nicolas Bierne, and Patrice David. 2010. “HETEROZYGOSITY-FITNESS CORRELATIONS: A TIME fOR REAPPRAISAL.” Evolution 64 (5). Wiley Online Library: 1202–17.