Line cross analysis (LCA) or partitioning the contribution of composite genetic effects (CGEs) to the mean phenotype of cohorts is widely used to investigate the genetic architecture of traits. This approach uses two parental strains which have diverged in a phenotype of interest. These parents are crossed, producing an F1, and subsequent crosses (e.g. F2, backcross, reciprocals) are made to generate groups that have different combinations of parental genes. We refer to each of these groups as cohorts. Using a weighted least squares regression with weights inversely proportional to the variance of the cohort means, the degree to which a phenotype is determined by different CGEs (e.g. additive, dominance, and epistatic gene action) may be estimated [1, 2]. Traditionally LCA has been accomplished by a process refereed to as the joint-scaling test, essentially forward variable selection weighted least squares regression. However this approach has a number of documented problems [3]. A full information-theoretic (I-T) approach to model selection and parameter estimation alleviates difficulties associated with previous approaches and provides additional understanding that is not possible under older approaches such as the joint-scaling test [4]. SAGA provides a full I-T approach to LCA that leverages the finite sample size corrected version of the Akaike information criterion (AICc) to explore all possible models and make unbiased and, when appropriate, model averaged estimates of the contribution of CGEs to cohort means. SAGA includes four functions and seven empirical datasets.
Functions:
Data:
We use the function GLM from the base R package to perform weighted least square regression [7]. GLM returns the parameter and standard error estimates conditional on the model as well as the AIC value for the model. We convert AIC to AICc using equation 1. Where n is the number of cohorts and K is the number of parameters being estimated.
We then calculate AICc differences (delta AICc) using equation 2.
Where delta AICc min is the minimum AICc score calculated across all possible models and AICci is the AICc calculated for a specific model. Delta AICc is used in generating Akaike weights (wi) using equation 3. The denominator in this equation is the summation of the numerator across all possible models being evaluated ®.
Under the default settings, if wi of the best model is 0.9 or greater then SAGA will perform parameter estimation under a single model. If no model reaches this threshold then we construct a 95% confidence set of models that contains the minimum number of models whose wi sum to 0.95. To calculate model averaged parameter estimates and unconditional standard errors we recalculate wi for each model performing the summation in the denominator of equation 3 across all models in the confidence set. The model weighted parameter estimates are then calculated using equation 4 where wi is the recalculated model weight and omega hat i is the parameter estimate from the model; the product of these values is summed across all models R in the confidence set.
Standard error estimates that are unconditional on any one model are calculated using equation 5.
Finally variable importance vi is calculated by summing wi of all models R in which a CGE occurs (Eq. 6).
A stable tested version of SAGA is available from the CRAN repository or the most recent version may be installed from github using the devtools package:
Installing from CRAN
install.packages("SAGA")
Installing from github
library(devtools)
install_github("coleoguy/SAGA", build_vignettes = TRUE)
C-matrix of composite genetic effects
The first step in analysis of line cross data is choice of a C-matrix that describes the expected contribution of different types of gene action to cohort phenotypes. By default SAGA will use a C-matrix that is scaled to the midparent mean (equivalent to Finf), and includes 23 potential CGEs. For each CGE we have calculated coefficients for 23 potential crosses; each of which is divided into male, female, or mixed sex cohorts. This C-matrix has 69 rows and the row numbers are used to identify the cohorts being used in an experiment. The function DisplayCmatrix is available so that we can determine what IDs should be used to identify the cohorts included in an analysis.
# print the C-matrix to the terminal
DisplayCmatrix(table = "MP")
Table 1. The first 15 rows of the C-matrix supplied with SAGA.
X.sire.x.dam. | ID | M | Aa | Ad | Xa | Xd | Ya | Ca | Ma | Md | AaAa | AaAd | AdAd | XaAa | XaAd | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | P1:daughters | 1 | 1 | 1 | 0.0 | 1.00 | 0.00 | 0.0 | 1 | 1 | 0 | 1 | 0 | 0.00 | 1 | 0.000 |
2 | P1:sons | 2 | 1 | 1 | 0.0 | 1.00 | 0.00 | 1.0 | 1 | 1 | 0 | 1 | 0 | 0.00 | 1 | 0.000 |
3 | P1:mixed | 3 | 1 | 1 | 0.0 | 1.00 | 0.00 | 0.5 | 1 | 1 | 0 | 1 | 0 | 0.00 | 1 | 0.000 |
4 | P2:daughters | 4 | 1 | -1 | 0.0 | -1.00 | 0.00 | 0.0 | -1 | -1 | 0 | 1 | 0 | 0.00 | 1 | 0.000 |
5 | P2:sons | 5 | 1 | -1 | 0.0 | -1.00 | 0.00 | -1.0 | -1 | -1 | 0 | 1 | 0 | 0.00 | 1 | 0.000 |
6 | P2:mixed | 6 | 1 | -1 | 0.0 | -1.00 | 0.00 | -0.5 | -1 | -1 | 0 | 1 | 0 | 0.00 | 1 | 0.000 |
7 | F1 (P2xP1):daughters | 7 | 1 | 0 | 1.0 | 0.00 | 1.00 | 0.0 | 1 | 1 | 0 | 0 | 0 | 1.00 | 0 | 0.000 |
8 | F1 (P2xP1):sons | 8 | 1 | 0 | 1.0 | 1.00 | 0.00 | -1.0 | 1 | 1 | 0 | 0 | 0 | 1.00 | 0 | 1.000 |
9 | F1 (P2xP1):mixed | 9 | 1 | 0 | 1.0 | 0.50 | 0.50 | -0.5 | 1 | 1 | 0 | 0 | 0 | 1.00 | 0 | 0.500 |
10 | rF1 (P1xP2):daughters | 10 | 1 | 0 | 1.0 | 0.00 | 1.00 | 0.0 | -1 | -1 | 0 | 0 | 0 | 1.00 | 0 | 0.000 |
11 | rF1 (P1xP2):sons | 11 | 1 | 0 | 1.0 | -1.00 | 0.00 | 1.0 | -1 | -1 | 0 | 0 | 0 | 1.00 | 0 | -1.000 |
12 | rF1 (P1xP2):mixed | 12 | 1 | 0 | 1.0 | -0.50 | 0.50 | 0.5 | -1 | -1 | 0 | 0 | 0 | 1.00 | 0 | -0.500 |
13 | F2a (F1xF1):daughters | 13 | 1 | 0 | 0.5 | 0.50 | 0.50 | 0.0 | 1 | 0 | 1 | 0 | 0 | 0.25 | 0 | 0.250 |
14 | F2a (F1xF1):sons | 14 | 1 | 0 | 0.5 | 0.00 | 0.00 | -1.0 | 1 | 0 | 1 | 0 | 0 | 0.25 | 0 | 0.000 |
15 | F2a (F1xF1):mixed | 15 | 1 | 0 | 0.5 | 0.25 | 0.25 | -0.5 | 1 | 0 | 1 | 0 | 0 | 0.25 | 0 | 0.125 |
Input Data Format
Data that will be analyzed with SAGA should be in a dataframe with three columns:
id of the cohort which should match the appropriate row of the C-matrix above
mean phenotype measure of the cohort
standard error of the cohort's mean phenotype.
SAGA comes with several empirical datasets allready appropriately formatted. Here we will load data on the number offspring produced by crosses involving Tribolium castaneum from Tanzania and India [6].
data(per.inf, package="SAGA")
Table 2. per.inf data illustrating the format required for analysis with SAGA.
Cohort ID | Mean | SE | |
---|---|---|---|
P1 | 3 | 33.62500 | 2.31407 |
P2 | 6 | 42.50000 | 4.31774 |
F1 | 9 | 44.80000 | 6.93830 |
rF1 | 12 | 37.25000 | 8.22977 |
F2a | 15 | 23.85714 | 4.52205 |
F2b | 21 | 25.85714 | 2.88203 |
rF2a | 18 | 33.25000 | 5.46008 |
rF2b | 24 | 24.12500 | 3.28110 |
BC1a | 27 | 35.12500 | 6.18303 |
BC1b | 30 | 54.66667 | 6.55574 |
rBC1a | 33 | 43.50000 | 7.23303 |
rBC1b | 36 | 43.12500 | 5.65824 |
BC2a | 45 | 19.20000 | 4.16413 |
BC2b | 48 | 13.00000 | 2.94958 |
rBC2a | 39 | 47.66667 | 11.06245 |
rBC2b | 42 | 47.66667 | 11.20020 |
Analyze Models
Once data is prepared as above we can analyze it with the function AnalyzeCrossesMM
. This will return a list of the class “genarch”. The list has four elements:
models: a list containing the weighted least squares solution for all models tested.
estimates: a data frame containing Model Weighted Average for each parameter and its unconditional standard error.
daicc: a vector of the delta AICc scores for all models tested.
varimp: a data frame containing the vi scores for composite effects
As SAGA is analyzing the data it will print the composite effects being tested as well as progress in analyzing models to the terminal, and by default a plot of the primary results of the analysis. In this case none of the models tested has a wi greater than 95%. So the plot is of the model averaged parameter estimation from equation 4, and unconditional standard errors calculated in equation 5 are indicated with whiskers on each bar. The colors of the bars reflects the vi calculated in equation 6.:
# we will need the plotrix package for plotting
library(plotrix)
results <- SAGA::AnalyzeCrossesMM(per.inf, graph=T, cex.names=.8)
## The composite genetic effects that will be tested are:
## Aa, Ad, Ca, Ma, Md, AaAa, AaAd, AdAd, CaAa, CaAd
##
## Generating Models..........
## 500
## 1000
## AICc weights were used to select the minimum number of models whose weights sum
## to greater than 95% this model set includes 219 model(s)
Now we can load a different dataset to demonstrate what happens when there is less model selection uncertainty. This dataset is from a study of sperm receptacle length measured in crosses between disjunct populations of Drosophila mojavensis [6].
#Sperm receptacle length in Drosophila mojavensis
data(SR)
#Because we are using cohorts where we know the distribution of sexes we set sexed=T.
AnalyzeCrossesMM(SR, sexed=T, graph=T)
library(plotrix)
data(SR, package="SAGA")
results2 <- SAGA::AnalyzeCrossesMM(SR, even.sex=T, graph=T)
## The following composite effects cannot be estimated with the line
## means available because they estimate identical quantities to
## lower order effects:
## Xd, XdAa, XdAd, CaXd
##
## The composite genetic effects that will be tested are:
## Aa, Ad, Xa, Ca, Ma, Md, AaAa, AaAd, AdAd, XaAa, XaAd, CaAa, CaAd, CaXa
##
## Since there are 12910 possible models this may take a bit:
## Generating Models........
## 5000
## 10000
## 2565 models were removed due to high covariances
## or linear relationships between predictor variables.
## The remaining 10345 models have been evaluated.
##
##
## AICc weights were used to select the minimum number of models whose weights sum
## to greater than 95% this model set includes 1 model(s)
In this case a single model has a wi of greater than 97%, and as figure 2 illustrates SAGA has returned estimates based on this single model.
If we want to plot something differently than the default for SAGA we can access the results of the analysis stored in the second element of the genarch object.
# here we extract the 4 largest composite effects found in the first analysis
estimates <- as.numeric(results[[2]][1, c(3, 7, 8, 9)])
names(estimates) <- colnames(results[[2]])[c(3, 7, 8, 9)]
barplot(estimates, main = "Estimate for composite effects",
names.arg = names(estimates))
We can also explore the relative fit of models to our data using the function 'VisModelSpace'. This function will plot a box for each model tested and will color it based on its wi. To illustrate the differences in model space we can plot the results of the two analyses stored in results
and results2
. First lets look at the Tribolium analysis which indicated a nontrivial level of model selection uncertainty. The results from this analysis are stored in results
VisModelSpace(results, cex.u=1.6)
This plot shows us that their are a number of models of varying complexity that have very similar akaike weights, and this dataset highlights why our understanding of the genetic architecture should not be based on any single model.
Next lets create the same plot but this time for the Drosophila dataset which indicated very little model selection uncertainty. The results from our analysis of this dataset are stored in results2
.
VisModelSpace(results2, cex.u=.4)
SAGA also provides the ability to investigate the results of individual models. For instance in the case of the first dataset we could use the daicc scores stored in the third element of the genarch object to find the best two models and then plot these using the function EvaluateModel
. To illustrate this lets find the best two models from the Tribolium dataset and plot them side by side to see how they differ.
# first lets find the best two models
good.models <- order(results[[3]])[1:2]
EvaluateModel(results, good.models[1], cex.names=.7, cex.main=.7)
EvaluateModel(results, good.models[2], cex.names=.7, cex.main=.7)
Figure 6. Estimates conditional on individual models.
Here we can see that the top two models both include 3 composite genetic effects, and in both cases the strongest effect is assigned to autosomal additive by autosomal dominance epistasis. We can also see that the first model includes autosomal dominance by dominance epistasis while in the second this is replaced by simple autosomal dominance.
In reporting the results of line cross analysis experiments we recommend reporting estimates and standard errors from model averaged results unless a single model has greater than 95% wi. It is also important to report vi scores since these give an indication of our certainty that a particular composite genetic effect is important in the genetic architecture of the trait in question.
[1] Mather, K., and J. L. Jinks, 1982 Biometrical genetics: The study of continuous variation. Chapman and Hall, London.
[2] Lynch, M., and B. Walsh, 1998 Lynch, M., & Walsh, B. (1998). Genetics and analysis of quantitative traits. Sinauer Associates, Inc., Sunderland, Massachusetts.
[3] Whittingham, M. J., P. A. Stephens, R. B. Bradbury and R. P. Freckleton, 2006 Why do we still use stepwise modelling in ecology and behaviour? J Anim Ecol 75: 1182-1189.
[4] Burnham, K. P., and D. R. Anderson, 2002 Model selection and multimodel inference: a practical information-theoretic approach. Springer, New York.
[5] Demuth, J. P., 2004 Evolution of Hybrid Incompatibility in the beetle Tribolium Castaneum, pp. 152 in Biology. Indiana University, Bloomington.
[6] Miller, G. T., Starmer, W. T., & S. Pitnick 2003. Quantitative genetic analysis of among-population variation in sperm and female sperm-storage organ length in Drosophila mojavensis. Genetical research, 81(03), 213-220.
[7] R Development Core Team, 2013 R: A Language and Environment for Statistical Computing, pp., Vienna, Austria.