Table of Contents

1. Introduction

2. Mathematical Approach

3. Installation

4. Fitting LCA models

5. Analyzing Results

6. Citations


1. Introduction

Line cross analysis (LCA) or partitioning the contribution of composite genetic effects (CGEs) to the mean phenotype of cohorts is widely used to investigate the genetic architecture of traits. This approach uses two parental strains which have diverged in a phenotype of interest. These parents are crossed, producing an F1, and subsequent crosses (e.g. F2, backcross, reciprocals) are made to generate groups that have different combinations of parental genes. We refer to each of these groups as cohorts. Using a weighted least squares regression with weights inversely proportional to the variance of the cohort means, the degree to which a phenotype is determined by different CGEs (e.g. additive, dominance, and epistatic gene action) may be estimated [1, 2]. Traditionally LCA has been accomplished by a process refereed to as the joint-scaling test, essentially forward variable selection weighted least squares regression. However this approach has a number of documented problems [3]. A full information-theoretic (I-T) approach to model selection and parameter estimation alleviates difficulties associated with previous approaches and provides additional understanding that is not possible under older approaches such as the joint-scaling test [4]. SAGA provides a full I-T approach to LCA that leverages the finite sample size corrected version of the Akaike information criterion (AICc) to explore all possible models and make unbiased and, when appropriate, model averaged estimates of the contribution of CGEs to cohort means. SAGA includes four functions and seven empirical datasets.

Functions:

Data:


2. Mathematical Approach

We use the function GLM from the base R package to perform weighted least square regression [7]. GLM returns the parameter and standard error estimates conditional on the model as well as the AIC value for the model. We convert AIC to AICc using equation 1. Where n is the number of cohorts and K is the number of parameters being estimated.

Equation 1

We then calculate AICc differences (delta AICc) using equation 2.

Equation 2

Where delta AICc min is the minimum AICc score calculated across all possible models and AICci is the AICc calculated for a specific model. Delta AICc is used in generating Akaike weights (wi) using equation 3. The denominator in this equation is the summation of the numerator across all possible models being evaluated ®.

Equation 3

Under the default settings, if wi of the best model is 0.9 or greater then SAGA will perform parameter estimation under a single model. If no model reaches this threshold then we construct a 95% confidence set of models that contains the minimum number of models whose wi sum to 0.95. To calculate model averaged parameter estimates and unconditional standard errors we recalculate wi for each model performing the summation in the denominator of equation 3 across all models in the confidence set. The model weighted parameter estimates are then calculated using equation 4 where wi is the recalculated model weight and omega hat i is the parameter estimate from the model; the product of these values is summed across all models R in the confidence set.

Equation 4

Standard error estimates that are unconditional on any one model are calculated using equation 5.

Equation 5

Finally variable importance vi is calculated by summing wi of all models R in which a CGE occurs (Eq. 6).

Equation 6


3. Installation

A stable tested version of SAGA is available from the CRAN repository or the most recent version may be installed from github using the devtools package:

Installing from CRAN

install.packages("SAGA")

Installing from github

library(devtools)
install_github("coleoguy/SAGA", build_vignettes = TRUE)

4. Fitting LCA models

C-matrix of composite genetic effects
The first step in analysis of line cross data is choice of a C-matrix that describes the expected contribution of different types of gene action to cohort phenotypes. By default SAGA will use a C-matrix that is scaled to the midparent mean (equivalent to Finf), and includes 23 potential CGEs. For each CGE we have calculated coefficients for 23 potential crosses; each of which is divided into male, female, or mixed sex cohorts. This C-matrix has 69 rows and the row numbers are used to identify the cohorts being used in an experiment. The function DisplayCmatrix is available so that we can determine what IDs should be used to identify the cohorts included in an analysis.

# print the C-matrix to the terminal
DisplayCmatrix(table = "MP")

Table 1. The first 15 rows of the C-matrix supplied with SAGA.

X.sire.x.dam. ID M Aa Ad Xa Xd Ya Ca Ma Md AaAa AaAd AdAd XaAa XaAd
1 P1:daughters 1 1 1 0.0 1.00 0.00 0.0 1 1 0 1 0 0.00 1 0.000
2 P1:sons 2 1 1 0.0 1.00 0.00 1.0 1 1 0 1 0 0.00 1 0.000
3 P1:mixed 3 1 1 0.0 1.00 0.00 0.5 1 1 0 1 0 0.00 1 0.000
4 P2:daughters 4 1 -1 0.0 -1.00 0.00 0.0 -1 -1 0 1 0 0.00 1 0.000
5 P2:sons 5 1 -1 0.0 -1.00 0.00 -1.0 -1 -1 0 1 0 0.00 1 0.000
6 P2:mixed 6 1 -1 0.0 -1.00 0.00 -0.5 -1 -1 0 1 0 0.00 1 0.000
7 F1 (P2xP1):daughters 7 1 0 1.0 0.00 1.00 0.0 1 1 0 0 0 1.00 0 0.000
8 F1 (P2xP1):sons 8 1 0 1.0 1.00 0.00 -1.0 1 1 0 0 0 1.00 0 1.000
9 F1 (P2xP1):mixed 9 1 0 1.0 0.50 0.50 -0.5 1 1 0 0 0 1.00 0 0.500
10 rF1 (P1xP2):daughters 10 1 0 1.0 0.00 1.00 0.0 -1 -1 0 0 0 1.00 0 0.000
11 rF1 (P1xP2):sons 11 1 0 1.0 -1.00 0.00 1.0 -1 -1 0 0 0 1.00 0 -1.000
12 rF1 (P1xP2):mixed 12 1 0 1.0 -0.50 0.50 0.5 -1 -1 0 0 0 1.00 0 -0.500
13 F2a (F1xF1):daughters 13 1 0 0.5 0.50 0.50 0.0 1 0 1 0 0 0.25 0 0.250
14 F2a (F1xF1):sons 14 1 0 0.5 0.00 0.00 -1.0 1 0 1 0 0 0.25 0 0.000
15 F2a (F1xF1):mixed 15 1 0 0.5 0.25 0.25 -0.5 1 0 1 0 0 0.25 0 0.125

Input Data Format

Data that will be analyzed with SAGA should be in a dataframe with three columns:

SAGA comes with several empirical datasets allready appropriately formatted. Here we will load data on the number offspring produced by crosses involving Tribolium castaneum from Tanzania and India [6].

data(per.inf, package="SAGA")

Table 2. per.inf data illustrating the format required for analysis with SAGA.

Cohort ID Mean SE
P1 3 33.62500 2.31407
P2 6 42.50000 4.31774
F1 9 44.80000 6.93830
rF1 12 37.25000 8.22977
F2a 15 23.85714 4.52205
F2b 21 25.85714 2.88203
rF2a 18 33.25000 5.46008
rF2b 24 24.12500 3.28110
BC1a 27 35.12500 6.18303
BC1b 30 54.66667 6.55574
rBC1a 33 43.50000 7.23303
rBC1b 36 43.12500 5.65824
BC2a 45 19.20000 4.16413
BC2b 48 13.00000 2.94958
rBC2a 39 47.66667 11.06245
rBC2b 42 47.66667 11.20020

Analyze Models

Once data is prepared as above we can analyze it with the function AnalyzeCrossesMM. This will return a list of the class “genarch”. The list has four elements:

As SAGA is analyzing the data it will print the composite effects being tested as well as progress in analyzing models to the terminal, and by default a plot of the primary results of the analysis. In this case none of the models tested has a wi greater than 95%. So the plot is of the model averaged parameter estimation from equation 4, and unconditional standard errors calculated in equation 5 are indicated with whiskers on each bar. The colors of the bars reflects the vi calculated in equation 6.:

# we will need the plotrix package for plotting
library(plotrix)
results <- SAGA::AnalyzeCrossesMM(per.inf, graph=T, cex.names=.8)
## The composite genetic effects that will be tested are: 
##  Aa, Ad, Ca, Ma, Md, AaAa, AaAd, AdAd, CaAa, CaAd 
## 
## Generating Models..........
##  500
##  1000
## AICc weights were used to select the minimum number of models whose weights sum 
## to greater than 95% this model set includes 219 model(s)

__Figure 1.__ Model averaged estimate of genetic architecture.

Now we can load a different dataset to demonstrate what happens when there is less model selection uncertainty. This dataset is from a study of sperm receptacle length measured in crosses between disjunct populations of Drosophila mojavensis [6].

#Sperm receptacle length in Drosophila mojavensis
data(SR)
#Because we are using cohorts where we know the distribution of sexes we set sexed=T.
AnalyzeCrossesMM(SR, sexed=T, graph=T)
library(plotrix)
data(SR, package="SAGA")
results2 <- SAGA::AnalyzeCrossesMM(SR, even.sex=T, graph=T)
## The following composite effects cannot be estimated with the line 
## means available because they estimate identical quantities to 
## lower order effects: 
## Xd, XdAa, XdAd, CaXd
## 
## The composite genetic effects that will be tested are: 
##  Aa, Ad, Xa, Ca, Ma, Md, AaAa, AaAd, AdAd, XaAa, XaAd, CaAa, CaAd, CaXa 
## 
## Since there are 12910 possible models this may take a bit:
## Generating Models........
##  5000
##  10000
## 2565 models were removed due to high covariances 
## or linear relationships between predictor variables.  
## The remaining 10345 models have been evaluated.
## 
## 
## AICc weights were used to select the minimum number of models whose weights sum 
## to greater than 95% this model set includes 1 model(s)

__Figure 3.__ Conditional estimate of genetic architecture.

In this case a single model has a wi of greater than 97%, and as figure 2 illustrates SAGA has returned estimates based on this single model.


5. Analzying Results

If we want to plot something differently than the default for SAGA we can access the results of the analysis stored in the second element of the genarch object.

    # here we extract the 4 largest composite effects found in the first analysis
    estimates <- as.numeric(results[[2]][1, c(3, 7, 8, 9)])
    names(estimates) <- colnames(results[[2]])[c(3, 7, 8, 9)]
    barplot(estimates, main = "Estimate for composite effects",
            names.arg = names(estimates))

__Figure 2.__ Subset of model averaged estimate of genetic architecture.

We can also explore the relative fit of models to our data using the function 'VisModelSpace'. This function will plot a box for each model tested and will color it based on its wi. To illustrate the differences in model space we can plot the results of the two analyses stored in results and results2. First lets look at the Tribolium analysis which indicated a nontrivial level of model selection uncertainty. The results from this analysis are stored in results

VisModelSpace(results, cex.u=1.6)

__Figure 4.__ Distribution of akaike weights across model space for Tribolium dataset.

This plot shows us that their are a number of models of varying complexity that have very similar akaike weights, and this dataset highlights why our understanding of the genetic architecture should not be based on any single model.

Next lets create the same plot but this time for the Drosophila dataset which indicated very little model selection uncertainty. The results from our analysis of this dataset are stored in results2.

VisModelSpace(results2, cex.u=.4)

__Figure 5.__ Distribution of akaike weights across model space for Drosophila dataset.

SAGA also provides the ability to investigate the results of individual models. For instance in the case of the first dataset we could use the daicc scores stored in the third element of the genarch object to find the best two models and then plot these using the function EvaluateModel. To illustrate this lets find the best two models from the Tribolium dataset and plot them side by side to see how they differ.

# first lets find the best two models
good.models <- order(results[[3]])[1:2]
EvaluateModel(results, good.models[1], cex.names=.7, cex.main=.7)
EvaluateModel(results, good.models[2], cex.names=.7, cex.main=.7)

plot of chunk unnamed-chunk-8 plot of chunk unnamed-chunk-8

Figure 6. Estimates conditional on individual models.

Here we can see that the top two models both include 3 composite genetic effects, and in both cases the strongest effect is assigned to autosomal additive by autosomal dominance epistasis. We can also see that the first model includes autosomal dominance by dominance epistasis while in the second this is replaced by simple autosomal dominance.

In reporting the results of line cross analysis experiments we recommend reporting estimates and standard errors from model averaged results unless a single model has greater than 95% wi. It is also important to report vi scores since these give an indication of our certainty that a particular composite genetic effect is important in the genetic architecture of the trait in question.


6. Citations

[1] Mather, K., and J. L. Jinks, 1982 Biometrical genetics: The study of continuous variation. Chapman and Hall, London.

[2] Lynch, M., and B. Walsh, 1998 Lynch, M., & Walsh, B. (1998). Genetics and analysis of quantitative traits. Sinauer Associates, Inc., Sunderland, Massachusetts.

[3] Whittingham, M. J., P. A. Stephens, R. B. Bradbury and R. P. Freckleton, 2006 Why do we still use stepwise modelling in ecology and behaviour? J Anim Ecol 75: 1182-1189.

[4] Burnham, K. P., and D. R. Anderson, 2002 Model selection and multimodel inference: a practical information-theoretic approach. Springer, New York.

[5] Demuth, J. P., 2004 Evolution of Hybrid Incompatibility in the beetle Tribolium Castaneum, pp. 152 in Biology. Indiana University, Bloomington.

[6] Miller, G. T., Starmer, W. T., & S. Pitnick 2003. Quantitative genetic analysis of among-population variation in sperm and female sperm-storage organ length in Drosophila mojavensis. Genetical research, 81(03), 213-220.

[7] R Development Core Team, 2013 R: A Language and Environment for Statistical Computing, pp., Vienna, Austria.