This is a simple introductory vignette to the MAPITR package. In this introductory example, we will use a simulated dataset to run MAPITR and explore some basic output. The dataset is provided with the package.

Running MAPITR

For this introductory vignette, we will be using MAPITR_SimData_Phenotype, MAPITR_SimData_Genotypes, MAPITR_SimData_Pathways, and MAPITR_SimData_PCs. These are simulated datasets created for the purposes of this vignette. To load MAPITR and the datasets, run the following:

library("MAPITR")

data(MAPITR_SimData_Phenotype, MAPITR_SimData_Genotypes, MAPITR_SimData_Pathways, MAPITR_SimData_PCs)

First, we'll take a look at the formats of our simulated input data:

MAPITR_SimData_Genotypes[1:10,1:10]
##    V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
## 1   1  2  0  1  1  0  1  0  0   2
## 2   2  0  1  2  1  2  1  1  1   2
## 3   0  0  1  2  1  1  1  1  1   0
## 4   1  2  1  0  1  2  2  0  2   1
## 5   1  0  1  2  0  0  2  2  1   1
## 6   2  1  1  1  2  0  0  0  2   1
## 7   0  1  1  1  1  1  0  1  0   0
## 8   1  1  1  1  1  2  0  0  2   1
## 9   0  2  1  0  2  2  2  1  2   0
## 10  2  2  0  2  1  0  0  1  1   0

Here you should see the first ten rows and columns of our genotype matrix. The genotype matrix is n x p, so the rows are our n individuals and the columns are our p SNPs. SNP genotypes, the entries of each column, are assumed to be 0, 1, or 2 – so note, diploid information is assumed to have been collapsed into a single value for each SNP. For example, the typical diploid setup of 0/0, 0/1, and 1/1, which represents the three possible combinations a given pair of reference (0) and alternative (1) alleles can have across both chromosomal copies in an individual, is now instead represented as 0 (0/0), 1 (0/1), or 2 (1/1). One way to derive a genotype matrix with this type of encoding is to use PLINK's --recodeAD output option and remove all the *_HET columns afterwards.

Also note that the genotype matrix cannot have any missing genotypes (ie 0% rate of genotype missingness). This is necessary due to how the genetic relatedness matrices (GRMs), required components for the MAPITR variance component model, are constructed. For each entry in the genotype matrix, the only acceptable values are 0, 1, and 2.

head(MAPITR_SimData_Phenotype)
##           V1
## 1 -0.4518125
## 2 -0.4628268
## 3  0.4655829
## 4 -0.3667877
## 5  0.5072770
## 6 -0.0278133

Here you should see the first ten values of our input phenotype. MAPITR only runs on one phenotype at a time currently, so the main function only requires a single vector of phenotypic values as input.

head(MAPITR_SimData_Pathways)
##         V1
## 1 Pathway1
## 2 Pathway2
## 3 Pathway3
## 4 Pathway4
## 5 Pathway5
##                                                                                                                                                                                                                                                                                                                                                                                                          V2
## 1          663,259,220,149,567,514,624,661,14,142,476,394,711,68,565,138,326,689,665,490,674,360,536,515,521,147,172,459,465,240,435,317,703,643,316,503,359,492,534,177,38,717,7,146,28,496,430,370,89,135,96,200,518,441,57,383,500,315,630,171,302,523,437,418,384,25,82,284,247,444,213,362,429,653,740,527,745,389,342,162,81,153,163,563,325,634,616,633,116,434,337,594,405,367,622,548,228,51,78,36
## 2            649,232,196,550,455,727,457,290,365,184,176,555,454,420,59,377,556,245,543,298,266,629,233,283,437,367,237,351,328,182,456,49,742,171,424,558,494,165,427,686,728,253,662,641,174,10,381,199,294,459,387,687,32,297,678,30,55,27,142,39,87,694,160,252,270,552,128,11,206,732,91,545,413,617,132,219,306,511,661,522,688,636,280,4,58,105,26,483,704,682,595,708,500,46,498,744,519,47,325,339
## 3 311,750,370,280,240,551,146,107,425,582,391,643,507,134,660,349,683,305,430,296,639,414,510,463,110,396,18,353,154,288,451,676,651,401,302,725,516,552,276,377,571,71,468,424,435,259,84,498,350,329,187,91,323,713,557,696,330,320,304,174,429,658,131,732,648,337,408,100,244,400,104,432,623,630,485,169,540,80,471,23,520,261,334,410,197,192,355,398,217,325,533,178,326,542,549,181,439,625,592,493
## 4           17,242,560,104,96,95,406,191,266,411,230,320,311,545,591,589,386,408,227,45,616,462,414,596,196,538,226,623,82,15,66,442,500,232,173,435,564,144,693,667,608,29,563,11,28,100,61,669,463,369,221,516,267,629,401,354,138,180,670,576,33,514,429,622,148,651,277,107,535,729,181,436,733,128,119,198,206,460,372,709,485,335,428,171,495,203,554,459,42,642,351,660,605,127,14,53,416,137,41,510
## 5             41,553,729,696,322,38,458,684,210,450,411,457,94,185,297,494,206,104,577,539,251,274,141,8,481,124,743,734,40,45,631,705,595,397,306,440,342,525,739,575,167,360,611,348,189,50,282,357,530,133,582,464,142,722,615,151,218,288,376,154,620,727,418,541,569,265,139,284,109,533,459,605,88,136,312,359,334,82,340,486,316,66,9,391,742,404,690,91,71,456,623,536,442,273,15,329,663,373,603,5

Here you should see a matrix with two columns. The first column should list the names of each pathway being analyzed. The second column should be a list of comma-separated column indices, representing the set of SNPs belonging to each associated pathway. These indices will be used by MAPITR to extract the proper SNPs per pathway from the input genotype matrix to conduct the marginal epistasis analysis. Note that the values for this second column are not the SNP IDs (eg rsID#s) or column headers (if your genotype matrix has them), they are the specific column locations in the input genotype matrix for each SNP.

Now to run MAPITR, we use the following code:

MAPITR_Output <- MAPITR(MAPITR_SimData_Genotypes, MAPITR_SimData_Phenotype, MAPITR_SimData_Pathways, cores=1)

(note that cores=1 is included only for CRAN specifications and does not need to be included by default)

And to see the output MAPITR() provides, we can run the following:

head(MAPITR_Output$Results)
##      Pathways    pValues        Est        PVE
## [1,]        1 0.00511314 1.09663702 1.32243728
## [2,]        2 0.31118083 0.40971253 0.45740085
## [3,]        3 0.33131510 0.36199437 0.43933281
## [4,]        4 0.96134690 0.01706105 0.02017861
## [5,]        5 0.94181499 0.02552678 0.03176702

The output shown from MAPITR_Output$Results should be a matrix with four columns. The first column should be the list of pathway names being analyzed (same order as input file), the second column should be the MAPITR p-values for each pathway, the third column should be the estimate of the epistatic GRM variance component (G in the model) for each pathway, and the fourth column should be the proportion of variance explained (PVE) for the phenotype by each pathway. Note that PVE estimates here are for pathways that generally contain overlapping variants – this impacts our interpretation of PVE. When single variant models are used in complex trait analysis, which is often the case, PVE can theoretically sum to 1 across all variants. A priori it is unclear what PVE should theoretically sum to across multiple pathways due to this common scenario where variants are shared across pathway definitions.

In this example, there should be one pathway that has a p-value below the Bonferonni-corrected threshold of .01 (.05 / 5, the number of pathways tested):

MAPITR_Output$Results[MAPITR_Output$Results[,2] < .01,]
##   Pathways    pValues        Est        PVE 
## 1.00000000 0.00511314 1.09663702 1.32243728

Also note this is a simulated dataset made to run quickly for this vignette – you will very likely need many more individuals and SNPs to identify an epistatic signal, and PVE should never be > 1 in real data.

Lastly, we will go over how one can introduce a covariate to this test (on both sides of the model) if desired. However, we recommend that most covariates should be directly regressed out from the phenotype prior to MAPITR analysis. To include a covariate as part of MAPITR though, it is incorporated as part of the orthogonality projection matrix, H, used in equation 3 of the MAPITR manuscript. This projection matrix is applied to both sides of the model, meaning it is applied to both the phenotype and the genotypes, and rotates both spaces to be orthogonal to the components of H. By default a column of ones is included in this matrix to act as a y-intercept term. However, any additional covariates can be appended to this matrix. We currently choose to include top principal components as a part of this matrix.

To include the top ten principal components of our simulated dataset into the MAPITR analysis, we can run the following:

MAPITR_SimData_PCs[1:5,1:10]
##             PC1        PC2        PC3        PC4        PC5        PC6
## [1,]  0.1127320 -0.4486358  0.6656510 -3.0188042 1.54672672 -0.9981512
## [2,]  1.9343189 -0.2892368 -1.3136277  0.4662708 0.04351503 -0.1170024
## [3,]  1.9908207 -4.1118289  0.5600524 -0.4923170 1.26330957  0.4864907
## [4,] -0.1753511 -0.8886352 -1.2132968  1.0306922 0.26797451 -2.0537700
## [5,] -1.4277754 -1.2106821  0.5611386 -0.9433904 1.01580351 -3.1516246
##             PC7        PC8        PC9      PC10
## [1,]  0.2063473 -0.8389474  1.3046610 1.8209930
## [2,] -0.2457400  0.5114055  2.1700065 0.9436350
## [3,]  2.0332192  0.9221459  1.7075565 0.4946276
## [4,]  1.0631471  0.9807124 -0.6004786 0.1891073
## [5,]  0.7364327 -0.3032539 -1.3603440 2.8166732
MAPITR_Output2 <- MAPITR(MAPITR_SimData_Genotypes, MAPITR_SimData_Phenotype, MAPITR_SimData_Pathways, Covariates=MAPITR_SimData_PCs, cores=1)
MAPITR_Output2$Results[MAPITR_Output2$Results[,2] < .01,]
##    Pathways     pValues         Est         PVE 
## 1.000000000 0.002208826 1.212190616 1.479454779

(note that cores=1 is included only for CRAN specifications and does not need to be included by default)

And after doing so, we find that including top principal components does further help refine our epistatic signal in this simulated dataset.

For further details on the MAPITR model, please see the associated manuscript.