This package provides functions to analyse de novo genetic variants using the statistical framework described in Samocha et al (2014) Nature Genetics 10.1038/ng.3050.
This vignette demonstrates the usage of the denovolyzeR
package to recapitulate the analyses dectribed in this paper.
# Install the package if you haven't already.
# OPTION 1 - install the latest release from CRAN:
# install.packages("denovolyzeR")
# OPTION 2 - install the latest development version from GitHub. Either download and install, or use devtools:
# if(!"devtools" %in% installed.packages()){
# install.packages("devtools")
# }
# devtools::install_github("jamesware/denovolyzeR")
We start with a table of de novo variants. An example dataset is provided:
library(denovolyzeR)
# have a look at the example data:
dim(autismDeNovos)
## [1] 1040 2
head(autismDeNovos)
## gene class
## 1 BCORL1 mis
## 2 SPANXD mis
## 3 GLRA2 mis
## 4 RPS6KA3 non
## 5 TSR2 mis
## 6 GNL3L syn
First we want to know whether there are more de novos than expected, using the denovolyzeByClass()
function. These variants were obtained by sequencing 1,078 cases, so we use nsamples=1078
.
denovolyzeByClass(genes=autismDeNovos$gene,
classes=autismDeNovos$class,
nsamples=1078)
## class observed expected enrichment pValue
## 1 syn 254 302.3 0.840 0.998000
## 2 mis 655 679.0 0.965 0.826000
## 3 lof 131 94.3 1.390 0.000199
## 4 prot 786 773.2 1.020 0.328000
## 5 all 1040 1075.5 0.967 0.864000
The total number of de novos is almost exactly as our model predicts. However, we see a statistically significant excess of LOF variants in this population.
Next, we look to see if the total number of genes that contain more than one de novo is greater than expected, using the denovolyzeMultiHits()
function.
denovolyzeMultiHits(genes=autismDeNovos$gene,
classes=autismDeNovos$class,
nsamples=1078)
## obs expMean expMax pValue nVars
## syn 3 3.4 8 0.67 254
## mis 31 20.2 31 0.01 655
## lof 5 1.0 4 0.00 131
## prot 43 27.8 41 0.00 786
## all 66 47.8 64 0.00 1040
obs = the number of genes in our dataset with >1 de novo variant
expMean = the expected number of genes containing >1 de novo: an average obtained by permutation
expMax = the maximum number of genes containing >1 de novo in nperms
permutations (default nperms=100
)
pValue = an empirical p value
nVars = the total number of de novo variants in each class
Note that the number of observed genes with >1 protein-altering variant does not equal the number of genes with >1 lof + number of genes with >1 missense, as genes containing 1 lof + 1 missense will only be counted as “multihits” in the combined analysis.
Here it looks like there may be an excess of genes with >1 lof variant, >1 missense, and >1 protein-altering variant. We will want to increase the number if permutations here to get a handle on our level of significance.
denovolyzeMultiHits(genes=autismDeNovos$gene,
classes=autismDeNovos$class,
nsamples=1078,
nperms=1000)
## obs expMean expMax pValue nVars
## syn 3 3.3 10 0.646 254
## mis 31 20.1 34 0.009 655
## lof 5 1.1 5 0.004 131
## prot 43 28.0 43 0.001 786
## all 66 47.6 70 0.004 1040
There is another important option here. The expected number of genes containing >1 hit is obtained by permutation: Given n de novo variants, how many genes contain >1 de novo? There are two options for selecting n: by default it is derived from your data: e.g. in the example above autismDeNovos contains 131 lof variants, so this is the number used in the permutation. This is controlled by the default parameter nVars="actual"
sum(autismDeNovos$class %in% c("frameshift","non","splice"))
## [1] 131
This is a conservative approach, addressing the question: “given the number of variants in our dataset, do we see more genes with >1 variant than expected?”
An alternative approach simply asks whether there are more genes with >1 variant than our de novo model predicts. This is accessed by setting nVars="expected"
.
denovolyzeMultiHits(genes=autismDeNovos$gene,
classes=autismDeNovos$class,
nsamples=1078,
nperms=1000,
nVars="expected")
## obs expMean expMax pValue nVars
## syn 3 4.6 12 0.844 302.27646
## mis 31 21.4 35 0.021 678.98816
## lof 5 0.6 4 0.000 94.26139
## prot 43 27.2 47 0.001 773.24955
## all 66 50.6 69 0.010 1075.52601
We see 6 genes containing >1 de novo lof variant. This is more than expected, but are any of these genes individually significant? We can denovolyzeByGene()
to find out.
By default this function compares the number of LOF variants against expectation for each gene, and then the total number of protein-altering variants (LOF + missense). It can also be configured to return other classes if relevant.
head(
denovolyzeByGene(genes=autismDeNovos$gene,
classes=autismDeNovos$class,
nsamples=1078)
)
## lof.observed lof.expected lof.pValue prot.observed prot.expected
## DYRK1A 3 0 2.69e-08 3 0.1
## SCN2A 3 0 1.83e-06 5 0.1
## CHD8 3 0 7.19e-07 4 0.2
## RFX8 0 0 1.00e+00 2 0.0
## SUV420H1 1 0 6.37e-03 3 0.1
## POGZ 2 0 1.23e-04 2 0.1
## prot.pValue
## DYRK1A 2.77e-05
## SCN2A 3.70e-07
## CHD8 2.44e-05
## RFX8 2.34e-05
## SUV420H1 3.17e-05
## POGZ 5.07e-03
Several genes meet statistical significance after correcting for multiple testing. Default options apply two tests across 0 genes, so a Bonferroni corrected p-value threshold at \(\alpha\) = 0.05 would be .
The analyses presented so far have been exome-wide. It may be appropriate to restrict analyses to a geneset of interest - for example, it may be relevant to examine the burden of de novo variation in a pathway of interest, or initial variant detection may have been restricted to a set of candidate genes (rather than whole exome sequencing). All of the above funtions can be targeted to a subset of genes using the includeGenes
argument.
The package includes as an example a list of 837 genes that interact with the fragile X mental retardation protein (FMRP). Is this geneset enriched for de novos, and recurrent de novos, in our autism trios?
nrow(fmrpGenes); head(fmrpGenes)
## [1] 837
## ensgID enstID hgncID hgncSymbol geneName
## 1 ENSG00000142599 ENST00000337907 9965 RERE RERE
## 2 ENSG00000149527 ENST00000449969 29037 PLCH2 PLCH2
## 3 ENSG00000078369 ENST00000378609 4396 GNB1 GNB1
## 4 ENSG00000157933 ENST00000378536 10896 SKI SKI
## 5 ENSG00000171735 ENST00000303635 18806 CAMTA1 CAMTA1
## 6 ENSG00000188157 ENST00000379370 329 AGRN AGRN
denovolyzeByClass(genes=autismDeNovos$gene,
classes=autismDeNovos$class,
nsamples=1078,
includeGenes=fmrpGenes$geneName)
## class observed expected enrichment pValue
## 1 syn 28 33.6 0.835 8.53e-01
## 2 mis 83 74.4 1.110 1.74e-01
## 3 lof 32 9.1 3.500 3.18e-09
## 4 prot 115 83.6 1.380 6.47e-04
## 5 all 143 117.1 1.220 1.13e-02
denovolyzeMultiHits(genes=autismDeNovos$gene,
classes=autismDeNovos$class,
nsamples=1078,
nperms=1000,
includeGenes=fmrpGenes$geneName)
## Warning in denovolyzeMultiHits(genes = autismDeNovos$gene, classes =
## autismDeNovos$class, : De novo list includes 897 genes not specified for
## inclusion. These will not be analysed.
## obs expMean expMax pValue nVars
## syn 0 0.6 4 1.000 28
## mis 9 5.1 12 0.062 83
## lof 3 1.0 5 0.061 32
## prot 15 9.5 19 0.022 115
## all 20 14.2 24 0.039 143
viewProbabilityTable
provides access to the underlying de novo probability tables used to calculate expected de novo burdens throughout this package:
head(
viewProbabilityTable()
)
## hgncID hgncSymbol enstID ensgID geneName syn
## 1 5 A1BG ENST00000263100 ENSG00000121410 A1BG 8.997970e-06
## 2 7 A2M ENST00000318602 ENSG00000175899 A2M 1.543159e-05
## 3 16 SERPINA3 ENST00000467132 ENSG00000196136 SERPINA3 5.694582e-06
## 4 17 AADAC ENST00000232892 ENSG00000114771 AADAC 4.252483e-06
## 5 18 AAMP ENST00000248450 ENSG00000127837 AAMP 6.496774e-06
## 6 19 AANAT ENST00000250615 ENSG00000129673 AANAT 3.530488e-06
## mis non splice frameshift lof
## 1 1.738961e-05 5.763794e-07 2.639868e-07 6.532817e-07 1.493648e-06
## 2 3.545894e-05 1.960148e-06 1.477263e-06 4.616751e-07 3.899086e-06
## 3 1.176919e-05 4.433874e-07 1.387157e-07 5.276555e-07 1.109759e-06
## 4 1.018458e-05 5.312742e-07 1.748982e-07 1.051152e-06 1.757324e-06
## 5 1.313861e-05 5.042914e-07 4.247556e-07 3.344019e-06 4.273066e-06
## 6 7.729807e-06 1.707018e-07 9.988864e-08 4.132204e-07 6.838108e-07
## prot all
## 1 1.888326e-05 2.788123e-05
## 2 3.935803e-05 5.478962e-05
## 3 1.287895e-05 1.857353e-05
## 4 1.194191e-05 1.619439e-05
## 5 1.741167e-05 2.390845e-05
## 6 8.413618e-06 1.194411e-05
head(
viewProbabilityTable(format="long")
)
## hgncID hgncSymbol enstID ensgID geneName class
## 1 658 ARF5 ENST00000000233 ENSG00000004059 ARF5 syn
## 2 6752 M6PR ENST00000000412 ENSG00000003056 M6PR syn
## 3 3720 FKBP4 ENST00000001008 ENSG00000004478 FKBP4 syn
## 4 20581 CYP26B1 ENST00000001146 ENSG00000003137 CYP26B1 syn
## 5 28816 NDUFAF7 ENST00000002125 ENSG00000003509 NDUFAF7 syn
## 6 4008 FUCA2 ENST00000002165 ENSG00000001036 FUCA2 syn
## value
## 1 2.704295e-06
## 2 2.877349e-06
## 3 5.608786e-06
## 4 1.013847e-05
## 5 5.079037e-06
## 6 5.445799e-06
Most of the core functionality of this package is contained in the denovolyze
function. The denovolyzeByClass
and denovolyzeByGene
functions used in this vignette are convenience functions that set defaults appropriate to the most common usages of this function. Full details of additional options and default behaviours are available using ?denovolyze
.