We present a package for estimation of cis-eQTL effect sizes, using a new model called ACME which respects biological understanding of cis-eQTL action. The model involves an additive effect of allele count and multiplicative component random noise (hence “ACME”: Additive-Contribution, Multiplicative-Error), and is defined as
\[y_i = \log(\beta_0 + \beta_1 s_i) + Z_i^T \gamma + \epsilon_i\]
where
We estimate the model using a fast iterative algorithm.
The algorithm estimates the model which is nonlinear only with respect to \(\eta = \beta_1 / \beta_0\) \[y_i = \log(1 + s_i \eta) + \log(\beta_0) + Z_i^T \gamma + \epsilon_i\]
ACMEeqtl can be installed with the following command.
install.packages('ACMEeqtl')
ACMEeqtl package provides functions for analysis of a single gene-SNP pair as well as fast parallel testing of all local gene-SNP pairs.
library(ACMEeqtl)
First we generate sample gene expression, SNP allele counts, and a set of covariates.
# Model parameters
beta0 = 10000
beta1 = 50000
# Data dimensions
nsample = 1000
ncvrt = 19
### Data generation
### Zero average covariates
cvrt = matrix(rnorm(nsample * ncvrt), nsample, ncvrt)
cvrt = t(t(cvrt) - colMeans(cvrt))
# Generate SNPs
s = rbinom(n = nsample, size = 2, prob = 0.2)
# Generate log-normalized expression
y = log(beta0 + beta1 * s) +
cvrt %*% rnorm(ncvrt) +
rnorm(nsample)
We provide two equivalent functions for model estimation.
effectSizeEstimationR
– fully coded in ReffectSizeEstimationC
– faster version with core coded in C.z1 = effectSizeEstimationR(s, y, cvrt)
## Warning in beta_cur * x: Recycling array of length 1 in array-vector arithmetic is deprecated.
## Use c() or as.vector() instead.
## Warning in beta_cur * x: Recycling array of length 1 in array-vector arithmetic is deprecated.
## Use c() or as.vector() instead.
## Warning in beta_cur * x: Recycling array of length 1 in array-vector arithmetic is deprecated.
## Use c() or as.vector() instead.
## Warning in beta_cur * x: Recycling array of length 1 in array-vector arithmetic is deprecated.
## Use c() or as.vector() instead.
## Warning in beta_cur * x: Recycling array of length 1 in array-vector arithmetic is deprecated.
## Use c() or as.vector() instead.
## Warning in beta_cur * x: Recycling array of length 1 in array-vector arithmetic is deprecated.
## Use c() or as.vector() instead.
## Warning in beta_cur * x: Recycling array of length 1 in array-vector arithmetic is deprecated.
## Use c() or as.vector() instead.
z2 = effectSizeEstimationC(s, y, cvrt)
pander(rbind(z1,z2))
beta0 | beta1 | nits | SSE | SST | F | eta | SE_eta | |
---|---|---|---|---|---|---|---|---|
z1 | 9196 | 54704 | 6 | 1049 | 1925 | 818 | 5.95 | 0.483 |
z2 | 9196 | 54704 | 6 | 1049 | 1925 | 818 | 5.95 | 0.483 |
Variables z1
, z2
show that the estimation was done in 6 iterations, with estimated parameters
First we generate a eQTL dataset in filematrix format (see filematrix package).
tempdirectory = tempdir();
#tempdirectory = "~/Desktop/package_tests"
z = create_artificial_data(
nsample = 100,
ngene = 500,
nsnp = 5000,
ncvrt = 1,
minMAF = 0.2,
saveDir = tempdirectory,
returnData = FALSE,
savefmat = TRUE,
savetxt = FALSE,
verbose = FALSE)
In this example, we use 2 CPU cores (threads) for testing of all gene-SNP pairs within 100,000 bp.
multithreadACME(
genefm = "gene",
snpsfm = "snps",
glocfm = "gene_loc",
slocfm = "snps_loc",
cvrtfm = "cvrt",
acmefm = "ACME",
cisdist = 100e+03,
threads = 2,
workdir = paste0(tempdirectory,"/filematrices"),
verbose = FALSE)
Now the filematrix ACME
holds estimations for all local gene-SNP pairs.
fm = fm.open(paste0(tempdirectory,"/filematrices/ACME"))
TenResults = fm[,1:10];
rownames(TenResults) = rownames(fm);
close(fm);
pander(t(TenResults))
geneid | snp_id | beta0 | beta1 | nits | SSE | SST | F | eta | SE |
---|---|---|---|---|---|---|---|---|---|
1 | 1 | 98.4 | -36.1 | 7 | 102 | 116 | 13.7 | -0.367 | 0.0507 |
1 | 2 | 83.8 | -11.2 | 7 | 115 | 116 | 1.04 | -0.133 | 0.112 |
2 | 10 | 142 | 9.41 | 4 | 126 | 127 | 0.12 | 0.0662 | 0.2 |
2 | 11 | 101 | 66.3 | 5 | 117 | 127 | 8.24 | 0.656 | 0.322 |
2 | 12 | 144 | 7.02 | 6 | 127 | 127 | 0.0711 | 0.0488 | 0.19 |
2 | 13 | 160 | -17.6 | 5 | 126 | 127 | 0.346 | -0.111 | 0.171 |
3 | 20 | 141 | -25.1 | 5 | 99.2 | 102 | 2.52 | -0.178 | 0.0897 |
3 | 21 | 100 | 31.5 | 5 | 99.4 | 102 | 2.33 | 0.315 | 0.244 |
3 | 22 | 113 | 4.64 | 5 | 102 | 102 | 0.089 | 0.0412 | 0.144 |
3 | 23 | 137 | -33.5 | 6 | 97.3 | 102 | 4.46 | -0.245 | 0.085 |
Now we can estimate multi-SNP ACME models for each gene:
multisnpACME(
genefm = 'gene',
snpsfm = 'snps',
glocfm = 'gene_loc',
slocfm = 'snps_loc',
cvrtfm = 'cvrt',
acmefm = 'ACME',
workdir = paste0(tempdirectory, "/filematrices"),
genecap = Inf,
verbose = FALSE)
Now the filematrix ACME_multiSNP
holds estimations for all multi-SNP models.
fm = fm.open(paste0(tempdirectory,"/filematrices/ACME_multiSNP"))
TenResults = fm[,1:10];
rownames(TenResults) = rownames(fm);
close(fm);
pander(t(TenResults))
geneid | snp_id | beta0 | betas | forward_adjR2 |
---|---|---|---|---|
1 | 1 | 95.2 | -34.9 | 0.114 |
2 | 11 | 98.7 | 62.7 | 0.0688 |
3 | 23 | 129 | -28.4 | 0.0341 |
3 | 21 | 129 | 36.4 | 0.057 |
3 | 20 | 129 | -21 | 0.0716 |
4 | 30 | 100 | 30.7 | 0.0324 |
4 | 33 | 100 | -24.1 | 0.0616 |
5 | 41 | 123 | 41.7 | 0.0379 |
5 | 40 | 123 | -18.5 | 0.0407 |
6 | 51 | 104 | -26.6 | 0.0492 |
Note that each multi-SNP model will contain at least one SNP, even if that initial SNP was not significant under the single-SNP models. This initial SNP will be the one with the highest adjusted-R\(^2\) value among the single-SNP models. However, after the initial SNP, further SNPs are added only if the combined model’s adjusted-R\(^2\) is greater than that from the previous combined model.