The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

A set-based association test in snpsettest

Jaehyun Joo

09 September, 2023

For set-based association tests, the snpsettest package employed the statistical model described in VEGAS (versatile gene-based association study) [1], which takes as input variant-level p values and reference linkage disequilibrium (LD) data. Briefly, the test statistics is defined as the sum of squared variant-level Z-statistics. Letting a set of \(Z\) scores of individual SNPs \(z_i\) for \(i \in 1:p\) within a set \(s\), the test statistic \(Q_s\) is

\[Q_s = \sum_{i=1}^p z_i^2\]

Here, \(Z = \{z_1,...,z_p\}'\) is a vector of multivariate normal distribution with a mean vector \(\mu\) and a covariance matrix \(\Sigma\) in which \(\Sigma\) represents LD among SNPs. To test a set-level association, we need to evaluate the distribution of \(Q_s\). VEGAS uses Monte Carlo simulations to approximate the distribution of \(Q_s\) (directly simulate \(Z\) from multivariate normal distribution), and thus, compute a set-level p value. However, its use is hampered in practice when set-based p values are very small because the number of simulations required to obtain such p values is be very large. The snpsettest package utilizes a different approach to evaluate the distribution of \(Q_s\) more efficiently.

Let \(Y = \Sigma^{-\frac12}Z\) (instead of \(\Sigma^{-\frac12}\), we could use any decomposition that satisfies \(\Sigma = AA'\) with a \(p \times p\) non-singular matrix \(A\) such that \(Y = A^{-1}Z\)). Then,

\[ \begin{gathered} E(Y) = \Sigma^{-\frac12} \mu \\ Var(Y) = \Sigma^{-\frac12}\Sigma\Sigma^{-\frac12} = I_p \\ Y \sim N(\Sigma^{-\frac12} \mu,~I_p) \end{gathered} \]

Now, we posit \(U = \Sigma^{-\frac12}(Z - \mu)\) so that

\[U \sim N(\mathbf{0}, I_p),~~U = Y - \Sigma^{-\frac12}\mu\]

and express the test statistic \(Q_s\) as a quadratic form:

\[ \begin{aligned} Q_s &= \sum_{i=1}^p z_i^2 = Z'I_pZ = Y'\Sigma^{\frac12}I_p\Sigma^{\frac12}Y \\ &= (U + \Sigma^{-\frac12}\mu)'\Sigma(U + \Sigma^{-\frac12}\mu) \end{aligned} \]

With the spectral theorem, \(\Sigma\) can be decomposed as follow:

\[ \begin{gathered} \Sigma = P\Lambda P' \\ \Lambda = \mathbf{diag}(\lambda_1,...,\lambda_p),~~P'P = PP' = I_p \end{gathered} \]

where \(P\) is an orthogonal matrix. If we set \(X = P'U\), \(X\) is a vector of independent standard normal variable \(X \sim N(\mathbf{0}, I_p)\) since

\[E(X) = P'E(U) = \mathbf{0},~~Var(X) = P'Var(U)P = P'I_pP = I_p\]

\[ \begin{aligned} Q_s &= (U + \Sigma^{-\frac12}\mu)'\Sigma(U + \Sigma^{-\frac12}\mu) \\ &= (U + \Sigma^{-\frac12}\mu)'P\Lambda P'(U + \Sigma^{-\frac12}\mu) \\ &= (X + P'\Sigma^{-\frac12}\mu)'\Lambda (X + P'\Sigma^{-\frac12}\mu) \end{aligned} \]

Under the null hypothesis, \(\mu\) is assumed to be \(\mathbf{0}\). Hence,

\[Q_s = X'\Lambda X = \sum_{i=1}^p \lambda_i x_i^2\]

where \(X = \{x_1,...,x_p\}'\). Thus, the null distribution of \(Q_s\) is a linear combination of independent chi-square variables \(x_i^2 \sim \chi_{(1)}^2\) (i.e., central quadratic form in independent normal variables). For computing a probability with a scalar \(q\),

\[Pr(Q_s > q)\]

several methods have been proposed, such as numerical inversion of the characteristic function [2]. The snpsettest package uses the algorithm of Davies [3] or saddlepoint approximation [4] to obtain set-based p values.

References

  1. Liu JZ, Mcrae AF, Nyholt DR, Medland SE, Wray NR, Brown KM, et al. A Versatile Gene-Based Test for Genome-wide Association Studies. Am J Hum Genet. 2010 Jul 9;87(1):139–45.

  2. Duchesne P, De Micheaux P. Computing the distribution of quadratic forms: Further comparisons between the Liu-Tang-Zhang approximation and exact methods. Comput Stat Data Anal. 2010;54:858–62.

  3. Davies RB. Algorithm AS 155: The Distribution of a Linear Combination of Chi-square Random Variables. J R Stat Soc Ser C Appl Stat. 1980;29(3):323–33.

  4. Kuonen D. Saddlepoint Approximations for Distributions of Quadratic Forms in Normal Variables. Biometrika. 1999;86(4):929–35.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.