The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
twoPhaseGAS provides tools for the design and analysis of two-phase genetic association studies (GAS) in which a subset of the full cohort is selected for targeted sequencing (phase 2), while the remaining individuals contribute only GWAS-level data (phase 1).
The package offers:
DataGeneration_TPD().twoPhaseDesign() / twoPhase().twoPhaseSPML(), which implements a semiparametric maximum
likelihood (SPML) estimator using the EM algorithm.library(twoPhaseGAS)
data.table::setDTthreads(1)
set.seed(42)
data <- DataGeneration_TPD(
Beta0 = 2, # intercept
Beta1 = 0.5, # genetic effect (G on Y)
Disp = 1, # error variance for default gaussian family
N = 3000, # phase 1 cohort size
LD.r = 0.75, # LD (r) between causal SNP G and GWAS SNP Z
P_g = 0.20, # MAF of G
P_z = 0.30 # MAF of Z
)
#> Beta0=2 Beta1=0.5
head(data)
#> wait_it Y G1 Z G0 S
#> 1 1 1.314338 0 1 0 1
#> 2 1 1.207286 0 2 0 1
#> 3 1 1.592996 0 0 0 1
#> 4 1 1.351329 1 1 1 1
#> 5 1 3.115760 0 0 0 3
#> 6 1 1.120543 0 0 1 1The simulated data frame contains:
| Column | Description |
|---|---|
Y |
Continuous outcome |
G1 |
Causal sequence variant (absent from phase 1 GWAS) |
Z |
GWAS SNP (available for everyone) |
S |
Stratum variable derived from residuals |
Here we use simple random sampling; twoPhaseDesign() can
be used for optimised designs.
set.seed(1)
n2 <- 500 # phase 2 size
R <- rep(0L, nrow(data))
R[sample(nrow(data), n2)] <- 1L # 1 = selected for phase 2
data[R == 0, c("G1")] <- NA # Make all phase 1 data complement G1 values missing
cat("Phase 1 complement:", sum(R == 0), "individuals\n")
#> Phase 1 complement: 2500 individuals
cat("Phase 2: ", sum(R == 1), "individuals\n")
#> Phase 2: 500 individualsWhen the missing-by-design covariate (miscov) is
absent from formula,
twoPhaseSPML() fits the null model and returns score
statistics.
res_Ho <- twoPhaseSPML(
formula = Y ~ Z,
miscov = ~ G1,
auxvar = ~ Z,
data = data
)
print(res_Ho)
#>
#> Call:
#> twoPhaseSPML(formula = Y ~ Z, miscov = ~G1, auxvar = ~Z, family = gaussian)
#>
#> Coefficients:
#> Estimate
#> (Intercept) 2.016
#> Z 0.290
#>
#> Family: gaussian Link: identity
#> Dispersion: 1.074
#> Log-likelihood: -7382 (EM iterations: 45 )
#>
#> Hypothesis test for missing-by-design covariate(s): G1
#> [Null model - Score statistics]
#> Observed score statistic (Sobs): 26.06
#> Expected score statistic (Sexp): 25.51
#> Degrees of freedom: 1
#> p-value (chi-squared): 3.313e-07The print method (modelled on glm)
displays:
Including miscov in formula switches to the
alternative model. The EM algorithm now estimates the effect of
G1 jointly with the regression parameters, and Wald
statistics are reported.
res_Ha <- twoPhaseSPML(
formula = Y ~ Z + G1,
miscov = ~ G1,
auxvar = ~ Z,
data = data
)
print(res_Ha)
#>
#> Call:
#> twoPhaseSPML(formula = Y ~ Z + G1, miscov = ~G1, auxvar = ~Z, family = gaussian)
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 2.00924 0.02548 78.846 < 2e-16 ***
#> Z -0.09125 0.06188 -1.474 0.14
#> G1 0.61235 0.08688 7.048 1.81e-12 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Family: gaussian Link: identity
#> Dispersion: 1.021
#> Log-likelihood: -7367 (EM iterations: 53 )
#>
#> Hypothesis test for missing-by-design covariate(s): G1
#> [Alternative model - Wald statistics]
#> Observed Wald statistic (Wobs): 49.68
#> Expected Wald statistic (Wexp): 38.87
#> Degrees of freedom: 1
#> p-value (chi-squared): 1.811e-12The returned object is a list of class
"twoPhaseSPML".
# Regression coefficients
res_Ha$theta
#> (Intercept) Z G1
#> 2.00923599 -0.09124575 0.61234907
# Log-likelihood
res_Ha$ll
#> [1] -7367.262
# Estimated joint distribution of G1 and Z
head(res_Ha$qGZ)
#> G1 Z q
#> 1 0 0 0.477127116
#> 2 1 0 0.004206217
#> 3 0 1 0.151060698
#> 4 1 1 0.269605968
#> 5 0 2 0.012126314
#> 6 1 2 0.049821574
# Wald statistic and degrees of freedom
cat("Wobs =", res_Ha$Wobs, " df =", res_Ha$df, "\n")
#> Wobs = 49.67874 df = 1
cat("p-value =", pchisq(res_Ha$Wobs, df = res_Ha$df, lower.tail = FALSE), "\n")
#> p-value = 1.810981e-12start.values for warm startsWhen analysing many variants, passing converged estimates from a
nearby model as start.values can reduce computation
time.
# Use null-model estimates as starting point for the alternative model
res_warm <- twoPhaseSPML(
formula = Y ~ Z + G1,
miscov = ~ G1,
auxvar = ~ Z,
data = data,
start.values = list(betas = res_Ho$theta,
q = res_Ho$qGZ)
)sessionInfo()
#> R version 4.1.2 (2021-11-01)
#> Platform: aarch64-apple-darwin20 (64-bit)
#> Running under: macOS 15.7.5
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/lib/libRlapack.dylib
#>
#> locale:
#> [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] twoPhaseGAS_1.2.5
#>
#> loaded via a namespace (and not attached):
#> [1] Rcpp_1.0.12 enrichwith_0.4.0 bigmemory.sri_0.1.3
#> [4] kofnGA_1.3 rstudioapi_0.13 knitr_1.37
#> [7] magrittr_2.0.3 MASS_7.3-54 lattice_0.20-45
#> [10] R6_2.6.1 rlang_1.1.7 fastmap_1.1.0
#> [13] bigmemory_4.5.36 stringr_1.4.0 tools_4.1.2
#> [16] parallel_4.1.2 grid_4.1.2 dfoptim_2023.1.0
#> [19] data.table_1.18.0 xfun_0.34 cli_3.6.5
#> [22] jquerylib_0.1.4 htmltools_0.5.3 yaml_2.2.1
#> [25] digest_0.6.37 Matrix_1.6-5 nloptr_1.2.2.3
#> [28] sass_0.4.2 evaluate_1.0.5 rmarkdown_2.17
#> [31] stringi_1.7.5 compiler_4.1.2 bslib_0.3.1
#> [34] jsonlite_2.0.0These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.