Introduction to twoPhaseGAS

The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Osvaldo Espin-Garcia, Apostolos Dimitromanolakis, Shelley Bull

2026-05-19

Overview

twoPhaseGAS provides tools for the design and analysis of two-phase genetic association studies (GAS) in which a subset of the full cohort is selected for targeted sequencing (phase 2), while the remaining individuals contribute only GWAS-level data (phase 1).

The package offers:

Data simulation via DataGeneration_TPD().
Phase 2 design optimisation via twoPhaseDesign() / twoPhase().
Joint analysis of phase 1 and phase 2 data via twoPhaseSPML(), which implements a semiparametric maximum likelihood (SPML) estimator using the EM algorithm.

1. Simulating two-phase data

library(twoPhaseGAS)

data.table::setDTthreads(1)

set.seed(42)
data <- DataGeneration_TPD(
  Beta0  = 2,    # intercept
  Beta1  = 0.5,  # genetic effect (G on Y)
  Disp = 1,      # error variance for default gaussian family
  N      = 3000, # phase 1 cohort size
  LD.r   = 0.75, # LD (r) between causal SNP G and GWAS SNP Z
  P_g    = 0.20, # MAF of G
  P_z    = 0.30  # MAF of Z
)
#> Beta0=2 Beta1=0.5
head(data)
#>   wait_it        Y G1 Z G0 S
#> 1       1 1.314338  0 1  0 1
#> 2       1 1.207286  0 2  0 1
#> 3       1 1.592996  0 0  0 1
#> 4       1 1.351329  1 1  1 1
#> 5       1 3.115760  0 0  0 3
#> 6       1 1.120543  0 0  1 1

The simulated data frame contains:

Column	Description
`Y`	Continuous outcome
`G1`	Causal sequence variant (absent from phase 1 GWAS)
`Z`	GWAS SNP (available for everyone)
`S`	Stratum variable derived from residuals

2. Selecting a phase 2 subsample

Here we use simple random sampling; twoPhaseDesign() can be used for optimised designs.

set.seed(1)
n2   <- 500                              # phase 2 size
R    <- rep(0L, nrow(data))
R[sample(nrow(data), n2)] <- 1L         # 1 = selected for phase 2

data[R == 0, c("G1")]  <- NA # Make all phase 1 data complement G1 values  missing

cat("Phase 1 complement:", sum(R == 0), "individuals\n")
#> Phase 1 complement: 2500 individuals
cat("Phase 2:           ", sum(R == 1), "individuals\n")
#> Phase 2:            500 individuals

3. Analysis under H₀ (score test)

When the missing-by-design covariate (miscov) is absent from formula, twoPhaseSPML() fits the null model and returns score statistics.

res_Ho <- twoPhaseSPML(
  formula = Y ~ Z,
  miscov  = ~ G1,
  auxvar  = ~ Z,
  data   = data
)

print(res_Ho)
#> 
#> Call:
#> twoPhaseSPML(formula = Y ~ Z, miscov = ~G1, auxvar = ~Z, family = gaussian)
#> 
#> Coefficients:
#>             Estimate
#> (Intercept)    2.016
#> Z              0.290
#> 
#> Family: gaussian  Link: identity 
#> Dispersion: 1.074 
#> Log-likelihood: -7382  (EM iterations: 45 )
#> 
#> Hypothesis test for missing-by-design covariate(s): G1 
#>   [Null model - Score statistics]
#>   Observed score statistic (Sobs): 26.06 
#>   Expected score statistic (Sexp): 25.51 
#>   Degrees of freedom: 1 
#>   p-value (chi-squared): 3.313e-07

The print method (modelled on glm) displays:

the fitted call,
estimated regression coefficients,
family and link,
log-likelihood and number of EM iterations, and
the observed and expected score statistics with a chi-squared p-value.

4. Analysis under Hₐ (Wald test)

Including miscov in formula switches to the alternative model. The EM algorithm now estimates the effect of G1 jointly with the regression parameters, and Wald statistics are reported.

res_Ha <- twoPhaseSPML(
  formula = Y ~ Z + G1,
  miscov  = ~ G1,
  auxvar  = ~ Z,
  data   = data
)

print(res_Ha)
#> 
#> Call:
#> twoPhaseSPML(formula = Y ~ Z + G1, miscov = ~G1, auxvar = ~Z, family = gaussian)
#> 
#> Coefficients:
#>             Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)  2.00924    0.02548  78.846  < 2e-16 ***
#> Z           -0.09125    0.06188  -1.474     0.14    
#> G1           0.61235    0.08688   7.048 1.81e-12 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Family: gaussian  Link: identity 
#> Dispersion: 1.021 
#> Log-likelihood: -7367  (EM iterations: 53 )
#> 
#> Hypothesis test for missing-by-design covariate(s): G1 
#>   [Alternative model - Wald statistics]
#>   Observed Wald statistic (Wobs): 49.68 
#>   Expected Wald statistic (Wexp): 38.87 
#>   Degrees of freedom: 1 
#>   p-value (chi-squared): 1.811e-12

5. Accessing results programmatically

The returned object is a list of class "twoPhaseSPML".

# Regression coefficients
res_Ha$theta
#> (Intercept)           Z          G1 
#>  2.00923599 -0.09124575  0.61234907

# Log-likelihood
res_Ha$ll
#> [1] -7367.262

# Estimated joint distribution of G1 and Z
head(res_Ha$qGZ)
#>   G1 Z           q
#> 1  0 0 0.477127116
#> 2  1 0 0.004206217
#> 3  0 1 0.151060698
#> 4  1 1 0.269605968
#> 5  0 2 0.012126314
#> 6  1 2 0.049821574

# Wald statistic and degrees of freedom
cat("Wobs =", res_Ha$Wobs, "  df =", res_Ha$df, "\n")
#> Wobs = 49.67874   df = 1
cat("p-value =", pchisq(res_Ha$Wobs, df = res_Ha$df, lower.tail = FALSE), "\n")
#> p-value = 1.810981e-12

6. Using `start.values` for warm starts

When analysing many variants, passing converged estimates from a nearby model as start.values can reduce computation time.

# Use null-model estimates as starting point for the alternative model
res_warm <- twoPhaseSPML(
  formula     = Y ~ Z + G1,
  miscov      = ~ G1,
  auxvar      = ~ Z,
  data       = data,
  start.values = list(betas = res_Ho$theta,
                      q     = res_Ho$qGZ)
)

Session information

sessionInfo()
#> R version 4.1.2 (2021-11-01)
#> Platform: aarch64-apple-darwin20 (64-bit)
#> Running under: macOS 15.7.5
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] twoPhaseGAS_1.2.5
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.12         enrichwith_0.4.0    bigmemory.sri_0.1.3
#>  [4] kofnGA_1.3          rstudioapi_0.13     knitr_1.37         
#>  [7] magrittr_2.0.3      MASS_7.3-54         lattice_0.20-45    
#> [10] R6_2.6.1            rlang_1.1.7         fastmap_1.1.0      
#> [13] bigmemory_4.5.36    stringr_1.4.0       tools_4.1.2        
#> [16] parallel_4.1.2      grid_4.1.2          dfoptim_2023.1.0   
#> [19] data.table_1.18.0   xfun_0.34           cli_3.6.5          
#> [22] jquerylib_0.1.4     htmltools_0.5.3     yaml_2.2.1         
#> [25] digest_0.6.37       Matrix_1.6-5        nloptr_1.2.2.3     
#> [28] sass_0.4.2          evaluate_1.0.5      rmarkdown_2.17     
#> [31] stringi_1.7.5       compiler_4.1.2      bslib_0.3.1        
#> [34] jsonlite_2.0.0

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.

Introduction to twoPhaseGAS

Osvaldo Espin-Garcia, Apostolos Dimitromanolakis, Shelley Bull

2026-05-19

Overview

1. Simulating two-phase data

2. Selecting a phase 2 subsample

3. Analysis under H₀ (score test)

4. Analysis under Hₐ (Wald test)

5. Accessing results programmatically

6. Using start.values for warm starts

Session information

6. Using `start.values` for warm starts