Large-Scale Linear Regression with lsReg

The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Overview

lsReg is designed for settings where a large number of candidate covariates must each be tested for association with an outcome, given a fixed set of baseline covariates. Rather than fitting a separate model for each candidate, lsReg pre-computes and caches expensive quantities from the base model once, then reuses them for every subsequent test. This is the pattern used in genome-wide association studies (GWAS), where thousands or millions of genetic variants are tested one at a time against a common base model.

The workflow has two steps:

Allocate — fit the base model with glm(), then call lsReg() to cache the quantities needed for repeated testing.
Test — call addcovar() once per candidate covariate, retrieving the parameter estimate and test statistic from the allocated memory object.

Example data

We use a simulated dataset with 1,000 observations. The outcome ylin was generated as a linear function of x1, x2, z5, and z9. The variables z1 through z10 are candidate covariates to be screened.

library(lsReg)

datafile <- system.file("extdata", "simulated_data.rds", package = "lsReg")
dat <- readRDS(datafile)
head(dat[, c("ylin", "x1", "x2", "z1", "z2", "z5", "z9")])
#>        ylin          x1 x2 z1 z2 z5        z9
#> 1 1.2926463 -1.52199116  1  0  1  0 0.5931028
#> 2 3.9721109 -0.96937043  1  0  0  0 0.6778513
#> 3 0.5331866 -0.38494124  0  0  0  0 0.8131135
#> 4 3.3306664 -0.24568841  0  1  1  0 0.2320682
#> 5 4.1558941 -0.49640532  0  0  1  0 0.7719654
#> 6 2.5245574  0.07975954  0  0  1  0 0.2162713

Step 1: Fit the base model

The base model includes only the baseline covariates x1 and x2.

basemdl <- glm(ylin ~ x1 + x2, data = dat, family = gaussian)
summary(basemdl)
#> 
#> Call:
#> glm(formula = ylin ~ x1 + x2, family = gaussian, data = dat)
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  4.56684    0.11907  38.355  < 2e-16 ***
#> x1           1.03466    0.08637  11.979  < 2e-16 ***
#> x2           0.59230    0.17517   3.381  0.00075 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for gaussian family taken to be 7.616899)
#> 
#>     Null deviance: 8798.1  on 999  degrees of freedom
#> Residual deviance: 7594.0  on 997  degrees of freedom
#> AIC: 4873.2
#> 
#> Number of Fisher Scoring iterations: 2

Step 2: Allocate memory

Pass the fitted base model to lsReg(), specifying that each test will add one column (colstoadd = 1) and that the Wald test statistic should be used.

mem <- lsReg(basemdl, colstoadd = 1, testtype = "wald")

Step 3: Test each candidate covariate

Loop over z1 through z10, calling addcovar() for each. After each call the parameter estimate for the candidate covariate is stored in mem$fitdata$betab and the Wald test statistic in mem$testvalue.

zvars <- paste0("z", 1:10)

results <- data.frame(
  variable  = zvars,
  estimate  = NA_real_,
  statistic = NA_real_
)

for (i in seq_along(zvars)) {
  xr <- as.matrix(dat[, zvars[i], drop = FALSE])
  addcovar(mem, xr)
  results$estimate[i]  <- mem$fitdata$betab[1]
  results$statistic[i] <- mem$testvalue[1, 1]
}

Results

results$pvalue <- 2 * pnorm(-abs(results$statistic))
print(results, digits = 4)
#>    variable  estimate statistic    pvalue
#> 1        z1 -0.469022 -1.405562 0.1598543
#> 2        z2 -0.327854 -1.875926 0.0606655
#> 3        z3 -0.096312 -0.532620 0.5942964
#> 4        z4 -0.096532 -0.409402 0.6822450
#> 5        z5  0.744905  3.565794 0.0003628
#> 6        z6 -0.024623 -0.082187 0.9344980
#> 7        z7  0.682225  2.241809 0.0249737
#> 8        z8  0.001659  0.005704 0.9954488
#> 9        z9 -0.748608 -2.471511 0.0134543
#> 10      z10 -0.533338 -1.774089 0.0760484

The variables z5 and z9 have the largest test statistics and the smallest p-values, consistent with the data-generating model. The remaining variables are null and show statistics close to zero.

Verification against standard GLM

To confirm that lsReg produces the same estimates and test statistics as a full glm() fit, we fit the models ylin ~ x1 + x2 + z5 and ylin ~ x1 + x2 + z9 directly and compare.

glm_z5 <- glm(ylin ~ x1 + x2 + z5, data = dat, family = gaussian)
glm_z9 <- glm(ylin ~ x1 + x2 + z9, data = dat, family = gaussian)

glm_est_z5  <- coef(glm_z5)["z5"]
glm_stat_z5 <- coef(summary(glm_z5))["z5", "t value"]

glm_est_z9  <- coef(glm_z9)["z9"]
glm_stat_z9 <- coef(summary(glm_z9))["z9", "t value"]

lsreg_est_z5  <- results$estimate[results$variable == "z5"]
lsreg_stat_z5 <- results$statistic[results$variable == "z5"]

lsreg_est_z9  <- results$estimate[results$variable == "z9"]
lsreg_stat_z9 <- results$statistic[results$variable == "z9"]

comparison <- data.frame(
  variable  = c("z5", "z9"),
  glm_estimate  = c(glm_est_z5,  glm_est_z9),
  lsreg_estimate = c(lsreg_est_z5, lsreg_est_z9),
  glm_statistic  = c(glm_stat_z5,  glm_stat_z9),
  lsreg_statistic = c(lsreg_stat_z5, lsreg_stat_z9)
)
print(comparison, digits = 6)
#>    variable glm_estimate lsreg_estimate glm_statistic lsreg_statistic
#> z5       z5     0.744905       0.744905       3.56579         3.56579
#> z9       z9    -0.748608      -0.748608      -2.47151        -2.47151

The estimates and test statistics from lsReg match those from glm() to numerical precision.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.