The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

ssp.glm.rF: Balanced Subsampling for Preserving Rare Features in Generalized Linear Models

This vignette illustrates how to use ssp.glm.rF() for generalized linear models with rare binary features. A rare feature is a binary covariate that takes the value 1 in only a small fraction of observations. Ordinary subsampling can miss many of these expressed rare-feature observations, making estimation of the corresponding coefficients unstable.

ssp.glm.rF() uses rarity-aware sampling probabilities for one-step balance-score sampling, two-step optimal subsampling, optional response balancing for binary outcomes, automatic rare-feature detection, and a combined estimator fitted on the union of the pilot and second-step subsamples.

Setup

library(subsampling)

Simulated Logistic Regression Example

We first simulate a logistic regression dataset with two rare binary features and two continuous covariates. In the formula Y ~ ., the model matrix contains an intercept column. Numeric rareFeature.index values follow the original covariate order supplied by the user; the function internally shifts the indices to account for the intercept column.

set.seed(2)
N <- 5000
Z1 <- rbinom(N, 1, 0.04)
Z2 <- rbinom(N, 1, 0.07)
X1 <- rnorm(N)
X2 <- rnorm(N)
eta <- 0.5 + 0.5 * Z1 + 0.5 * Z2 + 0.5 * X1 + 0.5 * X2
Y <- rbinom(N, 1, plogis(eta))

data <- data.frame(Y, Z1, Z2, X1, X2)
formula <- Y ~ .
rareFeature.index <- 1:2
n.plt <- 300
n.ssp <- 700

One-Step Balanced Sampling

The default criterion is "BL-Uni". It draws one Poisson subsample with probabilities proportional to the rare-feature balance score. For "BL-Uni" and "Uni", the expected sample size is n.plt + n.ssp.

For observation \(i\), the balance score is \[ b(Z_i) = \sum_{j=1}^{d_r} \frac{|Z_{ij} - \bar{Z}_j|}{\bar{Z}_j}, \] where \(d_r\) is the number of rare features and \(\bar{Z}_j\) is the prevalence of the \(j\)th rare feature in the full data. Observations containing expressed rare features receive larger scores and therefore larger sampling probabilities.

fit_bl <- ssp.glm.rF(
  formula = formula,
  data = data,
  n.plt = n.plt,
  n.ssp = n.ssp,
  family = "quasibinomial",
  criterion = "BL-Uni",
  rareFeature.index = rareFeature.index
)

summary(fit_bl)
#> Model Summary
#> 
#> Call:
#> 
#> ssp.glm.rF(formula = formula, data = data, n.plt = n.plt, n.ssp = n.ssp, 
#>     family = "quasibinomial", criterion = "BL-Uni", rareFeature.index = rareFeature.index)
#> 
#> Design Summary:
#>                   Variable  Value
#>          Pilot Sample Size   1001
#>    Expected Subsample Size   1000
#>      Actual Subsample Size   1001
#>      Unique Subsample Size   1001
#>        Combined Union Size   1001
#>              Overlap Count   1001
#>      Overlap Rate in Pilot   100%
#>  Overlap Rate in Subsample   100%
#>    Expected Subsample Rate    20%
#>      Actual Subsample Rate 20.02%
#>      Unique Subsample Rate 20.02%
#> 
#> Sample Composition (Logistic Regression):
#>          Sample Size Y_count Y_rate Rare_row_rate
#>           Pilot 1001     656 65.53%        49.15%
#>       Subsample 1001     656 65.53%        49.15%
#>  Combined union 1001     656 65.53%        49.15%
#>       Full data 5000    3044 60.88%        11.32%
#> 
#> Subsample Rare-Feature Summary:
#>  Feature Subsample_count Subsample_rate Full_count Full_rate Coverage_of_full
#>       Z1             212         21.18%        212     4.24%             100%
#>       Z2             298         29.77%        372     7.44%           80.11%
#> 
#> Subsample Rare-Pattern Distribution:
#>  Rare_features_in_row Count   Rate
#>                0_ones   509 50.85%
#>                1_ones   474 47.35%
#>                2_ones    18   1.8%
#> 
#> Pilot Coefficients:
#> 
#>           Estimate Std. Error z value Pr(>|z|)
#> Intercept   0.4701     0.0951  4.9413  <0.0001
#> Z1          0.5197     0.1816  2.8620   0.0042
#> Z2          0.4508     0.1600  2.8169   0.0048
#> X1          0.3368     0.0891  3.7818   0.0002
#> X2          0.5454     0.0859  6.3484  <0.0001
#> 
#> Second-Step Coefficients:
#> 
#>           Estimate Std. Error z value Pr(>|z|)
#> Intercept   0.4701     0.0951  4.9413  <0.0001
#> Z1          0.5197     0.1816  2.8620   0.0042
#> Z2          0.4508     0.1600  2.8169   0.0048
#> X1          0.3368     0.0891  3.7818   0.0002
#> X2          0.5454     0.0859  6.3484  <0.0001
#> 
#> Combined-Union Coefficients:
#> 
#>           Estimate Std. Error z value Pr(>|z|)
#> Intercept   0.4701     0.0951  4.9413  <0.0001
#> Z1          0.5197     0.1816  2.8620   0.0042
#> Z2          0.4508     0.1600  2.8169   0.0048
#> X1          0.3368     0.0891  3.7818   0.0002
#> X2          0.5454     0.0859  6.3484  <0.0001

The summary reports the realized sample size, response composition, rare-feature coverage, and coefficient estimates. Because this is a one-step method, the pilot, subsample, and combined estimators are the same.

Two-Step Rareness-Aware Optimal Subsampling

Two-step criteria such as "Lopt", "Aopt", "R-Lopt", and "BL-Lopt" first draw a pilot sample, fit a pilot estimator, compute second-step sampling probabilities, and then refit the model on the second-step subsample. The final combined estimator is fitted on the union of the pilot and second-step samples.

For two-step methods, balance.X.plt = TRUE draws the pilot sample using the balance score.

fit_rlopt <- ssp.glm.rF(
  formula = formula,
  data = data,
  n.plt = n.plt,
  n.ssp = n.ssp,
  family = "quasibinomial",
  criterion = "R-Lopt",
  balance.X.plt = TRUE,
  rareFeature.index = c("Z1", "Z2")
)

summary(fit_rlopt)
#> Model Summary
#> 
#> Call:
#> 
#> ssp.glm.rF(formula = formula, data = data, n.plt = n.plt, n.ssp = n.ssp, 
#>     family = "quasibinomial", criterion = "R-Lopt", balance.X.plt = TRUE, 
#>     rareFeature.index = c("Z1", "Z2"))
#> 
#> Design Summary:
#>                   Variable  Value
#>          Pilot Sample Size    300
#>    Expected Subsample Size    700
#>      Actual Subsample Size    702
#>      Unique Subsample Size    702
#>        Combined Union Size    892
#>              Overlap Count    110
#>      Overlap Rate in Pilot 36.67%
#>  Overlap Rate in Subsample 15.67%
#>    Expected Subsample Rate    14%
#>      Actual Subsample Rate 14.04%
#>      Unique Subsample Rate 14.04%
#> 
#> Sample Composition (Logistic Regression):
#>          Sample Size Y_count Y_rate Rare_row_rate
#>           Pilot  300     211 70.33%           55%
#>       Subsample  702     332 47.29%           49%
#>  Combined union  892     483 54.15%        45.74%
#>       Full data 5000    3044 60.88%        11.32%
#> 
#> Subsample Rare-Feature Summary:
#>  Feature Subsample_count Subsample_rate Full_count Full_rate Coverage_of_full
#>       Z1             150         21.37%        212     4.24%           70.75%
#>       Z2             209         29.77%        372     7.44%           56.18%
#> 
#> Subsample Rare-Pattern Distribution:
#>  Rare_features_in_row Count   Rate
#>                0_ones   358    51%
#>                1_ones   329 46.87%
#>                2_ones    15  2.14%
#> 
#> Pilot Coefficients:
#> 
#>           Estimate Std. Error z value Pr(>|z|)
#> Intercept   0.7663     0.1986  3.8579   0.0001
#> Z1          0.5044     0.3246  1.5541   0.1202
#> Z2          0.3788     0.3488  1.0860   0.2775
#> X1          0.5885     0.1784  3.2984   0.0010
#> X2          0.8452     0.1801  4.6936  <0.0001
#> 
#> Second-Step Coefficients:
#> 
#>           Estimate Std. Error z value Pr(>|z|)
#> Intercept   0.3045     0.1166  2.6118   0.0090
#> Z1          0.7609     0.2004  3.7969   0.0001
#> Z2          0.6093     0.1799  3.3874   0.0007
#> X1          0.5275     0.0927  5.6909  <0.0001
#> X2          0.5109     0.0974  5.2464  <0.0001
#> 
#> Combined-Union Coefficients:
#> 
#>           Estimate Std. Error z value Pr(>|z|)
#> Intercept   0.4040     0.0943  4.2854  <0.0001
#> Z1          0.6489     0.1860  3.4887   0.0005
#> Z2          0.6148     0.1625  3.7833   0.0002
#> X1          0.5406     0.0803  6.7342  <0.0001
#> X2          0.5595     0.0767  7.2953  <0.0001

Automatically Account for Rarity if rareFeature.index = NULL

If rareFeature.index = NULL, the function searches for binary columns whose prevalence is below rareThreshold.

fit_auto <- ssp.glm.rF(
  formula = formula,
  data = data,
  n.plt = n.plt,
  n.ssp = n.ssp,
  family = "quasibinomial",
  criterion = "BL-Uni",
  rareFeature.index = NULL,
  rareThreshold = 0.09
)

fit_auto$rareFeature.index
#> Z1 Z2 
#>  2  3

Balancing the Outcome for Logistic Regression

For binary outcomes, balance.Y.ssp = TRUE applies a case-control style allocation for one-step "Uni" and "BL-Uni" methods. The option balance.Y.all = TRUE includes all observations with Y = 1 and subsamples from observations with Y = 0.

fit_y_balanced <- ssp.glm.rF(
  formula = formula,
  data = data,
  n.plt = n.plt,
  n.ssp = n.ssp,
  family = "quasibinomial",
  criterion = "BL-Uni",
  balance.Y.ssp = TRUE,
  rareFeature.index = c("Z1", "Z2")
)

c(
  full_Y_rate = mean(data$Y),
  subsample_Y_rate = fit_y_balanced$Y.proportion.ssp
)
#>      full_Y_rate subsample_Y_rate 
#>        0.6088000        0.5185557

Objective Weights

By default, sampled observations are fitted with inverse-probability weights. For one-step methods whose sampling probabilities do not depend on the response, objective.weight = "unweighted" can be used. Two-step methods currently use a weighted second-step objective.

Control Options

The control argument accepts alpha, b, and poi.method. The default poi.method = "exact" computes Poisson probabilities using full-data normalization. The alternative poi.method = "estimated" uses the pilot sample to estimate the normalizing quantity.

fit_estimated <- ssp.glm.rF(
  formula = formula,
  data = data,
  n.plt = n.plt,
  n.ssp = n.ssp,
  family = "quasibinomial",
  criterion = "R-Lopt",
  balance.X.plt = TRUE,
  rareFeature.index = c("Z1", "Z2"),
  control = list(poi.method = "estimated", b = 2),
  record.stage.time = TRUE
)

fit_estimated$stage.time
#>      balance_score     pilot.estimate        subsampling subsample.estimate 
#>              0.001              0.001              0.001              0.001 
#>    combining.union    result.assembly 
#>              0.001              0.000

Non-Binomial Families

The rare-feature machinery can also be used with non-binomial GLMs. Response balancing options are ignored for non-binary outcomes.

set.seed(3)
N_g <- 3000
Z <- rbinom(N_g, 1, 0.05)
X <- rnorm(N_g)
y <- 1 + 0.6 * Z + 0.2 * X + rnorm(N_g)
gaussian_data <- data.frame(y, Z, X)

fit_gaussian <- ssp.glm.rF(
  y ~ .,
  data = gaussian_data,
  n.plt = 200,
  n.ssp = 500,
  family = "gaussian",
  criterion = "BL-Uni",
  rareFeature.index = "Z"
)

summary(fit_gaussian)
#> Model Summary
#> 
#> Call:
#> 
#> ssp.glm.rF(formula = y ~ ., data = gaussian_data, n.plt = 200, 
#>     n.ssp = 500, family = "gaussian", criterion = "BL-Uni", rareFeature.index = "Z")
#> 
#> Design Summary:
#>                   Variable  Value
#>          Pilot Sample Size    680
#>    Expected Subsample Size    700
#>      Actual Subsample Size    680
#>      Unique Subsample Size    680
#>        Combined Union Size    680
#>              Overlap Count    680
#>      Overlap Rate in Pilot   100%
#>  Overlap Rate in Subsample   100%
#>    Expected Subsample Rate 23.33%
#>      Actual Subsample Rate 22.67%
#>      Unique Subsample Rate 22.67%
#> 
#> Sample Composition:
#>          Sample Size Rare_row_rate
#>           Pilot  680        22.21%
#>       Subsample  680        22.21%
#>  Combined union  680        22.21%
#>       Full data 3000         5.03%
#> 
#> Subsample Rare-Feature Summary:
#>  Feature Subsample_count Subsample_rate Full_count Full_rate Coverage_of_full
#>        Z             151         22.21%        151   5.0333%             100%
#> 
#> Subsample Rare-Pattern Distribution:
#>  Rare_features_in_row Count   Rate
#>                0_ones   529 77.79%
#>                1_ones   151 22.21%
#> 
#> Pilot Coefficients:
#> 
#>           Estimate Std. Error z value Pr(>|z|)
#> Intercept   0.9475     0.0450 21.0475  <0.0001
#> Z           0.6655     0.1009  6.5962  <0.0001
#> X           0.2663     0.0417  6.3806  <0.0001
#> 
#> Second-Step Coefficients:
#> 
#>           Estimate Std. Error z value Pr(>|z|)
#> Intercept   0.9475     0.0450 21.0475  <0.0001
#> Z           0.6655     0.1009  6.5962  <0.0001
#> X           0.2663     0.0417  6.3806  <0.0001
#> 
#> Combined-Union Coefficients:
#> 
#>           Estimate Std. Error z value Pr(>|z|)
#> Intercept   0.9475     0.0450 21.0475  <0.0001
#> Z           0.6655     0.1009  6.5962  <0.0001
#> X           0.2663     0.0417  6.3806  <0.0001

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.