The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
ssp.glm.rF: Balanced
Subsampling for Preserving Rare Features in Generalized Linear
ModelsThis vignette illustrates how to use ssp.glm.rF() for
generalized linear models with rare binary features. A rare feature is a
binary covariate that takes the value 1 in only a small fraction of
observations. Ordinary subsampling can miss many of these expressed
rare-feature observations, making estimation of the corresponding
coefficients unstable.
ssp.glm.rF() uses rarity-aware sampling probabilities
for one-step balance-score sampling, two-step optimal subsampling,
optional response balancing for binary outcomes, automatic rare-feature
detection, and a combined estimator fitted on the union of the pilot and
second-step subsamples.
We first simulate a logistic regression dataset with two rare binary
features and two continuous covariates. In the formula
Y ~ ., the model matrix contains an intercept column.
Numeric rareFeature.index values follow the original
covariate order supplied by the user; the function internally shifts the
indices to account for the intercept column.
The default criterion is "BL-Uni". It draws one Poisson
subsample with probabilities proportional to the rare-feature balance
score. For "BL-Uni" and "Uni", the expected
sample size is n.plt + n.ssp.
For observation \(i\), the balance score is \[ b(Z_i) = \sum_{j=1}^{d_r} \frac{|Z_{ij} - \bar{Z}_j|}{\bar{Z}_j}, \] where \(d_r\) is the number of rare features and \(\bar{Z}_j\) is the prevalence of the \(j\)th rare feature in the full data. Observations containing expressed rare features receive larger scores and therefore larger sampling probabilities.
fit_bl <- ssp.glm.rF(
formula = formula,
data = data,
n.plt = n.plt,
n.ssp = n.ssp,
family = "quasibinomial",
criterion = "BL-Uni",
rareFeature.index = rareFeature.index
)
summary(fit_bl)
#> Model Summary
#>
#> Call:
#>
#> ssp.glm.rF(formula = formula, data = data, n.plt = n.plt, n.ssp = n.ssp,
#> family = "quasibinomial", criterion = "BL-Uni", rareFeature.index = rareFeature.index)
#>
#> Design Summary:
#> Variable Value
#> Pilot Sample Size 1001
#> Expected Subsample Size 1000
#> Actual Subsample Size 1001
#> Unique Subsample Size 1001
#> Combined Union Size 1001
#> Overlap Count 1001
#> Overlap Rate in Pilot 100%
#> Overlap Rate in Subsample 100%
#> Expected Subsample Rate 20%
#> Actual Subsample Rate 20.02%
#> Unique Subsample Rate 20.02%
#>
#> Sample Composition (Logistic Regression):
#> Sample Size Y_count Y_rate Rare_row_rate
#> Pilot 1001 656 65.53% 49.15%
#> Subsample 1001 656 65.53% 49.15%
#> Combined union 1001 656 65.53% 49.15%
#> Full data 5000 3044 60.88% 11.32%
#>
#> Subsample Rare-Feature Summary:
#> Feature Subsample_count Subsample_rate Full_count Full_rate Coverage_of_full
#> Z1 212 21.18% 212 4.24% 100%
#> Z2 298 29.77% 372 7.44% 80.11%
#>
#> Subsample Rare-Pattern Distribution:
#> Rare_features_in_row Count Rate
#> 0_ones 509 50.85%
#> 1_ones 474 47.35%
#> 2_ones 18 1.8%
#>
#> Pilot Coefficients:
#>
#> Estimate Std. Error z value Pr(>|z|)
#> Intercept 0.4701 0.0951 4.9413 <0.0001
#> Z1 0.5197 0.1816 2.8620 0.0042
#> Z2 0.4508 0.1600 2.8169 0.0048
#> X1 0.3368 0.0891 3.7818 0.0002
#> X2 0.5454 0.0859 6.3484 <0.0001
#>
#> Second-Step Coefficients:
#>
#> Estimate Std. Error z value Pr(>|z|)
#> Intercept 0.4701 0.0951 4.9413 <0.0001
#> Z1 0.5197 0.1816 2.8620 0.0042
#> Z2 0.4508 0.1600 2.8169 0.0048
#> X1 0.3368 0.0891 3.7818 0.0002
#> X2 0.5454 0.0859 6.3484 <0.0001
#>
#> Combined-Union Coefficients:
#>
#> Estimate Std. Error z value Pr(>|z|)
#> Intercept 0.4701 0.0951 4.9413 <0.0001
#> Z1 0.5197 0.1816 2.8620 0.0042
#> Z2 0.4508 0.1600 2.8169 0.0048
#> X1 0.3368 0.0891 3.7818 0.0002
#> X2 0.5454 0.0859 6.3484 <0.0001The summary reports the realized sample size, response composition, rare-feature coverage, and coefficient estimates. Because this is a one-step method, the pilot, subsample, and combined estimators are the same.
Two-step criteria such as "Lopt", "Aopt",
"R-Lopt", and "BL-Lopt" first draw a pilot
sample, fit a pilot estimator, compute second-step sampling
probabilities, and then refit the model on the second-step subsample.
The final combined estimator is fitted on the union of the pilot and
second-step samples.
For two-step methods, balance.X.plt = TRUE draws the
pilot sample using the balance score.
fit_rlopt <- ssp.glm.rF(
formula = formula,
data = data,
n.plt = n.plt,
n.ssp = n.ssp,
family = "quasibinomial",
criterion = "R-Lopt",
balance.X.plt = TRUE,
rareFeature.index = c("Z1", "Z2")
)
summary(fit_rlopt)
#> Model Summary
#>
#> Call:
#>
#> ssp.glm.rF(formula = formula, data = data, n.plt = n.plt, n.ssp = n.ssp,
#> family = "quasibinomial", criterion = "R-Lopt", balance.X.plt = TRUE,
#> rareFeature.index = c("Z1", "Z2"))
#>
#> Design Summary:
#> Variable Value
#> Pilot Sample Size 300
#> Expected Subsample Size 700
#> Actual Subsample Size 702
#> Unique Subsample Size 702
#> Combined Union Size 892
#> Overlap Count 110
#> Overlap Rate in Pilot 36.67%
#> Overlap Rate in Subsample 15.67%
#> Expected Subsample Rate 14%
#> Actual Subsample Rate 14.04%
#> Unique Subsample Rate 14.04%
#>
#> Sample Composition (Logistic Regression):
#> Sample Size Y_count Y_rate Rare_row_rate
#> Pilot 300 211 70.33% 55%
#> Subsample 702 332 47.29% 49%
#> Combined union 892 483 54.15% 45.74%
#> Full data 5000 3044 60.88% 11.32%
#>
#> Subsample Rare-Feature Summary:
#> Feature Subsample_count Subsample_rate Full_count Full_rate Coverage_of_full
#> Z1 150 21.37% 212 4.24% 70.75%
#> Z2 209 29.77% 372 7.44% 56.18%
#>
#> Subsample Rare-Pattern Distribution:
#> Rare_features_in_row Count Rate
#> 0_ones 358 51%
#> 1_ones 329 46.87%
#> 2_ones 15 2.14%
#>
#> Pilot Coefficients:
#>
#> Estimate Std. Error z value Pr(>|z|)
#> Intercept 0.7663 0.1986 3.8579 0.0001
#> Z1 0.5044 0.3246 1.5541 0.1202
#> Z2 0.3788 0.3488 1.0860 0.2775
#> X1 0.5885 0.1784 3.2984 0.0010
#> X2 0.8452 0.1801 4.6936 <0.0001
#>
#> Second-Step Coefficients:
#>
#> Estimate Std. Error z value Pr(>|z|)
#> Intercept 0.3045 0.1166 2.6118 0.0090
#> Z1 0.7609 0.2004 3.7969 0.0001
#> Z2 0.6093 0.1799 3.3874 0.0007
#> X1 0.5275 0.0927 5.6909 <0.0001
#> X2 0.5109 0.0974 5.2464 <0.0001
#>
#> Combined-Union Coefficients:
#>
#> Estimate Std. Error z value Pr(>|z|)
#> Intercept 0.4040 0.0943 4.2854 <0.0001
#> Z1 0.6489 0.1860 3.4887 0.0005
#> Z2 0.6148 0.1625 3.7833 0.0002
#> X1 0.5406 0.0803 6.7342 <0.0001
#> X2 0.5595 0.0767 7.2953 <0.0001rareFeature.index = NULLIf rareFeature.index = NULL, the function searches for
binary columns whose prevalence is below rareThreshold.
For binary outcomes, balance.Y.ssp = TRUE applies a
case-control style allocation for one-step "Uni" and
"BL-Uni" methods. The option
balance.Y.all = TRUE includes all observations with
Y = 1 and subsamples from observations with
Y = 0.
fit_y_balanced <- ssp.glm.rF(
formula = formula,
data = data,
n.plt = n.plt,
n.ssp = n.ssp,
family = "quasibinomial",
criterion = "BL-Uni",
balance.Y.ssp = TRUE,
rareFeature.index = c("Z1", "Z2")
)
c(
full_Y_rate = mean(data$Y),
subsample_Y_rate = fit_y_balanced$Y.proportion.ssp
)
#> full_Y_rate subsample_Y_rate
#> 0.6088000 0.5185557By default, sampled observations are fitted with inverse-probability
weights. For one-step methods whose sampling probabilities do not depend
on the response, objective.weight = "unweighted" can be
used. Two-step methods currently use a weighted second-step
objective.
The control argument accepts alpha,
b, and poi.method. The default
poi.method = "exact" computes Poisson probabilities using
full-data normalization. The alternative
poi.method = "estimated" uses the pilot sample to estimate
the normalizing quantity.
fit_estimated <- ssp.glm.rF(
formula = formula,
data = data,
n.plt = n.plt,
n.ssp = n.ssp,
family = "quasibinomial",
criterion = "R-Lopt",
balance.X.plt = TRUE,
rareFeature.index = c("Z1", "Z2"),
control = list(poi.method = "estimated", b = 2),
record.stage.time = TRUE
)
fit_estimated$stage.time
#> balance_score pilot.estimate subsampling subsample.estimate
#> 0.001 0.001 0.001 0.001
#> combining.union result.assembly
#> 0.001 0.000The rare-feature machinery can also be used with non-binomial GLMs. Response balancing options are ignored for non-binary outcomes.
set.seed(3)
N_g <- 3000
Z <- rbinom(N_g, 1, 0.05)
X <- rnorm(N_g)
y <- 1 + 0.6 * Z + 0.2 * X + rnorm(N_g)
gaussian_data <- data.frame(y, Z, X)
fit_gaussian <- ssp.glm.rF(
y ~ .,
data = gaussian_data,
n.plt = 200,
n.ssp = 500,
family = "gaussian",
criterion = "BL-Uni",
rareFeature.index = "Z"
)
summary(fit_gaussian)
#> Model Summary
#>
#> Call:
#>
#> ssp.glm.rF(formula = y ~ ., data = gaussian_data, n.plt = 200,
#> n.ssp = 500, family = "gaussian", criterion = "BL-Uni", rareFeature.index = "Z")
#>
#> Design Summary:
#> Variable Value
#> Pilot Sample Size 680
#> Expected Subsample Size 700
#> Actual Subsample Size 680
#> Unique Subsample Size 680
#> Combined Union Size 680
#> Overlap Count 680
#> Overlap Rate in Pilot 100%
#> Overlap Rate in Subsample 100%
#> Expected Subsample Rate 23.33%
#> Actual Subsample Rate 22.67%
#> Unique Subsample Rate 22.67%
#>
#> Sample Composition:
#> Sample Size Rare_row_rate
#> Pilot 680 22.21%
#> Subsample 680 22.21%
#> Combined union 680 22.21%
#> Full data 3000 5.03%
#>
#> Subsample Rare-Feature Summary:
#> Feature Subsample_count Subsample_rate Full_count Full_rate Coverage_of_full
#> Z 151 22.21% 151 5.0333% 100%
#>
#> Subsample Rare-Pattern Distribution:
#> Rare_features_in_row Count Rate
#> 0_ones 529 77.79%
#> 1_ones 151 22.21%
#>
#> Pilot Coefficients:
#>
#> Estimate Std. Error z value Pr(>|z|)
#> Intercept 0.9475 0.0450 21.0475 <0.0001
#> Z 0.6655 0.1009 6.5962 <0.0001
#> X 0.2663 0.0417 6.3806 <0.0001
#>
#> Second-Step Coefficients:
#>
#> Estimate Std. Error z value Pr(>|z|)
#> Intercept 0.9475 0.0450 21.0475 <0.0001
#> Z 0.6655 0.1009 6.5962 <0.0001
#> X 0.2663 0.0417 6.3806 <0.0001
#>
#> Combined-Union Coefficients:
#>
#> Estimate Std. Error z value Pr(>|z|)
#> Intercept 0.9475 0.0450 21.0475 <0.0001
#> Z 0.6655 0.1009 6.5962 <0.0001
#> X 0.2663 0.0417 6.3806 <0.0001These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.