Best subset selection with combss

The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

combss solves the best subset selection problem in generalised linear models by reformulating it as a continuous optimisation over the hypercube $[0, 1]^p$ and applying a Frank-Wolfe homotopy algorithm. The inner ridge problem is solved with glmnet. Three families are supported:

Linear regression

A simulated dataset with $n = 300$ observations and $p = 30$ predictors, of which only the first five are truly active.

set.seed(102)
n <- 300; p <- 30
beta <- c(3, 2, 1.5, 1, 0.5, rep(0, p - 5))
x <- matrix(rnorm(n * p), n, p)
y <- as.numeric(x %*% beta + rnorm(n) * 0.5)

itr <- 1:200; iva <- 201:300
fit <- combss(x[itr, ], y[itr],
              x_val = x[iva, ], y_val = y[iva],
              family = "gaussian", q = 10)
fit
#> COMBSS fit
#>   family:    gaussian
#>   n, p:      200, 30
#>   q:         10
#>   lam_ridge: 0
#>   best k:    5
#>   mse:       0.22222
#>   subset:    1, 2, 3, 4, 5

fit$subset_list contains the selected feature set for each k = 1, ..., q:

fit$subset_list[1:8]
#> [[1]]
#> [1] 1
#> 
#> [[2]]
#> [1] 1 2
#> 
#> [[3]]
#> [1] 1 2 3
#> 
#> [[4]]
#> [1] 1 2 3 4
#> 
#> [[5]]
#> [1] 1 2 3 4 5
#> 
#> [[6]]
#> [1] 1 2 3 4 5 6
#> 
#> [[7]]
#> [1]  1  2  3  4  5  6 21
#> 
#> [[8]]
#> [1]  1  2  3  4  5  6 10 21

fit$mse_path gives the validation MSE at each k, and fit$best_k the size with smallest MSE.

plot(seq_along(fit$mse_path), fit$mse_path, type = "b",
     xlab = "k", ylab = "Validation MSE")
abline(v = fit$best_k, lty = 2)

Binary logistic regression

ybin <- as.numeric(plogis(x %*% beta) > 0.5)
fit_bin <- combss(x[itr, ], ybin[itr],
                  x_val = x[iva, ], y_val = ybin[iva],
                  family = "binomial", q = 10)
fit_bin
#> COMBSS fit
#>   family:    binomial
#>   n, p:      200, 30
#>   q:         10
#>   lam_ridge: 0
#>   best k:    5
#>   accuracy:  1
#>   subset:    1, 2, 3, 4, 5

Multinomial logistic regression

ymulti <- cut(as.numeric(x %*% beta),
              breaks = c(-Inf, -1, 1, Inf),
              labels = c("a", "b", "c"))
fit_mn <- combss(x[itr, ], ymulti[itr],
                 x_val = x[iva, ], y_val = ymulti[iva],
                 family = "multinomial", q = 10)
fit_mn
#> COMBSS fit
#>   family:    multinomial
#>   n, p:      200, 30
#>   q:         10
#>   lam_ridge: 0
#>   best k:    6
#>   accuracy:  0.87
#>   subset:    1, 2, 3, 4, 5, 12

LOOCV ridge selection

For each candidate ridge penalty lam_ridge, combss_cv() runs COMBSS and evaluates LOOCV error on the chosen subset.

cv <- combss_cv(x, y, family = "gaussian", q = 6)
cv$best_lambda

Predicting on new data

predict() refits on the chosen subset of the original training data and predicts on newx.

preds <- predict(fit, newx = x[iva, ],
                 x_train = x[itr, ], y_train = y[itr])
head(preds)
#> [1] -2.94159809 -6.56104048  1.71124589  4.53489027  0.11400216 -0.02461332

References

Moka, S., Liquet, B., Zhu, H. and Muller, S. (2024). COMBSS: best subset selection via continuous optimization. Statistics and Computing.
Mathur, A., Liquet, B., Muller, S. and Moka, S. (2026). Parsimonious Subset Selection for Generalized Linear Models with Biomedical Applications. arXiv:2603.21952.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.