The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

fbrglm

Safe formula-based wrapper around glmnet: bring glm()’s strict modeling-workflow semantics — formula + data in, predict-time design matrix reconstructed from a frozen recipe out — to L1 / L2 / elastic-net regularized GLMs.

Why

glmnet is the de facto standard for regularized GLMs in R, but its matrix-shaped API hands the full model recipe back to the caller: factor encoding, contrasts, complete-case filtering, rank deficiency, and predict-time design-matrix reconstruction. The recurring failure modes are:

fbrglm brings stats::glm()’s strict conventions to the glmnet engine: frozen terms / xlevels / contrasts on the fit, QR-pivot rank-deficient column drop with NA in coef() / summary(), glm-style error on novel test factor levels (with an opt-in on_new_levels = "na" for production scoring), explicit nobs_info complete-case counts, and an explicit refusal to print classical SE / z / p / CI under infer = "none". The underlying glmnet / cv.glmnet calls are unchanged — predictions are bit-identical to a hand-built raw-glmnet call across every glmnet-supported family — and reachable through as_glmnet() / as_cv_glmnet() for downstream tooling.

Status

infer = "none" only. Family coverage is:

Core supported (parity-checked against raw glmnet): gaussian, binomial, poisson, Gamma (via stats::Gamma(link = "log")), negative binomial (via MASS::negative.binomial(theta = ...)fixed θ only).

Experimental (basic fit / predict paths work, breadth of usage not exhaustively tested): native Cox (family = "cox" with Surv(time, status) ~ ...), multinomial, mgaussian.

Out of scope for the MVP: - Joint θ estimation in the style of MASS::glm.nb(). - Cox-specific extras such as strata, ties handling, and time-varying covariates have not been validated. - infer = "split" and infer = "selective" are planned but not implemented.

See TODO.md for the full backlog.

Installation

Recommended:

pak::pkg_install("dsc-chiba-u/fbrglm")

Alternative:

remotes::install_github("dsc-chiba-u/fbrglm")
# or
devtools::install_github("dsc-chiba-u/fbrglm")

Quick start

Gaussian

library(fbrglm)

set.seed(1)
n <- 100
df <- data.frame(
    y  = rnorm(n),
    x1 = rnorm(n),
    x2 = rnorm(n)
)

fit <- fbrglm(y ~ x1 + x2, data = df,
              family = "gaussian",
              lambda = "cv_min")

coef(fit)
head(predict(fit, type = "response"))
nobs(fit)

Binomial

library(fbrglm)

set.seed(2)
n <- 200
df <- data.frame(
    y  = rbinom(n, 1, 0.5),
    x1 = rnorm(n),
    x2 = rnorm(n)
)

fit <- fbrglm(y ~ x1 + x2, data = df,
              family = "binomial",
              lambda = "cv_1se")

head(predict(fit, type = "response"))   # probabilities in [0, 1]
head(predict(fit, type = "class"))      # 0/1

Factor predictors

Factors are auto-dummied by model.matrix(). The training factor levels are stored on the fit so predict(newdata = ...) still works when some levels are missing from the test data.

library(fbrglm)

set.seed(3)
n <- 200
train <- data.frame(
    y  = rnorm(n),
    x1 = rnorm(n),
    g  = factor(sample(c("A", "B", "C"), n, replace = TRUE),
                levels = c("A", "B", "C"))
)
fit <- fbrglm(y ~ x1 + g, data = train,
              family = "gaussian",
              lambda = "fix", lambda_value = 0.05)

# newdata missing level "C" — still works
test <- data.frame(
    x1 = rnorm(10),
    g  = factor(rep(c("A", "B"), 5), levels = c("A", "B", "C"))
)
predict(fit, newdata = test, type = "response")

By default, predict() errors on a factor value the model has not seen — same as stats::predict.glm(). Production-style batch scoring can opt into a softer mode:

# `test_unseen$g` contains a new level "D" not in the training data;
# rows with that value get NA, the rest score normally, and a warning
# names how many rows were dropped.
predict(fit, newdata = test_unseen, type = "response",
        on_new_levels = "na")

API

lambda selection

fbrglm() exposes three rules through one argument:

lambda meaning backend
"cv_min" cv.glmnet()$lambda.min (default) glmnet::cv.glmnet()
"cv_1se" cv.glmnet()$lambda.1se glmnet::cv.glmnet()
"fix" uses lambda_value directly glmnet::glmnet(lambda = ...)

lambda = "fix" requires lambda_value (numeric). The numeric λ that the fit actually used is always available as fit$lambda_value.

predict(type = ...)

For the single-response GLM families (gaussian, binomial, poisson, Gamma, negative.binomial, cox):

type gaussian / Gamma / NB binomial poisson cox
"link" η η (logit) η (log) η (linear predictor)
"response" mean (link⁻¹η) probability ∈ [0, 1] rate, exp(η) relative risk, exp(η)
"class" error 0 / 1 (threshold 0.5) error error

Cox predictions are on the linear-predictor / relative-risk scale, not absolute hazards. fbrglm does not estimate the baseline hazard; absolute survival curves require a separate baseline-hazard step (e.g. via survival::basehaz() on a coxph fit, or an equivalent Breslow estimate computed against the coxnet linear predictors).

For the multi-response families:

See vignette("fbrglm-families") for worked examples.

Complete-case bookkeeping: nobs_info

Complete-case filtering happens automatically. The numbers are exposed at fit$nobs_info:

fit$nobs_info$n_total            # rows in the input data
fit$nobs_info$n_dropped_missing  # rows with any NA in the model.frame
fit$nobs_info$n_used             # rows actually fit
nobs(fit)                        # same as $n_used

If any rows are dropped, fbrglm() prints a one-line message.

Inference: only infer = "none" for now

Only infer = "none" is currently enabled, and summary() deliberately does not report classical SE, z, p-values, or confidence intervals: shrinkage bias, data-driven λ selection, and active-set conditioning all break the textbook interpretation of those quantities. The summary() output instead carries a permanent footer naming the three failure modes and the planned remediation paths (infer = "split" for sample-split refits, infer = "selective" for selective inference at the chosen λ). coef() returns the regularized point estimates with NA for any column dropped by the QR-pivot rank check; summary() adds the glm-style “(N not defined because of singularities: ...)” header when that happens, the complete-case nobs_info triple, and the inference policy footer.

Planned (not yet implemented)

Full list and rationale: TODO.md.

Vignettes

Reproducible experiments

Smoke tests and benchmarks live in a separate repository: https://github.com/dsc-chiba-u/fbrglm-experiments.

Currently it contains:

Comparison methods covered in the small benchmarks:

Headline observations from the prediction-failure benchmark, in two factor-level scenarios:

License

MIT — see LICENSE.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.