The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

fbrglm

Safe formula-based wrapper around glmnet: bring glm()’s strict modeling-workflow semantics — formula + data in, predict-time design matrix reconstructed from a frozen recipe out — to L1 / L2 / elastic-net regularized GLMs.

Why

glmnet is the de facto standard for regularized GLMs in R, but its matrix-shaped API hands the full model recipe back to the caller: factor encoding, contrasts, complete-case filtering, rank deficiency, and predict-time design-matrix reconstruction. The recurring failure modes are:

the test factor’s levels narrow (or an interaction with such a level is involved), so glmnet::predict() errors on a column-width mismatch;
a session’s options("contrasts") setting differs between fit and predict, so the design matrix silently changes meaning;
a column is linearly dependent and gets reported as a numeric 0 — visually identical to a coefficient the L1 penalty shrunk to zero;
production data carries a factor value the training set never saw, and there is no built-in way to keep the batch alive;
complete-case bookkeeping is not exposed on the fit object.

fbrglm brings stats::glm()’s strict conventions to the glmnet engine: frozen terms / xlevels / contrasts on the fit, QR-pivot rank-deficient column drop with NA in coef() / summary(), glm-style error on novel test factor levels (with an opt-in on_new_levels = "na" for production scoring), explicit nobs_info complete-case counts, and an explicit refusal to print classical SE / z / p / CI under infer = "none". The underlying glmnet / cv.glmnet calls are unchanged — predictions are bit-identical to a hand-built raw-glmnet call across every glmnet-supported family — and reachable through as_glmnet() / as_cv_glmnet() for downstream tooling.

Status

infer = "none" only. Family coverage is:

Core supported (parity-checked against raw glmnet): gaussian, binomial, poisson, Gamma (via stats::Gamma(link = "log")), negative binomial (via MASS::negative.binomial(theta = ...) — fixed θ only).

Experimental (basic fit / predict paths work, breadth of usage not exhaustively tested): native Cox (family = "cox" with Surv(time, status) ~ ...), multinomial, mgaussian.

Out of scope for the MVP: - Joint θ estimation in the style of MASS::glm.nb(). - Cox-specific extras such as strata, ties handling, and time-varying covariates have not been validated. - infer = "split" and infer = "selective" are planned but not implemented.

See TODO.md for the full backlog.

Installation

Recommended:

pak::pkg_install("dsc-chiba-u/fbrglm")

Alternative:

remotes::install_github("dsc-chiba-u/fbrglm")
# or
devtools::install_github("dsc-chiba-u/fbrglm")

Quick start

Gaussian

library(fbrglm)

set.seed(1)
n <- 100
df <- data.frame(
    y  = rnorm(n),
    x1 = rnorm(n),
    x2 = rnorm(n)
)

fit <- fbrglm(y ~ x1 + x2, data = df,
              family = "gaussian",
              lambda = "cv_min")

coef(fit)
head(predict(fit, type = "response"))
nobs(fit)

Binomial

library(fbrglm)

set.seed(2)
n <- 200
df <- data.frame(
    y  = rbinom(n, 1, 0.5),
    x1 = rnorm(n),
    x2 = rnorm(n)
)

fit <- fbrglm(y ~ x1 + x2, data = df,
              family = "binomial",
              lambda = "cv_1se")

head(predict(fit, type = "response"))   # probabilities in [0, 1]
head(predict(fit, type = "class"))      # 0/1

Factor predictors

Factors are auto-dummied by model.matrix(). The training factor levels are stored on the fit so predict(newdata = ...) still works when some levels are missing from the test data.

library(fbrglm)

set.seed(3)
n <- 200
train <- data.frame(
    y  = rnorm(n),
    x1 = rnorm(n),
    g  = factor(sample(c("A", "B", "C"), n, replace = TRUE),
                levels = c("A", "B", "C"))
)
fit <- fbrglm(y ~ x1 + g, data = train,
              family = "gaussian",
              lambda = "fix", lambda_value = 0.05)

# newdata missing level "C" — still works
test <- data.frame(
    x1 = rnorm(10),
    g  = factor(rep(c("A", "B"), 5), levels = c("A", "B", "C"))
)
predict(fit, newdata = test, type = "response")

By default, predict() errors on a factor value the model has not seen — same as stats::predict.glm(). Production-style batch scoring can opt into a softer mode:

# `test_unseen$g` contains a new level "D" not in the training data;
# rows with that value get NA, the rest score normally, and a warning
# names how many rows were dropped.
predict(fit, newdata = test_unseen, type = "response",
        on_new_levels = "na")

API

`lambda` selection

fbrglm() exposes three rules through one argument:

`lambda`	meaning	backend
`"cv_min"`	`cv.glmnet()$lambda.min` (default)	`glmnet::cv.glmnet()`
`"cv_1se"`	`cv.glmnet()$lambda.1se`	`glmnet::cv.glmnet()`
`"fix"`	uses `lambda_value` directly	`glmnet::glmnet(lambda = ...)`

lambda = "fix" requires lambda_value (numeric). The numeric λ that the fit actually used is always available as fit$lambda_value.

`predict(type = ...)`

For the single-response GLM families (gaussian, binomial, poisson, Gamma, negative.binomial, cox):

`type`	gaussian / Gamma / NB	binomial	poisson	cox
`"link"`	η	η (logit)	η (log)	η (linear predictor)
`"response"`	mean (link⁻¹η)	probability ∈ [0, 1]	rate, `exp(η)`	relative risk, `exp(η)`
`"class"`	error	0 / 1 (threshold 0.5)	error	error

Cox predictions are on the linear-predictor / relative-risk scale, not absolute hazards. fbrglm does not estimate the baseline hazard; absolute survival curves require a separate baseline-hazard step (e.g. via survival::basehaz() on a coxph fit, or an equivalent Breslow estimate computed against the coxnet linear predictors).

For the multi-response families:

multinomial: "link" / "response" return an (n × k) matrix (one column per class); "class" returns the argmax class label.
mgaussian: "link" / "response" return an (n × q) matrix (one column per response); "class" errors.

See vignette("fbrglm-families") for worked examples.

Complete-case bookkeeping: `nobs_info`

Complete-case filtering happens automatically. The numbers are exposed at fit$nobs_info:

fit$nobs_info$n_total            # rows in the input data
fit$nobs_info$n_dropped_missing  # rows with any NA in the model.frame
fit$nobs_info$n_used             # rows actually fit
nobs(fit)                        # same as $n_used

If any rows are dropped, fbrglm() prints a one-line message.

Inference: only `infer = "none"` for now

Only infer = "none" is currently enabled, and summary() deliberately does not report classical SE, z, p-values, or confidence intervals: shrinkage bias, data-driven λ selection, and active-set conditioning all break the textbook interpretation of those quantities. The summary() output instead carries a permanent footer naming the three failure modes and the planned remediation paths (infer = "split" for sample-split refits, infer = "selective" for selective inference at the chosen λ). coef() returns the regularized point estimates with NA for any column dropped by the QR-pivot rank check; summary() adds the glm-style “(N not defined because of singularities: ...)” header when that happens, the complete-case nobs_info triple, and the inference policy footer.

Planned (not yet implemented)

infer = "split" — data splitting with selection_frac for honest post-selection SEs / p-values / CIs via a base-R glm() refit.
infer = "selective" — selective inference at the chosen λ.
Broader Cox coverage (strata, ties handling, time-varying covariates) and corresponding tests.
Joint θ estimation for negative binomial (MASS::glm.nb()-style).

Full list and rationale: TODO.md.

Vignettes

Getting started (vignette("fbrglm")) — formula / data, λ selection, predict(type = ...), nobs_info, factor narrowing, offsets.
Families and model types (vignette("fbrglm-families")) — worked examples for linear, logistic, Poisson (with offset), Gamma, negative binomial (fixed θ), native Cox (Surv() LHS), plus the experimental multinomial and mgaussian paths. Two survival routes are shown explicitly: the piecewise exponential Poisson model on long-format data and the native Cox path via family = "cox"; these are different models and the vignette does not conflate them.

Reproducible experiments

Smoke tests and benchmarks live in a separate repository: https://github.com/dsc-chiba-u/fbrglm-experiments.

Currently it contains:

smoke tests for MVP behavior (gaussian / binomial / poisson basics, glmnet parity at fixed λ, cv.glmnet parity at cv_min / cv_1se, factor newdata, complete-case bookkeeping)
a prediction-failure benchmark for train/test factor-level mismatch
a small runtime benchmark
generated plots for both benchmarks

Comparison methods covered in the small benchmarks:

raw glmnet
glmnetUtils
parsnip / workflows with the glmnet engine

Headline observations from the prediction-failure benchmark, in two factor-level scenarios:

Narrowed test (train: A/B/C/D, test: A/B). fbrglm, the glmnet_raw_safe (manual re-level) path, glmnetUtils with use.model.frame = TRUE, and the parsnip / workflows pipeline all succeed. Default glmnet_raw_naive and default glmnetUtils fail with a column-width error.
Novel level test (train: A/B, test: A/B/C/D). fbrglm’s default reproduces stats::predict.glm()’s “factor has new levels” error verbatim, as does glmnetUtils with use.model.frame = TRUE and stats::glm() itself. The default fast paths get the column-width error instead. parsnip warns and silently coerces novel cells to the reference level; fbrglm’s opt-in on_new_levels = "na" is the only path that returns NA at novel-level rows and finite predictions elsewhere.

License

MIT — see LICENSE.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.