Getting started with fbrglm

The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

What `fbrglm` is for

fbrglm is a formula-based front-end for regularized generalized linear models. Internally it delegates the fit to glmnet; the wrapper’s job is to make the user-facing experience look like base R’s glm() — a formula + data.frame, automatic factor handling, complete-case filtering, and the familiar S3 methods (print, summary, coef, predict, nobs, plot).

The MVP described here is infer = "none": it returns regularized point estimates and does not report classical standard errors, z values, p values, or confidence intervals. Honest post-selection inference (via data splitting or selective inference) is on the roadmap; see the package TODO.md.

library(fbrglm)

A small binomial example

n <- 150
dat <- data.frame(
    y  = rbinom(n, 1, 0.5),
    x1 = rnorm(n),
    x2 = rnorm(n),
    x3 = rnorm(n)
)

fit <- fbrglm(y ~ x1 + x2 + x3, data = dat,
              family = "binomial",
              lambda = "cv_min")

print() shows the call and the basics of the fit:

print(fit)
#> Formula-based regularized GLM (fbrglm)
#> 
#> Call:
#> fbrglm(formula = y ~ x1 + x2 + x3, data = dat, family = "binomial", 
#>     lambda = "cv_min")
#> 
#> Family:        binomial
#> Alpha:         1
#> Lambda rule:   cv_min
#> Lambda value:  0.05370831
#> Inference:     none
#> Observations:  total = 150, dropped = 0, used = 150
#> Non-zero terms: 0 / 3

summary() returns a structured object that includes the call, family, chosen λ, complete-case bookkeeping, and the (regularized) coefficient vector with zeros included:

summary(fit)
#> fbrglm summary
#> ==============
#> 
#> Call:
#> fbrglm(formula = y ~ x1 + x2 + x3, data = dat, family = "binomial", 
#>     lambda = "cv_min")
#> 
#> Family:        binomial
#> Lambda method: cv_min
#> Lambda value:  0.05370831
#> Inference:     none
#> 
#> Observations:
#>   total = 150, dropped (missing) = 0, used = 150
#> 
#> Coefficients:
#>             Estimate
#> (Intercept)    -0.08
#> x1              0.00
#> x2              0.00
#> x3              0.00
#> 
#> Non-zero predictors (0): (none)
#> 
#> Note: no standard errors, z-values, or p-values are reported under
#> infer = "none". Classical inference does not account for the
#> shrinkage bias from L1/L2 penalisation or for the data-driven
#> selection of lambda. Use infer = "split" / "selective" (planned)
#> for valid post-selection inference; see vignette("fbrglm").

Coefficients and predictions follow the same shapes you’d expect from glm():

coef(fit)
#> (Intercept)          x1          x2          x3 
#> -0.08004271  0.00000000  0.00000000  0.00000000

head(predict(fit, newdata = dat[1:5, ], type = "response"))
#> [1] 0.48 0.48 0.48 0.48 0.48

A plot() method is registered; it delegates to plot.cv.glmnet() when λ was chosen by cross-validation, and to plot.glmnet() otherwise.

plot(fit)

Choosing `lambda`

There are three rules, exposed through a single argument:

fit_min <- fbrglm(y ~ x1 + x2 + x3, data = dat,
                  family = "binomial", lambda = "cv_min")
fit_1se <- fbrglm(y ~ x1 + x2 + x3, data = dat,
                  family = "binomial", lambda = "cv_1se")
fit_fix <- fbrglm(y ~ x1 + x2 + x3, data = dat,
                  family = "binomial",
                  lambda = "fix", lambda_value = 0.05)

c(cv_min = fit_min$lambda_value,
  cv_1se = fit_1se$lambda_value,
  fix    = fit_fix$lambda_value)
#>     cv_min     cv_1se        fix 
#> 0.05370831 0.05370831 0.05000000

"cv_min" and "cv_1se" go through cv.glmnet(); "fix" skips CV and goes straight to glmnet() at the supplied lambda_value. The numeric λ actually used is always available at fit$lambda_value.

Factor predictors

Factor columns are auto-dummied via model.matrix(), and the training factor levels are stored on the fit object so predict(newdata = ...) can rebuild a design matrix that matches the training column structure — even when some training levels are missing from newdata.

n_train <- 200
train <- data.frame(
    y  = rbinom(n_train, 1, 0.5),
    x1 = rnorm(n_train),
    g  = factor(sample(c("A", "B", "C"), n_train, replace = TRUE),
                levels = c("A", "B", "C"))
)
fit_f <- fbrglm(y ~ x1 + g, data = train,
                family = "binomial",
                lambda = "fix", lambda_value = 0.05)

## newdata is missing level "C"
test <- data.frame(
    x1 = rnorm(10),
    g  = factor(rep(c("A", "B"), 5), levels = c("A", "B", "C"))
)
head(predict(fit_f, newdata = test, type = "response"))
#> [1] 0.56 0.56 0.56 0.56 0.56 0.56

fbrglm also tolerates the narrower case where newdata’s factor has its levels narrowed (not just its values): missing one-hot columns are padded with zeros before being handed to glmnet.

Missing values

fbrglm() drops rows with any NA from the modelling frame, prints a one-line note, and records the counts on the fit object under fit$nobs_info.

dat_na <- dat
dat_na$y[1:5] <- NA
fit_na <- fbrglm(y ~ x1 + x2 + x3, data = dat_na,
                 family = "binomial",
                 lambda = "fix", lambda_value = 0.05)
#> fbrglm: dropped 5 row(s) with missing values (145 / 150 kept).

fit_na$nobs_info
#> $n_total
#> [1] 150
#> 
#> $n_dropped_missing
#> [1] 5
#> 
#> $n_used
#> [1] 145
nobs(fit_na)
#> [1] 145

Offsets

offset at fit time goes through to glmnet(); at predict time, pass newoffset of matching length. With newdata = NULL the stored training offset is reused; with newdata supplied, an explicit newoffset is required.

n_off <- 80
dat_off <- data.frame(
    y  = rbinom(n_off, 1, 0.5),
    x1 = rnorm(n_off),
    x2 = rnorm(n_off)
)
fit_off <- fbrglm(y ~ x1 + x2, data = dat_off, family = "binomial",
                  offset = rep(0.2, n_off),
                  lambda = "fix", lambda_value = 0.05)

head(predict(fit_off, type = "response"))                  # reuses training offset
#> [1] 0.5625 0.5625 0.5625 0.5625 0.5625 0.5625
head(predict(fit_off, newdata = dat_off[1:5, ],
             newoffset = rep(0.2, 5), type = "response"))
#> [1] 0.5625 0.5625 0.5625 0.5625 0.5625

Reaching the underlying `glmnet` objects

If you need to use a glmnet-specific tool, two accessors get you out of the wrapper:

class(as_glmnet(fit_min))
#> [1] "lognet" "glmnet"
class(as_cv_glmnet(fit_min))
#> [1] "cv.glmnet"

class(as_glmnet(fit_fix))
#> [1] "lognet" "glmnet"
as_cv_glmnet(fit_fix)        # NULL — no CV was run
#> NULL

as_glmnet() returns the underlying glmnet object (the $glmnet.fit slot when the wrapper used CV). as_cv_glmnet() returns the cv.glmnet object, or NULL for the "fix" λ path.

Limitations (intentional)

The MVP is deliberately narrow:

only infer = "none" is implemented; "split" and "selective" are planned but not in this release.
families: gaussian, binomial, poisson only. multinomial and cox will land later.
the x / y direct-matrix entry point is reserved but not yet supported — supply formula + data instead.
classical glm()-style standard errors, z / p values, and confidence intervals are intentionally not shown for infer = "none". Doing so naively for regularized estimators would be misleading; honest inference is the next milestone.

Reproducible benchmarks against raw glmnet, glmnetUtils, and a parsnip / workflows pipeline with the glmnet engine live in a separate repository: https://github.com/dsc-chiba-u/fbrglm-experiments. In the prediction-failure case (narrowed test factor levels), raw glmnet built naively can fail; parsnip / workflows succeeds but with higher runtime overhead than fbrglm in the tested small-data setting. See the experiments repo for the CSVs and figures.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.

Getting started with fbrglm

What fbrglm is for

A small binomial example

Choosing lambda

Factor predictors

Missing values

Offsets

Reaching the underlying glmnet objects

Limitations (intentional)

What `fbrglm` is for

Choosing `lambda`

Reaching the underlying `glmnet` objects