The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
The fastglm package is intended as a fast
and stable alternative to
stats::glm() for fitting generalized linear models. It is
compatible with R’s family objects (see
?family) and produces fitted objects whose downstream
methods — summary(), vcov(),
predict(), coef(), residuals(),
logLik() — behave exactly like those for a
glm. This vignette walks through the main functionality of
the package at a higher level than the other vignettes; it is intended
as a single entry point for a new user.
The main package function is fastglm(). It takes a
numeric design matrix x, a response y, and an
R family object:
data(esoph)
x <- model.matrix(~ agegp + unclass(tobgp) + unclass(alcgp),
data = esoph)
y <- cbind(esoph$ncases, esoph$ncontrols)
fit <- fastglm(x, y, family = binomial(link = "cloglog"))
summary(fit)
#>
#> Call:
#> fastglm.default(x = x, y = y, family = binomial(link = "cloglog"))
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -4.29199 0.29777 -14.414 < 2e-16 ***
#> agegp.L 3.30677 0.63454 5.211 1.88e-07 ***
#> agegp.Q -1.35525 0.57310 -2.365 0.018 *
#> agegp.C 0.20296 0.43333 0.468 0.640
#> agegp^4 0.15056 0.29318 0.514 0.608
#> agegp^5 -0.23425 0.17855 -1.312 0.190
#> unclass(tobgp) 0.33276 0.07285 4.568 4.93e-06 ***
#> unclass(alcgp) 0.80384 0.07538 10.664 < 2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 61.609 on 87 degrees of freedom
#> Residual deviance: 96.950 on 80 degrees of freedom
#> AIC: 228
#>
#> Number of Fisher Scoring iterations: 6fastglm() does not accept a formula directly, it
operates on a pre-built design matrix. To use a formula and a data
frame, pass fastglm_fit as the fitting function to base
glm():
The IRLS algorithm reduces every iteration to a weighted least-squares problem. fastglm supports six different matrix decompositions for solving that WLS step, all from RcppEigen (Bates and Eddelbuettel, 2013); the choice trades off speed against numerical stability and rank-revealing behaviour:
method |
decomposition |
|---|---|
0 |
column-pivoted Householder QR (default; rank-revealing) |
1 |
unpivoted Householder QR |
2 |
LLT Cholesky |
3 |
LDLT Cholesky |
4 |
full-pivoted Householder QR |
5 |
bidiagonal divide-and-conquer SVD |
The default (method = 0) is the safe choice: it is
rank-revealing, so it handles aliased / collinear columns gracefully.
The Cholesky methods (2 and 3) are roughly
3–4× faster but assume full column rank.
set.seed(123)
n <- 5000; p <- 30
x <- matrix(rnorm(n * p), n, p)
y <- rbinom(n, 1, plogis(x %*% rnorm(p) * 0.05))
system.time(f0 <- fastglm(x, y, family = binomial())) # default
#> user system elapsed
#> 0.005 0.000 0.004
system.time(f2 <- fastglm(x, y, family = binomial(), method = 2)) # LLT
#> user system elapsed
#> 0.002 0.000 0.002
system.time(f3 <- fastglm(x, y, family = binomial(), method = 3)) # LDLT
#> user system elapsed
#> 0.003 0.000 0.003fastglm does not compromise computational stability for
speed. It uses a step-halving safeguard following Marschner (2011) and
starts from better-initialised values than glm() or
glm2::glm2(), so it tends to converge in cases where the
standard IRLS algorithm fails. See
vignette("fastglm", package = "fastglm") for a
Gamma-with-sqrt-link example where glm() does
not converge but fastglm() does.
vcov(), robust SE, predictionsThe fitted object stores the unscaled covariance directly (no
refitting), and the standard vcov() /
summary() machinery works as expected:
fit <- fastglm(x, y, family = binomial(), method = 2)
V <- vcov(fit)
dim(V)
#> [1] 30 30
## standard errors agree with the diagonal of vcov()
all.equal(sqrt(diag(V)), unname(coef(summary(fit))[, "Std. Error"]))
#> [1] "names for target but not for current"Heteroskedasticity-consistent and cluster-robust covariance matrices
are available via sandwich::vcovHC() and
sandwich::vcovCL() — fastglm registers methods
on those generics, so loading sandwich is all that is
required:
library(sandwich)
V_hc <- vcovHC(fit, type = "HC0")
V_hc[1:3, 1:3]
#> X1 X2 X3
#> X1 8.546454e-04 7.509395e-06 1.483571e-06
#> X2 7.509395e-06 8.321108e-04 7.951566e-07
#> X3 1.483571e-06 7.951566e-07 8.351855e-04
cluster <- rep(seq_len(20), length.out = n)
V_cl <- vcovCL(fit, cluster = cluster, type = "HC1")
V_cl[1:3, 1:3]
#> X1 X2 X3
#> X1 0.0007676800 -3.021035e-04 1.517204e-04
#> X2 -0.0003021035 8.060696e-04 8.492543e-05
#> X3 0.0001517204 8.492543e-05 7.823621e-04Results are numerically identical to sandwich::vcovHC()
and sandwich::vcovCL() applied to a glm fit
(Zeileis, Köll, and Graham, 2020) to floating-point precision. They work
for sparse and big.matrix fits as well, since the full
design matrix is preserved on the fitted object (as a reference, not
copied).
predict() supports se.fit = TRUE:
For designs that are sparse, that live on disc, or that have to be built from a parquet / arrow / DuckDB source, fastglm provides three large-data paths that share a common streaming kernel and produce identical results:
Matrix::dgCMatrix: pass directly to
fastglm(). Useful for one-hot encoded categoricals and
high-dimensional sparse designs.bigmemory::big.matrix: pass directly to
fastglm(). The matrix is read in row-blocks and never fully
materialised in memory.fastglm_streaming(chunk_callback, n_chunks, family): a
user-supplied closure yields one row-block per call. The right path for
fitting on a parquet dataset, DuckDB query, or any external
columnar store.See vignette("large-data-fastglm", package = "fastglm")
for a detailed walk through all three paths. A short example:
n <- 4000
X <- cbind(1, matrix(rnorm(n * 3), n, 3))
y <- rbinom(n, 1, plogis(X %*% c(0.2, 0.4, -0.2, 0.3)))
chunk_size <- 1000
chunks <- function(k) {
idx <- ((k - 1) * chunk_size + 1):(k * chunk_size)
list(X = X[idx, , drop = FALSE], y = y[idx])
}
fit_stream <- fastglm_streaming(chunks, n_chunks = 4, family = binomial())
fit_full <- fastglm(X, y, family = binomial(), method = 2)
max(abs(coef(fit_stream) - coef(fit_full)))
#> [1] 1.111683e-07For the most commonly-used family/link
combinations, fastglm dispatches variance(),
mu.eta(), linkinv(), and
dev.resids() to inline C++ implementations rather than
calling back into R once per IRLS iteration. The covered
combinations are:
Detection is automatic, if the family object matches one
of the above, the native path is used; otherwise fastglm falls
back to the R-callback path used in earlier versions. The
native path is meaningfully faster on large n because it
eliminates the per-iteration round-trip to R for each of
the four family functions:
library(microbenchmark)
set.seed(7)
n <- 50000; p <- 30
x <- matrix(rnorm(n * p), n, p)
y <- rpois(n, exp(x %*% rnorm(p) * 0.05 + 1))
## force the R-callback path by giving the family object an unrecognised name
disguise <- function(fam) { fam$family <- paste0(fam$family, "*"); fam }
mb <- microbenchmark(
native = fastglm(x, y, family = poisson(), method = 2),
callback = fastglm(x, y, family = disguise(poisson()), method = 2),
times = 5L
)
print(mb)
#> Unit: milliseconds
#> expr min lq mean median uq max neval cld
#> native 17.62336 17.78781 18.29158 17.92971 18.33762 19.77942 5 a
#> callback 25.22529 25.24013 25.66594 25.33501 26.04398 26.48530 5 bThe fastglm package switches transparently between the two paths; user code does not change.
A handful of regressions that come up often in practice are not standard GLMs but build on the same IRLS machinery. fastglm provides dedicated entry points for these:
fastglm_nb() — negative-binomial
regression with the dispersion theta estimated jointly with
beta. Plays the same role as MASS::glm.nb()
but runs the joint (beta, theta) MLE entirely in C++ (IRLS
for beta, Brent for theta).firth = TRUE — a flag on
fastglm() / fastglm_fit() that activates
Firth’s bias-reducing penalty on the score. Useful for binary logistic
regression under separation or in small samples; analogous to
logistf::logistf().fastglm_hurdle() — a two-part count
model: a binary regression for whether y > 0, plus a
zero-truncated Poisson or NB regression on the positive subset. The two
parts factorize, and both are fit by the same C++ IRLS solver. Same
model as pscl::hurdle().fastglm_zi() — a zero-inflated Poisson
or NB regression: a binary inflation component overlaid on the original
count distribution, fit by an EM algorithm in C++ with closed-form
posterior responsibilities and an analytical observed-information
vcov. Same model as pscl::zeroinfl().All four take the standard fastglm tuning options
(method, tol, maxit) and return
objects with the usual coef(), vcov(),
predict(), logLik() methods.
fastglm_hurdle() and fastglm_zi() use the
Formula package’s two-RHS syntax
y ~ x1 + x2 | z1 + z2 to specify different designs for the
count and zero parts. See
vignette("count-firth-fastglm", package = "fastglm") for
examples and timing comparisons against the canonical reference
packages.
There are three R-side functions to fit a GLM with
fastglm. They all call into the same C++ solver, but differ in
how much of base R’s machinery they wrap around it:
fastglm() is the top-level S3
function. It returns an object of class "fastglm" with
deviance, AIC, null deviance, residuals, etc. populated — the right
entry point for interactive use.fastglm_fit() is designed to be passed
as method = fastglm_fit to base glm(). The
returned object inherits from "fastglmFit" so that
glm()’s formula/data handling, update(),
predict(), etc. all work on top.fastglmPure() is the lowest-level
wrapper: no dispersion, no AIC, no null-deviance computation. Use this
when calling fastglm from another package and you only need the
coefficients.Marschner, I. C. (2011). glm2: Fitting generalized linear models with convergence problems. The R Journal, 3(2), 12–15. doi:10.32614/RJ-2011-012
Bates, D. and Eddelbuettel, D. (2013). Fast and elegant numerical linear algebra using the RcppEigen package. Journal of Statistical Software, 52(5), 1–24. doi:10.18637/jss.v052.i05
Zeileis, A., Köll, S., and Graham, N. (2020). Various versatile variances: An object-oriented implementation of clustered covariances in R. Journal of Statistical Software, 95(1), 1–36. doi:10.18637/jss.v095.i01
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.