Scaling KRLS to larger n: the Nystrom approximation

The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Jens Hainmueller and Chad Hazlett

July 2026

Motivation

Exact KRLS solves a Tikhonov problem on the full \(n \times n\) Gaussian kernel. Time is \(O(n^3)\) and memory is \(O(n^2)\), so the practical ceiling on a laptop is around \(n \approx 5{,}000\). For larger samples the kernel allocation alone becomes prohibitive.

The Nystrom approximation replaces the full kernel with a low-rank representation anchored at \(m\) landmark points. The fitted model is \(\hat f(x) = K(x, Z)\, \alpha\) where \(Z\) are the landmarks and \(\alpha\) is a length-\(m\) coefficient vector solved in \(m\)-dimensional space. Time becomes \(O(n m^2 + m^3)\) and memory \(O(n m)\), so KRLS can now handle sample sizes well past the exact ceiling.

KRLS exposes this as an explicit approximation:

krls(X, y, approx = "nystrom", nystrom_m = 100)

Default landmark count is min(500, n). Inference (when vcov = TRUE) is conditional approximate – standard errors condition on the selected landmarks, fixed \(\lambda\), and the low-rank feature map; they are not equivalent to exact KRLS inference.

library(KRLS)
#> ## KRLS Package for Kernel-based Regularized Least Squares.
#> ## See Hainmueller and Hazlett (2014) for details.

A scaling comparison

We simulate a smooth function in three predictors plus a binary indicator, and compare exact KRLS to Nystrom across modest \(n\).

make_data <- function(n) {
  X <- cbind(rnorm(n), rnorm(n), rnorm(n), rbinom(n, 1, 0.4))
  colnames(X) <- c("x1", "x2", "x3", "x4")
  f <- sin(X[, 1]) + 0.3 * X[, 2] + 0.05 * X[, 3]^2 + 0.5 * X[, 4]
  list(X = X, y = f + rnorm(n, sd = 0.3), truth = f)
}

For each \(n\) we fit both paths, record wall-clock time, in-sample \(R^2\), and prediction RMSE on a held-out test set drawn from the same distribution. Predictions are evaluated against the true mean function (no noise on the test target) to focus on approximation error.

sizes <- c(500, 1000)
res <- vector("list", length(sizes))

for (i in seq_along(sizes)) {
  n <- sizes[i]
  tr <- make_data(n)
  te <- make_data(200)
  # Use the package default landmark count, min(500, n).
  m  <- min(500L, n)

  t_e <- system.time({
    fit_e <- krls(tr$X, tr$y, approx = "none",
                  derivative = FALSE, vcov = FALSE,
                  print.level = 0)
  })["elapsed"]
  t_n <- system.time({
    fit_n <- krls(tr$X, tr$y, approx = "nystrom",
                  derivative = FALSE, print.level = 0)
  })["elapsed"]

  pred_e <- predict(fit_e, newdata = te$X)$fit
  pred_n <- predict(fit_n, newdata = te$X)$fit

  res[[i]] <- data.frame(
    n          = n,
    m          = m,
    time_exact = round(as.numeric(t_e), 2),
    time_nys   = round(as.numeric(t_n), 3),
    speedup    = round(as.numeric(t_e) / as.numeric(t_n), 1),
    rmse_exact = round(sqrt(mean((pred_e - te$truth)^2)), 4),
    rmse_nys   = round(sqrt(mean((pred_n - te$truth)^2)), 4)
  )
}
#> Warning: krls: the following column(s) look categorical but were not listed in cat_columns:
#>   x4 (2 unique values)
#> KRLS does NOT autodetect categoricals. Pass these in cat_columns to get the sqrt(0.5)-scaled one-hot encoding (matching kbal / GPSS), or pass cat_columns = character(0) / integer(0) to silence this warning.
#> Warning: krls: the following column(s) look categorical but were not listed in cat_columns:
#>   x4 (2 unique values)
#> KRLS does NOT autodetect categoricals. Pass these in cat_columns to get the sqrt(0.5)-scaled one-hot encoding (matching kbal / GPSS), or pass cat_columns = character(0) / integer(0) to silence this warning.
#> Warning: krls: the following column(s) look categorical but were not listed in cat_columns:
#>   x4 (2 unique values)
#> KRLS does NOT autodetect categoricals. Pass these in cat_columns to get the sqrt(0.5)-scaled one-hot encoding (matching kbal / GPSS), or pass cat_columns = character(0) / integer(0) to silence this warning.
#> Warning: krls: the following column(s) look categorical but were not listed in cat_columns:
#>   x4 (2 unique values)
#> KRLS does NOT autodetect categoricals. Pass these in cat_columns to get the sqrt(0.5)-scaled one-hot encoding (matching kbal / GPSS), or pass cat_columns = character(0) / integer(0) to silence this warning.
do.call(rbind, res)
#>      n   m time_exact time_nys speedup rmse_exact rmse_nys
#> 1  500 500       0.87    0.587     1.5     0.1119   0.1119
#> 2 1000 500       6.37    1.228     5.2     0.0885   0.0885

The Nystrom path stays in milliseconds while exact KRLS grows cubically. Prediction RMSE on held-out data is comparable – Nystrom trades a small amount of approximation error for a large amount of runtime.

Scaling beyond what exact KRLS can do

Past \(n \approx 2{,}000\) the exact path becomes uncomfortable on a typical laptop. Nystrom keeps going:

n  <- 2000
tr <- make_data(n)
te <- make_data(200)
# Default is min(500, n) -- here 500 landmarks at n = 2000.
m  <- min(500L, n)

t_n <- system.time({
  fit_big <- krls(tr$X, tr$y, approx = "nystrom",
                  derivative = FALSE, print.level = 0)
})["elapsed"]
#> Warning: krls: the following column(s) look categorical but were not listed in cat_columns:
#>   x4 (2 unique values)
#> KRLS does NOT autodetect categoricals. Pass these in cat_columns to get the sqrt(0.5)-scaled one-hot encoding (matching kbal / GPSS), or pass cat_columns = character(0) / integer(0) to silence this warning.

pred_big <- predict(fit_big, newdata = te$X)$fit
sprintf("n = %d, m = %d, time = %.3fs, R2 = %.3f, RMSE = %.3f",
        n, m, t_n, fit_big$R2,
        sqrt(mean((pred_big - te$truth)^2)))
#> [1] "n = 2000, m = 500, time = 2.499s, R2 = 0.862, RMSE = 0.075"

The fit still recovers the truth at held-out RMSE comparable to the exact path on the smaller datasets, in a fraction of a second.

Reusing landmarks across fits

When running sensitivity checks – refitting on the same predictors with several outcome variants, or perturbing \(y\) – it is convenient to fix the landmark set so the approximation geometry is held constant. The fitted object stores landmarks in standardized \(X\)-space; to pass them back through krls() use get_landmarks(), which un-standardizes to the original units the function expects:

tr <- make_data(400)
fit <- krls(tr$X, tr$y, approx = "nystrom", nystrom_m = 30,
            derivative = FALSE, print.level = 0)
#> Warning: krls: the following column(s) look categorical but were not listed in cat_columns:
#>   x4 (2 unique values)
#> KRLS does NOT autodetect categoricals. Pass these in cat_columns to get the sqrt(0.5)-scaled one-hot encoding (matching kbal / GPSS), or pass cat_columns = character(0) / integer(0) to silence this warning.

Z <- get_landmarks(fit)              # original X-scale matrix
y_perturbed <- tr$y + rnorm(400, sd = 0.05)

fit2 <- krls(tr$X, y_perturbed, approx = "nystrom",
             landmarks = Z, derivative = FALSE, print.level = 0)
#> Warning: krls: the following column(s) look categorical but were not listed in cat_columns:
#>   x4 (2 unique values)
#> KRLS does NOT autodetect categoricals. Pass these in cat_columns to get the sqrt(0.5)-scaled one-hot encoding (matching kbal / GPSS), or pass cat_columns = character(0) / integer(0) to silence this warning.

# Same landmarks under the hood
identical(unname(fit$landmarks), unname(fit2$landmarks))
#> [1] TRUE

The landmark coordinates are bit-identical, so any difference between the two fits is attributable to the change in \(y\) alone – not to re-selecting anchor points.

Choosing the lambda criterion

KRLS picks the regularization parameter by minimizing a closed-form objective. Two are available:

lambda_method = "loo" (the default) – leave-one-out squared error.
lambda_method = "gcv" – generalized cross-validation, \(RSS(\lambda) / (1 - \mathrm{tr}(S(\lambda))/n)^2\).

For the Nystrom path both are evaluated in \(O(nm)\) per candidate \(\lambda\) from the cached SVD, so the choice is essentially free.

tr <- make_data(500)
fit_loo <- krls(tr$X, tr$y, approx = "nystrom", nystrom_m = 25,
                derivative = FALSE, lambda_method = "loo",
                print.level = 0)
#> Warning: krls: the following column(s) look categorical but were not listed in cat_columns:
#>   x4 (2 unique values)
#> KRLS does NOT autodetect categoricals. Pass these in cat_columns to get the sqrt(0.5)-scaled one-hot encoding (matching kbal / GPSS), or pass cat_columns = character(0) / integer(0) to silence this warning.
fit_gcv <- krls(tr$X, tr$y, approx = "nystrom", nystrom_m = 25,
                derivative = FALSE, lambda_method = "gcv",
                print.level = 0)
#> Warning: krls: the following column(s) look categorical but were not listed in cat_columns:
#>   x4 (2 unique values)
#> KRLS does NOT autodetect categoricals. Pass these in cat_columns to get the sqrt(0.5)-scaled one-hot encoding (matching kbal / GPSS), or pass cat_columns = character(0) / integer(0) to silence this warning.
data.frame(
  method = c("loo", "gcv"),
  lambda = c(fit_loo$lambda, fit_gcv$lambda),
  R2     = c(fit_loo$R2, fit_gcv$R2)
)
#>   method    lambda        R2
#> 1    loo 0.1983744 0.8341407
#> 2    gcv 0.3202462 0.8134584

LOO and GCV typically pick lambdas within a small multiplicative factor of each other. GCV is sometimes preferred when one or two observations have disproportionate leverage; LOO has more historical use in KRLS applications.

What carries over from the exact path

A Nystrom fit is still a krls object, so the standard accessors work:

predict(fit, newdata = ...) – predictions; pass se.fit = TRUE for conditional approximate standard errors when the fit was built with vcov = TRUE.
summary(fit) – now prints an Approximation block reporting m, the landmark method, inference type, the landmark-kernel condition number, and the count of eigenvalues at the relative-ridge floor.
tidy(fit), glance(fit), augment(fit) – broom methods are Nystrom-aware. glance() reports approx, nystrom_m, inference, and the effective degrees of freedom.

Limitations

Standard errors are conditional on the selected landmarks, fixed \(\lambda\), and the low-rank feature approximation. They are not equivalent to exact KRLS standard errors and should not be used in contexts that assume exact inference.
Only the Gaussian kernel is supported under approx = "nystrom".
kmeans landmark selection is approximately reproducible under set.seed() but not bit-stable across R versions; for strict reproducibility use the default random sampling or supply landmarks explicitly.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.