The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
Adds the two convention items deferred from Chad Hazlett’s review of
the 1.6 line: a cat_columns argument for kbal/GPSS-style
one-hot rescaling of categorical inputs, and a b_maxvarK
bandwidth selector that becomes the new default for
sigma.
krls(..., cat_columns = NULL) — character vector of
column names (or integer indices) flagging categorical inputs. When
supplied, those columns are one-hot encoded with all levels (no
reference cell) and multiplied by sqrt(0.5), matching
kbal::one_hot and gpss:::one_hot. The
sqrt(0.5) factor compensates for the doubled
squared-Euclidean distance from carrying both “in level k” and “not in
level k” indicators, restoring a Hamming-like distance between two
observations differing in one category. Continuous columns are still
standardized to sd = 1.
When cat_columns is left at NULL,
krls() scans the input and emits a single warning if it
finds columns that look categorical (factor / character / logical, or
numeric with <= 10 unique values) but were not declared.
There is no autodetection; the warning is a friendly nudge that mirrors
how kbal and gpss handle the same ambiguity.
Pass cat_columns = integer(0) (or
character(0)) to acknowledge the situation and silence the
warning.
sigma = NULL (the default) now triggers
b_maxvarK() — bandwidth chosen by maximizing the variance
of the off-diagonal kernel entries. The pre-1.7 default
sigma = ncol(X) was a dimensional heuristic; the maxvarK
choice picks the bandwidth that makes the columns of K most informative
and matches the convention in kbal and gpss. A
one-line message() notes the change at first call and tells
users how to pin pre-1.7 behavior
(sigma = ncol(X_processed)).
Bit-identical reproduction of v1.5-2 and v1.6-x fits remains
available by pinning sigma and approx = "none"
explicitly.
b_maxvarK(X_processed) — exact-path bandwidth selector.
Returns the sigma that maximizes off-diagonal var(K) over
the search interval.b_maxvarK_nystrom(X_processed, Z_processed) —
Nystrom-path bandwidth selector that operates on the n × m cross-kernel
C = K(X, Z).$X_proc — the kernel-space (preprocessed) matrix used
internally. Identical column count to $X when
cat_columns is not used.$prep — preprocessing metadata (centers, scales, factor
levels) for predict() to re-apply the same transformation
to newdata.$prep continue to work
in predict() via a legacy code path.Bug-fix patch on top of 1.6-0.
krls(derivative = TRUE, vcov = FALSE) under the default
approx = "auto" no longer crashes with the cryptic
object 'vcovmatc' not found. The derivative+vcov guard was
previously only enforced under approx = "none"
(pre-dispatch); it now re-runs after the auto-dispatch resolves and
produces a clear error when the exact path is chosen. The same
combination remains supported under approx = "nystrom",
which returns point-estimate derivatives without SEs.nystrom_m before using it
as the dispatch threshold or printing the switchover message. Previously
nystrom_m = Inf produced an opaque “missing value where
TRUE/FALSE needed”, and nystrom_m = 1.5 announced
nystrom_m = 1 before the real validator caught it. Both now
error cleanly with the same “must be a finite integer” message.README.md and
vignette("krls-nystrom-scaling") now describe the default
nystrom_m as min(500, N) (was still listed as
the pre-1.6-0 min(500, ceiling(sqrt(N))) heuristic). The
scaling vignette’s benchmark code now uses the actual default rather
than ceiling(sqrt(n)).Refinements from Chad Hazlett’s review of the Nystrom path. Three user-visible additions; one default-behavior change at large samples.
approx = "auto" is now the default)krls() now defaults to approx = "auto",
which uses the exact path when N <= nystrom_m and
switches to approx = "nystrom" with a one-line
message() when N > nystrom_m. The new
behavior benefits casual users who do not realize the Nystrom path
exists, while preserving the call-site honesty of an explicit
approximation flag.approx = "none" forces the exact path regardless of
N and reproduces every KRLS <= 1.5-2 fit bit-for-bit.
Users who need the historical default for reproducibility should set
this explicitly.eigtrunc, derivative = TRUE with
vcov = FALSE) fall through to the exact path silently under
approx = "auto".nystrom_mnystrom_m is now min(500, N) (was
min(500, ceiling(sqrt(N)))). The old sqrt(N)
floor gave overly coarse approximations at moderate N
(e.g. m = 100 at N = 10000) given how cheap
the cached-SVD Nystrom path is. The new default caps at 500 regardless
and uses every row as a landmark for small samples.landmark_seed argument with RNG-hygienekrls(..., landmark_seed = NULL) lets users pin Nystrom
landmark selection without touching their global RNG state. When
non-NULL, .Random.seed is saved before landmark selection
and restored on exit, so the seed is consumed locally and the caller’s
downstream draws are unaffected. Two krls() calls with the same
landmark_seed produce bit-identical landmarks regardless of
the surrounding set.seed() context.man/krls.Rd now states the Nystrom approximation
explicitly as K ≈ C W_reg^{-1} C' after the relative-ridge
floor, matching the internal implementation.sigma for the data scale, or landmarks placed far from the
observations) now signal a clear error pointing at sigma/landmark scale,
instead of failing later inside the lambda-bound search with a cryptic
missing value where TRUE/FALSE needed. The guard runs
whether lambda is supplied directly or auto-selected.Bug-fix patch for the Nystrom path that landed in 1.4-0 / 1.4-1.
nystrom_m (notably
m = 1) no longer collapses. The previous bound heuristic
could only satisfy EDF >= 1 at U = 0,
leaving the golden-section search running over a near-zero interval and
selecting an essentially unregularized lambda. The bound logic now
anchors at the dominant landmark eigenvalue and grows U as needed; the
search runs over a usable window for every supported m.landmark_method = "kmeans" now handles
nystrom_m == nrow(X) by short-circuiting to one cluster per
row, instead of letting stats::kmeans() error on
centers >= n.lambda_method = "gcv" and
print.level > 1, both backends now correctly label the
printed scalar as the GCV minimum (was hard-coded as “Loo-Loss”).man/krls.Rd no longer claims
landmark_method = "kmeans" is reserved for a future
release; it documents the actual behavior.dev/03_nystrom_bootstrap_validation.R assigns column
names to its simulated X and prefers pkgload::load_all()
when available so the script validates the source tree rather than a
stale installed package.lambda_method = "gcv"krls(..., lambda_method = c("loo", "gcv")) adds a
generalized cross-validation alternative to the historical leave-one-out
criterion. GCV minimizes
RSS(lambda) / (1 - tr(S(lambda))/n)^2 and is computed in
closed form from the kernel eigendecomposition (exact path) or from the
cached SVD of Phi (Nystrom path). Both methods evaluate at the same
per-iteration cost, so the choice is essentially free. Default is
"loo" – backward-compatible with all earlier KRLS
releases.lambda_method.vignette("krls-nystrom-scaling", package = "KRLS")
walks through the exact-vs-Nystrom runtime comparison, the
landmark-reuse pattern via get_landmarks(), and the
LOO-vs-GCV trade-off. Covers what to expect when moving past the exact
path’s ~5,000-row ceiling.Polish patch for the Nystrom path that landed in 1.4-0. All changes are additive and non-breaking; existing code paths are unaffected.
landmark_method = "kmeans"krls(..., approx = "nystrom", landmark_method = "kmeans")
selects landmarks as the cluster centers of a
stats::kmeans() run on the standardized predictors. More
robust than random sampling when the predictor distribution is
imbalanced. Random sampling remains the default because it is simpler
and bit-reproducible under set.seed(); k-means
initialization is reproducible within an R version but not bit-stable
across versions.get_landmarks()
accessorget_landmarks(fit, scale = c("original", "standardized"))
returns the landmark coordinates from a Nystrom fit. The default
"original" scale un-standardizes the stored matrix so it
can be passed back through krls(..., landmarks = ...)
without the standardize-twice bug. Useful for sensitivity-check refits
and for inspecting which points the approximation is anchored at.summary() on a Nystrom fit now prints the approximation
configuration (m, landmark method, inference type) and the
landmark-kernel condition number alongside the count of eigenvalues that
landed at the relative-ridge floor. When more than half of the landmark
eigenvalues are floored, an explanatory note is emitted pointing at
sigma / m / landmark distinctness as the likely cause.glance() gains approx,
nystrom_m, and inference columns. Under
Nystrom, eff_df is now computed correctly from the cached
Phi spectrum (Sigma2) rather than returning
NA.krls() now rejects user-supplied lambda-search windows
where L >= U up front rather than producing a silently
invalid golden-section sweep.landmarks matrices are checked for NAs,
non-finite entries, and duplicate rows (which would make the landmark
kernel rank-deficient).landmark_method: "random",
"kmeans", "user_indices", or
"user_matrix" recording which path produced the
landmarks.Sigma2, floored_count,
D_min_raw, D_max_raw: diagnostic scalars from
the W eigendecomposition; consumed by summary() and
glance() and available to downstream code.krls(..., approx = "nystrom") adds an explicitly
approximate low-rank fit path for larger data. The approximation uses
landmark points, a stabilized Nystrom feature map, and an
m-space ridge solve. Time becomes O(N m^2 + m^3) and memory
O(N m), making KRLS feasible at sample sizes well beyond the exact
path’s ~5000-row ceiling.approx, nystrom_m
(default min(500, ceiling(sqrt(N)))),
landmarks, landmark_method
("random" implemented; "kmeans" reserved for a
future release), nystrom_eps (relative-ridge stabilization
of W).landmarks accepts three forms: NULL
(auto-select), an integer vector of row indices into X, or
an m by D numeric matrix in the original
X units (standardized internally).vcov = TRUE, Nystrom fits report conditional
approximate standard errors for coefficient weights, predictions, and
average derivatives. These standard errors condition on the selected
landmarks, fixed lambda, and the low-rank feature
approximation – they are not equivalent to exact KRLS standard errors.
The full N x N fitted-value covariance matrix is not
stored, preserving the memory benefit of the approximation. A new
inference field on the fit object reports
"conditional_nystrom" or "none".predict(fit, ..., se.fit = TRUE) works under Nystrom
when the fit was built with vcov = TRUE.summary(), tidy(), and
fdskrls() (binary first-differences) handle Nystrom fits
cleanly, with or without standard errors.dev/03_nystrom_bootstrap_validation.R (developer-side
check, not shipped to CRAN).sum(L' V L) = (L 1)' V (L 1), reducing the per-predictor
work from O(n^3) to O(n^2) without materializing the full
L' V L product. The default krls() call is
roughly 1.2x to 3x faster on moderate-to-large fits (n in the few
hundred to a few thousand range), with the largest gains where the
variance computation dominated the runtime.var.avgderivatives, which differs by ~1e-15
(machine precision; an irreducible side effect of the change in
summation order). All other quantities — coeffs,
fitted, lambda, R2,
Looe, derivatives,
avgderivatives, vcov.c — are bit-for-bit
identical.krls() now correctly propagates the user-supplied
tol argument to lambdasearch(). Previously,
krls(X, y, tol = ...) silently ignored tol and
used the default convergence tolerance; calling
lambdasearch() directly worked as documented.solveforc() and the variance-covariance construction in
krls() no longer error when eigtrunc is
aggressive enough to keep only a single eigenvector. The single-column
subset of Eigenobject$vectors is now extracted with
drop = FALSE, preserving its matrix shape for downstream
multdiag() / tcrossprod().krls(y ~ x1 + x2, data = df) is now supported alongside
the matrix interface. The original krls(X, y) call
continues to work unchanged. Drops the intercept column automatically;
rejects NAs in either side with the same error messages the matrix
interface uses.tidy(fit) returns a per-predictor data frame:
estimate (the average marginal effect),
std.error, statistic, p.value,
conf.low/conf.high, plus pointwise quartiles
(q25, median, q75) so users can
see the heterogeneity that the AME averages over.
glance(fit) is a one-row fit summary:
nobs, n_predictors, r.squared,
leave-one-out MSE, lambda, sigma, and
effective degrees of freedom from the kernel ridge.
augment(fit, data = ...) joins .fitted,
.resid, and one .dy_d_<predictor> column
per predictor (the pointwise derivatives) back to the data.
Methods are registered against the generics package
generics, so library(broom) makes them
discoverable.
ggplot2 autoplotautoplot(fit) returns a faceted ggplot of the pointwise
marginal-effect distribution, one panel per predictor, with the AME
overlaid in blue. Discoverable via library(ggplot2);
ggplot2 is in Suggests:.vignette("krls-quickstart", package = "KRLS") walks
through a small simulated example showing the AME-vs-pointwise
heterogeneity story end-to-end.lambdasearch() now returns a plain scalar instead of
a 1x1 matrix (it previously used ifelse() which inherits
the shape of its test argument). This eliminates the R 4.4+ deprecation
warning “Recycling array of length 1 in vector-array arithmetic is
deprecated” that fired during Eigenobject$values + lambda
in both solveforc() and krls()’s vcov
calculation. The numerical results are identical to 1.0-0 within
rounding.
plot.krls() no longer dispatches
UseMethod("summary") on a wrong- class input — that was a
copy-paste mistake. All three S3 methods (predict,
summary, plot) now stop() cleanly
when the input is not a krls object, with a clear message.
The class check itself is now inherits(x, "krls") rather
than class(x) != "krls" (the old pattern misbehaves when
the object has multiple classes).
krls(): removed dead code — pre-allocation of
Eigenobject$values and $vectors immediately
overwritten by eigen(), and several large commented-out
alternative implementations.solveforc(): trimmed dead alternatives, modernized
formatting.predict.krls(): removed dead commented
if(is.vector(newdata)) branch; fixed typo “standart errors”
-> “standard errors”.sd(y) == 0 check in krls()
removed (var(y) == 0 already covers it).Authors@R (Jens Hainmueller as cre+aut,
Chad Hazlett as aut) per current CRAN style.Imports: grDevices, graphics, stats added explicitly
(these were already used via the NAMESPACE; making the dependency
declaration explicit).Suggests: testthat (>= 3.0.0) added;
Config/testthat/edition: 3 added.URL field gains the GitHub repo and corrects the
Stanford URL. BugReports field added.Encoding: UTF-8 declared explicitly.These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.