The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

crossfit: Cross-Fitting Engine for Double/Debiased ML

crossfit

A small cross-fitting engine for double / debiased machine learning and other meta-learners.

The package lets you define:

and then runs a cross-fitting schedule with configurable aggregation over panels and repetitions.

Installation

You can install the development version from GitHub:

# install.packages("remotes")
remotes::install_github("EtiennePeyrot/crossfit-R")

Then load it as usual:

library(crossfit)

Overview

crossfit is designed for settings where:

The engine:

Internally, the graph is normalized into a set of instances with structural signatures, so that identical models can share fits and be cached efficiently.

Quick example: cross-fitted MSE

Here is a minimal example on a simple regression problem.
We define a nuisance \(`m(x) = E[Y \mid X]`\) and use the cross-fitted mean squared error of this nuisance as our target.

library(crossfit)

set.seed(1)
n <- 200
x <- rnorm(n)
y <- x + rnorm(n)
data <- data.frame(x = x, y = y)

# 1) Nuisance: regression m(x) = E[Y | X]
nuis_y <- create_nuisance(
  fit = function(data, ...) lm(y ~ x, data = data),
  predict = function(model, data, ...) {
    as.numeric(predict(model, newdata = data))
  }
)

# 2) Target: cross-fitted MSE of m(x)
target_mse <- function(data, nuis_y, ...) {
  mean((data$y - nuis_y)^2)
}

# 3) Method: use 4 folds, 3 repetitions, DML-style "independence" allocation
method <- create_method(
  target = target_mse,
  list_nuisance = list(nuis_y = nuis_y),
  folds = 4,
  repeats = 3,
  eval_fold = 1,
  mode = "estimate",
  fold_allocation = "independence",
  aggregate_panels  = mean_estimate,
  aggregate_repeats = mean_estimate
)

res <- crossfit(data, method)

str(res$estimates)
res$estimates[[1]]

The crossfit() call:

Multiple methods and shared nuisances

You can run several methods in parallel, sharing some or all nuisances. For example, we can estimate both:

in a single call:

target_mean <- function(data, nuis_y, ...) {
  mean(nuis_y)
}

m_mse <- create_method(
  target = target_mse,
  list_nuisance = list(nuis_y = nuis_y),
  folds = 4,
  repeats = 3,
  eval_fold = 1,
  mode = "estimate",
  fold_allocation = "independence",
  aggregate_panels  = mean_estimate,
  aggregate_repeats = mean_estimate
)

m_mean <- create_method(
  target = target_mean,
  list_nuisance = list(nuis_y = nuis_y),
  folds = 4,
  repeats = 3,
  eval_fold = 1,
  mode = "estimate",
  fold_allocation = "overlap",
  aggregate_panels  = mean_estimate,
  aggregate_repeats = mean_estimate
)

cf_multi <- crossfit_multi(
  data    = data,
  methods = list(mse = m_mse, mean = m_mean),
  aggregate_panels  = mean_estimate,
  aggregate_repeats = mean_estimate
)

cf_multi$estimates

The two methods share the fitted nuisance models whenever their structure and training folds coincide, which can save a lot of computation when you compare multiple learners or targets.

Predict mode: cross-fitted predictor

In "predict" mode, the engine returns a prediction function instead of a numeric estimate. This is useful if you want a cross-fitted predictor you can re-use on new data.

Here we build a cross-fitted regression function:

library(crossfit)

set.seed(1)

# Toy nonlinear regression problem
n  <- 200
x  <- runif(n, -2, 2)
y  <- sin(x) + rnorm(n, sd = 0.3)
data <- data.frame(x = x, y = y)

# Two simple nuisances: linear and quadratic regressions
nuis_lin <- create_nuisance(
  fit = function(data, ...) lm(y ~ x, data = data),
  predict = function(model, data, ...) {
    as.numeric(predict(model, newdata = data))
  },
  train_fold = 2
)

nuis_quad <- create_nuisance(
  fit = function(data, ...) lm(y ~ poly(x, 2), data = data),
  predict = function(model, data, ...) {
    as.numeric(predict(model, newdata = data))
  },
  train_fold = 2
)

# Target in "predict" mode: ensemble of the two nuisances
target_ensemble <- function(data, m_lin, m_quad, ...) {
  0.5 * m_lin + 0.5 * m_quad
}

method_ens <- create_method(
  target        = target_ensemble,
  list_nuisance = list(m_lin  = nuis_lin,
                       m_quad = nuis_quad),
  folds         = 4,
  repeats       = 3,
  eval_fold     = 0, # no eval window in predict mode
  mode          = "predict",
  fold_allocation = "independence"
)

res <- crossfit_multi(
  data    = data,
  methods = list(ensemble = method_ens),
  aggregate_panels  = mean_predictor,
  aggregate_repeats = mean_predictor
)

# Cross-fitted ensemble predictor on new data
f_hat <- res$estimates$ensemble
newdata <- data.frame(x = seq(-2, 2, length.out = 5))
cbind(x = newdata$x, y_hat = f_hat(newdata))

Here:

Key functions

Further documentation

See:

?crossfit
?crossfit_multi
?create_method
?create_nuisance

You can find a more detailed introduction in the package vignette:

browseVignettes("crossfit")
# or directly:
vignette("crossfit-intro", package = "crossfit")

If you encounter a bug or have a feature request, please open an issue at: https://github.com/EtiennePeyrot/crossfit-R/issues.

License

crossfit is free software released under the GPL-3 license.

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.