The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Basic Examples

library(nadir)

Let’s start with an extremely simple example: a prediction problem on a continuous outcome, where we want to use cross-validation to minimize the expected risk/loss on held out data across a few different models.

We’ll use the iris dataset to do this.

nadir::super_learner() strives to keep the syntax simple, so the simplest call to super_learner() might look something like this:

super_learner(
  data = iris,
  formula = Petal.Width ~ Petal.Length + Sepal.Length + Sepal.Width,
  learners = list(lnr_lm, lnr_rf, lnr_earth, lnr_mean))
#> $predict
#> function (newdata) 
#> {
#>     Reduce(`+`, x = future_lapply(1:length(fit_learners), function(i) {
#>         fit_learners[[i]](newdata) * learner_weights[[i]]
#>     }, future.seed = TRUE))
#> }
#> <bytecode: 0x1492f1408>
#> <environment: 0x1492f38d8>
#> 
#> $y_variable
#> [1] "Petal.Width"
#> 
#> $outcome_type
#> [1] "continuous"
#> 
#> $learner_weights
#>        lm        rf     earth      mean 
#> 0.5769071 0.4230929 0.0000000 0.0000000 
#> 
#> $holdout_predictions
#> # A tibble: 150 × 6
#>    .sl_fold      lm    rf earth  mean Petal.Width
#>       <int>   <dbl> <dbl> <dbl> <dbl>       <dbl>
#>  1        1  0.425  0.329 1.83   1.20         0.4
#>  2        1  0.143  0.250 1.43   1.20         0.1
#>  3        1  0.0913 0.227 0.959  1.20         0.1
#>  4        1  0.0996 0.599 1.68   1.20         0.2
#>  5        1  0.319  0.220 1.64   1.20         0.4
#>  6        1  0.417  0.293 1.71   1.20         0.1
#>  7        1  0.203  0.191 1.55   1.20         0.2
#>  8        1 -0.0406 0.467 1.20   1.20         0.3
#>  9        1  0.232  0.213 1.02   1.20         0.2
#> 10        1  0.346  0.221 1.63   1.20         0.6
#> # ℹ 140 more rows
#> 
#> attr(,"class")
#> [1] "nadir_sl_model"

Notice what it returns: A function of newdata that predicts across the learners, sums up according to the learned weights, and returns the ensemble predictions.

We can store that learned predictor function and use it:

# We recommend storing more complicated arguments used repeatedly to simplify 
# the call to super_learner()
petal_formula <- Petal.Width ~ Petal.Length + Sepal.Length + Sepal.Width
learners <- list(lnr_lm, lnr_rf, lnr_earth, lnr_mean)

sl_model <- super_learner(
  data = iris,
  formula = petal_formula,
  learners = learners)

In particular, we can use it to predict on the same dataset,

predict(sl_model, iris) |> head()
#>         1         2         3         4         5         6 
#> 0.2274732 0.1725615 0.1903219 0.2566321 0.2482329 0.3803236

On a random sample of it,

predict(sl_model, iris[sample.int(size = 10, n = nrow(iris)), ]) |> 
  head()
#>       128        45        25         2        97       106 
#> 1.7339317 0.4793721 0.4237248 0.1725615 1.3776838 2.2293237

Or on completely new data.

fake_iris_data <- data.frame()
fake_iris_data <- cbind.data.frame(
  Sepal.Length = 
  rnorm(
    n = 6,
    mean = mean(iris$Sepal.Length),
    sd = sd(iris$Sepal.Length)
  ),

Sepal.Width = 
  rnorm(
    n = 6,
    mean = mean(iris$Sepal.Width),
    sd = sd(iris$Sepal.Width)
  ),

Petal.Length = 
  rnorm(
    n = 6,
    mean = mean(iris$Petal.Length),
    sd = sd(iris$Petal.Length)
  )
)

predict(sl_model, fake_iris_data) |> 
  head()
#>         1         2         3         4         5         6 
#> 1.1015928 1.7899799 1.1575923 0.8751806 0.8710849 0.5649973

Getting More Information Out

If we want to know a lot more about the super_learner() process, how it weighted the candidate learners, what the candidate learners predicted on the held-out data, etc., then we will want to look at the other metadata contained in the nadir_sl_model object produced: option.

sl_model_iris <- super_learner(
  data = iris,
  formula = petal_formula,
  learners = learners)

str(sl_model_iris, max.level = 2)
#> List of 5
#>  $ predict            :function (newdata)  
#>  $ y_variable         : chr "Petal.Width"
#>  $ outcome_type       : chr "continuous"
#>  $ learner_weights    : Named num [1:4] 0.521 0.479 0 0
#>   ..- attr(*, "names")= chr [1:4] "lm" "rf" "earth" "mean"
#>  $ holdout_predictions: tibble [150 × 6] (S3: tbl_df/tbl/data.frame)
#>  - attr(*, "class")= chr "nadir_sl_model"

To put some description to what’s contained in the output from super_learner():

We can call compare_learners() on the verbose output from super_learner() if we want to assess how the different learners performed. We can also call cv_super_learner() with the same arguments as super_learner() to wrap the super_learner() call in another layer of cross-validation to assess how super_learner() performs on held-out data.

compare_learners(sl_model_iris)
#> Inferring the loss metric for learner comparison based on the outcome type:
#> outcome_type=continuous -> using mean squared error
#> # A tibble: 1 × 4
#>       lm     rf earth  mean
#>    <dbl>  <dbl> <dbl> <dbl>
#> 1 0.0373 0.0391  1.69 0.590

cv_super_learner(
  data = iris, 
  formula = petal_formula,
  learners = learners)$cv_loss
#> The loss_metric is being inferred based on the outcome_type=continuous -> using CV-MSE
#> [1] 0.03374206

We can, of course, do anything with a super learned model that we would do with a conventional prediction model, like calculating performance statistics like \(R^2\).

var_residuals <- var(iris$Sepal.Length - predict(sl_model_iris, iris))
total_variance <- var(iris$Sepal.Length)
variance_explained <- total_variance - var_residuals 

rsquared <- variance_explained / total_variance
print(rsquared)
#> [1] 0.7217497

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.