Getting Started

Mark Rieke

2022-03-07

When creating models, the range of expected outcomes is often just as important as the most likely outcome. For example, a prediction that a house will have a price of $250,000 +/- $10,000 has a vastly different interpretation than a prediction that a house will have a price of $250,000 +/- $50,000! Some models (like linear models) can output both point predictions and confidence intervals (N.B. this is actually different than a prediction interval) around each prediction but other — often more powerful — models can only output point predictions.

This is where bootstrap resampling can help! Creating n resamples of the original dataset allows us to create n models — one for each resample. These many models can then be used to predict on new data and create a distribution of expected outcomes for each prediction.

{workboots} is a tidy implementation of this solution written around the core function predict_boots(). Pass an untrained workflow object to predict_boots() to return a tibble of nested predictions for each observation.

Generating point predictions

Let’s work through a motivating example of predicting a penguin’s weight (body_mass_g) from other characteristics using the Palmer Penguins dataset.

library(tidymodels)

data("penguins")

penguins <- 
  penguins %>%
  drop_na()

penguins
#> # A tibble: 333 x 7
#>    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#>    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
#>  1 Adelie  Torgersen           39.1          18.7               181        3750
#>  2 Adelie  Torgersen           39.5          17.4               186        3800
#>  3 Adelie  Torgersen           40.3          18                 195        3250
#>  4 Adelie  Torgersen           36.7          19.3               193        3450
#>  5 Adelie  Torgersen           39.3          20.6               190        3650
#>  6 Adelie  Torgersen           38.9          17.8               181        3625
#>  7 Adelie  Torgersen           39.2          19.6               195        4675
#>  8 Adelie  Torgersen           41.1          17.6               182        3200
#>  9 Adelie  Torgersen           38.6          21.2               191        3800
#> 10 Adelie  Torgersen           34.6          21.1               198        4400
#> # ... with 323 more rows, and 1 more variable: sex <fct>

XGBoost is a powerful model architecture, but only can generate point predictions. To generate single estimates for each penguin’s weight, we can create and fit a workflow (some useful resources for how to use {tidymodels} include the tidymodels package site and the book Tidy Modeling with R).

# split data into training and testing sets
set.seed(123)
penguins_split <- initial_split(penguins)
penguins_test <- testing(penguins_split)
penguins_train <- training(penguins_split)

# create a workflow
penguins_wf <-
  workflow() %>%
  
  # add preprocessing steps
  add_recipe(
    recipe(body_mass_g ~ ., data = penguins_train) %>%
      step_dummy(all_nominal_predictors())
  ) %>%
  
  # add xgboost model specification
  add_model(
    boost_tree("regression") %>% set_engine("xgboost")
  )

# fit to training data & predict on test data
set.seed(234)
penguins_preds <-
  penguins_wf %>%
  fit(penguins_train) %>%
  predict(penguins_test)

As mentioned above, XGBoost models only generate point predictions.

penguins_preds %>%
  bind_cols(penguins_test) %>%
  ggplot(aes(x = body_mass_g,
             y = .pred)) +
  geom_point() +
  geom_segment(aes(x = 3000, xend = 6000,
                   y = 3000, yend = 6000),
               linetype = "dashed",
               color = "gray") +
  labs(title = "Single XGBoost Model Predictions")

Using {workboots} to Generate Prediction Intervals

With {workboots}, however, we can generate a prediction distribution for each penguin’s weight in the test set! To do so, we’ll pass our workflow to predict_boots(), which will return a nested tibble with a set of predictions for each penguin in the penguins_test set.

library(workboots)

# create 100 models from bootstrap resamples and make predictions on the test set
set.seed(345)
penguins_preds_boot <- 
  penguins_wf %>%
  predict_boots(
    n = 100,
    training_data = penguins_train,
    new_data = penguins_test
  )

penguins_preds_boot
#> # A tibble: 84 x 2
#>    rowid .preds            
#>    <int> <list>            
#>  1     1 <tibble [100 x 2]>
#>  2     2 <tibble [100 x 2]>
#>  3     3 <tibble [100 x 2]>
#>  4     4 <tibble [100 x 2]>
#>  5     5 <tibble [100 x 2]>
#>  6     6 <tibble [100 x 2]>
#>  7     7 <tibble [100 x 2]>
#>  8     8 <tibble [100 x 2]>
#>  9     9 <tibble [100 x 2]>
#> 10    10 <tibble [100 x 2]>
#> # ... with 74 more rows

From each set of nested predictions, we can summarize with a lower and upper bound of our prediction interval (this uses the quantile() function under the hood).

penguins_preds_boot %>%
  summarise_predictions()
#> # A tibble: 84 x 5
#>    rowid .preds             .pred_lower .pred .pred_upper
#>    <int> <list>                   <dbl> <dbl>       <dbl>
#>  1     1 <tibble [100 x 2]>       3296. 3469.       3799.
#>  2     2 <tibble [100 x 2]>       3307. 3528.       3825.
#>  3     3 <tibble [100 x 2]>       3369. 3617.       3913.
#>  4     4 <tibble [100 x 2]>       3799. 4129.       4492.
#>  5     5 <tibble [100 x 2]>       3662. 3899.       4102.
#>  6     6 <tibble [100 x 2]>       3258. 3522.       3819.
#>  7     7 <tibble [100 x 2]>       3281. 3450.       3582.
#>  8     8 <tibble [100 x 2]>       3736. 4073.       4340.
#>  9     9 <tibble [100 x 2]>       3221. 3453.       3616.
#> 10    10 <tibble [100 x 2]>       3195. 3388.       3611.
#> # ... with 74 more rows

This allows us to include a prediction interval along with our point predictions!

penguins_preds_boot %>%
  summarise_predictions() %>%
  bind_cols(penguins_test) %>%
  ggplot(aes(x = body_mass_g,
             y = .pred,
             ymin = .pred_lower,
             ymax = .pred_upper)) +
  geom_abline(linetype = "dashed",
              color = "gray") + 
  geom_errorbar(alpha = 0.5) +
  geom_point(alpha = 0.5) +
  labs(title = "XGBoost Model Prediction Intervals from Bootstrap Resampling",
       subtitle = "Error bars represent the 2.5/97.5% quantiles")