The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
This vignette elaborates and demonstrates the regression paradigm explained in Olsen et al. (2024). We describe how to specify the regression model, how to enable automatic cross-validation of the model’s hyperparameters, and applying pre-processing steps to the data before fitting the regression models. We refer to Olsen et al. (2024) for when one should use the different paradigms, method classes, and methods.
Olsen et al. (2024) divides the regression paradigm into the separate and surrogate regression method classes. In this vignette, we briefly introduce the two method classes. For an in-depth explanation, we refer the reader to Sections 3.5 and 3.6 in Olsen et al. (2024).
Briefly stated, the regression paradigm uses regression models to directly estimate the contribution function \(v(S) = E[f(\boldsymbol{x})|\boldsymbol{x}_S = \boldsymbol{x}_S^*]\). The separate regression method class fits a separate regression model for each coalition \(S\), while the surrogate regression method class fits a single regression model to simultaneously predict the contribution function for all coalitions.
The shapr
package supports any regression model from the
popular tidymodels
package developed by Kuhn and Wickham (2020). The tidymodels
framework
is a collection of packages for modeling and machine learning using tidyverse
principles.
Some packages included in the tidymodels
framework are
parsnip
, recipes
, workflows
tune
, and rsample
; see the setup section below for more examples. Furthermore,
click here to
access the complete list of supported regression models in the
tidymodels
package. There are currently 80 supported
models, but the framework also allows adding regression models not
already implemented in tidymodels
. It is also possible to
apply a wide range of pre-processing data steps. For instance, we can
either apply the linear regression model directly to the data or
pre-process the data to compute principal components (principal
component regression) before fitting the linear regression to the first
few eigenvectors (processed features), see the pre-process section for an example. In the
add new regression methods section, we demonstrate
how to incorporate the projection pursuit regression model into the
tidymodels
framework.
Note that our framework does not currently support model formulas
with special terms. For example, we do not support
parsnip::gen_additive_mod
(i.e., mgcv::gam()
)
as it uses a non-standard notion in its formulas (in this case, the
s(feature, k = 2)
function). See
?parsnip::model_formula()
for more information. However,
this hurdle is overcome by pre-processing data steps containing spline
functions, which we showcase in the pre-process section for the separate
regression method class.
In the mixed data section, we demonstrate that the regression-based methods work on mixed data, too. However, we must add a pre-processing step for the regression models that do not natively support categorical data to encode the categorical features.
We use the same data and predictive models in this vignette as in the general usage.
See the end of the continious data and mixed data sections for summary figures of all the methods used in this vignette to compute the Shapley value explanations.
In the regression_separate
methods, we train a new
regression model \(g_S(\boldsymbol{x}s)\) to estimate the
conditional expectation for each coalition of features.
The idea is to estimate \(v(S) = E[f(\boldsymbol{x})|\boldsymbol{x}_S = \boldsymbol{x}_S^*] = E[f(\boldsymbol{x}_{\bar{S}},\boldsymbol{x}_S)|\boldsymbol{x}_S=\boldsymbol{x}_S^*]\) separately for each coalition \(S\) using regression. Let \(\mathcal{D} = \{ \boldsymbol{x}^{[i]}, y^{[i]} \}_{i=1}^{N_{\text{train}}}\) denote the training data, where \(\boldsymbol{x}^{[i]}\) is the \(i\)th \(M\)-dimensional input and \(y^{[i]}\) is the associated response. For each coalition \(S \subseteq \{1,2,\dots,M\}\), the corresponding training data set is \[\begin{align*} \mathcal{D}_S = \{\boldsymbol{x}_S^{[i]}, f(\underbrace{\boldsymbol{x}_\bar{S}^{[i]}, \boldsymbol{x}_S^{[i]}}_{\boldsymbol{x}^{[i]}})\}_{i=1}^{N_{\text{train}}} = \{\boldsymbol{x}_S^{[i]}, \underbrace{f(\boldsymbol{x}^{[i]})}_{z^{[i]}}\}_{i=1}^{N_{\text{train}}} = \{\boldsymbol{x}_S^{[i]}, z^{[i]}\}_{i=1}^{N_{\text{train}}}. \end{align*}\]
For each data set \(\mathcal{D}_S\), we train a regression model \(g_S(\boldsymbol{x}s)\) with respect to the mean squared error loss function. That is, we fit a regression model where the prediction \(f(\boldsymbol{x})\) is acting as the response and the feature subset of coalition \(S\), \(\boldsymbol{x}_S\), is acting as the available features. The optimal model, with respect to the loss function, is \(g^*_S(\boldsymbol{x}_S) = E[z|\boldsymbol{x}_S] = E[f(\boldsymbol{x}_\bar{S}, \boldsymbol{x}_S)|\boldsymbol{x}_S]\), which corresponds to the contribution function \(v(S)\). The regression model \(g_S\) aims for the optimal, hence, it resembles/estimates the contribution function, i.e., \(g_S(\boldsymbol{x}_S) = \hat{v}(S) \approx v(S) = E[f(\boldsymbol{x}_\bar{S}, \boldsymbol{x}_S) | \boldsymbol{x}_S = \boldsymbol{x}_S^*]\).
In this supplementary vignette, we use the same data and explain the
same model type as in the general usage. We train a simple
xgboost
model on the airquality
dataset and
demonstrate how to use the shapr
and the separate
regression method class to explain the individual predictions.
First, we set up the airquality
dataset and train an
xgboost
model, whose predictions we want to explain using
the Shapley value explanation framework. We import all packages in the
tidymodels
framework in the code chunk below, but we could
have specified them directly, too. In this vignette, we use the
following packages in the tidymodels
framework:
parsnip
, recipes
, workflows
,
dials
, hardhat
, tibble
,
rlang
, and ggplot2
. We include the
package::function()
notation throughout this vignette to
indicate which package the functions originate from in the
tidymodels
framework.
# Either use `library(tidymodels)` or separately specify the libraries indicated above
library(tidymodels)
library(shapr)
# Ensure that shapr's functions are prioritzed, otherwise we need to use the `shapr::`
# prefix when calling explain(). The `conflicted` package is imported by `tidymodels`.
conflicted::conflicts_prefer(shapr::explain, shapr::prepare_data)
# Other libraries
library(xgboost)
library(data.table)
data("airquality")
data <- data.table::as.data.table(airquality)
data <- data[complete.cases(data), ]
x_var <- c("Solar.R", "Wind", "Temp", "Month")
y_var <- "Ozone"
ind_x_explain <- 1:20
x_train <- data[-ind_x_explain, ..x_var]
y_train <- data[-ind_x_explain, get(y_var)]
x_explain <- data[ind_x_explain, ..x_var]
# Fitting a basic xgboost model to the training data
set.seed(123) # Set seed for reproducibility
model <- xgboost::xgboost(
data = as.matrix(x_train),
label = y_train,
nround = 20,
verbose = FALSE
)
# Specifying the phi_0, i.e. the expected prediction without any features
p0 <- mean(y_train)
# List to store all the explanation objects
explanation_list <- list()
To make the rest of the vignette easier to follow, we create some helper functions that plot and summarize the results of the explanation methods. This code block is optional to understand and can be skipped.
# Plot the MSEv criterion scores as horizontal bars and add dashed line of one method's score
plot_MSEv_scores <- function(explanation_list, method_line = NULL) {
fig <- plot_MSEv_eval_crit(explanation_list) +
ggplot2::theme(legend.position = "none") +
ggplot2::coord_flip() +
ggplot2::theme(plot.title = ggplot2::element_text(size = rel(0.95)))
fig <- fig + ggplot2::scale_x_discrete(limits = rev(levels(fig$data$Method)))
if (!is.null(method_line) && method_line %in% fig$data$Method) {
fig <- fig + ggplot2::geom_hline(
yintercept = fig$data$MSEv[fig$data$Method == method_line],
linetype = "dashed",
color = "black"
)
}
return(fig)
}
# Extract the MSEv criterion scores and elapsed times
print_MSEv_scores_and_time <- function(explanation_list) {
res <- as.data.frame(t(sapply(
explanation_list,
function(explanation) {
round(c(explanation$MSEv$MSEv$MSEv, explanation$timing$total_time_secs), 2)
}
)))
colnames(res) <- c("MSEv", "Time")
return(res)
}
# Extract the k best methods in decreasing order
get_k_best_methods <- function(explanation_list, k_best) {
res <- print_MSEv_scores_and_time(explanation_list)
return(rownames(res)[order(res$MSEv)[seq(k_best)]])
}
To establish a baseline against which to compare the regression
methods, we will compare them with the Monte Carlo-based
empirical
approach with default hyperparameters. In the
last section, we include all Monte Carlo-based methods implemented in
shapr
to make an extensive comparison.
# Compute the Shapley value explanations using the empirical method
explanation_list$MC_empirical <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
approach = "empirical",
phi0 = p0
)
#> Note: Feature classes extracted from the model contains NA.
#> Assuming feature classes from the data are correct.
#> Success with message:
#> max_n_coalitions is NULL or larger than or 2^n_features = 16,
#> and is therefore set to 2^n_features = 16.
#>
#> ── Starting `shapr::explain()` at 2024-11-21 13:43:36 ───────────────────────────────────────────────────────────────────────────────────────────────────────────────
#> • Model class: <xgb.Booster>
#> • Approach: empirical
#> • Iterative estimation: FALSE
#> • Number of feature-wise Shapley values: 4
#> • Number of observations to explain: 20
#> • Computations (temporary) saved at: '/tmp/RtmpgQii3E/shapr_obj_1aa7864f7b70f7.rds'
#>
#> ── Main computation started ──
#>
#> ℹ Using 16 of 16 coalitions.
Then we compute the Shapley value explanations using a linear regression model and the separate regression method class.
explanation_list$sep_lm <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
phi0 = p0,
approach = "regression_separate",
regression.model = parsnip::linear_reg()
)
#> Note: Feature classes extracted from the model contains NA.
#> Assuming feature classes from the data are correct.
#> Success with message:
#> max_n_coalitions is NULL or larger than or 2^n_features = 16,
#> and is therefore set to 2^n_features = 16.
#>
#> ── Starting `shapr::explain()` at 2024-11-21 13:43:40 ───────────────────────────────────────────────────────────────────────────────────────────────────────────────
#> • Model class: <xgb.Booster>
#> • Approach: regression_separate
#> • Iterative estimation: FALSE
#> • Number of feature-wise Shapley values: 4
#> • Number of observations to explain: 20
#> • Computations (temporary) saved at: '/tmp/RtmpgQii3E/shapr_obj_1aa78629466589.rds'
#>
#> ── Main computation started ──
#>
#> ℹ Using 16 of 16 coalitions.
A linear model is often not flexible enough to properly model the
contribution function. Thus, it can produce inaccurate Shapley value
explanations. The figure below shows that the empirical
approach outperforms the linear regression model approach quite
significantly concerning the \(\operatorname{MSE}_v\) evaluation
criterion.
This section describes how to pre-process the data before fitting the separate regression models. We demonstrate this for the linear regression model, but we can apply this pre-processing to other regression methods.
The recipe
package in the tidymodels
framework contains many functions to pre-process the data before fitting
the model, for example, normalization, interaction, encodings, and
transformations (e.g., log, splines, pls, pca). Click here to
access a complete list of all available functions. The list also
contains functions for helping us select which features to apply the
functions to, e.g., recipes::all_predictors()
,
recipes::all_numeric_predictors()
, and
recipes::all_factor_predictors()
apply the functions to all
features, only the numerical features, and only the factor features,
respectively. We can also specify the names of the features to which the
functions are applied. However, as the included features change in each
coalition, we need to check that the feature we want to apply the
function to is present in the dataset. We give an example of this
below.
First, we demonstrate how to compute the principal components and use (up to) the first two components for each separate linear regression model. We write “up to” as we can only compute a single principal component for the singleton coalitions, i.e., the feature itself. This regression model is called principal component regression.
explanation_list$sep_pcr <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
phi0 = p0,
approach = "regression_separate",
regression.model = parsnip::linear_reg(),
regression.recipe_func = function(regression_recipe) {
return(recipes::step_pca(regression_recipe, recipes::all_numeric_predictors(), num_comp = 2))
}
)
#> Note: Feature classes extracted from the model contains NA.
#> Assuming feature classes from the data are correct.
#> Success with message:
#> max_n_coalitions is NULL or larger than or 2^n_features = 16,
#> and is therefore set to 2^n_features = 16.
#>
#> ── Starting `shapr::explain()` at 2024-11-21 13:43:41 ───────────────────────────────────────────────────────────────────────────────────────────────────────────────
#> • Model class: <xgb.Booster>
#> • Approach: regression_separate
#> • Iterative estimation: FALSE
#> • Number of feature-wise Shapley values: 4
#> • Number of observations to explain: 20
#> • Computations (temporary) saved at: '/tmp/RtmpgQii3E/shapr_obj_1aa78667333f80.rds'
#>
#> ── Main computation started ──
#>
#> ℹ Using 16 of 16 coalitions.
Second, we apply a pre-processing step that computes the basis expansions of the features using natural splines with two degrees of freedom. This is similar to fitting a generalized additive model.
explanation_list$sep_splines <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
phi0 = p0,
approach = "regression_separate",
regression.model = parsnip::linear_reg(),
regression.recipe_func = function(regression_recipe) {
return(recipes::step_ns(regression_recipe, recipes::all_numeric_predictors(), deg_free = 2))
}
)
#> Note: Feature classes extracted from the model contains NA.
#> Assuming feature classes from the data are correct.
#> Success with message:
#> max_n_coalitions is NULL or larger than or 2^n_features = 16,
#> and is therefore set to 2^n_features = 16.
#>
#> ── Starting `shapr::explain()` at 2024-11-21 13:43:42 ───────────────────────────────────────────────────────────────────────────────────────────────────────────────
#> • Model class: <xgb.Booster>
#> • Approach: regression_separate
#> • Iterative estimation: FALSE
#> • Number of feature-wise Shapley values: 4
#> • Number of observations to explain: 20
#> • Computations (temporary) saved at: '/tmp/RtmpgQii3E/shapr_obj_1aa78652874079.rds'
#>
#> ── Main computation started ──
#>
#> ℹ Using 16 of 16 coalitions.
Finally, we provide an example where we include interactions between
the features Solar.R
and Wind
, log-transform
Solar.R
, convert Wind
to be between 0 and 1
and then take the square root, include polynomials of the third degree
for Temp
, and apply the Box-Cox transformation to
Month
. These transformations are only applied when the
features are present for the different separate models.
Furthermore, we stress that the purpose of this example is to highlight the framework’s flexibility, not that the transformations below are reasonable.
# Example function of how to apply step functions from the recipes package to specific features
regression.recipe_func <- function(recipe) {
# Get the names of the present features
feature_names <- recipe$var_info$variable[recipe$var_info$role == "predictor"]
# If Solar.R and Wind is present, then we add the interaction between them
if (all(c("Solar.R", "Wind") %in% feature_names)) {
recipe <- recipes::step_interact(recipe, terms = ~ Solar.R:Wind)
}
# If Solar.R is present, then log transform it
if ("Solar.R" %in% feature_names) recipe <- recipes::step_log(recipe, Solar.R)
# If Wind is present, then scale it to be between 0 and 1 and then sqrt transform it
if ("Wind" %in% feature_names) recipe <- recipes::step_sqrt(recipes::step_range(recipe, Wind))
# If Temp is present, then expand it using orthogonal polynomials of degree 3
if ("Temp" %in% feature_names) recipe <- recipes::step_poly(recipe, Temp, degree = 3)
# If Month is present, then Box-Cox transform it
if ("Month" %in% feature_names) recipe <- recipes::step_BoxCox(recipe, Month)
# Finally we normalize all features (not needed as LM does this internally)
recipe <- recipes::step_normalize(recipe, recipes::all_numeric_predictors())
return(recipe)
}
# Compute the Shapley values using the pre-processing steps defined above
explanation_list$sep_reicpe_example <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
phi0 = p0,
approach = "regression_separate",
regression.model = parsnip::linear_reg(),
regression.recipe_func = regression.recipe_func
)
#> Note: Feature classes extracted from the model contains NA.
#> Assuming feature classes from the data are correct.
#> Success with message:
#> max_n_coalitions is NULL or larger than or 2^n_features = 16,
#> and is therefore set to 2^n_features = 16.
#>
#> ── Starting `shapr::explain()` at 2024-11-21 13:43:43 ───────────────────────────────────────────────────────────────────────────────────────────────────────────────
#> • Model class: <xgb.Booster>
#> • Approach: regression_separate
#> • Iterative estimation: FALSE
#> • Number of feature-wise Shapley values: 4
#> • Number of observations to explain: 20
#> • Computations (temporary) saved at: '/tmp/RtmpgQii3E/shapr_obj_1aa7862f1e5e2d.rds'
#>
#> ── Main computation started ──
#>
#> ℹ Using 16 of 16 coalitions.
We can examine the \(\operatorname{MSE}_v\) evaluation scores, and we see that the method using natural splines significantly outperforms the other methods.
In the following example, we use a decision tree model instead of the simple linear regression model.
The tidymodels
framework supports several
implementations of the decision tree model. We use
set_engine("rpart")
to specify that we want to use the
implementation in the rpart
package, and we use
set_mode("regression")
to specify that we are doing
regression. The tidymodels
framework uses the default
hyperparameter values set in rpart
when we do not specify
them. By searching for “decision tree” in the list of tidymodels,
we see that the default hyperparameter values for the decision_tree_rpart
model are tree_depth = 30
, min_n = 2
, and
cost_complexity = 0.01
.
# Decision tree with specified parameters (stumps)
explanation_list$sep_tree_stump <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
phi0 = p0,
approach = "regression_separate",
regression.model = parsnip::decision_tree(
tree_depth = 1,
min_n = 2,
cost_complexity = 0.01,
engine = "rpart",
mode = "regression"
)
)
#> Note: Feature classes extracted from the model contains NA.
#> Assuming feature classes from the data are correct.
#> Success with message:
#> max_n_coalitions is NULL or larger than or 2^n_features = 16,
#> and is therefore set to 2^n_features = 16.
#>
#> ── Starting `shapr::explain()` at 2024-11-21 13:43:44 ───────────────────────────────────────────────────────────────────────────────────────────────────────────────
#> • Model class: <xgb.Booster>
#> • Approach: regression_separate
#> • Iterative estimation: FALSE
#> • Number of feature-wise Shapley values: 4
#> • Number of observations to explain: 20
#> • Computations (temporary) saved at: '/tmp/RtmpgQii3E/shapr_obj_1aa786b5a978d.rds'
#>
#> ── Main computation started ──
#>
#> ℹ Using 16 of 16 coalitions.
# Decision tree with default parameters
explanation_list$sep_tree_default <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
phi0 = p0,
approach = "regression_separate",
regression.model = parsnip::decision_tree(engine = "rpart", mode = "regression")
)
#> Note: Feature classes extracted from the model contains NA.
#> Assuming feature classes from the data are correct.
#> Success with message:
#> max_n_coalitions is NULL or larger than or 2^n_features = 16,
#> and is therefore set to 2^n_features = 16.
#>
#> ── Starting `shapr::explain()` at 2024-11-21 13:43:45 ───────────────────────────────────────────────────────────────────────────────────────────────────────────────
#> • Model class: <xgb.Booster>
#> • Approach: regression_separate
#> • Iterative estimation: FALSE
#> • Number of feature-wise Shapley values: 4
#> • Number of observations to explain: 20
#> • Computations (temporary) saved at: '/tmp/RtmpgQii3E/shapr_obj_1aa7867004239c.rds'
#>
#> ── Main computation started ──
#>
#> ℹ Using 16 of 16 coalitions.
We can also set
regression.model = parsnip::decision_tree(tree_depth = 1, min_n = 2, cost_complexity = 0.01) %>% parsnip::set_engine("rpart") %>% parsnip::set_mode("regression")
if we want to use the pipe function (%>%
).
We can now compare the two new methods. The decision tree with default parameters outperforms the linear model approach concerning the \(\operatorname{MSE}_v\) criterion and is on the same level as the empirical approach. We obtained a worse method by using stumps, i.e., trees with depth one.
# Compare the MSEv criterion of the different explanation methods
plot_MSEv_scores(explanation_list, method_line = "MC_empirical")
# Print the MSEv scores and the elapsed time (in seconds) for the different methods
print_MSEv_scores_and_time(explanation_list)
#> MSEv Time
#> MC_empirical 179.43 3.35
#> sep_lm 745.21 0.75
#> sep_pcr 784.91 0.88
#> sep_splines 165.13 0.89
#> sep_reicpe_example 687.45 1.25
#> sep_tree_stump 218.05 1.00
#> sep_tree_default 177.68 0.73
Another option is to use cross-validation to tune the hyperparameters. To do this, we need to specify three things:
regression.model
, we need to specify which
parameters to tune in the model. We do this by setting the parameter
equal to hardhat::tune()
. For example., if we want to tune
the tree_depth
parameter in the
parsnip::decision_tree
model while using default parameters
for the other parameters, then we set
parsnip::decision_tree(tree_depth = hardhat::tune())
.regression.tune_values
, we must provide either a
data.frame (can also be a data.table or tibble) containing the possible
hyperparameter values or a function that takes in the training data for
each combination/coalition and outputs a data.frame containing the
possible hyperparameter values. The latter allows us to use different
hyperparameter values for different coalition sizes, which is essential
if a hyperparameter’s domain changes with the coalition size. For
example, see the example below where we want to tune the
mtry
parameter in ranger
(random forest). The
column names of regression.tune_values
(or the output if it
is a function) must match the tunable hyperparameters specified in
regression.model
. For the example above,
regression.tune_values
must be a one-column data.frame with
the column name tree_depth
. We can either manually specify
the hyperparameter values or use the dials
package, e.g.,
dials::grid_regular(dials::tree_depth(), levels = 5)
. Or it
can be a function that outputs a data.frame on the same form.regression.vfold_cv_para
parameter is
optional. If used, then regression.vfold_cv_para
must be a
list specifying the parameters to send to the cross-validation function
rsample::vfold_cv()
. Use ?rsample::vfold_cv
to
see the default parameters. The names of the objects in the
regression.vfold_cv_para
list must match the parameter
names in rsample::vfold_cv()
. For example, if we want
5-fold cross-validation, we set
regression.vfold_cv_para = list(v = 5)
.First, let us look at some ways to specify
regression.tune_values
. Note that dials
have
several other grid functions, e.g., dials::grid_random()
and dials::grid_latin_hypercube()
.
# Possible ways to define the `regression.tune_values` object.
# function(x) dials::grid_regular(dials::tree_depth(), levels = 4)
dials::grid_regular(dials::tree_depth(), levels = 4)
data.table(tree_depth = c(1, 5, 10, 15)) # Can also use data.frame or tibble
# For several features
# function(x) dials::grid_regular(dials::tree_depth(), dials::cost_complexity(), levels = 3)
dials::grid_regular(dials::tree_depth(), dials::cost_complexity(), levels = 3)
expand.grid(tree_depth = c(1, 3, 5), cost_complexity = c(0.001, 0.05, 0.01))
We will now demonstrate how to use cross-validation to fine-tune the
separate decision tree regression method. In the following examples, we
consider two versions. In the first example, we use cross-validation to
tune the tree_depth
parameter using the
dials::grid_regular()
function. In the second example, we
tune both the tree_depth
and cost_complexity
parameters, but we will manually specify the possible hyperparameter
values this time.
# Decision tree with cross validated depth (default values other parameters)
explanation_list$sep_tree_cv <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
phi0 = p0,
approach = "regression_separate",
regression.model = parsnip::decision_tree(
tree_depth = hardhat::tune(), engine = "rpart", mode = "regression"
),
regression.tune_values = dials::grid_regular(dials::tree_depth(), levels = 4),
regression.vfold_cv_para = list(v = 5)
)
#> Note: Feature classes extracted from the model contains NA.
#> Assuming feature classes from the data are correct.
#> Success with message:
#> max_n_coalitions is NULL or larger than or 2^n_features = 16,
#> and is therefore set to 2^n_features = 16.
#>
#> ── Starting `shapr::explain()` at 2024-11-21 13:43:46 ───────────────────────────────────────────────────────────────────────────────────────────────────────────────
#> • Model class: <xgb.Booster>
#> • Approach: regression_separate
#> • Iterative estimation: FALSE
#> • Number of feature-wise Shapley values: 4
#> • Number of observations to explain: 20
#> • Computations (temporary) saved at: '/tmp/RtmpgQii3E/shapr_obj_1aa7862e01b595.rds'
#>
#> ── Main computation started ──
#>
#> ℹ Using 16 of 16 coalitions.
# Use trees with cross-validation on the depth and cost complexity. Manually set the values.
explanation_list$sep_tree_cv_2 <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
phi0 = p0,
approach = "regression_separate",
regression.model = parsnip::decision_tree(
tree_depth = hardhat::tune(),
cost_complexity = hardhat::tune(),
engine = "rpart",
mode = "regression"
),
regression.tune_values =
expand.grid(tree_depth = c(1, 3, 5), cost_complexity = c(0.001, 0.01, 0.1)),
regression.vfold_cv_para = list(v = 5)
)
#> Note: Feature classes extracted from the model contains NA.
#> Assuming feature classes from the data are correct.
#> Success with message:
#> max_n_coalitions is NULL or larger than or 2^n_features = 16,
#> and is therefore set to 2^n_features = 16.
#>
#> ── Starting `shapr::explain()` at 2024-11-21 13:43:59 ───────────────────────────────────────────────────────────────────────────────────────────────────────────────
#> • Model class: <xgb.Booster>
#> • Approach: regression_separate
#> • Iterative estimation: FALSE
#> • Number of feature-wise Shapley values: 4
#> • Number of observations to explain: 20
#> • Computations (temporary) saved at: '/tmp/RtmpgQii3E/shapr_obj_1aa78615cc7432.rds'
#>
#> ── Main computation started ──
#>
#> ℹ Using 16 of 16 coalitions.
We also include one example with a random forest model where the
tunable hyperparameter mtry
depends on the coalition size.
Thus, regression.tune_values
must be a function that
returns a data.frame where the hyperparameter values for
mtry
will change based on the coalition size. If we do not
let regression.tune_values
be a function, then
tidymodels
will crash for any mtry
higher than
1. Furthermore, by setting letting
"vS_details" %in% verbose
, we receive get messages with the
results of the cross-validation procedure ran within shapr
.
Note that the tested hyperparameter value combinations change based on
the coalition size.
# Using random forest with default parameters
explanation_list$sep_rf <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
phi0 = p0,
approach = "regression_separate",
regression.model = parsnip::rand_forest(engine = "ranger", mode = "regression")
)
#> Note: Feature classes extracted from the model contains NA.
#> Assuming feature classes from the data are correct.
#> Success with message:
#> max_n_coalitions is NULL or larger than or 2^n_features = 16,
#> and is therefore set to 2^n_features = 16.
#>
#> ── Starting `shapr::explain()` at 2024-11-21 13:44:24 ───────────────────────────────────────────────────────────────────────────────────────────────────────────────
#> • Model class: <xgb.Booster>
#> • Approach: regression_separate
#> • Iterative estimation: FALSE
#> • Number of feature-wise Shapley values: 4
#> • Number of observations to explain: 20
#> • Computations (temporary) saved at: '/tmp/RtmpgQii3E/shapr_obj_1aa7863390470d.rds'
#>
#> ── Main computation started ──
#>
#> ℹ Using 16 of 16 coalitions.
# Using random forest with parameters tuned by cross-validation
explanation_list$sep_rf_cv <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
phi0 = p0,
verbose = c("basic","vS_details"), # To get printouts
approach = "regression_separate",
regression.model = parsnip::rand_forest(
mtry = hardhat::tune(), trees = hardhat::tune(), engine = "ranger", mode = "regression"
),
regression.tune_values =
function(x) {
dials::grid_regular(dials::mtry(c(1, ncol(x))), dials::trees(c(50, 750)), levels = 3)
},
regression.vfold_cv_para = list(v = 5)
)
#> Note: Feature classes extracted from the model contains NA.
#> Assuming feature classes from the data are correct.
#> Success with message:
#> max_n_coalitions is NULL or larger than or 2^n_features = 16,
#> and is therefore set to 2^n_features = 16.
#>
#> ── Starting `shapr::explain()` at 2024-11-21 13:44:26 ───────────────────────────────────────────────────────────────────────────────────────────────────────────────
#> • Model class: <xgb.Booster>
#> • Approach: regression_separate
#> • Iterative estimation: FALSE
#> • Number of feature-wise Shapley values: 4
#> • Number of observations to explain: 20
#> • Computations (temporary) saved at: '/tmp/RtmpgQii3E/shapr_obj_1aa7865f901637.rds'
#>
#> ── Additional details about the regression model
#> Random Forest Model Specification (regression)
#>
#> Main Arguments: mtry = hardhat::tune() trees = hardhat::tune()
#>
#> Computational engine: ranger
#>
#> ── Main computation started ──
#>
#> ℹ Using 16 of 16 coalitions.
#>
#> ── Extra info about the tuning of the regression model ──
#>
#> ── Top 6 best configs for v(1 4) (using 5-fold CV)
#> #1: mtry = 1 trees = 50 rmse = 28.43 rmse_std_err = 3.02
#> #2: mtry = 1 trees = 750 rmse = 28.76 rmse_std_err = 2.57
#> #3: mtry = 1 trees = 400 rmse = 28.80 rmse_std_err = 2.64
#> #4: mtry = 2 trees = 50 rmse = 29.27 rmse_std_err = 2.29
#> #5: mtry = 2 trees = 400 rmse = 29.42 rmse_std_err = 2.40
#> #6: mtry = 2 trees = 750 rmse = 29.46 rmse_std_err = 2.20
#>
#> ── Top 6 best configs for v(2 4) (using 5-fold CV)
#> #1: mtry = 1 trees = 50 rmse = 21.12 rmse_std_err = 0.73
#> #2: mtry = 1 trees = 750 rmse = 21.21 rmse_std_err = 0.66
#> #3: mtry = 2 trees = 400 rmse = 21.27 rmse_std_err = 1.02
#> #4: mtry = 2 trees = 750 rmse = 21.31 rmse_std_err = 1.01
#> #5: mtry = 1 trees = 400 rmse = 21.34 rmse_std_err = 0.69
#> #6: mtry = 2 trees = 50 rmse = 21.65 rmse_std_err = 0.94
#>
#> ── Top 6 best configs for v(1 3) (using 5-fold CV)
#> #1: mtry = 1 trees = 50 rmse = 21.34 rmse_std_err = 3.18
#> #2: mtry = 1 trees = 400 rmse = 21.56 rmse_std_err = 3.13
#> #3: mtry = 1 trees = 750 rmse = 21.68 rmse_std_err = 3.13
#> #4: mtry = 2 trees = 50 rmse = 21.79 rmse_std_err = 3.10
#> #5: mtry = 2 trees = 750 rmse = 21.85 rmse_std_err = 2.98
#> #6: mtry = 2 trees = 400 rmse = 21.89 rmse_std_err = 2.97
#>
#> ── Top 6 best configs for v(3 4) (using 5-fold CV)
#> #1: mtry = 1 trees = 750 rmse = 22.94 rmse_std_err = 4.33
#> #2: mtry = 1 trees = 400 rmse = 23.13 rmse_std_err = 4.23
#> #3: mtry = 1 trees = 50 rmse = 23.43 rmse_std_err = 4.13
#> #4: mtry = 2 trees = 400 rmse = 23.86 rmse_std_err = 3.77
#> #5: mtry = 2 trees = 750 rmse = 24.00 rmse_std_err = 3.78
#> #6: mtry = 2 trees = 50 rmse = 24.57 rmse_std_err = 4.08
#>
#> ── Top 6 best configs for v(2 3) (using 5-fold CV)
#> #1: mtry = 2 trees = 50 rmse = 17.46 rmse_std_err = 2.26
#> #2: mtry = 2 trees = 750 rmse = 17.53 rmse_std_err = 2.43
#> #3: mtry = 2 trees = 400 rmse = 17.64 rmse_std_err = 2.38
#> #4: mtry = 1 trees = 750 rmse = 17.80 rmse_std_err = 2.09
#> #5: mtry = 1 trees = 50 rmse = 17.81 rmse_std_err = 1.79
#> #6: mtry = 1 trees = 400 rmse = 17.89 rmse_std_err = 2.13
#>
#> ── Top 3 best configs for v(3) (using 5-fold CV)
#> #1: mtry = 1 trees = 50 rmse = 22.55 rmse_std_err = 4.68
#> #2: mtry = 1 trees = 400 rmse = 22.59 rmse_std_err = 4.63
#> #3: mtry = 1 trees = 750 rmse = 22.64 rmse_std_err = 4.65
#>
#> ── Top 6 best configs for v(1 2) (using 5-fold CV)
#> #1: mtry = 1 trees = 400 rmse = 21.57 rmse_std_err = 2.25
#> #2: mtry = 1 trees = 750 rmse = 21.59 rmse_std_err = 2.29
#> #3: mtry = 1 trees = 50 rmse = 22.38 rmse_std_err = 2.10
#> #4: mtry = 2 trees = 400 rmse = 22.54 rmse_std_err = 2.09
#> #5: mtry = 2 trees = 750 rmse = 22.65 rmse_std_err = 2.09
#> #6: mtry = 2 trees = 50 rmse = 23.12 rmse_std_err = 2.23
#>
#> ── Top 3 best configs for v(4) (using 5-fold CV)
#> #1: mtry = 1 trees = 750 rmse = 32.14 rmse_std_err = 4.32
#> #2: mtry = 1 trees = 400 rmse = 32.21 rmse_std_err = 4.31
#> #3: mtry = 1 trees = 50 rmse = 32.21 rmse_std_err = 4.25
#>
#> ── Top 3 best configs for v(1) (using 5-fold CV)
#> #1: mtry = 1 trees = 50 rmse = 30.34 rmse_std_err = 3.40
#> #2: mtry = 1 trees = 750 rmse = 30.53 rmse_std_err = 3.31
#> #3: mtry = 1 trees = 400 rmse = 30.63 rmse_std_err = 3.32
#>
#> ── Top 3 best configs for v(2) (using 5-fold CV)
#> #1: mtry = 1 trees = 750 rmse = 26.62 rmse_std_err = 2.33
#> #2: mtry = 1 trees = 400 rmse = 26.72 rmse_std_err = 2.29
#> #3: mtry = 1 trees = 50 rmse = 26.97 rmse_std_err = 2.24
#>
#> ── Top 9 best configs for v(1 2 4) (using 5-fold CV)
#> #1: mtry = 2 trees = 750 rmse = 19.81 rmse_std_err = 1.53
#> #2: mtry = 2 trees = 400 rmse = 19.85 rmse_std_err = 1.64
#> #3: mtry = 1 trees = 750 rmse = 19.93 rmse_std_err = 1.93
#> #4: mtry = 1 trees = 400 rmse = 20.18 rmse_std_err = 1.90
#> #5: mtry = 2 trees = 50 rmse = 20.41 rmse_std_err = 1.56
#> #6: mtry = 3 trees = 50 rmse = 20.69 rmse_std_err = 1.54
#> #7: mtry = 3 trees = 750 rmse = 20.74 rmse_std_err = 1.69
#> #8: mtry = 3 trees = 400 rmse = 20.77 rmse_std_err = 1.76
#> #9: mtry = 1 trees = 50 rmse = 20.79 rmse_std_err = 1.89
#>
#> ── Top 9 best configs for v(1 2 3) (using 5-fold CV)
#> #1: mtry = 2 trees = 400 rmse = 16.16 rmse_std_err = 2.75
#> #2: mtry = 3 trees = 400 rmse = 16.30 rmse_std_err = 2.80
#> #3: mtry = 2 trees = 750 rmse = 16.41 rmse_std_err = 2.79
#> #4: mtry = 3 trees = 750 rmse = 16.43 rmse_std_err = 2.82
#> #5: mtry = 3 trees = 50 rmse = 16.52 rmse_std_err = 2.52
#> #6: mtry = 1 trees = 750 rmse = 16.69 rmse_std_err = 3.15
#> #7: mtry = 2 trees = 50 rmse = 16.89 rmse_std_err = 2.76
#> #8: mtry = 1 trees = 400 rmse = 16.98 rmse_std_err = 2.93
#> #9: mtry = 1 trees = 50 rmse = 17.69 rmse_std_err = 3.16
#>
#> ── Top 9 best configs for v(1 3 4) (using 5-fold CV)
#> #1: mtry = 1 trees = 400 rmse = 21.88 rmse_std_err = 4.33
#> #2: mtry = 1 trees = 750 rmse = 21.96 rmse_std_err = 4.38
#> #3: mtry = 1 trees = 50 rmse = 22.03 rmse_std_err = 4.07
#> #4: mtry = 2 trees = 400 rmse = 22.65 rmse_std_err = 4.11
#> #5: mtry = 2 trees = 750 rmse = 22.72 rmse_std_err = 4.09
#> #6: mtry = 2 trees = 50 rmse = 22.89 rmse_std_err = 3.97
#> #7: mtry = 3 trees = 400 rmse = 23.38 rmse_std_err = 3.80
#> #8: mtry = 3 trees = 750 rmse = 23.50 rmse_std_err = 3.77
#> #9: mtry = 3 trees = 50 rmse = 23.88 rmse_std_err = 3.64
#>
#> ── Top 9 best configs for v(2 3 4) (using 5-fold CV)
#> #1: mtry = 3 trees = 50 rmse = 17.96 rmse_std_err = 1.34
#> #2: mtry = 1 trees = 50 rmse = 17.97 rmse_std_err = 2.40
#> #3: mtry = 1 trees = 750 rmse = 18.63 rmse_std_err = 1.99
#> #4: mtry = 2 trees = 400 rmse = 18.76 rmse_std_err = 1.42
#> #5: mtry = 1 trees = 400 rmse = 18.79 rmse_std_err = 2.14
#> #6: mtry = 2 trees = 750 rmse = 18.80 rmse_std_err = 1.49
#> #7: mtry = 3 trees = 750 rmse = 19.12 rmse_std_err = 1.68
#> #8: mtry = 3 trees = 400 rmse = 19.14 rmse_std_err = 1.65
#> #9: mtry = 2 trees = 50 rmse = 19.33 rmse_std_err = 1.67
We can look at the \(\operatorname{MSE}_v\) evaluation
criterion, and we see that cross-validation improves both the decision
tree and the random forest methods. The two cross-validated decision
tree methods are comparable, but the second version outperforms the
first version by a small margin. This comparison is somewhat unfair for
the empirical
approach, which also has hyperparameters we
could potentially tune. However, shapr
does not currently
provide a function to do this automatically. In the figure below, we
include a vertical line at the \(\operatorname{MSE}_v\) score of the
empirical
method for easier comparison.
Furthermore, we must consider that cross-validation drastically increases the elapsed time (seconds) and determine if the increased precision is worth the extra computational time. We also see that the complex random forest method performs significantly worse than the simple decision tree method. This result indicates that even though we do hyperparameter tuning, we still overfit the data.
# Print the MSEv scores and the elapsed time (in seconds) for the different methods
print_MSEv_scores_and_time(explanation_list)
#> MSEv Time
#> MC_empirical 179.43 3.35
#> sep_lm 745.21 0.75
#> sep_pcr 784.91 0.88
#> sep_splines 165.13 0.89
#> sep_reicpe_example 687.45 1.25
#> sep_tree_stump 218.05 1.00
#> sep_tree_default 177.68 0.73
#> sep_tree_cv 222.71 12.35
#> sep_tree_cv_2 219.45 25.59
#> sep_rf 217.00 1.23
#> sep_rf_cv 212.64 30.98
The future
package can train the separate regression
models in parallel. More specifically, we parallelize both the training
step (when we fit the models) and the prediction step (when we compute
\(v(S)\)). In the general usage, we
also explain how to enable progress bars.
In the code chunk below, we consider four regression-based methods.
The first method uses xgboost
models with default
hyperparameter values, while the remaining three use cross-validation to
tune the number of trees. The second and third methods specify the same
potential hyperparameter values, but we run the former sequentially
while the latter is run in parallel to speed up the computations. The
fourth model is run in parallel but also tunes the depth of the trees
and not only the number of trees.
A small side note: If we let "vS_details" %in% verbose
,
we can see which tree
value shapr
chooses for
each coalition. We would then see that the values 25, 50, 100, and 500
are never chosen. Thus, we can remove these values without influencing
the result and instead do a finer grid search among the lower values. We
do this in the fourth method.
# Regular xgboost with default parameters
explanation_list$sep_xgboost <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
phi0 = p0,
approach = "regression_separate",
regression.model = parsnip::boost_tree(engine = "xgboost", mode = "regression")
)
#> Note: Feature classes extracted from the model contains NA.
#> Assuming feature classes from the data are correct.
#> Success with message:
#> max_n_coalitions is NULL or larger than or 2^n_features = 16,
#> and is therefore set to 2^n_features = 16.
#>
#> ── Starting `shapr::explain()` at 2024-11-21 13:44:57 ───────────────────────────────────────────────────────────────────────────────────────────────────────────────
#> • Model class: <xgb.Booster>
#> • Approach: regression_separate
#> • Iterative estimation: FALSE
#> • Number of feature-wise Shapley values: 4
#> • Number of observations to explain: 20
#> • Computations (temporary) saved at: '/tmp/RtmpgQii3E/shapr_obj_1aa78636583f25.rds'
#>
#> ── Main computation started ──
#>
#> ℹ Using 16 of 16 coalitions.
# Cross validate the number of trees
explanation_list$sep_xgboost_cv <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
phi0 = p0,
approach = "regression_separate",
regression.model =
parsnip::boost_tree(trees = hardhat::tune(), engine = "xgboost", mode = "regression"),
regression.tune_values = expand.grid(trees = c(10, 15, 25, 50, 100, 500)),
regression.vfold_cv_para = list(v = 5)
)
#> Note: Feature classes extracted from the model contains NA.
#> Assuming feature classes from the data are correct.
#> Success with message:
#> max_n_coalitions is NULL or larger than or 2^n_features = 16,
#> and is therefore set to 2^n_features = 16.
#>
#> ── Starting `shapr::explain()` at 2024-11-21 13:44:58 ───────────────────────────────────────────────────────────────────────────────────────────────────────────────
#> • Model class: <xgb.Booster>
#> • Approach: regression_separate
#> • Iterative estimation: FALSE
#> • Number of feature-wise Shapley values: 4
#> • Number of observations to explain: 20
#> • Computations (temporary) saved at: '/tmp/RtmpgQii3E/shapr_obj_1aa786931fb90.rds'
#>
#> ── Main computation started ──
#>
#> ℹ Using 16 of 16 coalitions.
# Cross validate the number of trees in parallel on two threads
future::plan(future::multisession, workers = 2)
explanation_list$sep_xgboost_cv_par <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
phi0 = p0,
approach = "regression_separate",
regression.model =
parsnip::boost_tree(trees = hardhat::tune(), engine = "xgboost", mode = "regression"),
regression.tune_values = expand.grid(trees = c(10, 15, 25, 50, 100, 500)),
regression.vfold_cv_para = list(v = 5)
)
#> Note: Feature classes extracted from the model contains NA.
#> Assuming feature classes from the data are correct.
#> Success with message:
#> max_n_coalitions is NULL or larger than or 2^n_features = 16,
#> and is therefore set to 2^n_features = 16.
#>
#> ── Starting `shapr::explain()` at 2024-11-21 13:45:13 ───────────────────────────────────────────────────────────────────────────────────────────────────────────────
#> • Model class: <xgb.Booster>
#> • Approach: regression_separate
#> • Iterative estimation: FALSE
#> • Number of feature-wise Shapley values: 4
#> • Number of observations to explain: 20
#> • Computations (temporary) saved at: '/tmp/RtmpgQii3E/shapr_obj_1aa786491f190d.rds'
#>
#> ── Main computation started ──
#>
#> ℹ Using 16 of 16 coalitions.
# Use a finer grid of low values for `trees` and also tune `tree_depth`
future::plan(future::multisession, workers = 4) # Change to 4 threads due to more complex CV
explanation_list$sep_xgboost_cv_2_par <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
phi0 = p0,
approach = "regression_separate",
regression.model = parsnip::boost_tree(
trees = hardhat::tune(),
tree_depth = hardhat::tune(),
engine = "xgboost",
mode = "regression"
),
regression.tune_values = expand.grid(trees = c(8, 10, 12, 15), tree_depth = c(4, 6, 8)),
regression.vfold_cv_para = list(v = 5)
)
#> Note: Feature classes extracted from the model contains NA.
#> Assuming feature classes from the data are correct.
#> Success with message:
#> max_n_coalitions is NULL or larger than or 2^n_features = 16,
#> and is therefore set to 2^n_features = 16.
#>
#> ── Starting `shapr::explain()` at 2024-11-21 13:45:26 ───────────────────────────────────────────────────────────────────────────────────────────────────────────────
#> • Model class: <xgb.Booster>
#> • Approach: regression_separate
#> • Iterative estimation: FALSE
#> • Number of feature-wise Shapley values: 4
#> • Number of observations to explain: 20
#> • Computations (temporary) saved at: '/tmp/RtmpgQii3E/shapr_obj_1aa78652904b76.rds'
#>
#> ── Main computation started ──
#>
#> ℹ Using 16 of 16 coalitions.
future::plan(future::sequential) # To return to non-parallel computation
Looking at the elapsed time, we see that the parallel version with two workers is faster than the sequential version. Note that the elapsed time of the parallel version is not reduced by a factor of two as the creation of the parallel processes creates some additional overhead, which is significant in this small example. However, parallelization will yield considerable relative time improvements in more complex situations. E.g., in settings with (more) training observations with more features (i.e., more coalitions to compute) and situations with more time-consuming cross-validation (i.e., more folds, hyperparameters to tune, or hyperparameter values to consider). Furthermore, we see that conducting the cross-validation has lowered the \(\operatorname{MSE}_v\)criterion drastically. Finally, note that we obtain the same value whether we run the cross-validation in parallel or sequentially.
# Print the MSEv scores and the elapsed time (in seconds) for the different methods
print_MSEv_scores_and_time(explanation_list)
#> MSEv Time
#> MC_empirical 179.43 3.35
#> sep_lm 745.21 0.75
#> sep_pcr 784.91 0.88
#> sep_splines 165.13 0.89
#> sep_reicpe_example 687.45 1.25
#> sep_tree_stump 218.05 1.00
#> sep_tree_default 177.68 0.73
#> sep_tree_cv 222.71 12.35
#> sep_tree_cv_2 219.45 25.59
#> sep_rf 217.00 1.23
#> sep_rf_cv 212.64 30.98
#> sep_xgboost 197.72 0.86
#> sep_xgboost_cv 169.83 14.60
#> sep_xgboost_cv_par 169.83 12.33
#> sep_xgboost_cv_2_par 153.13 13.99
Since the regression_separate
methods train a new
regression model \(g_S(\boldsymbol{x}_S)\) for each coalition
\(S \subseteq \{1,2,\dots,M\}\), a
total of \(2^M-2\) models has to be
trained, which can be time-consuming for slowly fitted models. The minus
two corresponds to the empty and grand coalitions.
The regression_surrogate
method class builds on the
ideas from the regression_separate
class, but instead of
fitting a new regression model for each coalition, we train a single
regression model \(g(\tilde{\boldsymbol{x}}_S)\) for all
coalitions \(S \subseteq
\{1,2,\dots,M\}\) (except the empty and grand coalitions), where
\(\tilde{\boldsymbol{x}}_S\) is an
augmented version of \(\boldsymbol{x}_S\). See Section 3.6.1 in
Olsen et al. (2024) for more details and
examples.
We can also apply all the examples above for the separate regression method class to the surrogate regression method class.
We demonstrate the surrogate method class using several regression
models below. More specifically, we use linear regression, random forest
(with and without (some) cross-validation), and xgboost
(with and without (some) cross-validation).
# Compute the Shapley value explanations using a surrogate linear regression model
explanation_list$sur_lm <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
phi0 = p0,
approach = "regression_surrogate",
regression.model = parsnip::linear_reg()
)
#> Note: Feature classes extracted from the model contains NA.
#> Assuming feature classes from the data are correct.
#> Success with message:
#> max_n_coalitions is NULL or larger than or 2^n_features = 16,
#> and is therefore set to 2^n_features = 16.
#>
#> ── Starting `shapr::explain()` at 2024-11-21 13:45:40 ───────────────────────────────────────────────────────────────────────────────────────────────────────────────
#> • Model class: <xgb.Booster>
#> • Approach: regression_surrogate
#> • Iterative estimation: FALSE
#> • Number of feature-wise Shapley values: 4
#> • Number of observations to explain: 20
#> • Computations (temporary) saved at: '/tmp/RtmpgQii3E/shapr_obj_1aa7867d8c7486.rds'
#>
#> ── Main computation started ──
#>
#> ℹ Using 16 of 16 coalitions.
# Using xgboost with default parameters as the surrogate model
explanation_list$sur_xgboost <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
phi0 = p0,
approach = "regression_surrogate",
regression.model = parsnip::boost_tree(engine = "xgboost", mode = "regression")
)
#> Note: Feature classes extracted from the model contains NA.
#> Assuming feature classes from the data are correct.
#> Success with message:
#> max_n_coalitions is NULL or larger than or 2^n_features = 16,
#> and is therefore set to 2^n_features = 16.
#>
#> ── Starting `shapr::explain()` at 2024-11-21 13:45:41 ───────────────────────────────────────────────────────────────────────────────────────────────────────────────
#> • Model class: <xgb.Booster>
#> • Approach: regression_surrogate
#> • Iterative estimation: FALSE
#> • Number of feature-wise Shapley values: 4
#> • Number of observations to explain: 20
#> • Computations (temporary) saved at: '/tmp/RtmpgQii3E/shapr_obj_1aa78660457178.rds'
#>
#> ── Main computation started ──
#>
#> ℹ Using 16 of 16 coalitions.
# Using xgboost with parameters tuned by cross-validation as the surrogate model
explanation_list$sur_xgboost_cv <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
phi0 = p0,
approach = "regression_surrogate",
regression.model = parsnip::boost_tree(
trees = hardhat::tune(),
tree_depth = hardhat::tune(),
engine = "xgboost",
mode = "regression"
),
regression.tune_values = expand.grid(trees = c(5, 15, 25), tree_depth = c(2, 6, 10)),
regression.vfold_cv_para = list(v = 5)
)
#> Note: Feature classes extracted from the model contains NA.
#> Assuming feature classes from the data are correct.
#> Success with message:
#> max_n_coalitions is NULL or larger than or 2^n_features = 16,
#> and is therefore set to 2^n_features = 16.
#>
#> ── Starting `shapr::explain()` at 2024-11-21 13:45:41 ───────────────────────────────────────────────────────────────────────────────────────────────────────────────
#> • Model class: <xgb.Booster>
#> • Approach: regression_surrogate
#> • Iterative estimation: FALSE
#> • Number of feature-wise Shapley values: 4
#> • Number of observations to explain: 20
#> • Computations (temporary) saved at: '/tmp/RtmpgQii3E/shapr_obj_1aa7865d9b9594.rds'
#>
#> ── Main computation started ──
#>
#> ℹ Using 16 of 16 coalitions.
# Using random forest with default parameters as the surrogate model
explanation_list$sur_rf <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
phi0 = p0,
approach = "regression_surrogate",
regression.model = parsnip::rand_forest(engine = "ranger", mode = "regression")
)
#> Note: Feature classes extracted from the model contains NA.
#> Assuming feature classes from the data are correct.
#> Success with message:
#> max_n_coalitions is NULL or larger than or 2^n_features = 16,
#> and is therefore set to 2^n_features = 16.
#>
#> ── Starting `shapr::explain()` at 2024-11-21 13:45:43 ───────────────────────────────────────────────────────────────────────────────────────────────────────────────
#> • Model class: <xgb.Booster>
#> • Approach: regression_surrogate
#> • Iterative estimation: FALSE
#> • Number of feature-wise Shapley values: 4
#> • Number of observations to explain: 20
#> • Computations (temporary) saved at: '/tmp/RtmpgQii3E/shapr_obj_1aa78658adae63.rds'
#>
#> ── Main computation started ──
#>
#> ℹ Using 16 of 16 coalitions.
# Using random forest with parameters tuned by cross-validation as the surrogate model
explanation_list$sur_rf_cv <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
phi0 = p0,
approach = "regression_surrogate",
regression.model = parsnip::rand_forest(
mtry = hardhat::tune(), trees = hardhat::tune(), engine = "ranger", mode = "regression"
),
regression.tune_values = dials::grid_regular(
dials::mtry(c(1, ncol(x_explain))),
dials::trees(c(50, 750)),
levels = 6
),
regression.vfold_cv_para = list(v = 5)
)
#> Note: Feature classes extracted from the model contains NA.
#> Assuming feature classes from the data are correct.
#> Success with message:
#> max_n_coalitions is NULL or larger than or 2^n_features = 16,
#> and is therefore set to 2^n_features = 16.
#>
#> ── Starting `shapr::explain()` at 2024-11-21 13:45:44 ───────────────────────────────────────────────────────────────────────────────────────────────────────────────
#> • Model class: <xgb.Booster>
#> • Approach: regression_surrogate
#> • Iterative estimation: FALSE
#> • Number of feature-wise Shapley values: 4
#> • Number of observations to explain: 20
#> • Computations (temporary) saved at: '/tmp/RtmpgQii3E/shapr_obj_1aa78642356367.rds'
#>
#> ── Main computation started ──
#>
#> ℹ Using 16 of 16 coalitions.
The code chunk below demonstrates how to run the surrogate regression
method class in parallel using the future
package. The
setup procedure is identical to the one we specified for separate regression method class.
The training step of the surrogate regression model can be run in
parallel if we tune some of its hyperparameters. We parallelize the
cross-validation procedure in the training step; hence, we apply no
parallelization in the training step of a surrogate model with specified
hyperparameters. Furthermore, we parallelize the prediction step (when
we compute \(v(S)\)) in the same way as
for the separate regression method class. Note that parallelization will
introduce some overhead, which can cause it to be slower than running
the code sequentially for smaller problems.
# Cross validate the number of trees in parallel on four threads
future::plan(future::multisession, workers = 4)
explanation_list$sur_rf_cv_par <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
phi0 = p0,
approach = "regression_surrogate",
regression.model = parsnip::rand_forest(
mtry = hardhat::tune(), trees = hardhat::tune(), engine = "ranger", mode = "regression"
),
regression.tune_values = dials::grid_regular(
dials::mtry(c(1, ncol(x_explain))),
dials::trees(c(50, 750)),
levels = 6
),
regression.vfold_cv_para = list(v = 5)
)
#> Note: Feature classes extracted from the model contains NA.
#> Assuming feature classes from the data are correct.
#> Success with message:
#> max_n_coalitions is NULL or larger than or 2^n_features = 16,
#> and is therefore set to 2^n_features = 16.
#>
#> ── Starting `shapr::explain()` at 2024-11-21 13:46:12 ───────────────────────────────────────────────────────────────────────────────────────────────────────────────
#> • Model class: <xgb.Booster>
#> • Approach: regression_surrogate
#> • Iterative estimation: FALSE
#> • Number of feature-wise Shapley values: 4
#> • Number of observations to explain: 20
#> • Computations (temporary) saved at: '/tmp/RtmpgQii3E/shapr_obj_1aa7866c6710c6.rds'
#>
#> ── Main computation started ──
#>
#> ℹ Using 16 of 16 coalitions.
future::plan(future::sequential) # To return to non-parallel computation
# Check that we get identical Shapley value explanations
all.equal(
explanation_list$sur_rf_cv$shapley_values_est,
explanation_list$sur_rf_cv_par$shapley_values_est
)
#> [1] TRUE
By looking at the \(\operatorname{MSE}_v\) evaluation criterion
and the elapsed time, we see that the surrogate methods (except the
linear regression model) outperform empirical
but are not
on the same level as the best separate regression methods. Furthermore,
parallelization (4 cores) decreased the elapsed time while obtaining the
same \(\operatorname{MSE}_v\) score.
The identical scores mean that the separate models are identical and
independent of whether they were run sequentially or in parallel.
# Print the MSEv scores and the elapsed time (in seconds) for the different methods
print_MSEv_scores_and_time(explanation_list)
#> MSEv Time
#> MC_empirical 179.43 3.35
#> sep_lm 745.21 0.75
#> sep_pcr 784.91 0.88
#> sep_splines 165.13 0.89
#> sep_reicpe_example 687.45 1.25
#> sep_tree_stump 218.05 1.00
#> sep_tree_default 177.68 0.73
#> sep_tree_cv 222.71 12.35
#> sep_tree_cv_2 219.45 25.59
#> sep_rf 217.00 1.23
#> sep_rf_cv 212.64 30.98
#> sep_xgboost 197.72 0.86
#> sep_xgboost_cv 169.83 14.60
#> sep_xgboost_cv_par 169.83 12.33
#> sep_xgboost_cv_2_par 153.13 13.99
#> sur_lm 649.61 0.42
#> sur_xgboost 169.92 0.44
#> sur_xgboost_cv 169.87 2.01
#> sur_rf 201.23 0.69
#> sur_rf_cv 172.09 27.44
#> sur_rf_cv_par 172.09 20.81
# Compare the MSEv criterion of the different explanation methods.
# Include vertical line corresponding to the MSEv of the empirical method.
plot_MSEv_scores(explanation_list, method_line = "MC_empirical")
Even though the tidymodels
framework contains many models, we might
want to add additional methods. In the following section, we demonstrate
how to add the projection pursuit regression (PPR) model as a new method
that can be used by shapr
to compute the Shapley value
explanations, both as a separate and surrogate method.
We use the ppr()
implementation in the
stats
package to fit the PPR model. The model has several
hyperparameters that can be tuned, but the main hyperparameter is the
number of terms nterms
. The following is based on the tidymodels
guide on adding new regression models. We refer to that guide for
more details and explanations of the code below.
# Step 1: register the model, modes, and arguments
parsnip::set_new_model(model = "ppr_reg")
parsnip::set_model_mode(model = "ppr_reg", mode = "regression")
parsnip::set_model_engine(model = "ppr_reg", mode = "regression", eng = "ppr")
parsnip::set_dependency("ppr_reg", eng = "ppr", pkg = "stats")
# If your function has several parameters, then we add one of these functions for each parameter
parsnip::set_model_arg(
model = "ppr_reg",
eng = "ppr",
original = "nterms", # The original parameter name used in stats::ppr
parsnip = "num_terms", # Change parameter name to match tidymodels' name convention
func = list(pkg = "dials", fun = "num_terms"), # list(pkg = "stats", fun = "ppr"),
has_submodel = FALSE
)
# Step 2: create the model function
ppr_reg <- function(mode = "regression", engine = "ppr", num_terms = NULL) {
# Check for correct mode
if (mode != "regression") rlang::abort("`mode` should be 'regression'")
# Check for correct engine
if (engine != "ppr") rlang::abort("`engine` should be 'ppr'")
# Capture the arguments in quosures
args <- list(num_terms = rlang::enquo(num_terms))
# Save some empty slots for future parts of the specification
parsnip::new_model_spec(
"ppr_reg",
args = args,
eng_args = NULL,
mode = mode,
method = NULL,
engine = engine
)
}
# Step 3: add a fit module
parsnip::set_fit(
model = "ppr_reg",
eng = "ppr",
mode = "regression",
value = list(
interface = "formula",
protect = c("formula", "data", "weights"),
func = c(pkg = "stats", fun = "ppr"),
defaults = list()
)
)
parsnip::set_encoding(
model = "ppr_reg",
eng = "ppr",
mode = "regression",
options = list(
predictor_indicators = "traditional",
compute_intercept = TRUE,
remove_intercept = TRUE,
allow_sparse_x = FALSE
)
)
# Step 4: add modules for prediction
parsnip::set_pred(
model = "ppr_reg",
eng = "ppr",
mode = "regression",
type = "numeric",
value = list(
pre = NULL,
post = NULL,
func = c(fun = "predict"),
args = list(
object = quote(object$fit),
newdata = quote(new_data),
type = "numeric"
)
)
)
# Step 5: add tuning function (used by tune::tune_grid())
tunable.ppr_reg <- function(x, ...) {
tibble::tibble(
name = c("num_terms"),
call_info = list(list(pkg = NULL, fun = "num_terms")),
source = "model_spec",
component = "ppr_reg",
component_id = "main"
)
}
# Step 6: add updating function (used by tune::finalize_workflow())
update.ppr_reg <- function(object, parameters = NULL, num_terms = NULL, ...) {
rlang::check_installed("parsnip")
eng_args <- parsnip::update_engine_parameters(object$eng_args, fresh = TRUE, ...)
args <- list(num_terms = rlang::enquo(num_terms))
args <- parsnip::update_main_parameters(args, parameters)
parsnip::new_model_spec(
"ppr_reg",
args = args,
eng_args = eng_args,
mode = object$mode,
method = NULL,
engine = object$engine
)
}
We can now use the PPR model to compute the Shapley value
explanations. We can use it as a separate and surrogate regression
method, and we can either set the number of terms num_terms
to a specific value or use cross-validation to tune the hyperparameter.
We do all four combinations below.
# PPR separate with specified number of terms
explanation_list$sep_ppr <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
phi0 = p0,
approach = "regression_separate",
regression.model = ppr_reg(num_terms = 2)
)
#> Note: Feature classes extracted from the model contains NA.
#> Assuming feature classes from the data are correct.
#> Success with message:
#> max_n_coalitions is NULL or larger than or 2^n_features = 16,
#> and is therefore set to 2^n_features = 16.
#>
#> ── Starting `shapr::explain()` at 2024-11-21 13:46:33 ───────────────────────────────────────────────────────────────────────────────────────────────────────────────
#> • Model class: <xgb.Booster>
#> • Approach: regression_separate
#> • Iterative estimation: FALSE
#> • Number of feature-wise Shapley values: 4
#> • Number of observations to explain: 20
#> • Computations (temporary) saved at: '/tmp/RtmpgQii3E/shapr_obj_1aa7863200c7e2.rds'
#>
#> ── Main computation started ──
#>
#> ℹ Using 16 of 16 coalitions.
# PPR separate with cross-validated number of terms
explanation_list$sep_ppr_cv <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
phi0 = p0,
approach = "regression_separate",
regression.model = ppr_reg(num_terms = hardhat::tune()),
regression.tune_values = dials::grid_regular(dials::num_terms(c(1, 4)), levels = 3),
regression.vfold_cv_para = list(v = 5)
)
#> Note: Feature classes extracted from the model contains NA.
#> Assuming feature classes from the data are correct.
#> Success with message:
#> max_n_coalitions is NULL or larger than or 2^n_features = 16,
#> and is therefore set to 2^n_features = 16.
#>
#> ── Starting `shapr::explain()` at 2024-11-21 13:46:34 ───────────────────────────────────────────────────────────────────────────────────────────────────────────────
#> • Model class: <xgb.Booster>
#> • Approach: regression_separate
#> • Iterative estimation: FALSE
#> • Number of feature-wise Shapley values: 4
#> • Number of observations to explain: 20
#> • Computations (temporary) saved at: '/tmp/RtmpgQii3E/shapr_obj_1aa7865e9cc51b.rds'
#>
#> ── Main computation started ──
#>
#> ℹ Using 16 of 16 coalitions.
# PPR surrogate with specified number of terms
explanation_list$sur_ppr <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
phi0 = p0,
approach = "regression_surrogate",
regression.model = ppr_reg(num_terms = 3)
)
#> Note: Feature classes extracted from the model contains NA.
#> Assuming feature classes from the data are correct.
#> Success with message:
#> max_n_coalitions is NULL or larger than or 2^n_features = 16,
#> and is therefore set to 2^n_features = 16.
#>
#> ── Starting `shapr::explain()` at 2024-11-21 13:46:44 ───────────────────────────────────────────────────────────────────────────────────────────────────────────────
#> • Model class: <xgb.Booster>
#> • Approach: regression_surrogate
#> • Iterative estimation: FALSE
#> • Number of feature-wise Shapley values: 4
#> • Number of observations to explain: 20
#> • Computations (temporary) saved at: '/tmp/RtmpgQii3E/shapr_obj_1aa786796408ed.rds'
#>
#> ── Main computation started ──
#>
#> ℹ Using 16 of 16 coalitions.
# PPR surrogate with cross-validated number of terms
explanation_list$sur_ppr_cv <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
phi0 = p0,
approach = "regression_surrogate",
regression.model = ppr_reg(num_terms = hardhat::tune()),
regression.tune_values = dials::grid_regular(dials::num_terms(c(1, 8)), levels = 4),
regression.vfold_cv_para = list(v = 5)
)
#> Note: Feature classes extracted from the model contains NA.
#> Assuming feature classes from the data are correct.
#> Success with message:
#> max_n_coalitions is NULL or larger than or 2^n_features = 16,
#> and is therefore set to 2^n_features = 16.
#>
#> ── Starting `shapr::explain()` at 2024-11-21 13:46:45 ───────────────────────────────────────────────────────────────────────────────────────────────────────────────
#> • Model class: <xgb.Booster>
#> • Approach: regression_surrogate
#> • Iterative estimation: FALSE
#> • Number of feature-wise Shapley values: 4
#> • Number of observations to explain: 20
#> • Computations (temporary) saved at: '/tmp/RtmpgQii3E/shapr_obj_1aa7863ebffc8.rds'
#>
#> ── Main computation started ──
#>
#> ℹ Using 16 of 16 coalitions.
We can then compare the \(\operatorname{MSE}_v\) and some of the Shapley value explanations. We see that conducting cross-validation improves the evaluation criterion, but also increase the running time.
# Print the MSEv scores and the elapsed time (in seconds) for the different methods
print_MSEv_scores_and_time(explanation_list)
#> MSEv Time
#> MC_empirical 179.43 3.35
#> sep_lm 745.21 0.75
#> sep_pcr 784.91 0.88
#> sep_splines 165.13 0.89
#> sep_reicpe_example 687.45 1.25
#> sep_tree_stump 218.05 1.00
#> sep_tree_default 177.68 0.73
#> sep_tree_cv 222.71 12.35
#> sep_tree_cv_2 219.45 25.59
#> sep_rf 217.00 1.23
#> sep_rf_cv 212.64 30.98
#> sep_xgboost 197.72 0.86
#> sep_xgboost_cv 169.83 14.60
#> sep_xgboost_cv_par 169.83 12.33
#> sep_xgboost_cv_2_par 153.13 13.99
#> sur_lm 649.61 0.42
#> sur_xgboost 169.92 0.44
#> sur_xgboost_cv 169.87 2.01
#> sur_rf 201.23 0.69
#> sur_rf_cv 172.09 27.44
#> sur_rf_cv_par 172.09 20.81
#> sep_ppr 327.23 0.76
#> sep_ppr_cv 246.28 10.24
#> sur_ppr 395.42 0.39
#> sur_ppr_cv 415.62 1.57
# Compare the MSEv criterion of the different explanation methods
plot_MSEv_scores(explanation_list, method_line = "MC_empirical")
In this section, we compute the Shapley value explanations for the
Monte Carlo-based methods in the shapr
package and compare
the results with all the regression-based methods above. The purpose of
this vignette is to demonstrate the rich possibilities that the
regression paradigm and the tidymodels
framework adds to
the shapr
package.
In the code chunk below, we compute the Shapley value explanations using the different Monte Carlo-based methods.
explanation_list_MC <- list()
# Compute the Shapley value explanations using the independence method
explanation_list_MC$MC_independence <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
approach = "independence",
phi0 = p0
)
#> Note: Feature classes extracted from the model contains NA.
#> Assuming feature classes from the data are correct.
#> Success with message:
#> max_n_coalitions is NULL or larger than or 2^n_features = 16,
#> and is therefore set to 2^n_features = 16.
#>
#> ── Starting `shapr::explain()` at 2024-11-21 13:46:47 ───────────────────────────────────────────────────────────────────────────────────────────────────────────────
#> • Model class: <xgb.Booster>
#> • Approach: independence
#> • Iterative estimation: FALSE
#> • Number of feature-wise Shapley values: 4
#> • Number of observations to explain: 20
#> • Computations (temporary) saved at: '/tmp/RtmpgQii3E/shapr_obj_1aa7862f2f3856.rds'
#>
#> ── Main computation started ──
#>
#> ℹ Using 16 of 16 coalitions.
# Copy the Shapley value explanations for the empirical method
explanation_list_MC$MC_empirical <- explanation_list$MC_empirical
# Compute the Shapley value explanations using the gaussian method
explanation_list_MC$MC_gaussian <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
approach = "gaussian",
phi0 = p0
)
#> Note: Feature classes extracted from the model contains NA.
#> Assuming feature classes from the data are correct.
#> Success with message:
#> max_n_coalitions is NULL or larger than or 2^n_features = 16,
#> and is therefore set to 2^n_features = 16.
#>
#> ── Starting `shapr::explain()` at 2024-11-21 13:46:48 ───────────────────────────────────────────────────────────────────────────────────────────────────────────────
#> • Model class: <xgb.Booster>
#> • Approach: gaussian
#> • Iterative estimation: FALSE
#> • Number of feature-wise Shapley values: 4
#> • Number of observations to explain: 20
#> • Computations (temporary) saved at: '/tmp/RtmpgQii3E/shapr_obj_1aa78626579ea5.rds'
#>
#> ── Main computation started ──
#>
#> ℹ Using 16 of 16 coalitions.
# Compute the Shapley value explanations using the copula method
explanation_list_MC$MC_copula <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
approach = "copula",
phi0 = p0
)
#> Note: Feature classes extracted from the model contains NA.
#> Assuming feature classes from the data are correct.
#> Success with message:
#> max_n_coalitions is NULL or larger than or 2^n_features = 16,
#> and is therefore set to 2^n_features = 16.
#>
#> ── Starting `shapr::explain()` at 2024-11-21 13:46:48 ───────────────────────────────────────────────────────────────────────────────────────────────────────────────
#> • Model class: <xgb.Booster>
#> • Approach: copula
#> • Iterative estimation: FALSE
#> • Number of feature-wise Shapley values: 4
#> • Number of observations to explain: 20
#> • Computations (temporary) saved at: '/tmp/RtmpgQii3E/shapr_obj_1aa78623a24f70.rds'
#>
#> ── Main computation started ──
#>
#> ℹ Using 16 of 16 coalitions.
# Compute the Shapley value explanations using the ctree method
explanation_list_MC$MC_ctree <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
approach = "ctree",
phi0 = p0
)
#> Note: Feature classes extracted from the model contains NA.
#> Assuming feature classes from the data are correct.
#> Success with message:
#> max_n_coalitions is NULL or larger than or 2^n_features = 16,
#> and is therefore set to 2^n_features = 16.
#>
#> ── Starting `shapr::explain()` at 2024-11-21 13:46:49 ───────────────────────────────────────────────────────────────────────────────────────────────────────────────
#> • Model class: <xgb.Booster>
#> • Approach: ctree
#> • Iterative estimation: FALSE
#> • Number of feature-wise Shapley values: 4
#> • Number of observations to explain: 20
#> • Computations (temporary) saved at: '/tmp/RtmpgQii3E/shapr_obj_1aa78622c601b8.rds'
#>
#> ── Main computation started ──
#>
#> ℹ Using 16 of 16 coalitions.
# Compute the Shapley value explanations using the vaeac method
explanation_list_MC$MC_vaeac <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
approach = "vaeac",
phi0 = p0,
vaeac.epochs = 10
)
#> Note: Feature classes extracted from the model contains NA.
#> Assuming feature classes from the data are correct.
#> Success with message:
#> max_n_coalitions is NULL or larger than or 2^n_features = 16,
#> and is therefore set to 2^n_features = 16.
#>
#> ── Starting `shapr::explain()` at 2024-11-21 13:46:50 ───────────────────────────────────────────────────────────────────────────────────────────────────────────────
#> • Model class: <xgb.Booster>
#> • Approach: vaeac
#> • Iterative estimation: FALSE
#> • Number of feature-wise Shapley values: 4
#> • Number of observations to explain: 20
#> • Computations (temporary) saved at: '/tmp/RtmpgQii3E/shapr_obj_1aa7863c7421dd.rds'
#>
#> ── Main computation started ──
#>
#> ℹ Using 16 of 16 coalitions.
# Combine the two explanations lists
explanation_list$MC_empirical <- NULL
explanation_list <- c(explanation_list_MC, explanation_list)
We then compare the compare the regression and Monte Carlo-based
methods by plotting the \(\operatorname{MSE}_v\) evaluation
criterion. We continue with include a vertical line corresponding to the
\(\operatorname{MSE}_v\) of the
MC_empirical
method to make the comparison easier.
# Print the MSEv scores and the elapsed time (in seconds) for the different methods
print_MSEv_scores_and_time(explanation_list)
#> MSEv Time
#> MC_independence 206.92 0.63
#> MC_empirical 179.43 3.35
#> MC_gaussian 235.15 0.47
#> MC_copula 237.35 0.53
#> MC_ctree 190.82 1.90
#> MC_vaeac 145.06 125.93
#> sep_lm 745.21 0.75
#> sep_pcr 784.91 0.88
#> sep_splines 165.13 0.89
#> sep_reicpe_example 687.45 1.25
#> sep_tree_stump 218.05 1.00
#> sep_tree_default 177.68 0.73
#> sep_tree_cv 222.71 12.35
#> sep_tree_cv_2 219.45 25.59
#> sep_rf 217.00 1.23
#> sep_rf_cv 212.64 30.98
#> sep_xgboost 197.72 0.86
#> sep_xgboost_cv 169.83 14.60
#> sep_xgboost_cv_par 169.83 12.33
#> sep_xgboost_cv_2_par 153.13 13.99
#> sur_lm 649.61 0.42
#> sur_xgboost 169.92 0.44
#> sur_xgboost_cv 169.87 2.01
#> sur_rf 201.23 0.69
#> sur_rf_cv 172.09 27.44
#> sur_rf_cv_par 172.09 20.81
#> sep_ppr 327.23 0.76
#> sep_ppr_cv 246.28 10.24
#> sur_ppr 395.42 0.39
#> sur_ppr_cv 415.62 1.57
# Compare the MSEv criterion of the different explanation methods
# Include vertical line corresponding to the MSEv of the MC_empirical method
plot_MSEv_scores(explanation_list, method_line = "MC_empirical")
The vaeac
approach is the best-performing method
according to the \(\operatorname{MSE}_v\) evaluation
criterion, while the sep_xgboost_cv_2_par
is the
best-performing regression-based method. However, we should note that
the vaeac
method is much slower and that the difference
between the \(\operatorname{MSE}_v\)
values is minuscule and inside the confidence intervals.
We can also order the methods to more easily look at the order of the methods according to the \(\operatorname{MSE}_v\) criterion.
order <- get_k_best_methods(explanation_list, k = length(explanation_list))
plot_MSEv_scores(explanation_list[order], method_line = "MC_empirical")
We can also examine the different Shapley value explanations for the first six explicands (two at a time), and we still sort the methods from best to worst. Most methods agree in the general directions, especially for the most important features (the features with the largest absolute Shapley values), but there are some differences for the less important features. These tendencies/discrepancies are often more visible for the methods with poor/larger \(\operatorname{MSE}_v\) values.
Here, we focus on the five best methods (and
MC_empricial
) to make it easier to analyze the individual
Shapley value explanations, and we see a quite strong agreement between
the different methods.
In this section, we replicate and extend the mixed data example from
the general usage by demonstrating the separate and surrogate regression
methods. Of the Monte Carlo-based methods, only the
independence
(not recommended), ctree
, and
vaeac
methods support mixed data. We can divide the
regression models into two groups based on whether the model can handle
categorical features by default or if we need to apply pre-processing of
the categorical features. By pre-processing, we mean that we need to
convert the categorical features into numerical values using, for
example, dummy features. We demonstrate this below using the
regression.recipe_func
function.
First, we copy the setup from the general usage.
# convert the month variable to a factor
data_cat <- copy(data)[, Month_factor := as.factor(Month)]
data_train_cat <- data_cat[-ind_x_explain, ]
data_explain_cat <- data_cat[ind_x_explain, ]
x_var_cat <- c("Solar.R", "Wind", "Temp", "Month_factor")
x_train_cat <- data_train_cat[, ..x_var_cat]
x_explain_cat <- data_explain_cat[, ..x_var_cat]
p0_cat <- mean(y_train)
# Fitting an lm model here as xgboost does not handle categorical features directly
formula <- as.formula(paste0(y_var, " ~ ", paste0(x_var_cat, collapse = " + ")))
model_cat <- lm(formula, data_train_cat)
# We could also consider other models such as random forest which supports mixed data
# model_cat <- ranger(formula, data_train_cat)
# List to store the explanations for this mixed data setup
explanation_list_mixed <- list()
Second, we compute the explanations using the Monte Carlo-based methods.
explanation_list_mixed$MC_independence <- explain(
model = model_cat,
x_explain = x_explain_cat,
x_train = x_train_cat,
phi0 = p0_cat,
approach = "independence"
)
#> Success with message:
#> max_n_coalitions is NULL or larger than or 2^n_features = 16,
#> and is therefore set to 2^n_features = 16.
#>
#> ── Starting `shapr::explain()` at 2024-11-21 13:49:00 ───────────────────────────────────────────────────────────────────────────────────────────────────────────────
#> • Model class: <lm>
#> • Approach: independence
#> • Iterative estimation: FALSE
#> • Number of feature-wise Shapley values: 4
#> • Number of observations to explain: 20
#> • Computations (temporary) saved at: '/tmp/RtmpgQii3E/shapr_obj_1aa786590644b4.rds'
#>
#> ── Main computation started ──
#>
#> ℹ Using 16 of 16 coalitions.
explanation_list_mixed$MC_ctree <- explain(
model = model_cat,
x_explain = x_explain_cat,
x_train = x_train_cat,
phi0 = p0_cat,
approach = "ctree"
)
#> Success with message:
#> max_n_coalitions is NULL or larger than or 2^n_features = 16,
#> and is therefore set to 2^n_features = 16.
#>
#> ── Starting `shapr::explain()` at 2024-11-21 13:49:01 ───────────────────────────────────────────────────────────────────────────────────────────────────────────────
#> • Model class: <lm>
#> • Approach: ctree
#> • Iterative estimation: FALSE
#> • Number of feature-wise Shapley values: 4
#> • Number of observations to explain: 20
#> • Computations (temporary) saved at: '/tmp/RtmpgQii3E/shapr_obj_1aa786424cf108.rds'
#>
#> ── Main computation started ──
#>
#> ℹ Using 16 of 16 coalitions.
explanation_list_mixed$MC_vaeac <- explain(
model = model_cat,
x_explain = x_explain_cat,
x_train = x_train_cat,
phi0 = p0_cat,
approach = "vaeac"
)
#> Success with message:
#> max_n_coalitions is NULL or larger than or 2^n_features = 16,
#> and is therefore set to 2^n_features = 16.
#>
#> ── Starting `shapr::explain()` at 2024-11-21 13:49:03 ───────────────────────────────────────────────────────────────────────────────────────────────────────────────
#> • Model class: <lm>
#> • Approach: vaeac
#> • Iterative estimation: FALSE
#> • Number of feature-wise Shapley values: 4
#> • Number of observations to explain: 20
#> • Computations (temporary) saved at: '/tmp/RtmpgQii3E/shapr_obj_1aa7865c8607f1.rds'
#>
#> ── Main computation started ──
#>
#> ℹ Using 16 of 16 coalitions.
Third, we compute the Shapley value explanations using separate regression methods. We use many of the same regression models as we did above for the continuous data examples.
# Standard linear regression
explanation_list_mixed$sep_lm <- explain(
model = model_cat,
x_explain = x_explain_cat,
x_train = x_train_cat,
phi0 = p0_cat,
approach = "regression_separate",
regression.model = parsnip::linear_reg()
)
#> Success with message:
#> max_n_coalitions is NULL or larger than or 2^n_features = 16,
#> and is therefore set to 2^n_features = 16.
#>
#> ── Starting `shapr::explain()` at 2024-11-21 13:52:26 ───────────────────────────────────────────────────────────────────────────────────────────────────────────────
#> • Model class: <lm>
#> • Approach: regression_separate
#> • Iterative estimation: FALSE
#> • Number of feature-wise Shapley values: 4
#> • Number of observations to explain: 20
#> • Computations (temporary) saved at: '/tmp/RtmpgQii3E/shapr_obj_1aa7865611779c.rds'
#>
#> ── Main computation started ──
#>
#> ℹ Using 16 of 16 coalitions.
# Linear regression where we have added splines to the numerical features
explanation_list_mixed$sep_splines <- explain(
model = model_cat,
x_explain = x_explain_cat,
x_train = x_train_cat,
phi0 = p0_cat,
approach = "regression_separate",
regression.model = parsnip::linear_reg(),
regression.recipe_func = function(regression_recipe) {
return(step_ns(regression_recipe, all_numeric_predictors(), deg_free = 2))
}
)
#> Success with message:
#> max_n_coalitions is NULL or larger than or 2^n_features = 16,
#> and is therefore set to 2^n_features = 16.
#>
#> ── Starting `shapr::explain()` at 2024-11-21 13:52:27 ───────────────────────────────────────────────────────────────────────────────────────────────────────────────
#> • Model class: <lm>
#> • Approach: regression_separate
#> • Iterative estimation: FALSE
#> • Number of feature-wise Shapley values: 4
#> • Number of observations to explain: 20
#> • Computations (temporary) saved at: '/tmp/RtmpgQii3E/shapr_obj_1aa78640377130.rds'
#>
#> ── Main computation started ──
#>
#> ℹ Using 16 of 16 coalitions.
# Decision tree with default parameters
explanation_list_mixed$sep_tree <- explain(
model = model_cat,
x_explain = x_explain_cat,
x_train = x_train_cat,
phi0 = p0_cat,
approach = "regression_separate",
regression.model = parsnip::decision_tree(engine = "rpart", mode = "regression")
)
#> Success with message:
#> max_n_coalitions is NULL or larger than or 2^n_features = 16,
#> and is therefore set to 2^n_features = 16.
#>
#> ── Starting `shapr::explain()` at 2024-11-21 13:52:28 ───────────────────────────────────────────────────────────────────────────────────────────────────────────────
#> • Model class: <lm>
#> • Approach: regression_separate
#> • Iterative estimation: FALSE
#> • Number of feature-wise Shapley values: 4
#> • Number of observations to explain: 20
#> • Computations (temporary) saved at: '/tmp/RtmpgQii3E/shapr_obj_1aa7864f9416f9.rds'
#>
#> ── Main computation started ──
#>
#> ℹ Using 16 of 16 coalitions.
# Use trees with cross-validation on the depth and cost complexity. Manually set the values.
explanation_list_mixed$sep_tree_cv <- explain(
model = model_cat,
x_explain = x_explain_cat,
x_train = x_train_cat,
phi0 = p0_cat,
approach = "regression_separate",
regression.model = parsnip::decision_tree(
tree_depth = hardhat::tune(),
cost_complexity = hardhat::tune(),
engine = "rpart",
mode = "regression"
),
regression.tune_values =
expand.grid(tree_depth = c(1, 3, 5), cost_complexity = c(0.001, 0.01, 0.1)),
regression.vfold_cv_para = list(v = 5)
)
#> Success with message:
#> max_n_coalitions is NULL or larger than or 2^n_features = 16,
#> and is therefore set to 2^n_features = 16.
#>
#> ── Starting `shapr::explain()` at 2024-11-21 13:52:29 ───────────────────────────────────────────────────────────────────────────────────────────────────────────────
#> • Model class: <lm>
#> • Approach: regression_separate
#> • Iterative estimation: FALSE
#> • Number of feature-wise Shapley values: 4
#> • Number of observations to explain: 20
#> • Computations (temporary) saved at: '/tmp/RtmpgQii3E/shapr_obj_1aa78668483482.rds'
#>
#> ── Main computation started ──
#>
#> ℹ Using 16 of 16 coalitions.
# Random forest with default hyperparameters. Do NOT need to use dummy features.
explanation_list_mixed$sep_rf <- explain(
model = model_cat,
x_explain = x_explain_cat,
x_train = x_train_cat,
phi0 = p0_cat,
approach = "regression_separate",
regression.model = parsnip::rand_forest(engine = "ranger", mode = "regression")
)
#> Success with message:
#> max_n_coalitions is NULL or larger than or 2^n_features = 16,
#> and is therefore set to 2^n_features = 16.
#>
#> ── Starting `shapr::explain()` at 2024-11-21 13:52:55 ───────────────────────────────────────────────────────────────────────────────────────────────────────────────
#> • Model class: <lm>
#> • Approach: regression_separate
#> • Iterative estimation: FALSE
#> • Number of feature-wise Shapley values: 4
#> • Number of observations to explain: 20
#> • Computations (temporary) saved at: '/tmp/RtmpgQii3E/shapr_obj_1aa786c6b8fb0.rds'
#>
#> ── Main computation started ──
#>
#> ℹ Using 16 of 16 coalitions.
# Random forest with cross validated hyperparameters.
explanation_list_mixed$sep_rf_cv <- explain(
model = model_cat,
x_explain = x_explain_cat,
x_train = x_train_cat,
phi0 = p0_cat,
approach = "regression_separate",
regression.model = parsnip::rand_forest(
mtry = hardhat::tune(), trees = hardhat::tune(), engine = "ranger", mode = "regression"
),
regression.tune_values =
function(x) {
dials::grid_regular(dials::mtry(c(1, ncol(x))), dials::trees(c(50, 750)), levels = 4)
},
regression.vfold_cv_para = list(v = 5)
)
#> Success with message:
#> max_n_coalitions is NULL or larger than or 2^n_features = 16,
#> and is therefore set to 2^n_features = 16.
#>
#> ── Starting `shapr::explain()` at 2024-11-21 13:52:56 ───────────────────────────────────────────────────────────────────────────────────────────────────────────────
#> • Model class: <lm>
#> • Approach: regression_separate
#> • Iterative estimation: FALSE
#> • Number of feature-wise Shapley values: 4
#> • Number of observations to explain: 20
#> • Computations (temporary) saved at: '/tmp/RtmpgQii3E/shapr_obj_1aa786296095ce.rds'
#>
#> ── Main computation started ──
#>
#> ℹ Using 16 of 16 coalitions.
# Xgboost with default hyperparameters, but we have to dummy encode the factors
explanation_list_mixed$sep_xgboost <- explain(
model = model_cat,
x_explain = x_explain_cat,
x_train = x_train_cat,
phi0 = p0_cat,
approach = "regression_separate",
regression.model = parsnip::boost_tree(engine = "xgboost", mode = "regression"),
regression.recipe_func = function(regression_recipe) {
return(step_dummy(regression_recipe, all_factor_predictors()))
}
)
#> Success with message:
#> max_n_coalitions is NULL or larger than or 2^n_features = 16,
#> and is therefore set to 2^n_features = 16.
#>
#> ── Starting `shapr::explain()` at 2024-11-21 13:53:36 ───────────────────────────────────────────────────────────────────────────────────────────────────────────────
#> • Model class: <lm>
#> • Approach: regression_separate
#> • Iterative estimation: FALSE
#> • Number of feature-wise Shapley values: 4
#> • Number of observations to explain: 20
#> • Computations (temporary) saved at: '/tmp/RtmpgQii3E/shapr_obj_1aa78654cd396d.rds'
#>
#> ── Main computation started ──
#>
#> ℹ Using 16 of 16 coalitions.
# Xgboost with cross validated hyperparameters and we dummy encode the factors
explanation_list_mixed$sep_xgboost_cv <- explain(
model = model_cat,
x_explain = x_explain_cat,
x_train = x_train_cat,
phi0 = p0_cat,
approach = "regression_separate",
regression.model = parsnip::boost_tree(
trees = hardhat::tune(),
tree_depth = hardhat::tune(),
engine = "xgboost",
mode = "regression"
),
regression.recipe_func = function(regression_recipe) {
return(step_dummy(regression_recipe, all_factor_predictors()))
},
regression.tune_values = expand.grid(trees = c(5, 15, 25), tree_depth = c(2, 6, 10)),
regression.vfold_cv_para = list(v = 5)
)
#> Success with message:
#> max_n_coalitions is NULL or larger than or 2^n_features = 16,
#> and is therefore set to 2^n_features = 16.
#>
#> ── Starting `shapr::explain()` at 2024-11-21 13:53:37 ───────────────────────────────────────────────────────────────────────────────────────────────────────────────
#> • Model class: <lm>
#> • Approach: regression_separate
#> • Iterative estimation: FALSE
#> • Number of feature-wise Shapley values: 4
#> • Number of observations to explain: 20
#> • Computations (temporary) saved at: '/tmp/RtmpgQii3E/shapr_obj_1aa78626ab5001.rds'
#>
#> ── Main computation started ──
#>
#> ℹ Using 16 of 16 coalitions.
Fourth, we compute the Shapley value explanations using surrogate regression methods. We use the same regression models as we did above for separate regression method class.
# Standard linear regression
explanation_list_mixed$sur_lm <- explain(
model = model_cat,
x_explain = x_explain_cat,
x_train = x_train_cat,
phi0 = p0_cat,
approach = "regression_surrogate",
regression.model = parsnip::linear_reg()
)
#> Success with message:
#> max_n_coalitions is NULL or larger than or 2^n_features = 16,
#> and is therefore set to 2^n_features = 16.
#>
#> ── Starting `shapr::explain()` at 2024-11-21 23:03:49 ──────────────────────────────────────────────────────────────────────────────────────────────────────
#> • Model class: <lm>
#> • Approach: regression_surrogate
#> • Iterative estimation: FALSE
#> • Number of feature-wise Shapley values: 4
#> • Number of observations to explain: 20
#> • Computations (temporary) saved at: '/tmp/RtmppO00aE/shapr_obj_1c345f66caf9e9.rds'
#>
#> ── Main computation started ──
#>
#> ℹ Using 16 of 16 coalitions.
# Linear regression where we have added splines to the numerical features
# NOTE, that we remove the augmented mask variables to avoid a rank-deficient fit
explanation_list_mixed$sur_splines <- explain(
model = model_cat,
x_explain = x_explain_cat,
x_train = x_train_cat,
phi0 = p0_cat,
approach = "regression_surrogate",
regression.model = parsnip::linear_reg(),
regression.recipe_func = function(recipe) {
return(step_ns(recipe, all_numeric_predictors(), -starts_with("mask_"), deg_free = 2))
}
)
#> Success with message:
#> max_n_coalitions is NULL or larger than or 2^n_features = 16,
#> and is therefore set to 2^n_features = 16.
#>
#> ── Starting `shapr::explain()` at 2024-11-21 23:03:50 ──────────────────────────────────────────────────────────────────────────────────────────────────────
#> • Model class: <lm>
#> • Approach: regression_surrogate
#> • Iterative estimation: FALSE
#> • Number of feature-wise Shapley values: 4
#> • Number of observations to explain: 20
#> • Computations (temporary) saved at: '/tmp/RtmppO00aE/shapr_obj_1c345f28c0612c.rds'
#>
#> ── Main computation started ──
#>
#> ℹ Using 16 of 16 coalitions.
# Decision tree with default parameters
explanation_list_mixed$sur_tree <- explain(
model = model_cat,
x_explain = x_explain_cat,
x_train = x_train_cat,
phi0 = p0_cat,
approach = "regression_surrogate",
regression.model = parsnip::decision_tree(engine = "rpart", mode = "regression")
)
#> Success with message:
#> max_n_coalitions is NULL or larger than or 2^n_features = 16,
#> and is therefore set to 2^n_features = 16.
#>
#> ── Starting `shapr::explain()` at 2024-11-21 23:03:51 ──────────────────────────────────────────────────────────────────────────────────────────────────────
#> • Model class: <lm>
#> • Approach: regression_surrogate
#> • Iterative estimation: FALSE
#> • Number of feature-wise Shapley values: 4
#> • Number of observations to explain: 20
#> • Computations (temporary) saved at: '/tmp/RtmppO00aE/shapr_obj_1c345ff402e18.rds'
#>
#> ── Main computation started ──
#>
#> ℹ Using 16 of 16 coalitions.
# Use trees with cross-validation on the depth and cost complexity. Manually set the values.
explanation_list_mixed$sur_tree_cv <- explain(
model = model_cat,
x_explain = x_explain_cat,
x_train = x_train_cat,
phi0 = p0_cat,
approach = "regression_surrogate",
regression.model = parsnip::decision_tree(
tree_depth = hardhat::tune(),
cost_complexity = hardhat::tune(),
engine = "rpart",
mode = "regression"
),
regression.tune_values =
expand.grid(tree_depth = c(1, 3, 5), cost_complexity = c(0.001, 0.01, 0.1)),
regression.vfold_cv_para = list(v = 5)
)
#> Success with message:
#> max_n_coalitions is NULL or larger than or 2^n_features = 16,
#> and is therefore set to 2^n_features = 16.
#>
#> ── Starting `shapr::explain()` at 2024-11-21 23:03:51 ──────────────────────────────────────────────────────────────────────────────────────────────────────
#> • Model class: <lm>
#> • Approach: regression_surrogate
#> • Iterative estimation: FALSE
#> • Number of feature-wise Shapley values: 4
#> • Number of observations to explain: 20
#> • Computations (temporary) saved at: '/tmp/RtmppO00aE/shapr_obj_1c345f159364cb.rds'
#>
#> ── Main computation started ──
#>
#> ℹ Using 16 of 16 coalitions.
# Random forest with default hyperparameters. Do NOT need to use dummy features.
explanation_list_mixed$sur_rf <- explain(
model = model_cat,
x_explain = x_explain_cat,
x_train = x_train_cat,
phi0 = p0_cat,
approach = "regression_surrogate",
regression.model = parsnip::rand_forest(engine = "ranger", mode = "regression")
)
#> Success with message:
#> max_n_coalitions is NULL or larger than or 2^n_features = 16,
#> and is therefore set to 2^n_features = 16.
#>
#> ── Starting `shapr::explain()` at 2024-11-21 23:03:53 ──────────────────────────────────────────────────────────────────────────────────────────────────────
#> • Model class: <lm>
#> • Approach: regression_surrogate
#> • Iterative estimation: FALSE
#> • Number of feature-wise Shapley values: 4
#> • Number of observations to explain: 20
#> • Computations (temporary) saved at: '/tmp/RtmppO00aE/shapr_obj_1c345f15d81edf.rds'
#>
#> ── Main computation started ──
#>
#> ℹ Using 16 of 16 coalitions.
# Random forest with cross validated hyperparameters.
explanation_list_mixed$sur_rf_cv <- explain(
model = model_cat,
x_explain = x_explain_cat,
x_train = x_train_cat,
phi0 = p0_cat,
approach = "regression_surrogate",
regression.model = parsnip::rand_forest(
mtry = hardhat::tune(), trees = hardhat::tune(), engine = "ranger", mode = "regression"
),
regression.tune_values = expand.grid(mtry = c(1, 2, 4), trees = c(50, 250, 500, 750)),
regression.vfold_cv_para = list(v = 5)
)
#> Success with message:
#> max_n_coalitions is NULL or larger than or 2^n_features = 16,
#> and is therefore set to 2^n_features = 16.
#>
#> ── Starting `shapr::explain()` at 2024-11-21 23:03:54 ──────────────────────────────────────────────────────────────────────────────────────────────────────
#> • Model class: <lm>
#> • Approach: regression_surrogate
#> • Iterative estimation: FALSE
#> • Number of feature-wise Shapley values: 4
#> • Number of observations to explain: 20
#> • Computations (temporary) saved at: '/tmp/RtmppO00aE/shapr_obj_1c345f5bb38b48.rds'
#>
#> ── Main computation started ──
#>
#> ℹ Using 16 of 16 coalitions.
# Xgboost with default hyperparameters, but we have to dummy encode the factors
explanation_list_mixed$sur_xgboost <- explain(
model = model_cat,
x_explain = x_explain_cat,
x_train = x_train_cat,
phi0 = p0_cat,
approach = "regression_surrogate",
regression.model = parsnip::boost_tree(engine = "xgboost", mode = "regression"),
regression.recipe_func = function(regression_recipe) {
return(step_dummy(regression_recipe, all_factor_predictors()))
}
)
#> Success with message:
#> max_n_coalitions is NULL or larger than or 2^n_features = 16,
#> and is therefore set to 2^n_features = 16.
#>
#> ── Starting `shapr::explain()` at 2024-11-21 23:04:09 ──────────────────────────────────────────────────────────────────────────────────────────────────────
#> • Model class: <lm>
#> • Approach: regression_surrogate
#> • Iterative estimation: FALSE
#> • Number of feature-wise Shapley values: 4
#> • Number of observations to explain: 20
#> • Computations (temporary) saved at: '/tmp/RtmppO00aE/shapr_obj_1c345f7f5cabf0.rds'
#>
#> ── Main computation started ──
#>
#> ℹ Using 16 of 16 coalitions.
# Xgboost with cross validated hyperparameters and we dummy encode the factors
explanation_list_mixed$sur_xgboost_cv <- explain(
model = model_cat,
x_explain = x_explain_cat,
x_train = x_train_cat,
phi0 = p0_cat,
approach = "regression_surrogate",
regression.model = parsnip::boost_tree(
trees = hardhat::tune(),
tree_depth = hardhat::tune(),
engine = "xgboost",
mode = "regression"
),
regression.recipe_func = function(regression_recipe) {
return(step_dummy(regression_recipe, all_factor_predictors()))
},
regression.tune_values = expand.grid(trees = c(5, 15, 25), tree_depth = c(2, 6, 10)),
regression.vfold_cv_para = list(v = 5)
)
#> Success with message:
#> max_n_coalitions is NULL or larger than or 2^n_features = 16,
#> and is therefore set to 2^n_features = 16.
#>
#> ── Starting `shapr::explain()` at 2024-11-21 23:04:10 ──────────────────────────────────────────────────────────────────────────────────────────────────────
#> • Model class: <lm>
#> • Approach: regression_surrogate
#> • Iterative estimation: FALSE
#> • Number of feature-wise Shapley values: 4
#> • Number of observations to explain: 20
#> • Computations (temporary) saved at: '/tmp/RtmppO00aE/shapr_obj_1c345f339560be.rds'
#>
#> ── Main computation started ──
#>
#> ℹ Using 16 of 16 coalitions.
Fifth, and finally, we compare the results. The surrogate random forest model performs well and outperforms the cross-validated version, but note the wide confidence interval. We see that several of the regression-based methods outperform the Monte Carlo-based methods. More specifically, three separate regression methods and three surrogate regression methods.
# Print the MSEv scores and the elapsed time (in seconds) for the different methods
print_MSEv_scores_and_time(explanation_list_mixed)
#> MSEv Time
#> MC_independence 641.82 0.86
#> MC_ctree 555.58 1.74
#> MC_vaeac 629.56 202.40
#> sep_lm 550.06 1.19
#> sep_splines 541.36 0.91
#> sep_tree 753.84 0.78
#> sep_tree_cv 756.27 26.22
#> sep_rf 518.27 1.18
#> sep_rf_cv 619.81 40.17
#> sep_xgboost 792.17 1.05
#> sep_xgboost_cv 595.98 18.70
#> sur_lm 610.61 0.45
#> sur_splines 596.86 0.43
#> sur_tree 677.04 0.47
#> sur_tree_cv 789.37 2.40
#> sur_rf 407.76 0.76
#> sur_rf_cv 520.63 15.18
#> sur_xgboost 606.92 0.47
#> sur_xgboost_cv 429.06 2.08
# Compare the MSEv criterion of the different explanation methods
# Include vertical line corresponding to the MSEv of the empirical method.
plot_MSEv_scores(explanation_list_mixed, method_line = "MC_ctree")
#> Error in MSEv_check_explanation_list(explanation_list): The object/objects 'sur_lm', 'sur_splines', 'sur_tree', 'sur_tree_cv', 'sur_rf', 'sur_rf_cv', 'sur_xgboost', 'sur_xgboost_cv' in `explanation_list` has/have a different `x_explain` than 'MC_independence'. Cannot compare them.
The best-performing methods are the surrogate random forest and
xgboost with cross-validation methods. The Monte Carlo-based methods
perform worse, with ctree
being the best, with a
seventh-place overall ranking.
We can also order the methods to more easily look at the order of the methods according to the \(\operatorname{MSE}_v\) criterion.
order <- get_k_best_methods(explanation_list_mixed, k = length(explanation_list_mixed))
plot_MSEv_scores(explanation_list_mixed[order], method_line = "MC_ctree")
#> Error in MSEv_check_explanation_list(explanation_list): The object/objects 'sep_rf', 'sep_splines', 'sep_lm', 'MC_ctree', 'sep_xgboost_cv', 'sep_rf_cv', 'MC_vaeac', 'MC_independence', 'sep_tree', 'sep_tree_cv', 'sep_xgboost' in `explanation_list` has/have a different `x_explain` than 'sur_rf'. Cannot compare them.
We also look at some of the Shapley value explanations and see that many methods produce similar explanations.
plot_SV_several_approaches(explanation_list_mixed[order], index_explicands = c(1, 2), facet_ncol = 1)
#> Error in plot_SV_several_approaches(explanation_list_mixed[order], index_explicands = c(1, : The object/objects 'sep_rf', 'sep_splines', 'sep_lm', 'MC_ctree', 'sep_xgboost_cv', 'sep_rf_cv', 'MC_vaeac', 'MC_independence', 'sep_tree', 'sep_tree_cv', 'sep_xgboost' in `explanation_list` has/have a different `x_explain` than 'sur_rf'. Cannot compare them.
We can also focus on the Shapley value explanations for the best five
methods according to the \(\operatorname{MSE}_v\) criterion. We also
include the ctree
method, the best-performing Monte
Carlo-based method.
best_methods <- get_k_best_methods(explanation_list_mixed, k = 5)
if (!"MC_ctree" %in% best_methods) best_methods <- c(best_methods, "MC_ctree")
plot_SV_several_approaches(explanation_list_mixed[best_methods], index_explicands = 1:4)
#> Error in plot_SV_several_approaches(explanation_list_mixed[best_methods], : The object/objects 'sep_rf', 'sep_splines', 'MC_ctree' in `explanation_list` has/have a different `x_explain` than 'sur_rf'. Cannot compare them.
In this section, we demonstrate that the
regression.model
, regression.tune_values
, and
regression.recipe_func
parameters can be provided as
strings. This is a property which is convenient if the
explain()
function is called from Python through the
associated shaprpy
Python library. That is, the user only
has to specify strings containing R code instead of having to deal with
creating the R objects in Python. In the code chunk below, we see that
we obtain identical \(\operatorname{MSE}_v\) scores for the
string and non-string versions.
explanation_list_str <- list()
explanation_list_str$sep_lm <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
phi0 = p0,
approach = "regression_separate",
regression.model = "parsnip::linear_reg()"
)
#> Note: Feature classes extracted from the model contains NA.
#> Assuming feature classes from the data are correct.
#> Success with message:
#> max_n_coalitions is NULL or larger than or 2^n_features = 16,
#> and is therefore set to 2^n_features = 16.
#>
#> ── Starting `shapr::explain()` at 2024-11-21 13:54:19 ───────────────────────────────────────────────────────────────────────────────────────────────────────────────
#> • Model class: <xgb.Booster>
#> • Approach: regression_separate
#> • Iterative estimation: FALSE
#> • Number of feature-wise Shapley values: 4
#> • Number of observations to explain: 20
#> • Computations (temporary) saved at: '/tmp/RtmpgQii3E/shapr_obj_1aa7868fc9a14.rds'
#>
#> ── Main computation started ──
#>
#> ℹ Using 16 of 16 coalitions.
explanation_list_str$sep_pcr <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
phi0 = p0,
approach = "regression_separate",
regression.model = "parsnip::linear_reg()",
regression.recipe_func = "function(regression_recipe) {
return(recipes::step_pca(regression_recipe, recipes::all_numeric_predictors(), num_comp = 2))
}"
)
#> Note: Feature classes extracted from the model contains NA.
#> Assuming feature classes from the data are correct.
#> Success with message:
#> max_n_coalitions is NULL or larger than or 2^n_features = 16,
#> and is therefore set to 2^n_features = 16.
#>
#> ── Starting `shapr::explain()` at 2024-11-21 13:54:20 ───────────────────────────────────────────────────────────────────────────────────────────────────────────────
#> • Model class: <xgb.Booster>
#> • Approach: regression_separate
#> • Iterative estimation: FALSE
#> • Number of feature-wise Shapley values: 4
#> • Number of observations to explain: 20
#> • Computations (temporary) saved at: '/tmp/RtmpgQii3E/shapr_obj_1aa786726ad1cd.rds'
#>
#> ── Main computation started ──
#>
#> ℹ Using 16 of 16 coalitions.
explanation_list_str$sep_splines <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
phi0 = p0,
approach = "regression_separate",
regression.model = parsnip::linear_reg(),
regression.recipe_func = "function(regression_recipe) {
return(recipes::step_ns(regression_recipe, recipes::all_numeric_predictors(), deg_free = 2))
}"
)
#> Note: Feature classes extracted from the model contains NA.
#> Assuming feature classes from the data are correct.
#> Success with message:
#> max_n_coalitions is NULL or larger than or 2^n_features = 16,
#> and is therefore set to 2^n_features = 16.
#>
#> ── Starting `shapr::explain()` at 2024-11-21 13:54:20 ───────────────────────────────────────────────────────────────────────────────────────────────────────────────
#> • Model class: <xgb.Booster>
#> • Approach: regression_separate
#> • Iterative estimation: FALSE
#> • Number of feature-wise Shapley values: 4
#> • Number of observations to explain: 20
#> • Computations (temporary) saved at: '/tmp/RtmpgQii3E/shapr_obj_1aa78619d633f2.rds'
#>
#> ── Main computation started ──
#>
#> ℹ Using 16 of 16 coalitions.
explanation_list_str$sep_tree_cv <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
phi0 = p0,
approach = "regression_separate",
regression.model = "parsnip::decision_tree(
tree_depth = hardhat::tune(), engine = 'rpart', mode = 'regression'
)",
regression.tune_values = "dials::grid_regular(dials::tree_depth(), levels = 4)",
regression.vfold_cv_para = list(v = 5)
)
#> Note: Feature classes extracted from the model contains NA.
#> Assuming feature classes from the data are correct.
#> Success with message:
#> max_n_coalitions is NULL or larger than or 2^n_features = 16,
#> and is therefore set to 2^n_features = 16.
#>
#> ── Starting `shapr::explain()` at 2024-11-21 13:54:21 ───────────────────────────────────────────────────────────────────────────────────────────────────────────────
#> • Model class: <xgb.Booster>
#> • Approach: regression_separate
#> • Iterative estimation: FALSE
#> • Number of feature-wise Shapley values: 4
#> • Number of observations to explain: 20
#> • Computations (temporary) saved at: '/tmp/RtmpgQii3E/shapr_obj_1aa78654748ae9.rds'
#>
#> ── Main computation started ──
#>
#> ℹ Using 16 of 16 coalitions.
# Using random forest with parameters tuned by cross-validation
explanation_list_str$sep_rf_cv <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
phi0 = p0,
approach = "regression_separate",
regression.model = "parsnip::rand_forest(
mtry = hardhat::tune(), trees = hardhat::tune(), engine = 'ranger', mode = 'regression'
)",
regression.tune_values =
"function(x) {
dials::grid_regular(dials::mtry(c(1, ncol(x))), dials::trees(c(50, 750)), levels = 3)
}",
regression.vfold_cv_para = list(v = 5)
)
#> Note: Feature classes extracted from the model contains NA.
#> Assuming feature classes from the data are correct.
#> Success with message:
#> max_n_coalitions is NULL or larger than or 2^n_features = 16,
#> and is therefore set to 2^n_features = 16.
#>
#> ── Starting `shapr::explain()` at 2024-11-21 13:54:34 ───────────────────────────────────────────────────────────────────────────────────────────────────────────────
#> • Model class: <xgb.Booster>
#> • Approach: regression_separate
#> • Iterative estimation: FALSE
#> • Number of feature-wise Shapley values: 4
#> • Number of observations to explain: 20
#> • Computations (temporary) saved at: '/tmp/RtmpgQii3E/shapr_obj_1aa7864275697e.rds'
#>
#> ── Main computation started ──
#>
#> ℹ Using 16 of 16 coalitions.
# Using random forest with parameters tuned by cross-validation as the surrogate model
explanation_list_str$sur_rf_cv <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
phi0 = p0,
approach = "regression_surrogate",
regression.model = "parsnip::rand_forest(
mtry = hardhat::tune(), trees = hardhat::tune(), engine = 'ranger', mode = 'regression'
)",
regression.tune_values = "dials::grid_regular(
dials::mtry(c(1, ncol(x_explain))),
dials::trees(c(50, 750)),
levels = 6
)",
regression.vfold_cv_para = list(v = 5)
)
#> Note: Feature classes extracted from the model contains NA.
#> Assuming feature classes from the data are correct.
#> Success with message:
#> max_n_coalitions is NULL or larger than or 2^n_features = 16,
#> and is therefore set to 2^n_features = 16.
#>
#> ── Starting `shapr::explain()` at 2024-11-21 13:55:05 ───────────────────────────────────────────────────────────────────────────────────────────────────────────────
#> • Model class: <xgb.Booster>
#> • Approach: regression_surrogate
#> • Iterative estimation: FALSE
#> • Number of feature-wise Shapley values: 4
#> • Number of observations to explain: 20
#> • Computations (temporary) saved at: '/tmp/RtmpgQii3E/shapr_obj_1aa7867814a029.rds'
#>
#> ── Main computation started ──
#>
#> ℹ Using 16 of 16 coalitions.
# See that the evaluation scores match the non-string versions.
print_MSEv_scores_and_time(explanation_list_str)
#> MSEv Time
#> sep_lm 745.21 0.70
#> sep_pcr 784.91 0.92
#> sep_splines 165.13 0.92
#> sep_tree_cv 222.71 12.96
#> sep_rf_cv 212.64 30.79
#> sur_rf_cv 172.09 26.96
print_MSEv_scores_and_time(explanation_list[names(explanation_list_str)])
#> MSEv Time
#> sep_lm 745.21 0.75
#> sep_pcr 784.91 0.88
#> sep_splines 165.13 0.89
#> sep_tree_cv 222.71 12.35
#> sep_rf_cv 212.64 30.98
#> sur_rf_cv 172.09 27.44
This vignette demonstrates the rich possibilities that the regression
paradigm and the tidymodels
framework add to the
shapr
package. We have seen that regression-based methods
are on par with or outperform the Monte Carlo-based methods regarding
the \(\operatorname{MSE}_v\) evaluation
criterion. Furthermore, we have seen that the regression-based methods
are relatively computationally fast and that parallelization can be used
to speed up the computations.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.