README

The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

effectplots

{effectplots} is an R package for calculating and plotting feature effects of any model. It is very fast thanks to {collapse}.

The main function feature_effects() crunches these statistics per feature X over values/bins:

Furthermore, it calculates counts, weight sums, average residuals, and standard deviations of observed y and residuals. All statistics respect optional case weights.

We highly recommend Christoph Molnar’s book [3] for more info on feature effects.

It takes 1 second on a normal laptop to get all statistics for 10 features on 10 Mio rows (+ prediction time).

Outlier capping: Extreme outliers in numeric features are capped by default (but not deleted). To avoid capping, set outlier_iqr = Inf.

Installation

# install.packages("pak")
pak::pak("mayer79/effectplots", dependencies = TRUE)

Usage

We use a 1 Mio row dataset on Motor TPL insurance. The aim is to model claim frequency. Before modeling, we want to study the association between features and response.

library(effectplots)
library(OpenML)
library(lightgbm)

set.seed(1)

df <- getOMLDataSet(data.id = 45106L)$data

xvars <- c("year", "town", "driver_age", "car_weight", "car_power", "car_age")

# 0.1s on laptop
average_observed(df[xvars], y = df$claim_nb) |>
  update(to_factor = TRUE) |>  # turn discrete numerics to factors
  plot(share_y = "all")

A shared y axis helps to compare the strength of the association across features.

Fit model

ix <- sample(nrow(df), 0.8 * nrow(df))
train <- df[ix, ]
test <- df[-ix, ]
X_train <- data.matrix(train[xvars])
X_test <- data.matrix(test[xvars])

# Training, using slightly optimized parameters found via cross-validation
params <- list(
  learning_rate = 0.05,
  objective = "poisson",
  num_leaves = 7,
  min_data_in_leaf = 50,
  min_sum_hessian_in_leaf = 0.001,
  colsample_bynode = 0.8,
  bagging_fraction = 0.8,
  lambda_l1 = 3,
  lambda_l2 = 5,
  num_threads = 7
)

fit <- lgb.train(
  params = params,
  data = lgb.Dataset(X_train, label = train$claim_nb),
  nrounds = 300
)

Inspect model

Let’s crunch all statistics on the test data. Sorting is done by weighted variance of partial dependence, a main-effect importance measure related to [4].

The average predictions closely follow the average observed, i.e., the model seems to do a good job. Comparing partial dependence/ALE with average predicted gives insights on whether an effect mainly comes from the feature on the x axis or from other, correlated, features.

# 0.1s + 0.15s prediction time
feature_effects(fit, v = xvars, data = X_test, y = test$claim_nb) |>
  update(sort_by = "pd") |> 
  plot()

Flexibility

What about combining training and test results? Or comparing different models or subgroups? No problem:

m_train <- feature_effects(fit, v = xvars, data = X_train, y = train$claim_nb)
m_test <- feature_effects(fit, v = xvars, data = X_test, y = test$claim_nb)

# Pick top 3 based on train
m_train <- m_train |> 
  update(sort_by = "pd") |> 
  head(3)
m_test <- m_test[names(m_train)]

# Concatenate train and test results and plot them
c(m_train, m_test) |> 
  plot(
    share_y = "rows",
    ncol = 2,
    byrow = FALSE,
    stats = c("y_mean", "pred_mean"),
    subplot_titles = FALSE,
    # plotly = TRUE,
    title = "Left: Train - Right: Test",
  )

To look closer at bias, let’s select the statistic “resid_mean” along with pointwise 95% confidence intervals for the true conditional bias.

c(m_train, m_test) |> 
  update(drop_below_n = 50) |> 
  plot(
    ylim = c(-0.07, 0.08),
    ncol = 2,
    byrow = FALSE,
    stats = "resid_mean",
    subplot_titles = FALSE,
    title = "Left: Train - Right: Test",
    # plotly = TRUE,
    interval = "ci"
  )

More examples

Most models work out-of-the box, including DALEX explainers and Tidymodels models. If not, a tailored prediction function can be specified.

DALEX

library(effectplots)
library(DALEX)
library(ranger)

set.seed(1)

fit <- ranger(Sepal.Length ~ ., data = iris)
ex <- DALEX::explain(fit, data = iris[, -1], y = iris[, 1])

feature_effects(ex, breaks = 5) |> 
  plot(share_y = "all")

Tidymodels

library(effectplots)
library(tidymodels)

set.seed(1)

xvars <- c("carat", "color", "clarity", "cut")

split <- initial_split(diamonds)
train <- training(split)
test <- testing(split)

dia_recipe <- train |> 
  recipe(reformulate(xvars, "price"))

mod <- rand_forest(trees = 100) |>
  set_engine("ranger") |> 
  set_mode("regression")
  
dia_wf <- workflow() |>
  add_recipe(dia_recipe) |>
  add_model(mod)

fit <- dia_wf |>
  fit(train)

M_train <- feature_effects(fit, v = xvars, data = train, y = "price")
M_test <- feature_effects(fit, v = xvars, data = test, y = "price")

plot(
  M_train + M_test,
  byrow = FALSE,
  ncol = 2,
  share_y = "rows",
  rotate_x = rep(45 * xvars %in% c("clarity", "cut"), each = 2),
  subplot_titles = FALSE,
  # plotly = TRUE,
  title = "Left: train - Right: test"
)

Probabilistic classification

library(effectplots)
library(ranger)

set.seed(1)

fit <- ranger(Species ~ ., data = iris, probability = TRUE)

M <- partial_dependence(
  fit,
  v = colnames(iris[1:4]), 
  data = iris,
  which_pred = 1  # "setosa" is the first class
)
plot(M, bar_height = 0.33, ylim = c(0, 0.7))

References

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.