The iml
package can now handle bigger datasets. Earlier problems with exploding memory have been fixed for Partial
, FeatureImp
and Interaction
. It’s also possible now to compute FeatureImp
and Interaction
in parallel. This document describes how.
First we load some data, fit a random forest and create a Predictor object.
set.seed(42)
library("iml")
library("randomForest")
data("Boston", package = "MASS")
rf = randomForest(medv ~ ., data = Boston, ntree = 500)
X = Boston[which(names(Boston) != "medv")]
predictor = Predictor$new(rf, data = X, y = Boston$medv)
You need to install the doParallel
or a similar framework to compute in parallel. Before you can use parallelization to compute for example the feature importance on multiple CPU cores, you have to setup up a cluster. Fortunately, the doParallel
makes it easy to setup and register a cluster:
library("doParallel")
#> Loading required package: iterators
#> Warning: package 'iterators' was built under R version 3.4.4
#> Loading required package: parallel
# Creates a cluster with 2 cores
cl = makePSOCKcluster(2)
# Registers cluster
registerDoParallel(cl)
Now we can easily compute feature importance in parallel. This means that the computation per feature is distributed among the 2 cores I specified earlier.
imp = FeatureImp$new(predictor, loss = "mae", parallel = TRUE)
plot(imp)
That wasn’t very impressive, let’s actually see how much speed up we get by parallelization.
system.time(FeatureImp$new(predictor, loss = "mae", parallel = FALSE))
#> user system elapsed
#> 2.690 0.064 1.525
system.time(FeatureImp$new(predictor, loss = "mae", parallel = TRUE))
#> user system elapsed
#> 0.361 0.023 1.683
A little bit of improvement, but not too impressive. Parallelization is more useful in the case where the model uses a lot of features or where the feature importance computation is repeated more often to get more stable results.
system.time(FeatureImp$new(predictor, loss = "mae", parallel = FALSE, n.repetitions = 200))
#> user system elapsed
#> 92.704 1.904 96.500
system.time(FeatureImp$new(predictor, loss = "mae", parallel = TRUE, n.repetitions = 200))
#> user system elapsed
#> 0.242 0.023 53.974
Here the parallel computation is twice as fast as the sequential computation of the feature importance.
The parallization also speeds up the computation of the interaction statistics:
system.time(Interaction$new(predictor, parallel = FALSE))
#> user system elapsed
#> 31.672 1.317 29.234
system.time(Interaction$new(predictor, parallel = TRUE))
#> user system elapsed
#> 0.271 0.034 16.594
Remember to stop the cluster in the end again.
stopCluster(cl)