The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
The iml
package can now handle bigger datasets. Earlier
problems with exploding memory have been fixed for
FeatureEffect
, FeatureImp
and
Interaction
. It’s also possible now to compute
FeatureImp
and Interaction
in parallel. This
document describes how.
First we load some data, fit a random forest and create a Predictor object.
set.seed(42)
library("iml")
library("randomForest")
data("Boston", package = "MASS")
rf <- randomForest(medv ~ ., data = Boston, n.trees = 10)
X <- Boston[which(names(Boston) != "medv")]
predictor <- Predictor$new(rf, data = X, y = Boston$medv)
Parallelization is supported via the {future} package. All you need
to do is to choose a parallel backend via
future::plan()
.
library("future")
library("future.callr")
#> Warning: Paket 'future.callr' wurde unter R Version 4.3.3 erstellt
# Creates a PSOCK cluster with 2 cores
plan("callr", workers = 2)
Now we can easily compute feature importance in parallel. This means that the computation per feature is distributed among the 2 cores I specified earlier.
That wasn’t very impressive, let’s actually see how much speed up we get by parallelization.
bench::system_time({
plan(sequential)
FeatureImp$new(predictor, loss = "mae")
})
#> Warning: Paket 'processx' wurde unter R Version 4.3.2 erstellt
#> Warning: Paket 'lattice' wurde unter R Version 4.3.2 erstellt
#> Warning: Paket 'callr' wurde unter R Version 4.3.2 erstellt
#> Warning: Paket 'ps' wurde unter R Version 4.3.2 erstellt
#> Warning: Paket 'rpart' wurde unter R Version 4.3.2 erstellt
#> Warning: Paket 'patchwork' wurde unter R Version 4.3.2 erstellt
#> Warning: Paket 'survival' wurde unter R Version 4.3.2 erstellt
#> Warning: Paket 'Rcpp' wurde unter R Version 4.3.2 erstellt
#> process real
#> 1.94s 3.31s
bench::system_time({
plan("callr", workers = 2)
FeatureImp$new(predictor, loss = "mae")
})
#> process real
#> 125ms 6.78s
A little bit of improvement, but not too impressive. Parallelization is more useful in the case where the model uses a lot of features or where the feature importance computation is repeated more often to get more stable results.
bench::system_time({
plan(sequential)
FeatureImp$new(predictor, loss = "mae", n.repetitions = 10)
})
#> process real
#> 2.94s 5.49s
bench::system_time({
plan("callr", workers = 2)
FeatureImp$new(predictor, loss = "mae", n.repetitions = 10)
})
#> process real
#> 296.88ms 6.79s
Here the parallel computation is twice as fast as the sequential computation of the feature importance.
The parallelization also speeds up the computation of the interaction statistics:
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.