The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
In this vignette, we present a global variable importance measure based on Partial Dependence Profiles (PDP) for the random forest regression model.
We work on Apartments dataset from DALEX
package.
#> m2.price construction.year surface floor no.rooms district
#> 1 5897 1953 25 3 1 Srodmiescie
#> 2 1818 1992 143 9 5 Bielany
#> 3 3643 1937 56 1 2 Praga
#> 4 3517 1995 93 7 3 Ochota
#> 5 3013 1992 144 6 5 Mokotow
#> 6 5795 1926 61 6 2 Srodmiescie
Now, we define a random forest regression model and use explain()
function from DALEX
.
library("randomForest")
apartments_rf_model <- randomForest(m2.price ~ construction.year + surface + floor +
no.rooms, data = apartments)
explainer_rf <- explain(apartments_rf_model,
data = apartmentsTest[,2:5], y = apartmentsTest$m2.price)
#> Preparation of a new explainer is initiated
#> -> model label : randomForest ( [33m default [39m )
#> -> data : 9000 rows 4 cols
#> -> target variable : 9000 values
#> -> predict function : yhat.randomForest will be used ( [33m default [39m )
#> -> predicted values : numerical, min = 2121.14 , mean = 3515.047 , max = 5261.62
#> -> model_info : package randomForest , ver. 4.6.14 , task regression ( [33m default [39m )
#> -> residual function : difference between y and yhat ( [33m default [39m )
#> -> residuals : numerical, min = -1227.352 , mean = -3.523581 , max = 2186.873
#> [32m A new explainer has been created! [39m
Let see the Partial Dependence Profiles calculated with DALEX::model_profile()
function. The PDP also can be calculated with DALEX::variable_profile()
or ingredients::partial_dependence()
.
Now, we calculated a measure of global variable importance via oscillation based on PDP.
The most important variable is surface, then no.rooms, floor, and construction.year.
Let created a linear regression model and explain
object.
apartments_lm_model <- lm(m2.price ~ construction.year + surface + floor +
no.rooms, data = apartments)
explainer_lm <- explain(apartments_lm_model,
data = apartmentsTest[,2:5], y = apartmentsTest$m2.price)
#> Preparation of a new explainer is initiated
#> -> model label : lm ( [33m default [39m )
#> -> data : 9000 rows 4 cols
#> -> target variable : 9000 values
#> -> predict function : yhat.lm will be used ( [33m default [39m )
#> -> predicted values : numerical, min = 2231.8 , mean = 3507.346 , max = 4769.053
#> -> model_info : package stats , ver. 3.6.3 , task regression ( [33m default [39m )
#> -> residual function : difference between y and yhat ( [33m default [39m )
#> -> residuals : numerical, min = -733.2516 , mean = 4.177813 , max = 2107.979
#> [32m A new explainer has been created! [39m
We calculated Partial Dependence Profiles and measure.
Now we can see the order of importance of variables by model.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.