Here we will use the HR churn data (https://www.kaggle.com/ludobenistant/hr-analytics/data) to present the breakDown
package for ranger
models.
The data is in the breakDown
package
library(breakDown)
head(HR_data, 3)
#> satisfaction_level last_evaluation number_project average_montly_hours
#> 1 0.38 0.53 2 157
#> 2 0.80 0.86 5 262
#> 3 0.11 0.88 7 272
#> time_spend_company Work_accident left promotion_last_5years sales salary
#> 1 3 0 1 0 sales low
#> 2 6 0 1 0 sales medium
#> 3 4 0 1 0 sales medium
Now let’s create a ranger
classification forest for churn, the left
variable.
library(ranger)
HR_data$left <- factor(HR_data$left)
model <- ranger(left ~ ., data = HR_data, importance = 'impurity', min.node.size = 10)
Variable importance for all trees in the forest.
importance(model)
#> satisfaction_level last_evaluation number_project
#> 1798.697201 605.339232 995.409953
#> average_montly_hours time_spend_company Work_accident
#> 741.670524 991.940335 27.786030
#> promotion_last_5years sales salary
#> 4.492048 41.708270 30.015464
But how to understand which factors drive predictions for a single observation?
With the breakDown
package!
Explanations for the trees votings.
library(ggplot2)
explain_1 <- broken(model, HR_data[1159,])
explain_1
#> contribution
#> time_spend_company = 2 0.042
#> satisfaction_level = 0.57 0.040
#> number_project = 4 0.036
#> average_montly_hours = 219 0.035
#> last_evaluation = 0.85 0.030
#> Work_accident = 1 0.022
#> sales = sales 0.018
#> salary = medium 0.013
#> promotion_last_5years = 0 0.002
#> final_prognosis 0.238
#> baseline: 0.5
plot(explain_1) + scale_y_continuous( limits = c(0,1), name = "fraction of trees", expand = c(0,0))
#> Scale for 'y' is already present. Adding another scale for 'y', which
#> will replace the existing scale.
explain_1 <- broken(model, HR_data[10099,])
explain_1
#> contribution
#> time_spend_company = 5 -0.040
#> last_evaluation = 0.83 -0.039
#> satisfaction_level = 0.73 -0.039
#> number_project = 5 -0.039
#> average_montly_hours = 266 -0.039
#> salary = low -0.022
#> sales = sales -0.019
#> Work_accident = 0 -0.017
#> promotion_last_5years = 0 -0.010
#> final_prognosis -0.264
#> baseline: 0.5
plot(explain_1) + scale_y_continuous( limits = c(0,1), name = "fraction of trees", expand = c(0,0))
#> Scale for 'y' is already present. Adding another scale for 'y', which
#> will replace the existing scale.
This is not the right approach.