The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
The explore package offers a simplified way to use machine learning to understand and explain patterns in the data.
explain_tree()
creates a decision tree. The target can
be binary, categorical or numericalexplain_forest()
creates a random forest. The target
can be binary, categorical or numericalexplain_xgboost()
creates a random forest. The target
must be binary (0/1, FALSE/TRUE)explain_logreg()
creates a logistic regression. The
target must be binarybalance_target()
to balance a targetweight_target()
to create weights for the decision
treeWe use synthetic data in this example
library(dplyr)
library(explore)
data <- create_data_buy(obs = 1000)
glimpse(data)
#> Rows: 1,000
#> Columns: 13
#> $ period <int> 202012, 202012, 202012, 202012, 202012, 202012, 202012~
#> $ buy <int> 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, ~
#> $ age <int> 39, 57, 55, 66, 71, 44, 64, 51, 70, 44, 58, 47, 68, 71~
#> $ city_ind <int> 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, ~
#> $ female_ind <int> 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, ~
#> $ fixedvoice_ind <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ~
#> $ fixeddata_ind <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
#> $ fixedtv_ind <int> 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, ~
#> $ mobilevoice_ind <int> 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, ~
#> $ mobiledata_prd <chr> "NO", "NO", "MOBILE STICK", "NO", "BUSINESS", "BUSINES~
#> $ bbi_speed_ind <int> 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, ~
#> $ bbi_usg_gb <int> 77, 49, 53, 44, 55, 93, 50, 64, 63, 87, 45, 45, 70, 79~
#> $ hh_single <int> 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, ~
To get the model itself as output you can use the parameter
out = "model
or out = all
to get all (feature
importance as plot and table, trained model). To use the model for a
prediction, you can use predict_target()
As XGBoost only accepts numeric variables, we use
drop_var_not_numeric()
to drop mobile_data_prd
as it is not a numeric variable. An alternative would be to convert the
non numeric variables into numeric.
Use parameter out = "all"
to get more details about the
training
train$importance
#> variable gain cover frequency importance
#> 1: age 0.438876303 0.269718075 0.22916667 0.438876303
#> 2: bbi_usg_gb 0.273748162 0.309418667 0.31250000 0.273748162
#> 3: female_ind 0.148257513 0.145936389 0.13095238 0.148257513
#> 4: fixedtv_ind 0.087929669 0.126867898 0.12500000 0.087929669
#> 5: bbi_speed_ind 0.022082801 0.057552002 0.07440476 0.022082801
#> 6: city_ind 0.018582322 0.064343469 0.06845238 0.018582322
#> 7: hh_single 0.005310378 0.010887744 0.02083333 0.005310378
#> 8: fixedvoice_ind 0.003014395 0.008814722 0.02083333 0.003014395
#> 9: mobilevoice_ind 0.002198457 0.006461034 0.01785714 0.002198457
train$tune_data
#> model_nr eta max_depth runtime iter train_auc_mean test_auc_mean
#> 1: 1 0.30 3 0 mins 21 0.9599662 0.9259773
#> 2: 2 0.10 3 0 mins 52 0.9572578 0.9271735
#> 3: 3 0.01 3 0 mins 551 0.9592901 0.9268295
#> 4: 4 0.30 5 0 mins 13 0.9762086 0.9212647
#> 5: 5 0.10 5 0 mins 38 0.9773647 0.9258133
#> 6: 6 0.01 5 0 mins 71 0.9601223 0.9256453
To use the model for a prediction, you can use
predict_target()
data %>% explain_logreg(target = buy)
#> # A tibble: 6 x 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) 5.87 0.544 10.8 3.88e-27
#> 2 age -0.146 0.0106 -13.8 3.49e-43
#> 3 city_ind 0.711 0.183 3.89 1.02e- 4
#> 4 female_ind 1.75 0.186 9.38 6.91e-21
#> 5 fixedtv_ind 1.51 0.190 7.93 2.14e-15
#> 6 bbi_usg_gb -0.0000724 0.0000904 -0.801 4.23e- 1
If you have a data set with a very unbalanced target (in this case
only 5% of all observations have buy == 1
) it may be
difficult to create a decision tree.
data <- create_data_buy(obs = 2000, target1_prob = 0.05)
data %>% describe(buy)
#> variable = buy
#> type = integer
#> na = 0 of 2 000 (0%)
#> unique = 2
#> 0 = 1 899 (95%)
#> 1 = 101 (5.1%)
It may help to balance the target before growing the decision tree
(or use weighs as alternative). In this example we down sample the data
so buy has 10% of target == 1
.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.