library(filtro)
library(dplyr)
library(modeldata)
The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
⚠️ work-in-progress
We’ll need to load a few packages:
library(filtro)
library(dplyr)
library(modeldata)
Predictor importance can be assessed using three different random forest models. They can be accessed via the following score class objects:
score_imp_rf
score_imp_rf_conditional score_imp_rf_oblique
These models are powered by the following packages:
#> [1] "ranger"
#> [1] "partykit"
#> [1] "aorsf"
Regarding score types:
The {ranger} random forest computes the importance scores.
The {partykit} conditional random forest computes the conditional importance scores.
The {aorsf} oblique random forest computes the permutation importance scores.
The {modeldata} package contains a data set used to predict which cells in a high content screen were well segmented. It has 57 predictor columns and a factor variable class
(the outcome).
Since case
is only used to indicate Train/Test, not for data analysis, it will be set to NULL
. Furthermore, for efficiency, we will use a small sample of 50 from the original 2019 observations.
<- modeldata::cells |>
cells_subset # Use a small example for efficiency
::slice(1:50)
dplyr$case <- NULL
cells_subset
# cells_subset |> str() # Uncomment to see the structure of the data
First, we create a score class object to specify a {ranger} random forest, and then use the fit()
method with the standard formula to compute the importance scores.
# Specify random forest and fit score
<- score_imp_rf |>
cells_imp_rf_res fit(
~ .,
class data = cells_subset,
seed = 42
)
The data frame of results can be accessed via object@results
.
@results
cells_imp_rf_res#> # A tibble: 56 × 4
#> name score outcome predictor
#> <chr> <dbl> <chr> <chr>
#> 1 imp_rf 0.000967 class angle_ch_1
#> 2 imp_rf -0.0000620 class area_ch_1
#> 3 imp_rf 0.00438 class avg_inten_ch_1
#> 4 imp_rf 0.00916 class avg_inten_ch_2
#> 5 imp_rf -0.000426 class avg_inten_ch_3
#> 6 imp_rf -0.000296 class avg_inten_ch_4
#> 7 imp_rf 0.00836 class convex_hull_area_ratio_ch_1
#> 8 imp_rf 0.00133 class convex_hull_perim_ratio_ch_1
#> 9 imp_rf 0.000739 class diff_inten_density_ch_1
#> 10 imp_rf -0.00128 class diff_inten_density_ch_3
#> # ℹ 46 more rows
A copule of notes here:
The random forest filter, including all three types of random forests,
regression tasks, and
classificaiton tasks.
In case where NA
is produced, a safe value can be used to retain the predictor, and can be accessed via object@fallback_value
.
Larger values indicate more important predictors.
For this specific filter, i.e., score_imp_rf_*
, case weights are supported.
Like {parsnip}, the argument names are harmonized. For example, the arguments to set the number of trees: num.trees
in {ranger}, ntree
in {partykit}, and n_tree
in {aorsf} are all standardized to a single name, trees
, so users only need to remember a single name.
The same applies to the number of variables to split at each node, mtry
, and the minimum node size for splitting, min_n
.
# Set hyperparameters
<- score_imp_rf |>
cells_imp_rf_res fit(
~ .,
class data = cells_subset,
trees = 100,
mtry = 2,
min_n = 1
)
However, there is one argument name specific to {ranger}. For reproducibility, instead of using the standard set.seed()
method, we would use the seed
argument.
<- score_imp_rf |>
cells_imp_rf_res fit(
~ .,
class data = cells_subset,
trees = 100,
mtry = 2,
min_n = 1,
seed = 42 # Set seed for reproducibility
)
If users use {ranger} argument names, intentionally or not, it still works. We have handled the necessary adjustments. The following code chunk can be used to obtain a fitted score:
<- score_imp_rf |>
cells_imp_rf_res fit(
~ .,
class data = cells_subset,
num.trees = 100,
mtry = 2,
min.node.size = 1,
seed = 42
)
The same applies to {partykit}- and {aorsf}- specific arguments.
For the {partykit} conditional random forest, we again create a score class object to specify the model, then use the fit()
method to compute the importance scores.
The data frame of results can be accessed via object@results
.
# Set seed for reproducibility
set.seed(42)
# Specify conditional random forest and fit score
<- score_imp_rf_conditional |>
cells_imp_rf_conditional_res fit(class ~ ., data = cells_subset, trees = 100)
@results
cells_imp_rf_conditional_res#> # A tibble: 40 × 4
#> name score outcome predictor
#> <chr> <dbl> <chr> <chr>
#> 1 imp_rf_conditional -0.0306 class angle_ch_1
#> 2 imp_rf_conditional 0.178 class area_ch_1
#> 3 imp_rf_conditional 0.158 class avg_inten_ch_1
#> 4 imp_rf_conditional 0.132 class avg_inten_ch_2
#> 5 imp_rf_conditional 0.0927 class convex_hull_area_ratio_ch_1
#> 6 imp_rf_conditional 0.963 class convex_hull_perim_ratio_ch_1
#> 7 imp_rf_conditional -0.0842 class diff_inten_density_ch_1
#> 8 imp_rf_conditional 0.0688 class diff_inten_density_ch_3
#> 9 imp_rf_conditional 0.147 class entropy_inten_ch_1
#> 10 imp_rf_conditional 0.00105 class entropy_inten_ch_3
#> # ℹ 30 more rows
Note that when a predictor’s importance score is 0, partykit::cforest()
may exclude its name from the output. In such cases, a score of 0 is assigned to the missing predictors.
For the {aorsf} oblique random forest, we again create a score class object to specify the model, then use the fit()
method to compute the importance scores.
The data frame of results can be accessed via object@results
.
# Set seed for reproducibility
set.seed(42)
# Specify oblique random forest and fit score
<- score_imp_rf_oblique |>
cells_imp_rf_oblique_res fit(class ~ ., data = cells_subset, trees = 100, mtry = 2)
@results
cells_imp_rf_oblique_res#> # A tibble: 56 × 4
#> name score outcome predictor
#> <chr> <dbl> <chr> <chr>
#> 1 imp_rf_oblique 0.0165 class fiber_width_ch_1
#> 2 imp_rf_oblique 0.0121 class inten_cooc_contrast_ch_3
#> 3 imp_rf_oblique 0.0109 class inten_cooc_max_ch_3
#> 4 imp_rf_oblique 0.00850 class shape_p_2_a_ch_1
#> 5 imp_rf_oblique 0.00777 class entropy_inten_ch_1
#> 6 imp_rf_oblique 0.00725 class eq_ellipse_lwr_ch_1
#> 7 imp_rf_oblique 0.00589 class inten_cooc_asm_ch_3
#> 8 imp_rf_oblique 0.00543 class diff_inten_density_ch_1
#> 9 imp_rf_oblique 0.00513 class shape_lwr_ch_1
#> 10 imp_rf_oblique 0.00506 class fiber_length_ch_1
#> # ℹ 46 more rows
The list of score class objects for random forests, their corresponding engines and supported tasks:
object | engine | task |
---|---|---|
score_imp_rf |
ranger::ranger |
regression, classification |
score_imp_rf_conditional |
partykit::cforest |
regression, classification |
score_imp_rf_oblique |
aorsf::orsf |
regression, classification |
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.