library(filtro)
library(desirability2)
library(dplyr)
library(modeldata)
The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
⚠️ work-in-progress
This document demonstrates some basic uses of filtro. We’ll need to load a few packages:
library(filtro)
library(desirability2)
library(dplyr)
library(modeldata)
The {modeldata} package contains a data set used to predict housing sale price. It has 73 predictor columns and a numeric variable Sale_Price
(the outcome). Since the outcome are right-skewed, we apply a log (base 10) transformation.
<- modeldata::ames
ames <- ames |>
ames ::mutate(Sale_Price = log10(Sale_Price))
dplyr
# ames |> str() # uncomment to see the structure of the data
To apply the ANOVA F-test filter, we first create a score class object to define the scoring method, and then use the fit()
method with the standard formula to compute the scores.
<-
ames_aov_pval_res |>
score_aov_pval fit(Sale_Price ~ ., data = ames)
The data frame of results can be accessed via object@results
.
@results
ames_aov_pval_res#> # A tibble: 73 × 4
#> name score outcome predictor
#> <chr> <dbl> <chr> <chr>
#> 1 aov_pval 237. Sale_Price MS_SubClass
#> 2 aov_pval 130. Sale_Price MS_Zoning
#> 3 aov_pval NA Sale_Price Lot_Frontage
#> 4 aov_pval NA Sale_Price Lot_Area
#> 5 aov_pval 5.75 Sale_Price Street
#> 6 aov_pval 19.2 Sale_Price Alley
#> 7 aov_pval 71.3 Sale_Price Lot_Shape
#> 8 aov_pval 21.4 Sale_Price Land_Contour
#> 9 aov_pval 1.38 Sale_Price Utilities
#> 10 aov_pval 12.0 Sale_Price Lot_Config
#> # ℹ 63 more rows
A couple of notes here:
Since our focus is on feature relevance (rather than hypothesis testing), the ANOVA F-test filter handles both cases when:
The predictors are numeric and the outcome is categorical, or
The predictors are categorical and the outcome is numeric.
Because the outcome is numeric, any predictor that is not a factor will result in an NA
. In case where NA
is produced, a safe value can be used to retain the predictor, and can be accessed via object@fallback_value
.
By default, this filter computes -log10(p_value)
, so that larger values indicate more important predictors. If users prefer raw p-values, a helper function dont_log_pvalues()
is available.
For this specific filter, i.e., score_aov_*
, case weights are supported. For other filters, you can check the property object@case_weights
to see if they can use case weights.
There are two main ways to rank and select a top proportion or number of features.
To filter or rank a single score, we can use built-in methods:
show_best_score_*()
rank_best_score_*()
For multi-parameter optimization, we can use API calls adapted from {desirability}:
show_best_desirability_*()
The show_best_score_prop()
function returns the best score for a single metric. The prop_terms
argument lets us control the proportion of predictors to keep.
# Show best score, based on proportion of predictors
|> show_best_score_prop(prop_terms = 0.2)
ames_aov_pval_res #> # A tibble: 14 × 4
#> name score outcome predictor
#> <chr> <dbl> <chr> <chr>
#> 1 aov_pval Inf Sale_Price Neighborhood
#> 2 aov_pval 288. Sale_Price Garage_Finish
#> 3 aov_pval 243. Sale_Price Garage_Type
#> 4 aov_pval 242. Sale_Price Foundation
#> 5 aov_pval 237. Sale_Price MS_SubClass
#> 6 aov_pval 183. Sale_Price Heating_QC
#> 7 aov_pval 173. Sale_Price BsmtFin_Type_1
#> 8 aov_pval 132. Sale_Price Mas_Vnr_Type
#> 9 aov_pval 130. Sale_Price Overall_Cond
#> 10 aov_pval 130. Sale_Price MS_Zoning
#> 11 aov_pval 127. Sale_Price Exterior_1st
#> 12 aov_pval 116. Sale_Price Exterior_2nd
#> 13 aov_pval 116. Sale_Price Bsmt_Exposure
#> 14 aov_pval 100. Sale_Price Garage_Cond
To handle multiple scores, we first create multiple score class objects, and then use the fit()
method with the standard formula to compute the scores.
# ANOVA raw p-value
<- score_aov_pval |> dont_log_pvalues()
natrual_units <-
ames_aov_pval_natrual_res |>
natrual_units fit(Sale_Price ~ ., data = ames)
# Pearson correlation
<-
ames_cor_pearson_res |>
score_cor_pearson fit(Sale_Price ~ ., data = ames)
# Forest importance
<-
ames_imp_rf_reg_res |>
score_imp_rf fit(Sale_Price ~ ., data = ames, seed = 42)
# Information gain
<-
ames_info_gain_reg_res |>
score_info_gain fit(Sale_Price ~ ., data = ames)
Next, we create a list to collect these score class objects, including their associated metadata and scores.
# Create a list
<- list(
class_score_list
ames_aov_pval_natrual_res,
ames_cor_pearson_res,
ames_imp_rf_reg_res,
ames_info_gain_reg_res )
Then, we fill the safe value specific to each method, and then remove the outcome
column.
# Fill safe values
<- class_score_list |>
ames_scores_results fill_safe_values() |>
# Remove outcome
::select(-outcome)
dplyr
ames_scores_results#> # A tibble: 73 × 5
#> predictor aov_pval cor_pearson imp_rf infogain
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 MS_SubClass 1.68e-237 1 0.000450 0.266
#> 2 MS_Zoning 2.75e-130 1 0.000434 0.113
#> 3 Lot_Frontage 1.11e- 16 0.165 0.000192 0.146
#> 4 Lot_Area 1.11e- 16 0.255 0.000690 0.140
#> 5 Street 1.77e- 6 1 0.00000300 0.00365
#> 6 Alley 6.06e- 20 1 0.0000111 0.0254
#> 7 Lot_Shape 5.17e- 72 1 0.0000806 0.0675
#> 8 Land_Contour 3.79e- 22 1 0.0000582 0.0212
#> 9 Utilities 4.16e- 2 1 0 0.00165
#> 10 Lot_Config 1.04e- 12 1 0.0000112 0.0133
#> # ℹ 63 more rows
Analogous to show_best_desirability()
, the show_best_desirability_prop()
function allows joint optimization of multiple metrics using desirability functions.
A desirability function maps values of a metric to a [0, 1] range where 1 is most desirable and 0 is unacceptable. When the verb maximize()
is used, it means larger values are better. This is the case for Pearson correlation, forest importance, and information gain.
For examples:
# Optimize correlation alone
|>
ames_scores_results show_best_desirability_prop(
maximize(cor_pearson, low = 0, high = 1)
|>
) # Show predictor and desirability only
::select(predictor, starts_with(".d_"))
dplyr#> # A tibble: 73 × 3
#> predictor .d_max_cor_pearson .d_overall
#> <chr> <dbl> <dbl>
#> 1 MS_SubClass 1 1
#> 2 MS_Zoning 1 1
#> 3 Street 1 1
#> 4 Alley 1 1
#> 5 Lot_Shape 1 1
#> 6 Land_Contour 1 1
#> 7 Utilities 1 1
#> 8 Lot_Config 1 1
#> 9 Land_Slope 1 1
#> 10 Neighborhood 1 1
#> # ℹ 63 more rows
# Optimize correlation and forest importance
|>
ames_scores_results show_best_desirability_prop(
maximize(cor_pearson, low = 0, high = 1),
maximize(imp_rf)
|>
) ::select(predictor, starts_with(".d_"))
dplyr#> # A tibble: 73 × 4
#> predictor .d_max_cor_pearson .d_max_imp_rf .d_overall
#> <chr> <dbl> <dbl> <dbl>
#> 1 Gr_Liv_Area 0.696 1 0.834
#> 2 Year_Built 0.615 0.869 0.731
#> 3 Total_Bsmt_SF 0.626 0.554 0.588
#> 4 Garage_Cars 0.675 0.453 0.553
#> 5 Garage_Type 1 0.298 0.546
#> 6 First_Flr_SF 0.603 0.479 0.537
#> 7 Year_Remod_Add 0.586 0.452 0.515
#> 8 Garage_Area 0.651 0.395 0.507
#> 9 Foundation 1 0.184 0.428
#> 10 Full_Bath 0.577 0.272 0.396
#> # ℹ 63 more rows
# Optimize correlation, forest importance and information gain
|>
ames_scores_results show_best_desirability_prop(
maximize(cor_pearson, low = 0, high = 1),
maximize(imp_rf),
maximize(infogain)
|>
) ::select(predictor, starts_with(".d_"))
dplyr#> # A tibble: 73 × 5
#> predictor .d_max_cor_pearson .d_max_imp_rf .d_max_infogain .d_overall
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Gr_Liv_Area 0.696 1 0.832 0.833
#> 2 Year_Built 0.615 0.869 0.709 0.724
#> 3 Total_Bsmt_SF 0.626 0.554 0.625 0.600
#> 4 Garage_Cars 0.675 0.453 0.708 0.600
#> 5 Garage_Area 0.651 0.395 0.684 0.560
#> 6 First_Flr_SF 0.603 0.479 0.551 0.542
#> 7 Year_Remod_Add 0.586 0.452 0.514 0.515
#> 8 Garage_Type 1 0.298 0.453 0.513
#> 9 Neighborhood 1 0.119 1 0.491
#> 10 Foundation 1 0.184 0.454 0.437
#> # ℹ 63 more rows
In show_best_desirability_prop()
, there is a argument called prop_terms
that lets us control the proportion of predictors to keep.
# Same as above, but retain only a proportion of predictors
|>
ames_scores_results show_best_desirability_prop(
maximize(cor_pearson, low = 0, high = 1),
maximize(imp_rf),
maximize(infogain),
prop_terms = 0.2
|>
) ::select(predictor, starts_with(".d_"))
dplyr#> # A tibble: 14 × 5
#> predictor .d_max_cor_pearson .d_max_imp_rf .d_max_infogain .d_overall
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Gr_Liv_Area 0.696 1 0.832 0.833
#> 2 Year_Built 0.615 0.869 0.709 0.724
#> 3 Total_Bsmt_SF 0.626 0.554 0.625 0.600
#> 4 Garage_Cars 0.675 0.453 0.708 0.600
#> 5 Garage_Area 0.651 0.395 0.684 0.560
#> 6 First_Flr_SF 0.603 0.479 0.551 0.542
#> 7 Year_Remod_Add 0.586 0.452 0.514 0.515
#> 8 Garage_Type 1 0.298 0.453 0.513
#> 9 Neighborhood 1 0.119 1 0.491
#> 10 Foundation 1 0.184 0.454 0.437
#> 11 Full_Bath 0.577 0.272 0.527 0.435
#> 12 MS_SubClass 1 0.105 0.576 0.392
#> 13 Garage_Finish 1 0.0862 0.501 0.351
#> 14 Fireplaces 0.489 0.217 0.331 0.328
Besides maximize()
, additional verbs that are available are: minimize()
, target()
, and constrain()
. They are used in different situations:
maximize()
when larger values are better.
minimize()
when smaller values are better.
target()
when a specific value of the metric is important.
constrain()
when a range of values is equally desirable.
For examples:
|>
ames_scores_results show_best_desirability_prop(
minimize(aov_pval, low = 0, high = 1)
|>
) ::select(predictor, starts_with(".d_"))
dplyr#> # A tibble: 73 × 3
#> predictor .d_min_aov_pval .d_overall
#> <chr> <dbl> <dbl>
#> 1 MS_SubClass 1 1
#> 2 MS_Zoning 1 1
#> 3 Alley 1 1
#> 4 Lot_Shape 1 1
#> 5 Land_Contour 1 1
#> 6 Neighborhood 1 1
#> 7 Condition_1 1 1
#> 8 Bldg_Type 1 1
#> 9 House_Style 1 1
#> 10 Overall_Cond 1 1
#> # ℹ 63 more rows
|>
ames_scores_results show_best_desirability_prop(
target(cor_pearson, low = 0.2, target = 0.255, high = 0.9)
|>
) ::select(predictor, starts_with(".d_"))
dplyr#> # A tibble: 73 × 3
#> predictor .d_target_cor_pearson .d_overall
#> <chr> <dbl> <dbl>
#> 1 Lot_Area 1.00 1.00
#> 2 Second_Flr_SF 0.969 0.969
#> 3 Bsmt_Full_Bath 0.969 0.969
#> 4 Latitude 0.952 0.952
#> 5 Half_Bath 0.921 0.921
#> 6 Open_Porch_SF 0.899 0.899
#> 7 Wood_Deck_SF 0.879 0.879
#> 8 Mas_Vnr_Area 0.709 0.709
#> 9 Fireplaces 0.637 0.637
#> 10 TotRms_AbvGrd 0.632 0.632
#> # ℹ 63 more rows
|>
ames_scores_results show_best_desirability_prop(
constrain(cor_pearson, low = 0.2, high = 1)
|>
) ::select(predictor, starts_with(".d_"))
dplyr#> # A tibble: 73 × 3
#> predictor .d_box_cor_pearson .d_overall
#> <chr> <dbl> <dbl>
#> 1 MS_SubClass 1 1
#> 2 MS_Zoning 1 1
#> 3 Lot_Area 1 1
#> 4 Street 1 1
#> 5 Alley 1 1
#> 6 Lot_Shape 1 1
#> 7 Land_Contour 1 1
#> 8 Utilities 1 1
#> 9 Lot_Config 1 1
#> 10 Land_Slope 1 1
#> # ℹ 63 more rows
The list of score class objects included:
#> [1] "score_aov_fstat" "score_aov_pval"
#> [3] "score_cor_pearson" "score_cor_spearman"
#> [5] "score_gain_ratio" "score_imp_rf"
#> [7] "score_imp_rf_conditional" "score_imp_rf_oblique"
#> [9] "score_info_gain" "score_roc_auc"
#> [11] "score_sym_uncert" "score_xtab_pval_chisq"
#> [13] "score_xtab_pval_fisher"
The list of filter methods for score singular:
#> [1] "show_best_score_cutoff" "show_best_score_dual" "show_best_score_num"
#> [4] "show_best_score_prop"
The list of filter methods for scores plural:
#> [1] "show_best_desirability_num" "show_best_desirability_prop"
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.