The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Introduction to filtro

⚠️ work-in-progress

This document demonstrates some basic uses of filtro. We’ll need to load a few packages:

library(filtro)
library(desirability2)
library(dplyr)
library(modeldata)

A scoring example

The {modeldata} package contains a data set used to predict housing sale price. It has 73 predictor columns and a numeric variable Sale_Price (the outcome). Since the outcome are right-skewed, we apply a log (base 10) transformation.

ames <- modeldata::ames
ames <- ames |>
  dplyr::mutate(Sale_Price = log10(Sale_Price))

# ames |> str() # uncomment to see the structure of the data

To apply the ANOVA F-test filter, we first create a score class object to define the scoring method, and then use the fit() method with the standard formula to compute the scores.

ames_aov_pval_res <-
  score_aov_pval |>
  fit(Sale_Price ~ ., data = ames)

The data frame of results can be accessed via object@results.

ames_aov_pval_res@results
#> # A tibble: 73 × 4
#>    name      score outcome    predictor   
#>    <chr>     <dbl> <chr>      <chr>       
#>  1 aov_pval 237.   Sale_Price MS_SubClass 
#>  2 aov_pval 130.   Sale_Price MS_Zoning   
#>  3 aov_pval  NA    Sale_Price Lot_Frontage
#>  4 aov_pval  NA    Sale_Price Lot_Area    
#>  5 aov_pval   5.75 Sale_Price Street      
#>  6 aov_pval  19.2  Sale_Price Alley       
#>  7 aov_pval  71.3  Sale_Price Lot_Shape   
#>  8 aov_pval  21.4  Sale_Price Land_Contour
#>  9 aov_pval   1.38 Sale_Price Utilities   
#> 10 aov_pval  12.0  Sale_Price Lot_Config  
#> # ℹ 63 more rows

A couple of notes here:

Since our focus is on feature relevance (rather than hypothesis testing), the ANOVA F-test filter handles both cases when:

Because the outcome is numeric, any predictor that is not a factor will result in an NA. In case where NA is produced, a safe value can be used to retain the predictor, and can be accessed via object@fallback_value.

By default, this filter computes -log10(p_value), so that larger values indicate more important predictors. If users prefer raw p-values, a helper function dont_log_pvalues() is available.

For this specific filter, i.e., score_aov_*, case weights are supported. For other filters, you can check the property object@case_weights to see if they can use case weights.

Filtering and ranking

There are two main ways to rank and select a top proportion or number of features.

To filter or rank a single score, we can use built-in methods:

For multi-parameter optimization, we can use API calls adapted from {desirability}:

A filtering exmple for score singular

The show_best_score_prop() function returns the best score for a single metric. The prop_terms argument lets us control the proportion of predictors to keep.

# Show best score, based on proportion of predictors
ames_aov_pval_res |> show_best_score_prop(prop_terms = 0.2)
#> # A tibble: 14 × 4
#>    name     score outcome    predictor     
#>    <chr>    <dbl> <chr>      <chr>         
#>  1 aov_pval  Inf  Sale_Price Neighborhood  
#>  2 aov_pval  288. Sale_Price Garage_Finish 
#>  3 aov_pval  243. Sale_Price Garage_Type   
#>  4 aov_pval  242. Sale_Price Foundation    
#>  5 aov_pval  237. Sale_Price MS_SubClass   
#>  6 aov_pval  183. Sale_Price Heating_QC    
#>  7 aov_pval  173. Sale_Price BsmtFin_Type_1
#>  8 aov_pval  132. Sale_Price Mas_Vnr_Type  
#>  9 aov_pval  130. Sale_Price Overall_Cond  
#> 10 aov_pval  130. Sale_Price MS_Zoning     
#> 11 aov_pval  127. Sale_Price Exterior_1st  
#> 12 aov_pval  116. Sale_Price Exterior_2nd  
#> 13 aov_pval  116. Sale_Price Bsmt_Exposure 
#> 14 aov_pval  100. Sale_Price Garage_Cond

A filtering example for scores plural

To handle multiple scores, we first create multiple score class objects, and then use the fit() method with the standard formula to compute the scores.

# ANOVA raw p-value 
natrual_units <- score_aov_pval |> dont_log_pvalues()
ames_aov_pval_natrual_res <-
  natrual_units |>
  fit(Sale_Price ~ ., data = ames)

# Pearson correlation
ames_cor_pearson_res <-
  score_cor_pearson |>
  fit(Sale_Price ~ ., data = ames)

# Forest importance
ames_imp_rf_reg_res <-
  score_imp_rf |>
  fit(Sale_Price ~ ., data = ames, seed = 42)

# Information gain
ames_info_gain_reg_res <-
  score_info_gain |>
  fit(Sale_Price ~ ., data = ames)

Next, we create a list to collect these score class objects, including their associated metadata and scores.

# Create a list
class_score_list <- list(
  ames_aov_pval_natrual_res, 
  ames_cor_pearson_res,
  ames_imp_rf_reg_res,
  ames_info_gain_reg_res
)

Then, we fill the safe value specific to each method, and then remove the outcome column.

# Fill safe values
ames_scores_results <- class_score_list |>
  fill_safe_values() |>
  # Remove outcome
  dplyr::select(-outcome)
ames_scores_results
#> # A tibble: 73 × 5
#>    predictor     aov_pval cor_pearson     imp_rf infogain
#>    <chr>            <dbl>       <dbl>      <dbl>    <dbl>
#>  1 MS_SubClass  1.68e-237       1     0.000450    0.266  
#>  2 MS_Zoning    2.75e-130       1     0.000434    0.113  
#>  3 Lot_Frontage 1.11e- 16       0.165 0.000192    0.146  
#>  4 Lot_Area     1.11e- 16       0.255 0.000690    0.140  
#>  5 Street       1.77e-  6       1     0.00000300  0.00365
#>  6 Alley        6.06e- 20       1     0.0000111   0.0254 
#>  7 Lot_Shape    5.17e- 72       1     0.0000806   0.0675 
#>  8 Land_Contour 3.79e- 22       1     0.0000582   0.0212 
#>  9 Utilities    4.16e-  2       1     0           0.00165
#> 10 Lot_Config   1.04e- 12       1     0.0000112   0.0133 
#> # ℹ 63 more rows

Analogous to show_best_desirability(), the show_best_desirability_prop() function allows joint optimization of multiple metrics using desirability functions.

A desirability function maps values of a metric to a [0, 1] range where 1 is most desirable and 0 is unacceptable. When the verb maximize() is used, it means larger values are better. This is the case for Pearson correlation, forest importance, and information gain.

For examples:

# Optimize correlation alone
ames_scores_results |>
  show_best_desirability_prop(
    maximize(cor_pearson, low = 0, high = 1)
  ) |> 
  # Show predictor and desirability only
  dplyr::select(predictor, starts_with(".d_"))
#> # A tibble: 73 × 3
#>    predictor    .d_max_cor_pearson .d_overall
#>    <chr>                     <dbl>      <dbl>
#>  1 MS_SubClass                   1          1
#>  2 MS_Zoning                     1          1
#>  3 Street                        1          1
#>  4 Alley                         1          1
#>  5 Lot_Shape                     1          1
#>  6 Land_Contour                  1          1
#>  7 Utilities                     1          1
#>  8 Lot_Config                    1          1
#>  9 Land_Slope                    1          1
#> 10 Neighborhood                  1          1
#> # ℹ 63 more rows

# Optimize correlation and forest importance
ames_scores_results |>
  show_best_desirability_prop(
    maximize(cor_pearson, low = 0, high = 1),
    maximize(imp_rf)
  ) |> 
  dplyr::select(predictor, starts_with(".d_"))
#> # A tibble: 73 × 4
#>    predictor      .d_max_cor_pearson .d_max_imp_rf .d_overall
#>    <chr>                       <dbl>         <dbl>      <dbl>
#>  1 Gr_Liv_Area                 0.696         1          0.834
#>  2 Year_Built                  0.615         0.869      0.731
#>  3 Total_Bsmt_SF               0.626         0.554      0.588
#>  4 Garage_Cars                 0.675         0.453      0.553
#>  5 Garage_Type                 1             0.298      0.546
#>  6 First_Flr_SF                0.603         0.479      0.537
#>  7 Year_Remod_Add              0.586         0.452      0.515
#>  8 Garage_Area                 0.651         0.395      0.507
#>  9 Foundation                  1             0.184      0.428
#> 10 Full_Bath                   0.577         0.272      0.396
#> # ℹ 63 more rows

# Optimize correlation, forest importance and information gain
ames_scores_results |>
  show_best_desirability_prop(
    maximize(cor_pearson, low = 0, high = 1),
    maximize(imp_rf),
    maximize(infogain)
  ) |> 
  dplyr::select(predictor, starts_with(".d_"))
#> # A tibble: 73 × 5
#>    predictor      .d_max_cor_pearson .d_max_imp_rf .d_max_infogain .d_overall
#>    <chr>                       <dbl>         <dbl>           <dbl>      <dbl>
#>  1 Gr_Liv_Area                 0.696         1               0.832      0.833
#>  2 Year_Built                  0.615         0.869           0.709      0.724
#>  3 Total_Bsmt_SF               0.626         0.554           0.625      0.600
#>  4 Garage_Cars                 0.675         0.453           0.708      0.600
#>  5 Garage_Area                 0.651         0.395           0.684      0.560
#>  6 First_Flr_SF                0.603         0.479           0.551      0.542
#>  7 Year_Remod_Add              0.586         0.452           0.514      0.515
#>  8 Garage_Type                 1             0.298           0.453      0.513
#>  9 Neighborhood                1             0.119           1          0.491
#> 10 Foundation                  1             0.184           0.454      0.437
#> # ℹ 63 more rows

In show_best_desirability_prop(), there is a argument called prop_terms that lets us control the proportion of predictors to keep.

# Same as above, but retain only a proportion of predictors
ames_scores_results |>
  show_best_desirability_prop(
    maximize(cor_pearson, low = 0, high = 1),
    maximize(imp_rf),
    maximize(infogain),
    prop_terms = 0.2
  ) |>
  dplyr::select(predictor, starts_with(".d_"))
#> # A tibble: 14 × 5
#>    predictor      .d_max_cor_pearson .d_max_imp_rf .d_max_infogain .d_overall
#>    <chr>                       <dbl>         <dbl>           <dbl>      <dbl>
#>  1 Gr_Liv_Area                 0.696        1                0.832      0.833
#>  2 Year_Built                  0.615        0.869            0.709      0.724
#>  3 Total_Bsmt_SF               0.626        0.554            0.625      0.600
#>  4 Garage_Cars                 0.675        0.453            0.708      0.600
#>  5 Garage_Area                 0.651        0.395            0.684      0.560
#>  6 First_Flr_SF                0.603        0.479            0.551      0.542
#>  7 Year_Remod_Add              0.586        0.452            0.514      0.515
#>  8 Garage_Type                 1            0.298            0.453      0.513
#>  9 Neighborhood                1            0.119            1          0.491
#> 10 Foundation                  1            0.184            0.454      0.437
#> 11 Full_Bath                   0.577        0.272            0.527      0.435
#> 12 MS_SubClass                 1            0.105            0.576      0.392
#> 13 Garage_Finish               1            0.0862           0.501      0.351
#> 14 Fireplaces                  0.489        0.217            0.331      0.328

Besides maximize(), additional verbs that are available are: minimize(), target(), and constrain(). They are used in different situations:

For examples:

ames_scores_results |>
  show_best_desirability_prop(
    minimize(aov_pval, low = 0, high = 1)
  ) |> 
  dplyr::select(predictor, starts_with(".d_"))
#> # A tibble: 73 × 3
#>    predictor    .d_min_aov_pval .d_overall
#>    <chr>                  <dbl>      <dbl>
#>  1 MS_SubClass                1          1
#>  2 MS_Zoning                  1          1
#>  3 Alley                      1          1
#>  4 Lot_Shape                  1          1
#>  5 Land_Contour               1          1
#>  6 Neighborhood               1          1
#>  7 Condition_1                1          1
#>  8 Bldg_Type                  1          1
#>  9 House_Style                1          1
#> 10 Overall_Cond               1          1
#> # ℹ 63 more rows

ames_scores_results |>
  show_best_desirability_prop(
    target(cor_pearson, low = 0.2, target = 0.255, high = 0.9)
  ) |> 
  dplyr::select(predictor, starts_with(".d_"))
#> # A tibble: 73 × 3
#>    predictor      .d_target_cor_pearson .d_overall
#>    <chr>                          <dbl>      <dbl>
#>  1 Lot_Area                       1.00       1.00 
#>  2 Second_Flr_SF                  0.969      0.969
#>  3 Bsmt_Full_Bath                 0.969      0.969
#>  4 Latitude                       0.952      0.952
#>  5 Half_Bath                      0.921      0.921
#>  6 Open_Porch_SF                  0.899      0.899
#>  7 Wood_Deck_SF                   0.879      0.879
#>  8 Mas_Vnr_Area                   0.709      0.709
#>  9 Fireplaces                     0.637      0.637
#> 10 TotRms_AbvGrd                  0.632      0.632
#> # ℹ 63 more rows

ames_scores_results |>
  show_best_desirability_prop(
    constrain(cor_pearson, low = 0.2, high = 1)
  ) |> 
  dplyr::select(predictor, starts_with(".d_"))
#> # A tibble: 73 × 3
#>    predictor    .d_box_cor_pearson .d_overall
#>    <chr>                     <dbl>      <dbl>
#>  1 MS_SubClass                   1          1
#>  2 MS_Zoning                     1          1
#>  3 Lot_Area                      1          1
#>  4 Street                        1          1
#>  5 Alley                         1          1
#>  6 Lot_Shape                     1          1
#>  7 Land_Contour                  1          1
#>  8 Utilities                     1          1
#>  9 Lot_Config                    1          1
#> 10 Land_Slope                    1          1
#> # ℹ 63 more rows

Available score objects and filter methods

The list of score class objects included:

#>  [1] "score_aov_fstat"          "score_aov_pval"          
#>  [3] "score_cor_pearson"        "score_cor_spearman"      
#>  [5] "score_gain_ratio"         "score_imp_rf"            
#>  [7] "score_imp_rf_conditional" "score_imp_rf_oblique"    
#>  [9] "score_info_gain"          "score_roc_auc"           
#> [11] "score_sym_uncert"         "score_xtab_pval_chisq"   
#> [13] "score_xtab_pval_fisher"

The list of filter methods for score singular:

#> [1] "show_best_score_cutoff" "show_best_score_dual"   "show_best_score_num"   
#> [4] "show_best_score_prop"

The list of filter methods for scores plural:

#> [1] "show_best_desirability_num"  "show_best_desirability_prop"

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.