The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

shadowVIMP-vignette

Introduction to `shadowVIMP` package

Quick start

The purpose of shadowVIMP package is to provide a tool for reducing the number of covariates in an analysis in an informed and statistically rigorous manner. It implements a method that performs statistical tests on the Variable Importance Measures (VIMP) derived from the Random Forest (RF) algorithm to determine whether each covariate is statistically significant and truly informative.

In this section, we demonstrate a basic use case of the shadowVIMP package. First, let’s install the package:

install.packages("shadowVIMP")

Then, we load necessary packages:

library(shadowVIMP)
library(magrittr)
library(dplyr)
library(ggplot2)

In the examples below, we use the well-known toy dataset mtcars. Our variable of interest is the vs covariate, which indicates whether the car’s engine is V-shaped or straight. To speed up vignette building, we set niters parameter lower than recommended. For real-world analyses, please use the default or a higher value - otherwise, the results will not be reliable. The example that follows demonstrates our package’s basic usage. By default, its functions display informative covariates based on p-values computed by using the “pooled” approach and report all p-value types: unadjusted, FWER-adjusted, and FDR-adjusted.

data("mtcars")
set.seed(786)

is_interactive <- interactive()
if (is_interactive) {
  # When interactive - set num_cores to a selected value
  global_num_threads <- min(parallel::detectCores()/2, 6)
} else{
  # Value of num.threads parameter for CRAN 
  global_num_threads <- 1
}

# WARNING 1: The specified values of the niters parameter are too small! To get reliable results, use the default or higher values of the niters parameter.
# WARNING 2: To avoid potential issues with parallel computing on CRAN, we set num.threads to 1, by default it is set to half of the available CPU cores, which speeds up computation.
vimp_def <- shadow_vimp(data = mtcars, outcome_var = "vs", niters = c(20, 30, 50),
                        num.threads = global_num_threads)
#> alpha 0.3
#> 2025-06-18 13:55:38: dataframe = mtcars niters = 20 num.trees = 10000. Running step 1
#> Variables remaining: 10
#> alpha 0.1
#> 2025-06-18 13:55:47: dataframe = mtcars niters = 30 num.trees = 10000. Running step 1
#> Variables remaining: 9
#> alpha 0.05
#> 2025-06-18 13:55:56: dataframe = mtcars niters = 50 num.trees = 10000. Running step 1
#> 2025-06-18 13:56:11: dataframe = mtcars niters = 50 num.trees = 10000. Running step 50
#> Variables remaining: 8
vimp_def
#> shadowVIMP Result
#> 
#> Call:
#>  shadow_vimp(niters = c(20, 30, 50), data = mtcars, outcome_var = "vs", num.threads = global_num_threads) 
#> 
#>    Step Alpha Retained Covariates
#>  Step 1  0.30                  10
#>  Step 2  0.10                   9
#>  Step 3  0.05                   8
#> 
#> Count of significant covariates from step 3 per p-value correction method using the pooled approach
#>  Type-1 Confirmed FDR Confirmed FWER Confirmed
#>                 8             8              7

By default, printing the results displays:

the call formula,
the significance level used at each step of the procedure,
the number of covariates retained at each step (column “Retained Covariates”)
the number of significant covariates identified at the final step according to unadjusted, FDR-adjusted, and FWER-adjusted p-values (columns “Type-1 Confirmed”, “FDR Confirmed” and “FWER Confirmed”, respectively)

To examine the exact p-values and corresponding decisions in more detail, inspect the final_dec_pooled element of the output (or final_dec_per_variable if method = "per_variable").

vimp_def$final_dec_pooled %>%
  head()
#>   varname     p_unadj   p_adj_FDR p_adj_FWER Type1_confirmed FDR_confirmed
#> 1     mpg 0.002217295 0.003695492 0.02217295               1             1
#> 2     cyl 0.002217295 0.003695492 0.02217295               1             1
#> 3    disp 0.002217295 0.003695492 0.02217295               1             1
#> 4      hp 0.002217295 0.003695492 0.02217295               1             1
#> 5    qsec 0.002217295 0.003695492 0.02217295               1             1
#> 6    carb 0.002217295 0.003695492 0.02217295               1             1
#>   FWER_confirmed
#> 1              1
#> 2              1
#> 3              1
#> 4              1
#> 5              1
#> 6              1

You can visualise your results by running the following code:

plot_vimps(shadow_vimp_out = vimp_def)

Theoretical framework

Motivation

When working with high-dimensional data, it is often desirable to reduce the number of covariates under consideration. This objective is especially important when determining which variables contribute most to the accuracy of the model, and therefore best explain the outcome of interest. There are several established methods of variable selection. The shadowVIMP package introduces a new variable selection method based on variable importance (VIMP) from the random forest (RF) algorithm. In short, VIMP in a random forest quantifies the contribution of each predictor variable to the accuracy of the model. It helps to identify which variables have the greatest influence on the modelled outcome. VIMPs only allow us to rank covariates by their relative importance, but the absolute magnitude of a VIMP is context dependent, i.e. there is no universal cut-off for identifying truly important variables. Therefore, a simple approach of selecting covariates with a VIMP above a certain threshold is not appropriate as it could lead to arbitrary choices. Another possible approach is to select the top n predictors (e.g. top 5, 10 or 20), but this approach risks missing important variables or selecting false positives, especially if few or none of the covariates are related to the outcome. Therefore, a more reliable approach is needed to ensure that observed VIMPs are not simply due to chance. This is where the shadowVIMP package comes in. shadowVIMP provides a statistical framework for assessing whether a given VIMP is large enough to be considered statistically significant.

Big picture

We aim to perform a statistical test to determine whether the variable importance measure (VIMP) of a given covariate is statistically significant at a specified significance level. However, since the null distribution of the VIMP is unknown, this poses a challenge. The shadowVIMP package provides the following solution to address this issue:

Approximate null distribution of the VIMPs for each variable: For each predictor, we create an r-shadow variable by permuting (shuffling) the rows of the original data, effectively breaking any real association with the outcome. This permutation process is repeated niters times, with the VIMP recalculated after each permutation. After performing niters permutations, the resulting VIMPs for the r-shadow variable form an approximate distribution of that variable under the null hypothesis, i.e. assuming that the predictor has no true association with the outcome.
Calculate p-values: Determine the proportion of iterations in which the VIMP of the randomly permuted variable exceeds the median VIMP of the original variable.
Compare p-values: Compare the calculated p-values to the pre-specified significance level to make the final decision.

Details

The procedure implemented in the shadowVIMP package can be described as follows:

For each of the p covariates in the dataset, create p r-shadow variables by permuting rows of copied original covariates, thereby removing any association with the outcome (see the figure below). These uninformative r-shadow variables serve as a reference for evaluating the importance of the original predictors. Repeat this permutation process niters times, generating new r-shadow variables in each iteration.

Permutation scheme

For each of the niters permutations, fit the random forest model and calculate the VIMPs for both the original and corresponding r-shadow covariates. After niters iterations, the algorithm produces a data frame with niters rows and 2p columns, containing VIMPs for each of the original and r-shadow variables. Permuting the data removes any existing relationship between predictors and the outcome. The VIMPs computed from the r-shadow covariates across many permutations form an approximate null distribution for each original predictor, representing the scenario in which the predictor has no true association with the outcome. By comparing the observed VIMP to this null distribution, you estimate how likely it is that the observed importance occurred purely by chance.
Calculate and decide whether a covariate j is significant based on its non-parametric p-value. The p-value for variable j is defined as follows:

\[p_{j} = 1 - \widehat{F}_{ \left\{VI_{j}^{(s)}, \infty \right\}}(median(VI_{j}))\]

where:

$\widehat{F}_{\{VI_{j}^{(s)}, \infty\}}(\cdot)$ is an empirical cumulative distribution function (ECDF) based on the r-shadow VIMPs of variable j with an infinity added. Adding infinity to the ECDF introduces a small offset to prevent p-values being exactly 0.
$median(VI_{j})$ is the median of the VIMPs of the original variables over all nsin iterations of the random forest.

In the remainder of this tutorial, we will refer to the p-values obtained using the above-presented per-variable approach simply as per-variable p-value. Such a defined p-value represents a proportion of iterations in which the r-shadow variable achieved a higher VIMP than the median VIMP of the original variable. The covariate $x_{j}$ is considered informative if, for a given significance level $\alpha$, the following inequality holds:

\[p_{j} \le \alpha\]

Pooled p-value

When performing numerous statistical tests, the application of multiple testing adjustment method is necessary. This requires calculating very small p-values. For example, with $5,000$ predictors in the data set and a target significance level of $0.05$, when using the Benjamini-Hochberg (BH) adjustment, the first significance threshold is $0.05/5000 = 0.00001$. This means that to reject the null hypothesis (a covariate is uninformative) for the variable with the smallest p-value, its p-value must be $\leq 0.00001$. Given the present definition of p-value, the smallest achievable p-value is $\frac{1}{n_{sim}}$. Therefore, to achieve a p-value lower than the significance threshold, we would need $\frac{5000}{0.05} = 100000$ iterations of the random forest. Performing such a large number of iterations would be computationally expensive.

To reduce computational cost, we introduce the p-values obtained using the pooled approach, which we call pooled p-value for short, as an alternative decision criterion. Pooling increases the effective sample size of the ECDF by incorporating VIMP values of all r-shadow variables instead of just one. Before pooling, it is essential to standardise the r-shadow VIMPs to ensure comparable distributions. The pooled p-value for covariate j is then defined as follows:

\[ p_j^{(\text{pooled})} = 1 - \widehat{F}_{ \left\{ \widetilde{VI}_{1}^{(s)}, \dots, \widetilde{VI}_{j}^{(s)}, \dots, \widetilde{VI}_{p}^{(s)}, \infty \right\} } \left( \text{median} \left( \widetilde{VI}_{j} \right) \right) \]

where $\widetilde{VI}_{j}^{(s)}$ is a set of standardised r-shadow VIMPs and $\widetilde{VI}_{j}$ is the standardised VIMP of the original covariate $x_{j}$. Similar to per-variable p-values, when using the alternative decision criterion, the covariate j is considered informative if:

\[ p_j^{(\text{pooled})} \leq \alpha \] The main advantage of pooled p-values is the reduced run time. Using the same example with $5000$ predictors as before, the smallest achievable p-value is $\frac{1}{n_{sim} \times 5000}$. The pooled decision criterion decreases the number of required iterations by a factor of 5,000 (or, more generally, by the number of covariates included in the data), making it possible (though not guaranteed) to reject at least one hypothesis. For analyses involving multiple testing adjustment, the authors of the paper behind this package strongly recommend adopting the pooled decision criterion.

Pre-selection

The method described above is works well for low-dimensional data, but high-dimensional settings require modifications. The first improvement we need for a high-dimensional setting is the pre-selection of covariates to eliminate uninformative variables early on. This involves reducing the set of predictors in one or more preliminary steps using a less strict significance level ($\alpha$) than the final target (e.g., $\alpha = 0.3$ for pre-selection when the final $\alpha = 0.05$). The figure below illustrates the pre-selection procedure implemented in the shadowVIMP package.

Illustration of pre-selection procedure

Pre-selection process:

Initial pre-selection: Run the algorithm with $\alpha = 0.3$ (possibly with fewer iterations, here $30$) to identify variables with p-values $ $.
Second step of pre-selection (optional): Further reduce the set of variables by applying a stricter threshold (e.g., $\alpha = 0.1$) using more RF iterations (here $120$).
Final selection: Only variables that pass all the pre-selection steps proceed to the final step of the algorithm, ensuring that they have the potential to meet the target significance level of $0.05$.

The current implementation of the shadowVIMP package uses pooled p-values in the pre-selection steps. In the final step of the procedure, the user can decide whether to report per-variable or pooled p-values along with the decisions based on them. More on this topic in the following practical guide to the shadowVIMP package.

Addressing multiple testing issue

The defined decision criterion, namely the per-variable p-value, directly compares the VIMPs of the original variable with with the VIMPs of its corresponding r-shadow variable. In practice, especially when analysing high-dimensional data, multiple testing should be addressed by appropriate p-value adjustment method. The shadowVIMP package provides two adjustment procedures: Benjamini-Hochberg (BH), which controls for the expected false discovery rate (FDR), and the Holm procedure, which controls for the family-wise error rate (FWER)

Hands-on insights: mastering `shadowVIMP`

Select output components

The shadow_vimp() function has a save_vimp_history argument, allowing users to specify what is stored in the output object. By default, it is set to "all", which saves variable importance measures from every step of the procedure. By using the previously created object vimp_def, let’s inspect the VIMPs from the first and the second steps of the pre-selection phase:

# Reminder - definition of vimp_def object:
# set.seed(786)
# vimp_def <- shadow_vimp(data = mtcars, outcome_var = "vs", niters = c(20, 30, 50),
#                         num.threads = global_num_threads)

# VIMP history from the 1st step for 5 covariates:
vimp_def$pre_selection$step_1$vimp_history %>%
  select(1:5) %>%
  head()
#>        mpg      cyl     disp       hp      drat
#> 1 32.74017 37.96075 35.00597 43.10855 10.283379
#> 2 33.24061 38.51238 34.36089 43.90340 12.509940
#> 3 34.12765 38.22569 35.61749 44.00873  8.611728
#> 4 33.20399 38.21009 37.57173 44.69625 12.133609
#> 5 32.45849 38.23485 36.95873 43.53657 12.883142
#> 6 32.10669 37.08169 34.46416 45.28319 11.879613

# VIMP history from the 2nd step for 5 covariates:
vimp_def$pre_selection$step_2$vimp_history %>%
  select(1:5) %>%
  head()
#>        mpg      cyl     disp       hp       wt
#> 1 31.49757 38.04965 34.70466 42.11314 18.67261
#> 2 34.43116 38.58055 35.93857 45.05841 19.06039
#> 3 32.65648 38.00637 35.59749 43.65969 16.76149
#> 4 32.06178 39.56347 37.36726 45.54855 22.57456
#> 5 32.19292 38.51876 34.49922 43.26804 18.05470
#> 6 33.97004 38.24063 35.85364 44.73257 17.62037

To check the VIMP values from the last, final step run:

vimp_def$vimp_history %>%
  select(1:5) %>%
  head()
#>        mpg      cyl     disp       hp     qsec
#> 1 33.79486 37.43572 34.88228 44.22286 66.37090
#> 2 31.90700 38.12489 35.78170 43.90521 65.83643
#> 3 31.99137 37.58339 34.64763 44.79829 68.52528
#> 4 32.50323 38.34340 35.21631 44.35888 67.76009
#> 5 33.18473 37.64256 36.43404 46.44419 69.00096
#> 6 31.78270 38.14715 35.10925 44.66282 67.31616

Alternatively, by setting save_vimp_history to "last", only the variable importance measures from the final step of the procedure are saved. Unlike the default setting, this option does not retain VIMPs obtained during the pre-selection phase. Finally, you can completely disable saving of variable importance measures by setting save_vimp_history to "none".

You can also specify which p-value adjustment should be applied to your results. To do that, select the appropriate value for the to_show parameter. There are three available options:

"FWER": All three types of p-values (FWER, FDR, and unadjusted) will be displayed along with the corresponding decisions.
"FDR": Only FDR and unadjusted p-values, along with their decisions, will be displayed.
"unadjusted": The raw, unadjusted p-values will be shown together with the decisions if the covariate is informative.

By default, the shadow_vimp() function chooses the “FWER” option, displaying all three decisions. You can change it to “FDR” or “unadjusted” as follows:

# Show FDR and unadjusted p-values
vimp_fdr <- shadow_vimp(data = mtcars, outcome_var = "vs",
                        to_show = "FDR", niters = c(20, 30, 50),
                        num.threads = global_num_threads)
#> alpha 0.3
#> 2025-06-18 13:56:15: dataframe = mtcars niters = 20 num.trees = 10000. Running step 1
#> Warning in add_test_results(vimpermsim, alpha = alphas[j], init_num_vars =
#> init_num_vars, : By setting the `to_show` parameter to `FDR` you will not be
#> able to use the `plot_vimps()` function.
#> Variables remaining: 10
#> alpha 0.1
#> 2025-06-18 13:56:21: dataframe = mtcars niters = 30 num.trees = 10000. Running step 1
#> Warning in add_test_results(vimpermsim, alpha = alphas[j], init_num_vars =
#> init_num_vars, : By setting the `to_show` parameter to `FDR` you will not be
#> able to use the `plot_vimps()` function.
#> Variables remaining: 9
#> alpha 0.05
#> 2025-06-18 13:56:30: dataframe = mtcars niters = 50 num.trees = 10000. Running step 1
#> 2025-06-18 13:56:44: dataframe = mtcars niters = 50 num.trees = 10000. Running step 50
#> Warning in add_test_results(vimpermsim, alpha = alphas[j], init_num_vars =
#> init_num_vars, : By setting the `to_show` parameter to `FDR` you will not be
#> able to use the `plot_vimps()` function.
#> Variables remaining: 7
vimp_fdr
#> shadowVIMP Result
#> 
#> Call:
#>  shadow_vimp(niters = c(20, 30, 50), data = mtcars, outcome_var = "vs", num.threads = global_num_threads, to_show = "FDR") 
#> 
#>    Step Alpha Retained Covariates
#>  Step 1  0.30                  10
#>  Step 2  0.10                   9
#>  Step 3  0.05                   7
#> 
#> Count of significant covariates from step 3 per p-value correction method using the pooled approach
#>  Type-1 Confirmed FDR Confirmed
#>                 7             7

# Show only unadjusted p-values
vimp_unadjusted <- shadow_vimp(data = mtcars, outcome_var = "vs",
                               to_show = "unadjusted", niters = c(20, 30, 50),
                               num.threads = global_num_threads)
#> alpha 0.3
#> 2025-06-18 13:56:44: dataframe = mtcars niters = 20 num.trees = 10000. Running step 1
#> Warning in add_test_results(vimpermsim, alpha = alphas[j], init_num_vars =
#> init_num_vars, : By setting the `to_show` parameter to `unadjusted` you will
#> not be able to use the `plot_vimps()` function.
#> Variables remaining: 10
#> alpha 0.1
#> 2025-06-18 13:56:51: dataframe = mtcars niters = 30 num.trees = 10000. Running step 1
#> Warning in add_test_results(vimpermsim, alpha = alphas[j], init_num_vars =
#> init_num_vars, : By setting the `to_show` parameter to `unadjusted` you will
#> not be able to use the `plot_vimps()` function.
#> Variables remaining: 10
#> alpha 0.05
#> 2025-06-18 13:57:00: dataframe = mtcars niters = 50 num.trees = 10000. Running step 1
#> 2025-06-18 13:57:15: dataframe = mtcars niters = 50 num.trees = 10000. Running step 50
#> Warning in add_test_results(vimpermsim, alpha = alphas[j], init_num_vars =
#> init_num_vars, : By setting the `to_show` parameter to `unadjusted` you will
#> not be able to use the `plot_vimps()` function.
#> Variables remaining: 8
vimp_unadjusted
#> shadowVIMP Result
#> 
#> Call:
#>  shadow_vimp(niters = c(20, 30, 50), data = mtcars, outcome_var = "vs", num.threads = global_num_threads, to_show = "unadjusted") 
#> 
#>    Step Alpha Retained Covariates
#>  Step 1  0.30                  10
#>  Step 2  0.10                  10
#>  Step 3  0.05                   8
#> 
#> Count of significant covariates from step 3 per p-value correction method using the pooled approach
#>  Type-1 Confirmed
#>                 8

Choose decision criteria for the final decision

By default, the decision on whether a covariate is informative is based on the pooled p-value. However, per-variable p-values can also be used to select informative covariates in the final step of the procedure. In the output below, the informativeness of covariates is determined based on per-variable p-values:

vimp_per_variable <- shadow_vimp(data = mtcars, outcome_var = "vs",
                                 method = "per_variable", niters = c(20, 30, 50),
                                 num.threads = global_num_threads)
#> alpha 0.3
#> 2025-06-18 13:57:16: dataframe = mtcars niters = 20 num.trees = 10000. Running step 1
#> Variables remaining: 10
#> alpha 0.1
#> 2025-06-18 13:57:22: dataframe = mtcars niters = 30 num.trees = 10000. Running step 1
#> Variables remaining: 9
#> alpha 0.05
#> Warning in shadow_vimp(data = mtcars, outcome_var = "vs", method = "per_variable", : Not enough iterations for any positives after FDR/FWER adjustment.
#>  Increase the number of iterations in the final step to get reliable results.
#> 2025-06-18 13:57:32: dataframe = mtcars niters = 50 num.trees = 10000. Running step 1
#> 2025-06-18 13:57:48: dataframe = mtcars niters = 50 num.trees = 10000. Running step 50
#> Variables remaining: 7

vimp_per_variable$final_dec_per_variable %>%
  head()
#>   varname    p_unadj  p_adj_FDR p_adj_FWER Type1_confirmed FDR_confirmed
#> 1     mpg 0.01960784 0.03267974  0.1960784               1             1
#> 2     cyl 0.01960784 0.03267974  0.1960784               1             1
#> 3    disp 0.01960784 0.03267974  0.1960784               1             1
#> 4      hp 0.01960784 0.03267974  0.1960784               1             1
#> 5    qsec 0.01960784 0.03267974  0.1960784               1             1
#> 6    carb 0.01960784 0.03267974  0.1960784               1             1
#>   FWER_confirmed
#> 1              0
#> 2              0
#> 3              0
#> 4              0
#> 5              0
#> 6              0

As demonstrated in the output above, the use of per-variable p-values may lead to different conclusions than when pooled p-values are used. Nevertheless, we strongly recommend using pooled p-values to obtain reliable and robust results.

Output structure

So far, we have explored the vimp_history and final_dec_pooled (or final_dec_per_variable) components from the output of the shadow_vimp() function. The function also returns several additional useful components. To illustrate these clearly, let’s revisit the results obtained earlier in this tutorial (in the “Quick Start” section), specifically the following object:

# Definition of vimp_def:
# vimp_def <- shadow_vimp(data = mtcars, outcome_var = "vs", niters = c(20, 30, 50),
#                         num.threads = global_num_threads)
vimp_def
#> shadowVIMP Result
#> 
#> Call:
#>  shadow_vimp(niters = c(20, 30, 50), data = mtcars, outcome_var = "vs", num.threads = global_num_threads) 
#> 
#>    Step Alpha Retained Covariates
#>  Step 1  0.30                  10
#>  Step 2  0.10                   9
#>  Step 3  0.05                   8
#> 
#> Count of significant covariates from step 3 per p-value correction method using the pooled approach
#>  Type-1 Confirmed FDR Confirmed FWER Confirmed
#>                 8             8              7

The structure of the printed results was already discussed in Section “Quick start”, but the output of the shadow_vimp() function contains additional components to explore. For example, the time_elapsed entry provides the runtime of each individual step as well as the total execution time of the algorithm:

vimp_def$time_elapsed
#> $step_1
#> [1] 0.1403273
#> 
#> $step_2
#> [1] 0.1585632
#> 
#> $step_3
#> [1] 0.2632628
#> 
#> $total_time_mins
#> [1] 0.5621532

Under the alpha, you can inspect the significance levels used at each step. The step_all_covariates_removedentry indicates whether all covariates were eliminated before the final step (a positive integer giving the step at which removal occurred) or if at least one covariate survived through to the last step (value 0):

vimp_def$alpha
#> [1] 0.30 0.10 0.05
vimp_def$step_all_covariates_removed
#> [1] 0

The final component of the shadow_vimp() function’s output is the pre_selection list, which includes two sublists (step_1 and step_2 in the case of vimp_def object). Each sublist contains:

vimp_history - data frame storing niters VIMPs of original covariates and their r-shadows,
decision_pooled` - data frame containing p-values and decisions on whether each covariate is informative at a given step,
alpha - the significance level applied during the respective pre-selection step.

The structure of the pre_selection list changes only if save_vimp_history is set to last or none. If one of these two options is chosen, vimp_history data frame will not be included for any pre-selection steps. The other two elements remain unchanged.

The code below demonstrates the described entries using vimp_def object.

# VIMP from 1st step
vimp_def$pre_selection$step_1$vimp_history %>%
  select(1:5) %>%
  head()
#>        mpg      cyl     disp       hp      drat
#> 1 32.74017 37.96075 35.00597 43.10855 10.283379
#> 2 33.24061 38.51238 34.36089 43.90340 12.509940
#> 3 34.12765 38.22569 35.61749 44.00873  8.611728
#> 4 33.20399 38.21009 37.57173 44.69625 12.133609
#> 5 32.45849 38.23485 36.95873 43.53657 12.883142
#> 6 32.10669 37.08169 34.46416 45.28319 11.879613

# Which covariates were considered as informative in the 1st step?
vimp_def$pre_selection$step_1$decision_pooled %>%
  head()
#>   varname quantile_pooled     p_unadj   p_adj_FDR p_adj_FWER Type1_confirmed
#> 1     mpg       0.9950249 0.004975124 0.007107321 0.04975124               1
#> 2     cyl       0.9950249 0.004975124 0.007107321 0.04975124               1
#> 3    disp       0.9950249 0.004975124 0.007107321 0.04975124               1
#> 4      hp       0.9950249 0.004975124 0.007107321 0.04975124               1
#> 5      wt       0.9950249 0.004975124 0.007107321 0.04975124               1
#> 6    qsec       0.9950249 0.004975124 0.007107321 0.04975124               1
#>   FDR_confirmed FWER_confirmed
#> 1             1              1
#> 2             1              1
#> 3             1              1
#> 4             1              1
#> 5             1              1
#> 6             1              1

# The significance level used in the 1st step
vimp_def$pre_selection$step_1$alpha
#> [1] 0.3

Plot your results

In this part of the tutorial, we will continue working with the results obtained previously. Specifically, we will use the vimp_def object created earlier. Instead of just looking at the table that stores the decisions from the last step of our procedure, we would like to have an appealing visual representation of our results. The shadowVIMP package provides an easy way to achieve this with the plot_vimps() function. To plot results based on pooled p-values, simply pass the output of the shadow_vimp() function as demonstrated below:

plot_vimps(shadow_vimp_out = vimp_def)

The boxes in the figure above display the VIMP measures for each covariate based on the niters iterations from the final step of the procedure. The numbers on the left of the plot are unadjusted, FDR-adjusted and FWER-adjusted p-values respectively. To adjust the size of displayed p-values, increase the text_size parameter in the plot_vimps() function (default is 4) in the following way:

plot_vimps(shadow_vimp_out = vimp_def, text_size = 5)

To display a specific number of covariates, such as $4$, use the filter_vars parameter as shown below:

plot_vimps(shadow_vimp_out = vimp_def, filter_vars = 4)
#> Selecting by median

All covariates are displayed by default. If you are working with a large dataset and many covariates are retained until the final step of the procedure, it is recommended to set the filter_vars parameter to ensure the plot remains readable.

If you want to remove the legend or change its position, set the appropriate value for the legend.position parameter:

# Legend on the bottom
plot_vimps(shadow_vimp_out = vimp_def, legend.position = "bottom", text_size = 3)


# Plot without the legend
plot_vimps(shadow_vimp_out = vimp_def, legend.position = "none", text_size = 3)

The p_val_labels parameter of the plot_vimps() function controls whether p-value labels are displayed on the plot. By default, it is set to TRUE, which makes them visible. If you do not want to display p-values on the plot, set the p_val_labels parameter to FALSE:

# No p-values labels
plot_vimps(shadow_vimp_out = vimp_def, p_val_labels = FALSE)

You can also remove the subplot that displays the relationship between FWER, FDR, and unadjusted p-values by setting the helper.legend parameter to FALSE:

plot_vimps(shadow_vimp_out = vimp_def, helper_legend = FALSE, , text_size = 3)

By default, the colors used in the output of the plot_vimps() function are color-blind friendly. However, if you wish to assign different colors to the four types of boxes on the plot, you can do so by passing a named list of colors to the category_colors parameter as follows:

plot_vimps(
  shadow_vimp_out = vimp_def,
  category_colors = c(
    "FWER conf." = "#F56455FF",
    "FDR conf." = "#15134BFF",
    "Unadjusted conf." = "#87C785FF",
    "Not significant" = "#572F30FF"
  ),
  text_size = 3
)

Besides playing around with the parameters of the plot_vimps() function, there are plenty of possibilities to further customize the appearance of your plot. For example, you can set the title and subtitle of your plot and modify their format as follows:

plot_vimps(shadow_vimp_out = vimp_def, text_size = 3) +
  patchwork::plot_annotation(
    title = "My Cool Plot",
    subtitle = "Even better subtitle"
  ) &
  theme(plot.title = element_text(size = 16, face = "bold"))

Below, we will again use the results based on per-variable p-values, keeping all other parameters at their default values:

vimp_per_variable <- shadow_vimp(data = mtcars, outcome_var = "vs",
                                 method = "per_variable", niters = c(20, 30, 50),
                                 num.threads = global_num_threads)

To use the plot_vimps() function for results based on per-variable p-values, the pooled parameter must be set to FALSE:

plot_vimps(shadow_vimp_out = vimp_per_variable, pooled = FALSE, text_size = 3)

Parallel computing

By default, the shadow_vimp() function from the shadowVIMP package utilizes the parallel computing capabilities of ranger::ranger(). Specifically, the default value of the num.threads parameter in shadow_vimp() is set to NULL, which means that half of the available threads are used by ranger::ranger() for parallel tree building. However, as shown below, the user can specify the number of threads to be used by setting the num.threads parameter to a desired numeric value.

# Detect if running on non-interactive environments
is_interactive <- interactive()
if (is_interactive) {
  # When interactive - set num_cores to a selected value
  # Here we want to use 6 cores or if 6 cores are not available, then the number of available cores
  num_cores <- min(parallel::detectCores(), 6)
  vimp_parallel1 <- shadow_vimp(data = mtcars, outcome_var = "vs", niters = c(20, 30, 50),
                        num.threads = num_cores)
} else {
  # When non-interactive - run with 1 thread
  vimp_parallel1 <- shadow_vimp(data = mtcars, outcome_var = "vs", niters = c(20, 30, 50),
                        num.threads = 1)
}
#> alpha 0.3
#> 2025-06-18 13:58:02: dataframe = mtcars niters = 20 num.trees = 10000. Running step 1
#> Variables remaining: 10
#> alpha 0.1
#> 2025-06-18 13:58:08: dataframe = mtcars niters = 30 num.trees = 10000. Running step 1
#> Variables remaining: 8
#> alpha 0.05
#> 2025-06-18 13:58:18: dataframe = mtcars niters = 50 num.trees = 10000. Running step 1
#> 2025-06-18 13:58:33: dataframe = mtcars niters = 50 num.trees = 10000. Running step 50
#> Variables remaining: 7

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.