The \(\textit{binomialRF}\) is a \(\textit{randomForest}\) feature selection wrapper (Zaim 2019) that treats the random forest as a binomial process where each tree represents an iid bernoulli random variable for the event of selecting \(X_j\) as the main splitting variable at a given tree. The algorithm below describes the technical aspects of the algorithm.
binomialRF Algorithm.
Since \(\textit{binomialRF}\) is a wrapper algorithm that internally calls and grows a randomForest object based on the inputted parameters. First we generate a simple simulated logistic data as follows:
\(X_{10}\sim MNV(0, I_{10})\),
\(p(x) = \frac{1}{1+e^{-X\beta}}\), and
\(y \sim Binom(10,p)\).
where \(\beta\) is a vector of coefficients where the first 2 coefficients are set to 3, and the rest are 0.
\[\beta = \begin{bmatrix} 3 & 3 & 0 & \cdots & 0 \end{bmatrix}^T\]
set.seed(324)
### Generate multivariate normal data in R10
X = matrix(rnorm(1000), ncol=10)
### let half of the coefficients be 0, the other be 10
trueBeta= c(rep(3,2), rep(0,8))
### do logistic transform and generate the labels
z = 1 + X %*% trueBeta
pr = 1/(1+exp(-z))
y = rbinom(100,1,pr)
To generate data looking like this:
y | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
0.61 | 1.13 | -0.30 | 0.11 | 0.20 | 1.11 | 1.51 | -0.44 | -0.39 | -1.87 | 1 |
0.19 | 0.13 | -0.99 | -0.41 | -0.49 | 1.07 | 2.33 | 0.72 | 0.34 | 0.97 | 1 |
0.54 | -1.00 | -0.47 | -0.48 | 1.74 | 0.23 | 0.13 | 0.95 | -0.99 | 0.12 | 1 |
0.56 | -2.52 | 0.82 | 0.44 | 1.24 | -0.01 | 0.11 | -0.51 | 0.39 | 1.24 | 0 |
-0.64 | -1.63 | 1.93 | -0.71 | -0.68 | 0.13 | -0.01 | 0.66 | -0.23 | 0.38 | 0 |
1.22 | -1.06 | -0.06 | 0.09 | 1.59 | 1.39 | -1.78 | -0.92 | -0.16 | 0.00 | 1 |
1.27 | -0.81 | 1.18 | 0.23 | 0.90 | 0.35 | 0.58 | -0.83 | 0.25 | 1.79 | 1 |
-0.57 | 1.51 | 0.39 | -1.74 | -0.57 | -0.40 | 1.12 | 0.76 | 0.44 | 1.11 | 1 |
-0.62 | -0.92 | -1.19 | 0.23 | -0.05 | -1.18 | -0.25 | -1.73 | -1.27 | -0.04 | 0 |
-0.97 | 0.43 | -1.13 | -0.18 | -0.59 | -1.76 | -0.62 | 0.72 | 0.12 | 0.73 | 0 |
Then we can run the binomialRF function call as below:
binom.rf <- binomialRF::binomialRF(X,factor(y), fdr.threshold = .05,
ntrees = 1000,percent_features = .6,
fdr.method = 'BY', user_cbinom_dist = cbinom,
sampsize = round(nrow(X)*.33))
print(binom.rf)
#> variable freq significance adjSignificance
#> X2 X2 447 0 0
#> X1 X1 350 0 0
#> X3 X3 123 1 1
#> X6 X6 20 1 1
#> X7 X7 20 1 1
#> X8 X8 15 1 1
#> X4 X4 12 1 1
#> X5 X5 5 1 1
#> X10 X10 5 1 1
#> X9 X9 3 1 1
Note that since the binomial exact test is contingent on a test statistic measuring the likelihood of selecting a feature, if there is a dominant feature, then it will render all remaining ‘important’ features useless as it will always be selected as the splitting variable. So it is important to set the \(percent_features\) parameter < 1. The results below show how setting the parameter to a fraction between .6 to 1 can allow other features to stand out as important.
#>
#>
#> binomialRF 100%
#> variable freq significance adjSignificance
#> X2 X2 581 0 0
#> X1 X1 367 0 0
#> X3 X3 42 1 1
#> X6 X6 3 1 1
#> X7 X7 3 1 1
#> X4 X4 1 1 1
#> X5 X5 1 1 1
#> X8 X8 1 1 1
#> X9 X9 1 1 1
#>
#>
#> binomialRF 80%
#> variable freq significance adjSignificance
#> X2 X2 527 0 0
#> X1 X1 367 0 0
#> X3 X3 83 1 1
#> X7 X7 10 1 1
#> X6 X6 8 1 1
#> X5 X5 2 1 1
#> X4 X4 1 1 1
#> X8 X8 1 1 1
#> X10 X10 1 1 1
#>
#>
#> binomialRF 60%
#> variable freq significance adjSignificance
#> X2 X2 448 0 0
#> X1 X1 351 0 0
#> X3 X3 128 1 1
#> X7 X7 23 1 1
#> X6 X6 21 1 1
#> X10 X10 9 1 1
#> X8 X8 8 1 1
#> X5 X5 7 1 1
#> X4 X4 4 1 1
#> X9 X9 1 1 1
We recommend growing at least 500 to 1,000 trees at a minimum so that the algorithm has a chance to stabilize, but also recommend choosing ntrees as a function of the number of features in your dataset. The ntrees tuning parameter must be set in conjunction with the percent_features as these two are inter-connectedm as well as the number of true features in the model. Since the correlbinom function call is slow to execute for ntrees > 1000, we recommend growing random forests with only 500-1000 trees.
#>
#>
#> binomialRF 500 trees
#> variable freq significance adjSignificance
#> X2 X2 185 1.110223e-15 3.251808e-14
#> X1 X1 167 3.544187e-10 5.190406e-09
#> X3 X3 75 9.998814e-01 1.000000e+00
#> X7 X7 25 1.000000e+00 1.000000e+00
#> X10 X10 14 1.000000e+00 1.000000e+00
#> X6 X6 13 1.000000e+00 1.000000e+00
#> X8 X8 7 1.000000e+00 1.000000e+00
#> X5 X5 5 1.000000e+00 1.000000e+00
#> X9 X9 5 1.000000e+00 1.000000e+00
#> X4 X4 4 1.000000e+00 1.000000e+00
#>
#>
#> binomialRF 1000 trees
#> variable freq significance adjSignificance
#> X2 X2 385 0 0
#> X1 X1 332 0 0
#> X3 X3 131 1 1
#> X7 X7 51 1 1
#> X6 X6 33 1 1
#> X8 X8 20 1 1
#> X10 X10 13 1 1
#> X4 X4 12 1 1
#> X9 X9 12 1 1
#> X5 X5 11 1 1