The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
The \(\textit{binomialRF}\) is a \(\textit{randomForest}\) feature selection wrapper (Zaim 2019) that treats the random forest as a binomial process where each tree represents an iid bernoulli random variable for the event of selecting \(X_j\) as the main splitting variable at a given tree. The algorithm below describes the technical aspects of the algorithm.
Since \(\textit{binomialRF}\) is a wrapper algorithm that internally calls and grows a randomForest object based on the inputted parameters. First we generate a simple simulated logistic data as follows:
\(X_{10}\sim MNV(0, I_{10})\),
\(p(x) = \frac{1}{1+e^{-X\beta}}\), and
\(y \sim Binom(10,p)\).
where \(\beta\) is a vector of coefficients where the first 2 coefficients are set to 3, and the rest are 0.
\[\beta = \begin{bmatrix} 3 & 3 & 0 & \cdots & 0 \end{bmatrix}^T\]
set.seed(324)
### Generate multivariate normal data in R10
X = matrix(rnorm(1000), ncol=10)
### let half of the coefficients be 0, the other be 10
trueBeta= c(rep(3,2), rep(0,8))
### do logistic transform and generate the labels
z = 1 + X %*% trueBeta
pr = 1/(1+exp(-z))
y = rbinom(100,1,pr)
To generate data looking like this:
y | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
0.61 | 1.13 | -0.30 | 0.11 | 0.20 | 1.11 | 1.51 | -0.44 | -0.39 | -1.87 | 1 |
0.19 | 0.13 | -0.99 | -0.41 | -0.49 | 1.07 | 2.33 | 0.72 | 0.34 | 0.97 | 1 |
0.54 | -1.00 | -0.47 | -0.48 | 1.74 | 0.23 | 0.13 | 0.95 | -0.99 | 0.12 | 1 |
0.56 | -2.52 | 0.82 | 0.44 | 1.24 | -0.01 | 0.11 | -0.51 | 0.39 | 1.24 | 0 |
-0.64 | -1.63 | 1.93 | -0.71 | -0.68 | 0.13 | -0.01 | 0.66 | -0.23 | 0.38 | 0 |
1.22 | -1.06 | -0.06 | 0.09 | 1.59 | 1.39 | -1.78 | -0.92 | -0.16 | 0.00 | 1 |
1.27 | -0.81 | 1.18 | 0.23 | 0.90 | 0.35 | 0.58 | -0.83 | 0.25 | 1.79 | 1 |
-0.57 | 1.51 | 0.39 | -1.74 | -0.57 | -0.40 | 1.12 | 0.76 | 0.44 | 1.11 | 1 |
-0.62 | -0.92 | -1.19 | 0.23 | -0.05 | -1.18 | -0.25 | -1.73 | -1.27 | -0.04 | 0 |
-0.97 | 0.43 | -1.13 | -0.18 | -0.59 | -1.76 | -0.62 | 0.72 | 0.12 | 0.73 | 0 |
Then we can run the binomialRF function call as below:
binom.rf <- binomialRF::binomialRF(X,factor(y), fdr.threshold = .05,
ntrees = ntrees,percent_features = .6,
fdr.method = 'BY', user_cbinom_dist = cbinom,
sampsize = round(nrow(X)*.33))
print(binom.rf)
#> variable freq significance adjSignificance
#> X2 X2 114 0.000000e+00 0.00000e+00
#> X1 X1 94 3.765189e-09 4.79322e-08
#> X3 X3 23 9.999999e-01 1.00000e+00
#> X6 X6 6 1.000000e+00 1.00000e+00
#> X7 X7 5 1.000000e+00 1.00000e+00
#> X8 X8 5 1.000000e+00 1.00000e+00
#> X4 X4 1 1.000000e+00 1.00000e+00
#> X5 X5 1 1.000000e+00 1.00000e+00
#> X10 X10 1 1.000000e+00 1.00000e+00
Note that since the binomial exact test is contingent on a test statistic measuring the likelihood of selecting a feature, if there is a dominant feature, then it will render all remaining ‘important’ features useless as it will always be selected as the splitting variable. So it is important to set the \(percent_features\) parameter < 1. The results below show how setting the parameter to a fraction between .6 to 1 can allow other features to stand out as important.
#>
#>
#> binomialRF 100%
#> variable freq significance adjSignificance
#> X2 X2 151 0.000000e+00 0.000000e+00
#> X1 X1 88 3.616441e-07 2.658084e-06
#> X3 X3 8 1.000000e+00 1.000000e+00
#> X5 X5 1 1.000000e+00 1.000000e+00
#> X6 X6 1 1.000000e+00 1.000000e+00
#> X8 X8 1 1.000000e+00 1.000000e+00
#>
#>
#> binomialRF 80%
#> variable freq significance adjSignificance
#> X2 X2 128 0.000000e+00 0.000000000
#> X1 X1 83 1.021044e-05 0.000111002
#> X3 X3 25 9.999994e-01 1.000000000
#> X6 X6 5 1.000000e+00 1.000000000
#> X7 X7 4 1.000000e+00 1.000000000
#> X8 X8 2 1.000000e+00 1.000000000
#> X10 X10 2 1.000000e+00 1.000000000
#> X4 X4 1 1.000000e+00 1.000000000
#>
#>
#> binomialRF 60%
#> variable freq significance adjSignificance
#> X2 X2 114 0.000000e+00 0.000000e+00
#> X1 X1 84 5.416284e-06 7.932062e-05
#> X3 X3 20 1.000000e+00 1.000000e+00
#> X6 X6 8 1.000000e+00 1.000000e+00
#> X4 X4 6 1.000000e+00 1.000000e+00
#> X7 X7 4 1.000000e+00 1.000000e+00
#> X8 X8 4 1.000000e+00 1.000000e+00
#> X9 X9 4 1.000000e+00 1.000000e+00
#> X5 X5 3 1.000000e+00 1.000000e+00
#> X10 X10 3 1.000000e+00 1.000000e+00
We recommend growing at least 500 to 1,000 trees at a minimum so that the algorithm has a chance to stabilize, but also recommend choosing ntrees as a function of the number of features in your dataset. The ntrees tuning parameter must be set in conjunction with the percent_features as these two are inter-connectedm as well as the number of true features in the model. Since the correlbinom function call is slow to execute for ntrees > 1000, we recommend growing random forests with only 500-1000 trees.
#>
#>
#> binomialRF 250 trees
#> variable freq significance adjSignificance
#> X2 X2 191 0.000000e+00 0.00000e+00
#> X1 X1 172 1.406475e-11 2.05976e-10
#> X3 X3 64 9.999997e-01 1.00000e+00
#> X7 X7 26 1.000000e+00 1.00000e+00
#> X6 X6 9 1.000000e+00 1.00000e+00
#> X10 X10 9 1.000000e+00 1.00000e+00
#> X4 X4 8 1.000000e+00 1.00000e+00
#> X5 X5 8 1.000000e+00 1.00000e+00
#> X9 X9 8 1.000000e+00 1.00000e+00
#> X8 X8 5 1.000000e+00 1.00000e+00
#>
#>
#> binomialRF 500 trees
#> variable freq significance adjSignificance
#> X2 X2 91 3.978159e-08 1.165190e-06
#> X1 X1 87 7.297797e-07 1.068751e-05
#> X3 X3 35 9.988915e-01 1.000000e+00
#> X6 X6 12 1.000000e+00 1.000000e+00
#> X7 X7 11 1.000000e+00 1.000000e+00
#> X8 X8 5 1.000000e+00 1.000000e+00
#> X9 X9 3 1.000000e+00 1.000000e+00
#> X10 X10 3 1.000000e+00 1.000000e+00
#> X5 X5 2 1.000000e+00 1.000000e+00
#> X4 X4 1 1.000000e+00 1.000000e+00
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.