The MXM R Package, short for the latin ‘Mens ex Machina’ ( Mind from the Machine ), is a collection of utility functions for feature selection, cross validation and Bayesian Networks. MXM offers many feature selection algorithms focused on providing one or more minimal feature subsets, refered also as variable signatures, that can be used to improve the performance of downstream analysis tasks such as regression and classification, by excluding irrelevant and redundant variables.
In this tutorial we will learn how to use the Forward Backward Early Dropping (FBED) algorithm. The algorithm is a variation of the usual forward selection. At every step, the most significant variable enters the selected variables set. In addition, only the significant variables stay and are further examined. The non significant ones are dropped. This goes until no variable can enter the set. The user has the option to redo this step 1 or more times (the argument K). In the end, a backward selection is performed to remove falsely selected variables.
For simplicity, in this tutorial, we will use a dataset referred as “The Wine Dataset”.
The Wine Dataset contains the results of a chemical analysis of wines grown in a specific area of Italy. Three types of wine are represented in the 178 samples, with the results of 13 chemical analyses recorded for each sample. Note that the “Type” variable was transformed into a categorical variable.
So, first of all, for this tutorial analysis, we are loading the ‘MXM’ library and ‘dplyr’ library for handling easier the dataset. The ‘hash’ library used also be imported, if an analysis is going to be applied more than once (we will explain this in the TIPS session).
### ~ ~ ~ Load Packages ~ ~ ~ ###
library(MXM)
library(dplyr)
And on a next step we are downloading and opening the dataset, defining also the column names.
### ~ ~ ~ Load The Dataset ~ ~ ~ ###
wine.url <- "ftp://ftp.ics.uci.edu/pub/machine-learning-databases/wine/wine.data"
wine <- read.csv(wine.url,
check.names = FALSE,
header = FALSE)
head(wine)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14
## 1 1 14.23 1.71 2.43 15.6 127 2.80 3.06 0.28 2.29 5.64 1.04 3.92 1065
## 2 1 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26 1.28 4.38 1.05 3.40 1050
## 3 1 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30 2.81 5.68 1.03 3.17 1185
## 4 1 14.37 1.95 2.50 16.8 113 3.85 3.49 0.24 2.18 7.80 0.86 3.45 1480
## 5 1 13.24 2.59 2.87 21.0 118 2.80 2.69 0.39 1.82 4.32 1.04 2.93 735
## 6 1 14.20 1.76 2.45 15.2 112 3.27 3.39 0.34 1.97 6.75 1.05 2.85 1450
str(wine)
## 'data.frame': 178 obs. of 14 variables:
## $ V1 : int 1 1 1 1 1 1 1 1 1 1 ...
## $ V2 : num 14.2 13.2 13.2 14.4 13.2 ...
## $ V3 : num 1.71 1.78 2.36 1.95 2.59 1.76 1.87 2.15 1.64 1.35 ...
## $ V4 : num 2.43 2.14 2.67 2.5 2.87 2.45 2.45 2.61 2.17 2.27 ...
## $ V5 : num 15.6 11.2 18.6 16.8 21 15.2 14.6 17.6 14 16 ...
## $ V6 : int 127 100 101 113 118 112 96 121 97 98 ...
## $ V7 : num 2.8 2.65 2.8 3.85 2.8 3.27 2.5 2.6 2.8 2.98 ...
## $ V8 : num 3.06 2.76 3.24 3.49 2.69 3.39 2.52 2.51 2.98 3.15 ...
## $ V9 : num 0.28 0.26 0.3 0.24 0.39 0.34 0.3 0.31 0.29 0.22 ...
## $ V10: num 2.29 1.28 2.81 2.18 1.82 1.97 1.98 1.25 1.98 1.85 ...
## $ V11: num 5.64 4.38 5.68 7.8 4.32 6.75 5.25 5.05 5.2 7.22 ...
## $ V12: num 1.04 1.05 1.03 0.86 1.04 1.05 1.02 1.06 1.08 1.01 ...
## $ V13: num 3.92 3.4 3.17 3.45 2.93 2.85 3.58 3.58 2.85 3.55 ...
## $ V14: int 1065 1050 1185 1480 735 1450 1290 1295 1045 1045 ...
colnames(wine) <- c('Type', 'Alcohol', 'Malic', 'Ash',
'Alcalinity', 'Magnesium', 'Phenols',
'Flavanoids', 'Nonflavanoids',
'Proanthocyanins', 'Color', 'Hue',
'Dilution', 'Proline')
For this tutorial example, we are going to apply the FBED algorithm on the above dataset, using as data and as target variables only continuous variables.
The selection of the appropriate conditional independence test is a crucial decision for the validity and success of downstream statistical analysis and machine learning tasks. Currently the __ MXM R package
__ supports numerous tests for different combinations of target ( dependent ) and predictor ( independent ) variables. A detailed summary table to guide you through the selection of the most suitable test can be found in MXM’s reference manual (p.21 “CondInditional independence tests” ) here: https://CRAN.R-project.org/package=MXM. In our example we will use the MXMX::fbed.reg()
, which is the implementation of the FBED algorithm and since we are going to examine only continuous variables, we will use the Fisher’s Independence Test.
dataset
- A numeric matrix ( or a data.frame in case of categorical predictors), containing the variables for performing the test. The rows should refer to the different samples and columns to the features. For the purposes of this example analysis, we are going to use only the continuous variables, therefore we are removing the “Type” variable from the dataset. Furthermore, we are removing the “Nonflavanoids” variable, because we will use it as target.
### ~ ~ ~ Removing The Categorical ('Type') and The Target ('Nonflavanoids') Variables ~ ~ ~ ###
wine_dataset <- dplyr::select(wine,
-contains("Type"),
-contains("Nonflavanoids"))
head(wine_dataset)
## Alcohol Malic Ash Alcalinity Magnesium Phenols Flavanoids Proanthocyanins
## 1 14.23 1.71 2.43 15.6 127 2.80 3.06 2.29
## 2 13.20 1.78 2.14 11.2 100 2.65 2.76 1.28
## 3 13.16 2.36 2.67 18.6 101 2.80 3.24 2.81
## 4 14.37 1.95 2.50 16.8 113 3.85 3.49 2.18
## 5 13.24 2.59 2.87 21.0 118 2.80 2.69 1.82
## 6 14.20 1.76 2.45 15.2 112 3.27 3.39 1.97
## Color Hue Dilution Proline
## 1 5.64 1.04 3.92 1065
## 2 4.38 1.05 3.40 1050
## 3 5.68 1.03 3.17 1185
## 4 7.80 0.86 3.45 1480
## 5 4.32 1.04 2.93 735
## 6 6.75 1.05 2.85 1450
target
- The class variable including the values of the target variable. We should provide either a string, an integer, a numeric value, a vector, a factor, an ordered factor or a Surv object. For the purposes of this example analysis, we are going to use as the dependent variable Nonflavanoids.
wine_target <- wine$Nonflavanoids
head(wine_target)
## [1] 0.28 0.26 0.30 0.24 0.39 0.34
This is the first time that we are running the algorithm, so we are going to explain what each Argument refers to:
y
: The class variable. Provide either a string, an integer, a numeric value, a vector, a factor, an ordered factor or a Surv object. As explained above, this will be the dependent variable. If the target is a single integer value or a string, it has to corresponds to the column number or to the name of the target feature in the dataset. Here we choose wine$Nonflavanoids
x
: The dataset. Provide either a data frame or a matrix. If the dataset (predictor variables) contains missing (NA) values, they will automatically be replaced by the current variable (column) mean value with an appropriate warning to the user after the execution. Here we choose the whole wine dataset, except from the Type (categorical) and Nonflavanoids (target) variables.
test
: The conditional independence test to use. Default value is NULL. Here since our dataset includes only continuous features (remember: Categorical variable ‘Type’ was removed) and our dependent variable is also continuous, we choose ‘testIndFisher’. For more information, about which test to use, please visit : https://www.rdocumentation.org/packages/MXM/versions/0.9.7/topics/CondInditional%20independence%20tests.
alpha
: Threshold (suitable values in [0,1]) for the significance of the p-values.The default value is 0.05. Here we choose the default value 0.05.
wei
: A vector of weights to be used for weighted regression. The default value is NULL. It is not suggested when robust is set to TRUE. If you want to use the “testIndBinom”, then supply the successes in the y and the trials here. Here we choose the default value NULL
K
: How many times should the process be repeated? The default value is 0. Here we choose 3.
method
: Do you want the likelihood ratio test to be performed (“LR” is the default value) or perform the selection using the “eBIC” criterion (BIC is a special case). Here we choose BIC.
gam
: In case the method is chosen to be “eBIC” one can also specify the gamma parameter. The default value is “NULL”, so that the value is automatically calculated. Here, although we choose BIC as selection criterion, we do not choose any gamma parameter.
backward
: After the Forward Early Dropping phase, the algorithm proceeds with the usual Backward Selection phase. The default value is set to TRUE. It is advised to perform this step as maybe some variables are false positives, they were wrongly selected. The backward phase using likelihood ratio test and eBIc are two different functions and can be called directly by the user. So, if you want for example to perform a backward regression with a different threshold value, just use these two functions separately. Here we set the backward argument as TRUE
### ~ ~ ~ Running FBED For First Time ~ ~ ~ ###
fbed_default_1st <- MXM::fbed.reg(target = wine_target,
dataset = wine_dataset,
test = "testIndFisher",
threshold = 0.05,
wei = NULL,
K = 10,
method = "eBIC",
gam = NULL,
backward = TRUE)
So, the algorithm run… Let’s see what information we can take out of it.
The main purpose of running FBED algorithm is to see which variables should be selected as important. The indices of those variables are stored in res
. Furthermore, in this matrix we see their test statistic and the associated p-value.
fbed_default_1st$res
## Vars eBIC difference
## [1,] 7 -241.6824
## [2,] 3 -281.8461
## [3,] 5 -292.1648
SelectedVars_names<-colnames(wine_dataset[fbed_default_1st$res[,1]])
SelectedVars_names
## [1] "Flavanoids" "Ash" "Magnesium"
Here, we see which are the important features. Yes yes, as you see in the second column of res
matrix, they are sorted… don’t worry!
And as you may imagine, you may also retrieve the information about their scores. They are all (sorted) in the second column.
fbed_default_1st$res[,2]
## [1] -241.6824 -281.8461 -292.1648
Perfect! But we see that the function returned an object called info
. What is this?
fbed_default_1st$info
## Number of vars Number of tests
## K=0 1 20
## K=1 2 11
## K=2 3 10
## K=3 3 9
The info
matrix describes the number of variables and the number of tests performed (or models fitted) at each round (remember this value of K
that in this example we set equal to 10. Here it did not reach K=10, because there were no difference after the 3rd run, so the algorithm stopped running earlier). Well, this refers to the forward phase only and for each K
the number of selected variables is returned together with the number of tests performed.
So, if the information about the forward step is appended in the info
matrix, where can we find information about the backward phase?
fbed_default_1st$back.rem
## numeric(0)
By calling the back.rem
, the variables that have been removed in the backward phase are returned. In case we are interested in the number of models that were fitted in the backward phase, all we have to do is to look for the back.n.tests
variable.
fbed_default_1st$back.n.tests
## [1] 3
And how quick has all this happened?
fbed_default_1st$runtime
## user system elapsed
## 0.04 0.00 0.05
Since the variable is categorical - and more specific it is a factor with more than two levels (unordered) -and the features are continuous, according to MXM’s reference manual (p.21 “CondInditional independence tests” ) here: https://CRAN.R-project.org/package=MXM, we should use the Multinomial logistic regression ( ‘testIndMultinom’ ).
In this step, we keep the whole dataset, in order to show how to use the algorithm also without subtracting the initial matrix.
### ~ ~ ~ Taking The Whole Dataset ~ ~ ~ ###
wine_dataset <- wine
head(wine_dataset)
## Type Alcohol Malic Ash Alcalinity Magnesium Phenols Flavanoids
## 1 1 14.23 1.71 2.43 15.6 127 2.80 3.06
## 2 1 13.20 1.78 2.14 11.2 100 2.65 2.76
## 3 1 13.16 2.36 2.67 18.6 101 2.80 3.24
## 4 1 14.37 1.95 2.50 16.8 113 3.85 3.49
## 5 1 13.24 2.59 2.87 21.0 118 2.80 2.69
## 6 1 14.20 1.76 2.45 15.2 112 3.27 3.39
## Nonflavanoids Proanthocyanins Color Hue Dilution Proline
## 1 0.28 2.29 5.64 1.04 3.92 1065
## 2 0.26 1.28 4.38 1.05 3.40 1050
## 3 0.30 2.81 5.68 1.03 3.17 1185
## 4 0.24 2.18 7.80 0.86 3.45 1480
## 5 0.39 1.82 4.32 1.04 2.93 735
## 6 0.34 1.97 6.75 1.05 2.85 1450
We will not create a different matrix for the target. As mentioned above, we are going to use the Type variable, but please… be patient…
### ~ ~ ~ Running FBED For Categorical Variable ~ ~ ~ ###
wine[, 1] <- as.factor(wine[, 1])
fbed_default_2nd <- MXM::fbed.reg(target = wine_target,
dataset = wine_dataset,
test = "testIndReg",
threshold = 0.05,
wei = NULL,
K = 10,
method = "eBIC",
gam = NULL,
backward = TRUE)
So, the algorithm run once again… Let’s see what information we can take out of it.
The main purpose of running FBED algorithm is to see which variables should be selected as important. The indices of those variables are stored in selectedVars
.
fbed_default_2nd$res
## Vars eBIC difference
## Vars 9 -227.3386
SelectedVars_names<-colnames(wine_dataset[fbed_default_2nd$res[,1]])
SelectedVars_names
## [1] "Nonflavanoids"
Again we see the scores in the second column.
fbed_default_2nd$res[,2]
## [1] -227.3386
What was stored this time in the info
matrix?
fbed_default_2nd$info
## Number of vars Number of tests
## K=0 1 24
## K=1 1 13
The info
matrix describes again the number of variables and the number of tests performed (or models fitted) at each round (remember this value of K
that in this example we set equal to 10. Here again it did not reach K=10, because there were no difference after the 2nd run, so the algorithm stopped earlier). Well, this refers to the forward phase only and for each K
the number of selected variables is returned together with the number of tests performed.
And now let us inspect the backward phase
fbed_default_2nd$back.rem
## numeric(0)
fbed_default_2nd$back.n.tests
## [1] 1
And how quick has all this happened?
fbed_default_2nd$runtime
## user system elapsed
## 0.03 0.00 0.03
Now you are ready to run your own analysis using MXM::FBED algorithm!
Thank you for your attention.
Hope that you found this tutorial helpful.
All analyses have been applied on:
sessionInfo()
## R version 3.4.2 (2017-09-28)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 16299)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=C LC_CTYPE=Greek_Greece.1253
## [3] LC_MONETARY=Greek_Greece.1253 LC_NUMERIC=C
## [5] LC_TIME=Greek_Greece.1253
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] dplyr_0.7.4 MXM_1.3.3 BiocStyle_2.6.1
##
## loaded via a namespace (and not attached):
## [1] slam_0.1-40 sets_1.0-17 splines_3.4.2
## [4] lattice_0.20-35 htmltools_0.3.6 hash_2.2.6
## [7] yaml_2.1.14 rlang_0.1.4 survival_2.41-3
## [10] R.oo_1.21.0 nloptr_1.0.4 glue_1.2.0
## [13] R.utils_2.6.0 RcppZiggurat_0.1.4 bindrcpp_0.2
## [16] bindr_0.1 foreach_1.4.4 R.cache_0.12.0
## [19] stringr_1.2.0 MatrixModels_0.4-1 R.methodsS3_1.7.1
## [22] visNetwork_2.0.1 htmlwidgets_0.9 codetools_0.2-15
## [25] evaluate_0.10.1 geepack_1.2-1 knitr_1.17
## [28] SparseM_1.77 doParallel_1.0.11 quantreg_5.34
## [31] parallel_3.4.2 Rfast_1.8.8 Rcpp_0.12.13
## [34] relations_0.6-7 backports_1.1.1 jsonlite_1.5
## [37] R.rsp_0.41.0 lme4_1.1-14 digest_0.6.14
## [40] stringi_1.1.6 bookdown_0.5 ordinal_2015.6-28
## [43] grid_3.4.2 rprojroot_1.2 tools_3.4.2
## [46] magrittr_1.5 tibble_1.3.4 cluster_2.0.6
## [49] ucminf_1.1-4 pkgconfig_2.0.1 MASS_7.3-47
## [52] Matrix_1.2-11 energy_1.7-2 assertthat_0.2.0
## [55] minqa_1.2.4 rmarkdown_1.8 iterators_1.0.8
## [58] R6_2.2.2 boot_1.3-20 nnet_7.3-12
## [61] nlme_3.1-131 compiler_3.4.2
Borboudakis G. and Tsamardinos I. (2017). Forward-Backward Selection with Early Dropping. https://arxiv.org/pdf/1705.10770.pdf