In this tutorial, we present the exprso package for R, a custom library built to tackle a wide variety of supervised machine learning tasks, including the construction of ensemble classifiers. We designed this package with the biologist user in mind, carefully assembling a toolkit that allows investigators to quickly implement dichotomous (binary) and multi-class classification on high-dimensional genomic data. We designed exprso using a modular framework, whereby each function acts as a self-contained, yet interchangeable, part of the whole. With these modules, the investigator has access to multiple tools that they can use in almost any sequence to build their very own personalized machine learning pipeline. By rearranging these modules, one can easily experiment with new pipeline ideas on the fly. In this way, we balance the simplicity of automation with endless customization, all while maintaining software extensibility.
We have bundled many kinds of functions into the exprso package, including functions for data import, data modification, feature selection, individual classifier construction, and ensemble classifier construction. We begin this tutorial by first presenting a brief overview of the objects and functions found in exprso. Then, we move on to show how to retrieve this package from CRAN and import data into exprso.
This package contains five principal object types listed below. Objects in R serve as a convenient way to organize data. One can think of an object simply as a custom built data container. We emphasize these now based on the notion that a better understanding of the back-end promotes a better use of the front-end.
ExprsBinary
and ExprsMulti
handle dichotomous and multi-class data, respectively.ExprsMachine
and ExprsModule
handle dichotomous and multi-class classifiers, respectively.The functions included in this package rely on the objects listed above. Some of these functions return an updated version of the same object type provided, while others return a different object type altogether. We have adopted a function nomenclature to help organize the functions available in this package. In this scheme, most functions have a few letters in the beginning of their name which designate their general utility. Below, we include a brief description of function prefixes along with a flow diagram of all available functions.
ExprsArray
object.ExprsArray
objects. Returns an updated ExprsArray
object.ExprsArray
objects into training and test sets. Returns a list of two ExprsArray
objects.ExprsArray
object.ExprsModel
or ExprsEnsemble
object.ExprsPipeline
object.ExprsPipeline
objects. Returns an updated ExprsPipeline
object.We can load the most recent version of exprso directly from CRAN using install.packages
.
install.packages("exprso")
library(exprso)
With exprso loaded, we can import data in one of three ways. First, we can import data stored locally as a tab-delimited text file or a data.frame
. The construction of the import function, arrayExprs
, assumes these data contain subjects as rows and variables as columns, with the annotation variables (e.g., sex or age) appearing before the feature variables (e.g., mRNA expression values). The include
argument lists any number of annotations in the colBy
column to assign to each class label. In the example below, we assign the annotation “Control” to the first class; then, we assign the annotations “ASD” and “PDDNOS” to the second class. Any subjects not assigned to a class will get dropped.
array <-
arrayExprs("some_example_data_file.txt", # tab-delimited file
colBy = "DX", # column name with class labels
include = list(c("Control"), c("ASD", "PDDNOS")),
begin = 11, # the i-th column where features begin
colID = "Subject.ID") # column name with subject ID)
Second, we can also use arrayExprs
to directly import an eSet
object, a popular container for storing gene expression data. In the example below, we convert the classic ALL/AML Golub gene expression dataset from an eSet
object (provided by the golubEsets package) into an ExprsArray
object.
library(golubEsets)
data(Golub_Merge)
array <-
arrayExprs(Golub_Merge, # an ExpressionSet (abrv. eSet) object
colBy = "ALL.AML", # column name with class labels
include = list("AML", "ALL"))
Third, we can make an ExprsArray
object manually using the new
function. To accomplish this, we need to define the four slots which comprise an ExprsArray
object. The @exprs
slot must contain a feature value matrix
with subject names as columns. The @annot
slot must contain an annotation data.frame
with subject names as rows. It is critical to make sure that the subject names appear in the same order across both data structures. The two additional slots, @preFilter
and @reductionModel
, pass along the feature selection history while deploying built machines. When constructing an ExprsArray
object manually, we should set both of these to NULL
.
array <-
new("ExprsArray",
exprs = some.expression.matrix,
annot = some.annotation.data.frame,
preFilter = NULL,
reductionModel = NULL)
We will routinely make use the Golub data for the remainder of this tutorial. However, we wish to note here that both GDS and GSE data files (from the NCBI GEO repository) easily convert to eSet
objects using the GDS2eSet
and GSE2eSet
functions from the Biobase and exprso packages, respectively. For more information on the import function described in this section, we refer the user to the package documentation, ?arrayExprs
.
For quickly subsetting ExprsArray
objects by subject annotations, we provide methods for the [
and $
operators that access the @annot
annotations slot directly. Below, we will use these operators to subset the “ALL.AML” column of the ExprsArray
object subjects assigned the “Case” class label.
array[array$defineCase == "Case", "ALL.AML"]
## [1] ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL
## [18] ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL
## [35] ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL
## Levels: ALL AML
The modSubset
function and the subset
method offer two alternatives for subsetting data.
modSubset(array, colBy = "defineCase", include = "Case")
## ##Number of classes: 1
## @exprs summary: 7129 features by 47 subjects
## @annot summary: 47 subjects by 12 annotations
## @preFilter summary:
## @reductionModel summary:
subset(array, subset = array$defineCase == "Case")
## ##Number of classes: 1
## @exprs summary: 7129 features by 47 subjects
## @annot summary: 47 subjects by 12 annotations
## @preFilter summary:
## @reductionModel summary:
When performing classification, an investigator will typically withhold some percentage of the data to use later when assessing classifier performance, effectively splitting the data into two. The first dataset, called the training set, gets used to train the model, while the other, called the external validation or test set, gets used to evaluate the model. This package offers two convenience functions for splitting the data, splitSample
and splitStratify
. The former builds the training set based on simple random sampling (with or without replacement), assigning the remaining subjects to the test set. The latter builds the training set using stratified random sampling. These functions both return a list of two ExprsArray
objects corresponding to the training set and test set respectively. Below, we use the splitStratify
function to build the training and test sets through a stratified random sample across the dichotomous (binary) classification annotation.
arrays <-
splitStratify(array,
percent.include = 67,
colBy = NULL)
##
## Pre-stratification table:
## df
## Case Control
## 47 25
##
## Weighted stratification results:
##
## Case Control
## 17 17
array.train <- arrays[[1]]
Above, all subjects not included in the training set (based on percent.include
) will automatically get assigned to the test set. When using splitStratify
on a dataset with an unequal number of annotated subjects (e.g., more cases than controls), the resultant test set may contain relative annotation frequencies that differ from the training set. We can fix this so-called “imbalance” at the cost of reducing sample size by performing splitStratify
a second time. Now, we will use the test set as the input and let percent.include = 100
(keeping the other parameters the same). This will split the test set such that the new “training set” (i.e., slot 1) now contains the balanced test set and the new “test set” (i.e., slot 2) now contains the “spillover” due to balancing.
balance <-
splitStratify(arrays[[2]],
percent.include = 100,
colBy = NULL)
##
## Pre-stratification table:
## df
## Case Control
## 30 8
##
## Weighted stratification results:
##
## Case Control
## 8 8
array.test <- balance[[1]]
For more information, we refer the user to the package documentation, ?split
.
Considering the high-dimensionality of most genomic datasets, it is prudent and often necessary to prioritize which features to include during classifier construction. There exists a myriad of ways to perform the task of feature selection. This package provides functions for some of the most frequently used feature selection methods. Each function works as a self-contained wrapper that (1) pre-processes the ExprsArray
input, (2) performs the feature selection, and (3) returns an ExprsArray
output with an updated feature selection history.
These feature selection histories get passed along at every step of the way until they eventually get used to pre-process an unlabeled dataset during classifier deployment (i.e., prediction). In the spirit of open source programming, we encourage users to submit their own feature selection functions, modeled after those provided in this library.
The first of these feature selection functions, fsStats
, performs some of the most basic feature selection methods: those rooted in simple statistical tests. Specifically, this function will rank features based on either the Student’s t-test or the Kolmogorov-Smirnov test. Below, we rank features according to the results of a Student’s t-test.
array.train <-
fsStats(array.train, top = 0, how = "t.test")
On first exposure, we acknowledge that the argument top
might seem confusing. Therefore, we wish to emphasize the role of this argument: it specifies either the names or the number of features to supply to the feature selection method. It does not refer to what the user intends to retrieve from the feature selection method.
When calling the first feature selection method (or the first build method, if skipping feature selection), a numeric top
argument will select a “top ranked” feature set according to their default order in the ExprsArray
input. Then, because each feature selection method returns an ExprsArray
object with the features implicitly (re-)ranked, all subsequent numeric top
arguments will select a “top ranked” feature set according to the results of the previous feature selection method. For example, the third feature selection call draws the top features from the second feature ranking. The user may deploy, in tandem, any number of these functions in whatever order they choose, limited only by computational power and imagination.
Another included feature selection method, fsPrcomp
, performs dimension reduction by way of principal components analysis (PCA). The dimension reduction model generated during this function call gets saved in the feature selection history along with the components themselves. Below, we subject the top 50 features, as prioritized by fsStats
, to PCA.
array.train <-
fsPrcomp(array.train, top = 50)
The other feature selection methods included in this package all follow the same use pattern. Below, we plot the first three components of the training set in 3-dimensional space.
plot(array.train)
## Setting col to #FF0000FF #FF0000FF #FF0000FF #FF0000FF #FF0000FF #FF0000FF #FF0000FF #FF0000FF #FF0000FF #FF0000FF #FF0000FF #FF0000FF #FF0000FF #FF0000FF #FF0000FF #FF0000FF #FF0000FF #00FFFFFF #00FFFFFF #00FFFFFF #00FFFFFF #00FFFFFF #00FFFFFF #00FFFFFF #00FFFFFF #00FFFFFF #00FFFFFF #00FFFFFF #00FFFFFF #00FFFFFF #00FFFFFF #00FFFFFF #00FFFFFF #00FFFFFF (default behavior, override explicitly)...
## Setting pch to 19 (default behavior, override explicitly)...
For more information, we refer the user to the package documentation, ?fs
.
This package exposes several methods for supervised machine learning, including wrapper functions that implement support vector machines, artificial neural networks, random forests, and more. These functions require an ExprsArray
object as input and return an ExprsModel
object as output. This ExprsModel
object contains the feature selection history that led up to classifier construction as well as the classifier itself. Below, we build an artificial neural network with five intermediate nodes in the hidden layer using the top 10 components from the training set above.
mach <-
buildANN(array.train, top = 10, size = 5)
## Setting range to 3.74692631724907e-05 (default behavior, override explicitly)...
## Setting decay to 0.5 (default behavior, override explicitly)...
## Setting maxit to 1000 (default behavior, override explicitly)...
## # weights: 61
## initial value 28.890603
## iter 10 value 10.990562
## iter 20 value 6.590649
## iter 30 value 5.849618
## iter 40 value 5.832830
## iter 50 value 5.689080
## iter 60 value 5.679698
## iter 70 value 5.679343
## iter 80 value 5.679234
## iter 90 value 5.679203
## iter 100 value 5.679192
## iter 110 value 5.679186
## iter 120 value 5.679182
## iter 130 value 5.679168
## iter 140 value 5.679159
## final value 5.679159
## converged
For more information, we refer the user to the package documentation, ?build
.
We deploy an ExprsModel
object using predict
. This function returns an ExprsPredict
object containing the prediction results in three forms: prediction, probability, and decision boundary predictions. The probability and decision boundary predictions relate to one another by a logistic transformation. The prediction (@pred
) slot converts these metrics into a single “all-or-nothing” class label assignment.
Another function, calcStats
, allows us to compare the prediction results against the actual class labels. The aucSkip
argument specifies whether to calculate the area under the receiver operating characteristic (ROC) curve. Note, however, that performance metrics calculated using the ROC curve may differ from those calculated using a confusion matrix because the former may adjust the discrimination threshold to optimize sensitivity and specificity. The discrimination threshold is automatically chosen as the point along the ROC curve which minimizes the Euclidean distance from (0, 1). Below, we deploy a classifier on the test set, then use the result to calculate classifier performance.
pred <-
predict(mach, array.test)
## Individual classifier performance:
## Arguments not provided in an ROCR AUC format. Calculating accuracy outside of ROCR...
## Classification confusion table:
## actual
## predicted Control Case
## Control 6 1
## Case 2 7
## acc sens spec
## 1 0.8125 0.875 0.75
calcStats(pred)
## Calculating accuracy using ROCR based on prediction probabilities...
## acc sens spec auc
## 1 0.875 0.875 0.875 0.859375
For more information, we refer the user to the package documentation, ?'exprso-predict
.
This package includes several functions named with the prefix pl
. These “pipeline” functions exist to help consolidate high-throughput analyses. In other words, they wrap repetitive tasks into a single operation. Such tasks include high-throughput parameter searches as well as some elaborate cross-validation designs. As we will see, some of these pl
functions will even have other pl
functions embedded within them. For example, the function plGrid
contains the function plCV
for managing simple v-fold and leave-one-out cross-validation.
When constructing a classifier using a build method, we can only specify one set of parameters at a time. However, we often want to test models across a vast range of parameters. For this task, we provide the plGrid
function. This function builds and deploys a model for each combination of all provided arguments. For example, calling plGrid
with the arguments how = "buildSVM"
, top = c(3, 5, 10)
, cost = 10^(-3:3)
, and kernel = c("linear", "radial")
will yield 42 classifiers.
We note here that this function only accepts one how
per run. To analyze the results of multiple build
parameter searches jointly, combine the results of multiple plGrid
function calls using ?conjoin
. We will also note that plGrid
does not execute any data splitting or feature selection, both of which the user may perform beforehand. However, plGrid
does allow the user to specify multiple classifier sizes by providing a numeric vector as the top
argument.
The plGrid
function can also calculate v-fold cross-validation accuracy at each step of the parameter search (toggled by supplying a non-NULL
argument to fold
). We emphasize, however, that the cross-validation method embedded within plGrid
(i.e., plCV
) does not re-select features with each fold, which may lead to overly-optimistic measures of classifier performance in the setting of prior feature selection, as is the case in the example here.
Below, we will run through a few different support vector machine (SVM) builds, calculating the leave-one-out cross-validation accuracy (by setting fold = 0
) at each step.
gs <-
plGrid(array.train = array.train,
array.valid = array.test,
top = c(5, 10),
how = "buildSVM",
fold = 0,
kernel = "linear",
cost = 10^(-3:3)
)
The returned object contains two slots, @summary
and @machs
, which store the performance summary and corresponding ExprsModel
objects, respectively. The performance summary contains columns detailing the parameters used to build each machine along with performance metrics for the training set (and test set, if provided). Columns named with “train” describe training set performances. Columns named with “valid” describe test set performances. The column, "train.plCV"
, contains the cross-validation accuracy, if performed. The returned ExprsPipeline
object also contains an ExprsModel
object for each entry in the performance summary.
We can subset ExprsPipeline
objects based on the performance summary (i.e., @summary
) by using the [
and $
operators, similar to how ExprsArray
object subsetting works. Below, we extract the column that contains the cross-validation accuracies using [
and $
.
gs[, "train.plCV"]
## [1] 1.0000000 1.0000000 0.3823529 0.5294118 1.0000000 0.9705882 1.0000000
## [8] 1.0000000 0.9705882 1.0000000 0.9705882 1.0000000 0.9705882 1.0000000
gs$train.plCV
## [1] 1.0000000 1.0000000 0.3823529 0.5294118 1.0000000 0.9705882 1.0000000
## [8] 1.0000000 0.9705882 1.0000000 0.9705882 1.0000000 0.9705882 1.0000000
As above, the pipeSubset
function and the subset
method offer two alternatives for subsetting data.
pipeSubset(gs, colBy = "cost", include = 1)
## Accuracy summary (complete summary stored in @summary slot):
##
## build top kernel cost fold train.plCV train.acc train.sens train.spec
## 7 buildSVM 5 linear 1 0 1 1 1 1
## 8 buildSVM 10 linear 1 0 1 1 1 1
## train.auc valid.acc valid.sens valid.spec valid.auc
## 7 1 0.875 0.875 0.875 0.859375
## 8 1 0.875 0.875 0.875 0.921875
##
## Machine summary (all machines stored in @machs slot):
##
## ##Number of classes: 2
## @preFilter summary: 7129 50 5
## @reductionModel summary: logical prcomp logical
## @mach class: svm.formula svm
## ...
## ##Number of classes: 2
## @preFilter summary: 7129 50 10
## @reductionModel summary: logical prcomp logical
## @mach class: svm.formula svm
subset(gs, subset = gs$cost == 1)
## Accuracy summary (complete summary stored in @summary slot):
##
## build top kernel cost fold train.plCV train.acc train.sens train.spec
## 7 buildSVM 5 linear 1 0 1 1 1 1
## 8 buildSVM 10 linear 1 0 1 1 1 1
## train.auc valid.acc valid.sens valid.spec valid.auc
## 7 1 0.875 0.875 0.875 0.859375
## 8 1 0.875 0.875 0.875 0.921875
##
## Machine summary (all machines stored in @machs slot):
##
## ##Number of classes: 2
## @preFilter summary: 7129 50 5
## @reductionModel summary: logical prcomp logical
## @mach class: svm.formula svm
## ...
## ##Number of classes: 2
## @preFilter summary: 7129 50 10
## @reductionModel summary: logical prcomp logical
## @mach class: svm.formula svm
The exprso package also provides a means by which to perform elaborate cross-validation, including Monte Carlo style and 2-layer “nested” cross-validation. Analogous to how plGrid
manages multiple build and predict tasks, these pipelines (i.e., plMonteCarlo
and plNested
) effectively manage multiple plGrid
tasks. In order to organize the sheer number of arguments necessary to execute these functions, we have implemented argument handler functions. These argument handler functions (i.e., ctrlSplitSet
, ctrlFeatureSelect
, and ctrlGridSearch
) handle the data splitting, feature selection, and grid searching arguments, respectively.
In simplest terms, plMonteCarlo
and plNested
use a single training set to calculate classifier performances on a withheld internal validation set. This internal validation set serves as a kind of proxy for a statistically independent test set. The main difference between plMonteCarlo
and plNested
stems from how the internal validation set gets constructed. On one hand, the plMonteCarlo
method uses the ctrlSplitSet
argument handler to split the training set into a training subset and an intenral validation set with each bootstrap. On the other hand, the plNested
method splits the training set into v-folds, treating each fold as an internal validation set while treating those outside that fold as the training subset.
For the sake of simplicity, we will call any performance measured on an internal validation set as the outer-loop cross-validation performance. Then, we will call any cross-validation accuracy measured using the training subset (i.e., via plGrid
) the inner-loop cross-validation performance. In the performance summaries of the ExprsPipeline
objects returned by plMonteCarlo
and plNested
, columns named with “train” describe training subset performances while columns named with “valid” describe internal validation set performances.
Although the inner-loop cross-validation performances (i.e., via plCV
) can still over-estimate cross-validation through prior feature selection, the outer-loop cross-validation performances derive from classifiers that have undergone feature selection anew with each bootstrap or fold. However, we emphasize here that performing feature selection on a training set prior to the use of plMonteCarlo
or plNested
can still result in overly optimistic outer-loop cross-validation performances.
In the example below, we perform five iterations of plMonteCarlo
using the original training set as it existed before it underwent any feature selection (i.e., the first slot of the object arrays
). With each iteration, we (1) sample the subjects randomly through bagging (i.e., random sampling with replacement), (2) perform feature selection using the Student’s t-test, and then (3) execute a grid-search across multiple support vector machine (SVM) parameters and classifier sizes. In this framework, the user could instead perform any number of feature selection tasks simply by supplying a list of multiple ctrlFeatureSelect
argument handlers to the ctrlFS
argument below.
ss <-
ctrlSplitSet(func = "splitSample", percent.include = 67, replace = TRUE)
fs <-
ctrlFeatureSelect(func = "fsStats", top = 0, how = "t.test")
gs <-
ctrlGridSearch(func = "plGrid",
how = "buildSVM",
top = c(10, 25),
kernel = "linear",
cost = 10^(-3:3),
fold = 10)
boot <-
plMonteCarlo(arrays[[1]],
B = 5,
ctrlSS = ss,
ctrlFS = fs,
ctrlGS = gs)
Next, we reduce the results of plMonteCarlo
to a single performance metric by feeding the returned ExprsPipeline
object through calcMonteCarlo
. Note that this helper function will fail unless plGrid
has called plCV
during the parameter grid-search.
calcMonteCarlo(boot, colBy = "valid.auc")
## Retrieving best accuracy for boot 1 ...
## Retrieving best accuracy for boot 2 ...
## Retrieving best accuracy for boot 3 ...
## Retrieving best accuracy for boot 4 ...
## Retrieving best accuracy for boot 5 ...
## Averaging best accuracies across all boots...
## [1] 0.982716
This package provides two ways to build ensemble classifiers. The first involves manually combining multiple ExprsModel
objects together through the function buildEnsemble
(or ?conjoin
). The second involves an orchestrated manipulation of an ExprsPipeline
object through the pipeFilter
function.
This latter approach filters an ExprsPipeline
object in (up to) three steps. First, a threshold filter gets imposed, whereby any model with a performance less than the threshold filter, how
, gets excluded. Second, a ceiling filter gets imposed, whereby any model with a performance greater than the ceiling filter, gate
, gets excluded. Third, an arbitrary subset occurs, whereby the top N models in the ExprsPipeline
object get selected based on the argument top
. In the case that the @summary
slot contains the column “boot” (e.g., in the results of plMonteCarlo
), pipeFilter
selects the top N models for each unique bootstrap. The user may skip any one of these three filter steps by setting the respective argument to 0.
When calling the buildEnsemble
method for an ExprsPipeline
object, any classifiers remaining after the pipeFilter
filter will get assembled into a single ensemble classifier. Ensemble classifiers get stored as an ExprsEnsemble
object which is nothing more than object-oriented container for a list of multiple ExprsModel
objects.
In the example below, we we will build an ensemble using the single best classifier from each plMonteCarlo
bootstrap, and then deploy that ensemble on the withheld test set from above.
ens <- buildEnsemble(boot, top = 1, colBy = "valid.auc")
pred <- predict(ens, array.test, how = "majority")
## Arguments not provided in an ROCR AUC format. Calculating accuracy outside of ROCR...
## Classification confusion table:
## actual
## predicted Control Case
## Control 0 0
## Case 8 8
## acc sens spec
## 1 0.5 1 0
Owing to how the pipeFilter
function handles ExprsPipeline
objects that contain a “boot” column in the performance summary (i.e., @summary
), we include the pipeUnboot
function to rename this “boot” column to “unboot”. To learn more about how ExprsEnsemble
predicts class labels, we refer the user to the package documentation, ?'exprso-predict'
. In addition, we encourage the user to visit the package documentation, ?'ExprsPipeline-class'
and ?'ExprsEnsemble-class'
, for more information on pipelines and ensembles, respectively.
We conclude this vignette by alerting the user that the exprso package also includes a framework for performing multi-class classification in an automated manner. These methods use the “1-vs-all” approach to multi-class classification, whereby each individual class label has a turn getting treated as the positive class label in a dichotomous (binary) scheme. Then, the results of each iteration get integrated into a single construct. To learn more about multi-class classification, we refer the user to the package documentation, ?arrayExprs
and ?doMulti
, as well as the companion vignette, “Advanced Topics for the exprso package”.
Thank you for your interest in exprso. Although we have made tremendous progress in formalizing this library in a reliable package framework, some of the tools included here may change. To the best of knowledge, we have followed the machine learning “best practices” when developing this software, but if you know better than us, please let us know! File any and all issues at GitHub. In addition, we always welcome suggestions for new tools that we could include in future releases. Happy learning!