Advanced Topics for the exprso Package

Thomas Quinn

2016-12-23

Foreword

Although this vignette contains a lot of exciting stuff, to get the most out of it, we recommend first reading the companion vignette, “An Introduction to the exprso Package”, which introduces many of the core features upon which this package depends. Then, once you become familiar with the fundamental modules of this package, read this vignette to learn more about the true power and flexibility of exprso.

Tidy learning

In this first section, we will show how you can combine tidy principles with the exprso package to execute machine learning even faster than before. We do this using the magrittr package which provides two key functions: %>% and %T>%. Both of these functions pass on the result from a prior function call to the first argument of the next function call, an operation known as piping. However, the latter differs from the former in that it “branches out” and does not pass on its own result. Instead, %T>% pipes along the result from the previous function. This makes %T>% useful for “side-chain” tasks like plotting.

To begin, we load the necessary packages, including the golubEsets data package used extensively in the introductory vignette.

library(exprso)
library(magrittr)
library(golubEsets)

Now, we will see how piping can help save you time, all while making your code more readable. Here, we use the %>% function to pre-process the Golub dataset and split it into a training and test set. Since the data object forks at the level of the split method (yielding two ExprsArray objects from one), it makes sense to break the pipe cascade there.

splitSets <- data(Golub_Merge) %>% get %>%
  arrayExprs(colBy = "ALL.AML", include = list("ALL", "AML")) %>%
  modFilter(20, 16000, 500, 5) %>% modTransform %>% modNormalize %>%
  splitSample(percent.include = 67)

Next, we use the %>% function to pull the training set from the split method result (via the trainingSet function) and pipe it through a chain of feature selection and classifier construction methods. Similar to trainingSet, the testSet function (or, equivalently, the validationSet function) will extract the test set from the split method result.

pred <- trainingSet(splitSets) %>%
  fsStats(how = "t.test") %>%
  fsPrcomp(top = 10) %T>%
  plot(c = 0) %>%
  buildSVM %>%
  predict(testSet(splitSets)) %T>%
  calcStats

Piping can expedite ensemble classifier construction as well. Here, we use the %>% function in conjunction with plMonteCarlo to (a) split the training set across 10 bootstraps, (b) perform recursive feature elimination on each training subset, (c) construct an LDA classifier, and (d) deploy the classifier on an internal validation set. Then, we select the best three performing classifiers, regardless of the bootstrap origin, by passing the results through pipeUnboot and pipeFilter (see ?pipeUnboot and ?pipeFilter to learn more about how the “boot” column changes pipeFilter behavior). Last, we build a classifier ensemble and deploy it on the test set. For code clarity, we define the argument handler functions ctrlSplitSet, ctrlFeatureSelect, and ctrlGridSearch outside of the pipe cascade.

ss <- ctrlSplitSet(func = "splitSample", percent.include = 67, replace = TRUE)
fs <- ctrlFeatureSelect(func = "fsPathClassRFE", top = 0)
gs <- ctrlGridSearch(func = "plGrid", top = 0, how = "buildLDA")

pred <- trainingSet(splitSets) %>%
  plMonteCarlo(B = 10, ctrlSS = ss, ctrlFS = fs, ctrlGS = gs) %>%
  pipeUnboot %>%
  pipeFilter(colBy = "valid.auc", top = 3) %>%
  buildEnsemble %>%
  predict(testSet(splitSets)) %T>%
  calcStats

Clustering before classifying

The exprso package also includes a wrapper function for quickly clustering subjects based on feature data, named modCluster. In terms of the expected arguments, this function works analogously to the feature selection and classification methods, using the familiar how argument to toggle between one of seven clustering algorithms. The modCluster function returns an ExprsArray object with updated @annot slot that contains the results of clustering, and typically would run after splitting but before feature selection.

Since there are so many ways you could cluster within a classification pipeline, we will show here the simplest example and leave it up to you to experiment on your own. Who knows, you might find that perfect way of subdividing your data that maximizes classification accuracy! In this example, we use the %>% function to carve the training set into two clusters, and then use just one of the clustered groups to build an artificial neural network. Although this example is a bit absurd since it sacrifices power so blatantly, it should help illustrate how modCluster works in the context of other modules.

pred <- trainingSet(splitSets) %>%
  modCluster(top = 0, how = "hclust", k = 2) %>%
  modSubset(colBy = "cluster", include = 1) %>%
  fsMrmre(top = 0) %>%
  buildANN(top = 20, size = 5, decay = 1, maxit = 100) %>%
  predict(testSet(splitSets)) %T>%
  calcStats

Next, we show how you can cluster by cases only, using a small trick that works just as well for any data annotation. In this example, we use the conjoin function to undo the subset. Note, however, that using conjoin after feature selection will throw an error.

clusteredCases <- trainingSet(splitSets) %>%
  modSubset(colBy = "defineCase", include = "Case") %>%
  modCluster %>%
  modSubset(colBy = "cluster", include = 1) %>%
  conjoin(trainingSet(splitSets) %>%
            modSubset(colBy = "defineCase", include = "Control"))

Importing GSE files

The NCBI GEO hosts files in GSE or GDS format, the latter of which exists as a curated version the former. These GDS data files easily convert to an ExpressionSet (abbreviated eSet) object using the GDS2eSet function available from the GEOquery package. However, not all GSE data files have a corresponding GDS data file available. To convert GSE data files into eSet objects, exprso provides the GSE2eSet function. Below, we download a GSE file and import it as an ExprsArray object.

data.gse <- GEOquery::getGEO("GSE5847", GSEMatrix = FALSE)
data.eset <- GSE2eSet(data.gse)
data.eset@phenoData@data

Deep learning

Deep learning in exprso does not differ much from the other approaches to classification. However, supplying arguments to the deep learning build wrapper (via h2o) does get a little cumbersome.

pred <- trainingSet(splitSets) %>%
  buildDNN(top = 0,
           activation = "TanhWithDropout", # or 'Tanh'
           input_dropout_ratio = 0.2, # % of inputs dropout
           hidden_dropout_ratios = c(0.5,0.5,0.5), # % for nodes dropout
           balance_classes = TRUE,
           hidden = c(50,50,50), # three layers of 50 nodes
           epochs = 100) %>%
  predict(testSet(splitSets)) %T>%
  calcStats

One important difference with buildDNN is that you have to take the time to manually clear out the old classification models from RAM. Unlike with other models, the ExprsModel object does not actually store the deep neural net, but rather just holds a “link” to the classifier. The actual classifier is stored outside of R to allow for parallelization and improved performance.

# Frees RAM for more learning
h2o::h2o.shutdown()

When embedding buildDNN within a grid-search, we run into the difficulty that most buildDNN arguments require a numeric vector as input. These vector inputs typically correspond to a unique value for each layer of the deep neural net. We can provide plGrid a vector argument by wrapping it in a list. Take note that this approach of providing argument vectors in a list would also work for other arguments (e.g., top).

pl <- trainingSet(splitSets) %>%
  plGrid(array.valid = testSet(splitSets), top = 0, how = "buildDNN", fold = NULL,
         activation = "TanhWithDropout", # or 'Tanh'
         input_dropout_ratio = 0.2, # % of inputs dropout
         hidden_dropout_ratios = list(c(0.5,0.5,0.5)), # % for nodes dropout
         balance_classes = TRUE,
         hidden = list(c(50,50,50)), # three layers of 50 nodes
         epochs = 100)

Below, we show a more elaborate deep learning grid-search. For details on these arguments, see ?h2o::h2o.deeplearning.

pl <- trainingSet(splitSets) %>%
  plGrid(array.valid = testSet(splitSets), top = 0, how = "buildDNN", fold = NULL,
         activation = c("Rectifier",
                        "TanhWithDropout"), # or 'Tanh'
         input_dropout_ratio = c(0.2,
                                 0.5,
                                 0.8), # % of inputs dropout
         hidden_dropout_ratios = list(c(0.5,0.5,0.5),
                                      c(0.2,0.2,0.2)), # % for nodes dropout
         balance_classes = TRUE,
         hidden = list(c(50,50,50),
                       c(100,100,100),
                       c(200,200,200)), # three layers of 50 nodes
         epochs = c(100))

Keep in mind that deep learning is a very RAM hungry task. If you’re not careful, you’ll run out RAM and throw an error. Remember to call h2o::h2o.shutdown() whenever you finish deep learning!

Perfect cross-validation

The “perfect” cross-validation pipeline would have two layers of cross-validation such that the outer-layer divides the data without feature selection while the inner-layer divides the data with feature selection. We can achieve this in exprso by embedding a plNested pipeline within another plNested pipeline.

If you use this approach, you should not need a test set as long as you calculate classification accuracy in a way that respects the independence of each fold. In other words, if you opt out of a test set, you must never let a validation set accuracy guide the selection of which training sets to use when calculating the final classification accuracy.

For illustrative purposes, we perform “perfect” cross-validation using support vector machines built across a single set of parameters. Extending this pipeline to a larger parameter grid-search will require a cautious analysis of the results.

fs.inner <- ctrlFeatureSelect(func = "fsStats", top = 0, how = "t.test")
gs.inner <- ctrlGridSearch(func = "plGrid", top = 3, how = "buildSVM", fold = NULL)

fs.outer <- ctrlFeatureSelect(func = "fsNULL", top = 0)
gs.outer <- ctrlGridSearch(func = "plNested", fold = 2, ctrlFS = fs.inner, ctrlGS = gs.inner)

pl <- data(Golub_Merge) %>% get %>%
  arrayExprs(colBy = "ALL.AML", include = list("ALL", "AML")) %>%
  modFilter(20, 16000, 500, 5) %>% modTransform %>% modNormalize %>%
  plNested(fold = 2, ctrlFS = fs.outer, ctrlGS = gs.outer)

Cross-validation variants

Typically, we summarize rounds of cross-validation using accuracy. However, we could conceive of situations where we might want to emphasize sensitivity over specificity (or vice versa). Below, we show how we can use plNested in lieu of plCV to select plMonteCarlo bootstraps based on sensitivity.

In this example, each iteration of plMonteCarlo will split the dataset, then call plNested on the training subset. Next, plNested will manage v-fold cross-validation, splitting the data into v equal folds. Finally, each fold will undergo a grid-search according to plGrid. Since we have chosen to use plNested in lieu of plCV, we disable plCV by setting the plGrid argument fold = NULL. Note that, as above, we only perform feature selection within the inner-layer. This ensures that the outer-layer serves as a truly independent validation set.

fs.inner <- ctrlFeatureSelect(func = "fsStats", top = 0, how = "t.test")
gs.inner <- ctrlGridSearch(func = "plGrid", top = c(2, 3, 4), how = "buildSVM", fold = NULL)

ss.outer <- ctrlSplitSet(func = "splitStratify", percent.include = 67)
fs.outer <- ctrlFeatureSelect(func = "fsNULL", top = 0)
gs.outer <- ctrlGridSearch(func = "plNested", fold = 10, ctrlFS = fs.inner, ctrlGS = gs.inner)

pl <- data(Golub_Merge) %>% get %>%
  arrayExprs(colBy = "ALL.AML", include = list("ALL", "AML")) %>%
  modFilter(20, 16000, 500, 5) %>% modTransform %>% modNormalize %>%
  plMonteCarlo(B = 5, ctrlSS = ss.outer, ctrlFS = fs.outer, ctrlGS = gs.outer)

The resultant object now contains the necessary information to rank plMonteCarlo bootstraps based on v-fold cross-validation sensitivity or specificity. Alternatively, we could aggregate the results by selecting the best fold from each bootstrap using pipeFilter, emphasizing sensitivity over specificity with colBy. Although neither of these options mirror plCV exactly, we hope future updates will offer more with respect to these kinds of analyses.

top <- pipeFilter(pl, colBy = c("valid.sens", "valid.sens", "valid.spec"), top = 1)

Multi-class classification

The exprso package also contains a growing number of methods made specifically for dealing with multi-class data. For example, all of the build methods available for binary classification also work for multi-class classification. In addition, exprso also contains some feature selection methods that work for binary and multi-class data alike.

Below, we use mock multi-class data to illustrate a simple multi-class classification pipeline.

splitSets <- data(arrayMulti) %>% get %>%
  splitStratify(percent.include = 67, colBy = "sex")

trainingSet(splitSets) %>%
  fsANOVA %>%
  buildNB %>% # any build method can become multi with 1-vs-all
  predict(testSet(splitSets)) %T>%
  calcStats

All the pipelines developed for binary classification work equally well for multi-class classification. However, not all feature selection methods work for multi-class data. As long as you choose a valid multi-class feature selection method, plGrid, plMonteCarlo, and plNested will work without fail.

fs <- ctrlFeatureSelect(func = "fsANOVA", top = 0)
gs <- ctrlGridSearch(func = "plGrid", top = 0, how = "buildRF", fold = 10)

pl <- trainingSet(splitSets) %>%
  plNested(fold = 2, ctrlFS = fs, ctrlGS = gs) %T>%
  calcNested(colBy = "valid.acc")
## Warning in (function (array, top, how, fold, ...) : Insufficient subjects
## for plCV v-fold cross-validation. Performing LOOCV instead.
## Warning in (function (array, top, how, fold, ...) : Insufficient subjects
## for plCV v-fold cross-validation. Performing LOOCV instead.

Note that exprso does not yet support multi-class classifier ensembles. Sorry!

testthat::expect_error(
  pl %>% buildEnsemble %>%
    predict(testSet(splitSets)) %>%
    calcStats
)

A special plGrid variant, called plGridMulti, is also available for multi-class data. This variant uses 1-vs-all feature selection instead of multi-class feature selection. In this implementation, 1-vs-all feature selection occurs just prior to 1-vs-all classifier construction. As such, each individual ExprsMachine within the ExprsModule will have its own unique feature selection history to pass on to the test set during classifier deployment. For plGridMulti, the 1-vs-all feature selection is managed just like the other pl functions, using the ctrlFeatureSelect argument handler.

fs <- ctrlFeatureSelect(func = "fsStats", top = 0, how = "t.test")

pl <- plGridMulti(trainingSet(splitSets), testSet(splitSets), ctrlFS = fs, top = c(2, 3),
                  how = "buildSVM", kernel = c("linear", "radial"), gamma = c(.1, .2))

However, plGridMulti does not have built-in plCV support. To run plGridMulti with cross-validation, use plNested.

fs.inner <- ctrlFeatureSelect(func = "fsStats", top = 0, how = "t.test")
fs.outer <- ctrlFeatureSelect(func = "fsNULL", top = 0)
gs.outer <- ctrlGridSearch(func = "plGridMulti", ctrlFS = fs.inner, top = c(2, 3),
                     how = "buildSVM", kernel = c("linear", "radial"), gamma = c(.1, .2))

pl <- plNested(trainingSet(splitSets), fold = 2,
               ctrlFS = fs.outer, ctrlGS = gs.outer)