The TDMR Tutorial:
Tuned Data Mining in R


Wolfgang Konen, Patrick Koch

Cologne University of Applied Sciences

Last update: Nov 2012

Contents

The TDMR Tutorial:  Tuned Data Mining in R. 1

Tutorial TDMR. 3

Overview.. 3

Installing TDMR. 3

Lessons. 3

Lesson 1: DM on task SONAR. 3

Lesson 2: SPOT tuning on task SONAR. 5

Lesson 3: “The Big Loop” on task SONAR. 7

Lesson 4: Regression Big Loop. 8

Lesson 5: Interactive Visualization. 8

Lesson 6: Performance Measure Plots. 10

Frequently Asked Questions (FAQ). 11

How can I rerun a certain best tuning solution and test it on the same / other data?. 11

How can I make with a trained model new predictions?. 12

How can I add a new tuning parameter to TDMR?. 12

How can I add a new machine learning algorithm  to TDMR?. 12

How can I add a new tuning algorithm  to TDMR?. 12

References. 13

 

Tutorial TDMR

Overview

The TDMR framework is written in R with the aim to facilitate the training, tuning and evaluation of data mining (DM) models. It puts special emphasis on tuning these data mining models as well as simultaneously tuning certain preprocessing options.

This document (TDMR-tutorial.html)

·         describes the TDMR installation

·         shows example usages: how to use TDMR on new data mining tasks. 

·         provides a FAQ-section (frequently asked questions)

This document should be read in conjunction with TDMR-docu.html, which describes more details and software concepts of TDMR.

Both documents concentrate more on the software usage aspects of the TDMR package. For a more scientific discussion of the underlying ideas and the results obtained, the reader is referred to [Kone10a, Kone11b].

Installing TDMR

Once you have R (http://cran.r-project.org/), > 2.14, up and running, simply install TDMR with

 

  > install.packages("TDMR");

  > library(TDMR);

 

Lessons

Lesson 1: DM on task SONAR

If you want to build a DM classification model for the SONAR data (see UCI repository or package mlbench for further info on SONAR), you write a file main_sonar.r [1]:

 

main_sonar <- function(opts=NULL, dset=NULL) {           

 

                if (is.null(opts)) {

                      opts = tdmOptsDefaultsSet();           # set initial defaults for many elements  of opts. See tdmOptsDefaults.r

                                                               # for the list of those elements and many explanatory comments

      opts$filename = "sonar.txt"

      opts$READ.CMD = "read.csv2(file=paste(opts$dir.data, filename, sep=\"\"), header=FALSE)"

      opts$data.title <- "Sonar Data"

                }

opts <- tdmOptsDefaultsSet(opts);         # fill in all opts params which are not yet set (see tdmOptsDefaults.r)

   

gdObj<-tdmGraAndLogInitialize(opts);   # init graphics and log file

   

#===============================================

# PART 1: READ DATA

#===============================================

if (is.null(dset)) {

      cat1(opts,opts$filename,": Read data ...\n")

      dset <- tdmReadData(opts);

}

names(dset)[61] <- "Class"                    # 60 columns V1,...,V60 with input data,

                                                               # one response column "Class" with levels ["M" (metal) | "R" (rock)]

   
response.variable <- "Class"                  # which variable is response variable

 

# which variables are input variables (in this case all others):

input.variables <- setdiff(names(dset), c(response.variable))

 

#===============================================

# PART 2: Model building and evaluation

#===============================================

result <- tdmClassifyLoop(dset,response.variable,input.variables,opts);

 

# print summary output and attach certain columns (here: y, sd.y, dset) to list result:

result <- tdmClassifySummary(result,opts,dset);

 

tdmGraAndLogFinalize(opts,gdObj);      # close graphics and log file

 

result; 

}

 

 

This function is invoked with

 

  result <- main_sonar();

 

or alternatively with

 

  demo(demo00sonar,ask=F);

 

In the first case, the control flow will pass through the branch if (is.null(opts)), in the second case demo00sonar.r will define opts and pass it to main_sonar(). In either case, tdmOptsDefaultsSet() will construct a default list opts. You need to specify only those things which differ from tdmOptsDefaultsSet(): in this case the filename of the SONAR dataset. Since you do not specify anything  else, an RF model will be built and the dataset will be divided in a training part (90%) and test part (10%), based on opts$TST.kind=”rand”, opts$TST.testFrac=0.1. Since you do not specify anything from the opts$SRF-block, you use the default SRF variable ranking (opts$SRF.kind =”xperc”, opts$SRF.Xperc=0.95). This means that the most important columns containing about 95% of the overall importance will be selected.

You need to specify in main_sonar.r what column in dset is response variable (classification target) and what columns are used for input (in this case all the others, because the SONAR dataset does not have ID columns or otherwise irrelevant columns).

Function tdmClassifyLoop() is started, it builds an RF model using the training data and evaluates it on training and validation data.

The object  result of class TDMclassifier is explained in more detail in Table 3.

 

Some of the graphics output generated by tdmClassify:

Figure 1: RF-based variable importance for task sonar.

 

Figure 2: True/false classification for the two classes M and R of task sonar.

 

 

The two plots in Figure 1 show the RF-based importance, where MeanDecreaseAccuracy, which has V9, V11 and V12 as the most important variables, is the more reliable measure. The right plot in Figure 2 shows the true/false classifications on the test set (which is here however rather small, so the results are not very reliable, a more reliable test set classification would be obtained with CV).

 

Lesson 2: SPOT tuning on task SONAR

If you want to do a SPOT tuning on task SONAR, you should follow the steps described in TDMR Workflow, Level 2 and create the three small files sonar.conf, sonar.apd and sonar.roi. The files’ content may look for example like this:

sonar_01.conf:

alg.func = "tdmStartSpot"

alg.resultColumn = "Y"

alg.seed = 1235

 

io.apdFileName = "sonar_01.apd"

io.roiFileName = "sonar_01.roi"

io.verbosity = 3;      

auto.loop.steps = 50;       # number of SPOT’s sequential generations

auto.loop.nevals = 100;   # concurrently, max number of algo evaluations may be specified

 

init.design.func = "spotCreateDesignLhd";

init.design.size = 10;       # number of initial design points

init.design.repeats = 1;   # number of initial repeats

 

seq.merge.func <- mean;

seq.design.size = 100;

seq.design.retries = 15;

seq.design.maxRepeats = 2;

seq.design.oldBest.size <- 1;

seq.design.new.size <- 3;

seq.predictionModel.func = "spotPredictRandomForest";

 

report.func = "spotReportSens"

 

sonar_01.apd:

opts = tdmOptsDefaultsSet();     # set initial defaults for many elements of opts.

opts$filename = "sonar.txt"

opts$data.title <- "Sonar Data"

opts$READ.CMD = "read.csv2(file=paste(opts$dir.txt, filename, sep=\"\"), dec=\".\", sep=\",\", nrow=opts$READ.NROW,header=F)"

opts$RF.mtry = 4

opts$NRUN =  1          # how many runs with different train & test samples  - or -

                                    # how many CV-runs, if TST.kind="cv"

opts$GRAPHDEV="non";                              

opts$GD.RESTART=F;      

opts$VERBOSE= opts$SRF.verbose = 0;     

 

sonar_01.roi:

name low high type                                              

CUTOFF1 0.1 0.80 FLOAT

CLASSWT2 5 15 FLOAT

XPERC 0.90 1.00 FLOAT

 

The three parameter CUTOFF1, CLASSWT2 and XPERC are tuned within the borders specified by sonar_01.roi. Usually you should set opts$GRAPHDEV=”non” and opts$GD.RESTART=F to avoid any graphic output and any graphics device closing from main_sonar.r, so that you get only the graphics made by SPOT. 

After this preparation, the SPOT tuning is started with  demo/demo02sonar.r:

 

path <- paste(.find.package("TDMR"), "demo02sonar",sep="/");

oldwd <- getwd();

setwd(path);

tdm=tdmDefaultsFill(mainFile="main_sonar.r");
tdmMapDesSpot$load(tdm);

 

source("sonar_01.apd");      # read in opts-settings

source(tdm$mainFile);

spotUserConfig = list(tdm=tdm,spot.fileMode=F, opts=opts);

spotConfig = spot("sonar_01.conf","auto",spotConfig=spotUserConfig);

 

It will generate the usual SPOT result files (see SPOT manual [Bart10e])

-          sonar_01.res

-          sonar_01.bst

The tuning will stop after 16 sequential steps with the configuration CONFIG=58, because the budget of auto.loop.nevals=100 evaluations is exhausted. The best solution can be seen from the last line of sonar_01.bst (or alternatively from the printout of spotConfig$alg.currentBest).

With

 

  spot(“sonar_01.conf”,”rep”);

 

Figure 3: Sensitivity plot. Each ROI [low,high] is mapped to the normalized ROI [-1,1] on the abscissa.the results from a prior tuning run producing sonar_01.res are read in again and a report including a sensitivity plot (see Figure 3) is made.

The sensitivity plot shows the response of the metamodel in the vicinity of the best solution. Consider for example the black line (CUTOFF1) in Figure 3: This line is produced by fixing the other parameters (CLASSWT2, XPERC) at their best-solution values (green and red dot) and varying CUTOFF1 within its ROI-interval [0.10, 0.80], which is mapped to the normalized abscissa [-1.0, 1.0] in the plot. The black line is the response of the metamodel. The best-solution value of CUTOFF1 is the black dot. The steep valley of the black line around the black dot shows that the metamodel is quite sensitive to CUTOFF1.

 

See Sec. "Details for Lesson 2" in TDM-docu for more details

 

Lesson 3: “The Big Loop” on task SONAR

 

To start “The Big Loop”, you configure a file similar to demo/demo03sonar.r:

 

path <- paste(.find.package("TDMR"), "demo02sonar",sep="/");

oldwd <- getwd();

setwd(path);

source("main_sonar.r");    # in working dir

 

tdm <- list( mainFunction="main_sonar"

            , umode=c("RSUB")           # ["CV" | "RSUB" | "TST"]

            , tuneMethod = c("spot","lhd")

            , filenameEnvT="demo03.RData"   # file to save environment envT (in working dir)

            , finalFile="sonar.fin"

            , nrun=5, nfold=2          # repeats and CV-folds for the unbiased runs

            , withParams=TRUE     # list the columns with tuned parameters in final results

            , nExperim=2

            , optsVerbosity = 0       # the verbosity for the unbiased runs

            );

## tdm$runList can contain also more than one file (e.g. c("sonar_01.conf","sonar_02.conf")):.

tdm$runList = c("sonar_04.conf");    

## spotStep can be either "auto" (do automatic tuning) or "rep" (make a visual report of best results)

spotStep = "auto";

 

## construct an initial environment envT from the given TDMR settings in tdm

envT <- tdmMakeNewEnvT(tdm);

 

envT <- tdmBigLoop(envT,spotStep);

 

setwd(oldwd);    # restore old working directory

 

demo03sonar.r  specifies a vector of TDMR tasks in tdm$runList (a vector of .conf files), and it specifies a vector of tuners in tdm$tuneMethod, e.g. c(“spot”,“lhd”) and other values of tdm. Then it calls tdmBigLoop.

Here, this script will trigger the following sequence of experiments:

-          sonar_04.conf is started with tuner spot

-          sonar_04.conf is started with tuner lhd

This sequence of 2 tuning experiments is repeated nExperim=2 times. The corresponding 4 result lines are written to data frame envT$theFinals.[2] The switch tdm$withParams=TRUE is only sensible if all .conf files have the same set of parameters in their .roi file.

 

The resulting data frame envT$theFinals contains:

 

CONF TUNER NEXP CLASSWT2   XPERC NRUN NEVAL RGain.bst RGain.avg RGain.OOB   sdR.OOB RGain.RSUB sdR.RSUB

sonar_02   lhd     1      12.026543  0.930197     3       36         86.70213     84.3676       84.4311    1.03715       83.73984      5.63268

sonar_02  spot    1      14.713475  0.981312     3       36         86.96809     84.6926       85.6287    1.03715       86.99187      7.04085

sonar_03   lhd     2        8.037636  0.954494     3       36         81.91489     78.6643       80.4391    1.82937       79.67480      7.45134

sonar_03  spot    2        7.375221  0.914740     3       35         81.91489     78.7082       78.8423    0.34571       74.79675      2.81634

 

 

Here CLASSWT2 and XPERC are the tuning parameters, the other columns are defined in Table 2.

In the case of the example above, the tuning process had a budget of NEVAL=36 model trainings, resulting in a best solution with class accuracy RGain.bst (in %). The average class accuracy (mean w.r.t. all design points) during tuning is RGain.avg. When the tuning is finished, the best solution is taken and NRUN=3 unbiased evaluation runs are done with the parameters of the best solution. Since the classification model in this example is RF (Random Forest), an OOB-error from the 3 trainings is returned, with average RGain.OOB and standard deviation sdR.OOB. Additionally, NRUN=3 trainings are done with random subsampling (RSUB) of the data set in training and test set, resulting in an average class accuracy on the test set RGain.RSUB and the corresponding standard deviation in sdR.RSUB.

Note that the standard deviation sdR.RSUB is in this case quite large (because the test set is very small). A more reliable result might be obtained with tdm$umode=“CV” instead of “RSUB”.

 

See "Details for Lesson 3" in TDM-Docu for more details

 

Lesson 4: Regression Big Loop

demo/demo04cpu.r

 

Lesson 5: Interactive Visualization

demo/demo05visMeta.r

Once a lesson-3 experiment is completed, the return value envT  from tdmBigLoop() contains the result of such an experiment and may be visually inspected. Alternatively, envT may be loaded from an appropriate .RData file. The call

  tdmPlotResMeta(envT);

 

allows to visually inspect all RES data frames contained in envT.

Figure 4: The user interface in tdmPlotResMeta. The user may select the tuner, the design variables to show on x- and y-axis, the display function (spotReport3d or spotReportContour) and the metamodel function (modelFit). Two optional sliders are nExper and nSkip (see text).

The user interface is shown and explained in Figure 4. An additional combo box “confFile” appears only, if envT$runList has more than one element. An additional slider “nExper” appears only, if envT$tdm$nExperim>1.

 

Figure 5: Two example outputs from tdmPlotResMeta with reportFunc=spotReport3d. Left: modelFit = spotPredictGausspr, right: = spotPredictRandomForest.

 

 

The user selects with “tuner”, “confFile” and “nExper” a certain RES data frame from envT. This data frame contains a collection of function evaluations for certain design points selected by the tuner. With one of the metamodel construction functions (see package SPOT for further details)

·         spotPredictGausspr

·         spotPredictRandomForest

·         spotPredictMlegp

a metamodel is fitted to the RES data frame and the result is shown as shaded surface in the plot. The RES data points are shown as black points in Figure 5.

Since certain “bad” RES point may dominate the plot as outliers and hinder the user to inspect the region near the optimum, there are two options to suppress “bad” points:

1.       If the slider nSkip has a value >0, then the nSkip RES data points with the worst y-value are discarded.

2.       If the checkbox “Skip incomplete CONFIGs” is activated, then design points belonging to a configuration which was not evaluated maxRepeats times are discarded (relevant for SPOT only).

Note that both options will reduce the number of RES data points. This will also affect the metamodel fit, so use both options with care, if the number of RES data points is small.

 

The plots created with spotReport3d make use of the rgl-package. They can be interactively manipulated with the mouse. They can be selected and saved as PNG images with commands like

 

  rgl.set(7);

  rgl.snapshot(“myFile.png”);

 

 

A complete demo example is invoked with:

  demo(demo05visMeta);

 

Lesson 6: Performance Measure Plots

With the help of package ROCR [Sing05], several area performance measure can be used for binary classification. The file demo/demo06ROCR.r shows an example:

  setwd(paste(.find.package("TDMR"), "demo02sonar",sep="/"));

  source("main_sonar.r");    

  opts = tdmOptsDefaultsSet();   

  # ...

  opts$rgain.type <- "arROC";

  result = main_sonar(opts);

  tdmGraphicNewWin(opts);

  cat("Area under ROC-curve for validation data set: ");

  print(tdmROCRbase(result));        # side effect: plot ROC-curve

  tdmGraphicNewWin(opts);

  cat("Area under lift curve for training data set: ");

  print(tdmROCRbase(result,dataset="training",typ="lift"));  

  # side effect: plot lift chart

 

Figure 6: Left: ROC chart, right: lift chart, as produced by demo06ROCR.r with the help of package ROCR [Sing05]. The bar on the right side of each plot shows a color coding of the cutoff parameter.

 

Once the variable result contains an object of class TDMclassifier, it is also possible to inspect such  an object interactively and to plot ROC-, lift- or precision-recall-chart for the training set or the validation set:

 

  tdmROCR(result);

 

A twiddler interface for object result  shows up, as shown in Figure 7, and allows to select between different performance measure plots:

 

Figure 7: The user interface of tdmROCR. The user may select the dataset (training or validation), the type of plot (ROC, lift, or precision-recall) and the number of run (only if Opts(result)$NRUN>1).

Frequently Asked Questions (FAQ)

How can I rerun a certain best tuning solution and test it on the same / other data?

Rerun your lesson-3 script with spotStep =”rep”.

Or use the following code snippet:

  envT = tdmEnvTLoad("demoSonar.RData");    # load envT
  source("main_sonar.r");
  envT$tdm$nrun=0;       # =0: no unbiasedRun
  finals = tdmEnvTSensi(envT,1);
  if (!is.null(finals)) print(finals);

 

This makes solely the sensitivity plot (w/o unbiased runs).

If envT$tdm$nrun>0, then a certain number of unbiased runs are done with the best tuning solution.

How can I make with a trained model new predictions?

How can I add a new tuning parameter to TDMR?

·         [user] Add a new line to userMapDesign.csv in directory tdm$path.[3] Suppose you want to tune the variable opts$SRF.samp: add to file userMapDesign.csv a line

  SRF.SAMP;    opts$SRF.samp;   0

This specifies that whenever SRF.SAMP appears in a .roi file in dir tdm$path, the tuner will tune this variable. TDMR maps SRF.SAMP to opts$SRF.SAMP.

·         [developer] Add similarly a new line to tdmMapDesign.csv. This means that the mapping is available for all tasks, not only for those in the current tdm$path.

·         [optional, for developer] For a new variable opts$Z, add a default-value-line to tdmOptsDefaultsSet(). Then all existing and further tasks have this default setting for opts$Z.

How can I add a new machine learning algorithm  to TDMR?

see section “How to integrate new machine learning algorithms” in TDM_docu.docx.

How can I add a new tuning algorithm  to TDMR?

See section “How to integrate new tuners” in TDM_docu.docx.

 

 

References

[Bart10e] T. Bartz-Beielstein. SPOT: An R package for automatic and interactive tuning of optimization algorithms by sequential parameter optimization. Technical Report, FH Köln, 2010

[Kone10a] W. Konen, P. Koch, O. Flasch, T. Bartz-Beielstein. Parameter-tuned data mining: A general framework. In F. Hoffmann and E. Hüllermeier, editors, Proceedings 20. Workshop Computational Intelligence. Universitätsverlag Karlsruhe, 2010.

[Kone11b] W. Konen, P. Koch, O. Flasch, T. Bartz-Beielstein. Tuned Data Mining: A Benchmark Study on Different Tuners, Proc. GECCO 2011, Dublin, July 2011.  

[Sing05] Tobias Sing, Oliver Sander, Niko Beerenwinkel, Thomas Lengauer. ROCR: visualizing classifier performance in R. Bioinformatics 21(20):3940-3941 (2005). See also http://rocr.bioinf.mpi-sb.mpg.de/



[1] Template main_sonar.r is available from inst/demo02sonar.

[2] If tdm$fileMode==TRUE (now deprecated), the 4 lines are also written to tdm$finalFile. If (tdm$experFile != NULL), these lines are also appended to file tdm$experFile.

[3] If such a file does not exist yet, the user has to create it with a first line
      roiValue;   optsValue;          isInt