Creating FFTrees

Nathaniel Phillips

2016-09-10

The FFTrees() function is at the heart of the FFTrees package. The function takes a training dataset as an argument, and generates several fast and frugal trees which attempt to classify cases into one of two classes based on cues.

heartdisease example

Let’s start with an example, we’ll create FFTrees fitted to the heartdisease dataset. This dataset contains data from 202 patients suspected of having heart disease. Here’s how the dataset looks:

head(heartdisease)
##   age sex cp trestbps chol fbs     restecg thalach exang oldpeak slope ca
## 1  63   1 ta      145  233   1 hypertrophy     150     0     2.3  down  0
## 2  67   1  a      160  286   0 hypertrophy     108     1     1.5  flat  3
## 3  67   1  a      120  229   0 hypertrophy     129     1     2.6  flat  2
## 4  37   1 np      130  250   0      normal     187     0     3.5  down  0
## 5  41   0 aa      130  204   0 hypertrophy     172     0     1.4    up  0
## 6  56   1 aa      120  236   0      normal     178     0     0.8    up  0
##     thal diagnosis
## 1     fd         0
## 2 normal         1
## 3     rd         1
## 4 normal         0
## 5 normal         0
## 6 normal         0

The critical dependent variable is diagnosis which indicates whether a patient has heart disease or not. The other variables in the dataset (e.g.; sex, age, and several biological measurements) will be used as predictors.

Now we’ll split the original dataset into a training dataset, and a testing dataset. We will create the trees with the training set, then test its performance in the test dataset:

set.seed(100) # For replication
samples <- sample(c(T, F), size = nrow(heartdisease), replace = T)
heartdisease.train <- heartdisease[samples,]
heartdisease.test <- heartdisease[samples == 0,]

We’ll create a new FFTrees object called heart.fft using the FFTrees() function. We’ll specify diagnosis as the (binary) dependent variable, and include all independent variables with formula = diagnosis ~ .:

heart.fft <- FFTrees(formula = diagnosis ~.,
                    data = heartdisease.train,
                    data.test = heartdisease.test
                    )

Elements of an FFTrees object

FFTrees() returns an object with the FFTrees class. There are many elements in an FFTrees object, here are their names:

names(heart.fft)
##  [1] "formula"          "data"             "cue.accuracies"  
##  [4] "tree.definitions" "tree.stats"       "level.stats"     
##  [7] "decision"         "levelout"         "auc"             
## [10] "lr"               "cart"

Printing an FFTrees object

You can view basic information about the FFTrees object by printing its name. This will give you a quick summary of the object, including how many trees it has, which cues the tree(s) use, and how well they performed.

heart.fft
## [1] "An FFTrees object containing 8 trees using 4 predictors {thal,cp,exang,slope}"
## [1] "FFTrees AUC: (Train = 0.88, Test = 0.85)"
## [1] "My favorite training tree is #5, here is how it performed:"
##                         train   test
## n                      149.00 154.00
## p(Correct)               0.83   0.77
## Hit Rate (HR)            0.91   0.79
## False Alarm Rate (FAR)   0.24   0.25
## d-prime                  2.04   1.46

Cue accuracy statistics: cue.accuracies

You can obtain marginal cue accuracy statistics from the cue.accuracies list. The list contains dataframes with marginal cue accuracies. That is, for each cue, the threshold that maximizes the v-statistic (HR - FAR) in the training dataset is chosen. If the object has test data, you can see the marginal cue accuracies in the test dataset (using the thresholds calculated from the training data):

heart.fft$cue.accuracies
## $train
##         cue     class            threshold direction   n hi mi fa cr
## 10      age   numeric                53.89        >= 149 43 21 42 43
## 2       sex   numeric                    1        >= 149 51 13 47 38
## 7        cp character             np,aa,ta        != 149 50 14 22 63
## 9  trestbps   numeric               138.74         > 149 27 37 16 69
## 6      chol   numeric               252.32         > 149 36 28 26 59
## 1       fbs   numeric                    0         > 149 12 52 11 74
## 21  restecg character hypertrophy,abnormal         = 149 39 25 34 51
## 11  thalach   numeric               144.32        <= 149 36 28 21 64
## 22    exang   numeric                    1        >= 149 40 24 11 74
## 4   oldpeak   numeric                 0.98         > 149 42 22 24 61
## 41    slope character                   up        != 149 52 12 31 54
## 12       ca   numeric                    0         > 149 41 23 21 64
## 42     thal character               normal        != 149 47 17 17 68
##          hr       far          v    dprime
## 10 0.671875 0.4941176 0.17775735 0.4598419
## 2  0.796875 0.5529412 0.24393382 0.6974151
## 7  0.781250 0.2588235 0.52242647 1.4233984
## 9  0.421875 0.1882353 0.23363971 0.6873190
## 6  0.562500 0.3058824 0.25661765 0.6648668
## 1  0.187500 0.1294118 0.05808824 0.2420296
## 21 0.609375 0.4000000 0.20937500 0.5310375
## 11 0.562500 0.2470588 0.31544118 0.8410851
## 22 0.625000 0.1294118 0.49558824 1.4478155
## 4  0.656250 0.2823529 0.37389706 0.9781159
## 41 0.812500 0.3647059 0.44779412 1.2330547
## 12 0.640625 0.2470588 0.39356618 1.0439043
## 42 0.734375 0.2000000 0.53437500 1.4677202
## 
## $test
##         cue     class            threshold direction   n hi mi fa cr
## 1       age   numeric                53.89        >= 154 58 17 33 46
## 2       sex   numeric                    1        >= 154 63 12 45 34
## 3        cp character             np,aa,ta        != 154 75  0 79  0
## 4  trestbps   numeric               138.74         > 154 27 48 28 51
## 5      chol   numeric               252.32         > 154 33 42 31 48
## 6       fbs   numeric                    0         > 154 10 65 12 67
## 7   restecg character hypertrophy,abnormal         = 154  0 75  0 79
## 8   thalach   numeric               144.32        <= 154 44 31 13 66
## 9     exang   numeric                    1        >= 154 36 39 12 67
## 10  oldpeak   numeric                 0.98         > 154 50 25 21 58
## 11    slope character                   up        != 154 51 24 27 52
## 12       ca   numeric                    0         > 154 52 23 13 66
## 13     thal character               normal        != 154 54 21 17 62
##           hr       far           v      dprime
## 1  0.7733333 0.4177215  0.35561181  0.95759530
## 2  0.8400000 0.5696203  0.27037975  0.81905044
## 3  1.0000000 1.0000000  0.00000000 -0.01705104
## 4  0.3600000 0.3544304  0.00556962  0.01492777
## 5  0.4400000 0.3924051  0.04759494  0.12208684
## 6  0.1333333 0.1518987 -0.01856540 -0.08244767
## 7  0.0000000 0.0000000  0.00000000  0.01705104
## 8  0.5866667 0.1645570  0.42210970  1.19487886
## 9  0.4800000 0.1518987  0.32810127  0.97817037
## 10 0.6666667 0.2658228  0.40084388  1.05622330
## 11 0.6800000 0.3417722  0.33822785  0.87533020
## 12 0.6933333 0.1645570  0.52877637  1.48122129
## 13 0.7200000 0.2151899  0.50481013  1.37138350

You can also view the cue accuracies in an ROC-type plot with showcues():

showcues(heart.fft, 
         main = "Heartdisease Cue Accuracy")

Tree definitions and accuracy statistics

The tree.definitions dataframe contains definitions (cues, classes, exits, thresholds, and directions) of all trees in the object:

heart.fft$tree.definitions
##   tree                cues nodes                               classes
## 1    1 thal;cp;exang;slope     4 character;character;numeric;character
## 5    2       thal;cp;exang     3           character;character;numeric
## 3    3       thal;cp;exang     3           character;character;numeric
## 7    4 thal;cp;exang;slope     4 character;character;numeric;character
## 2    5       thal;cp;exang     3           character;character;numeric
## 6    6 thal;cp;exang;slope     4 character;character;numeric;character
## 4    7       thal;cp;exang     3           character;character;numeric
## 8    8 thal;cp;exang;slope     4 character;character;numeric;character
##       exits                  thresholds directions
## 1 0;0;0;0.5 normal;np,aa,ta;1;flat,down !=;!=;>=;=
## 5   0;0;0.5           normal;np,aa,ta;1   !=;!=;>=
## 3   0;1;0.5           normal;np,aa,ta;1   !=;!=;>=
## 7 0;1;1;0.5 normal;np,aa,ta;1;flat,down !=;!=;>=;=
## 2   1;0;0.5           normal;np,aa,ta;1   !=;!=;>=
## 6 1;0;1;0.5 normal;np,aa,ta;1;flat,down !=;!=;>=;=
## 4   1;1;0.5           normal;np,aa,ta;1   !=;!=;>=
## 8 1;1;1;0.5 normal;np,aa,ta;1;flat,down !=;!=;>=;=

The tree.stats list contains classification statistics for all trees applied to both training tree.stats$train and test tree.stats$test data:

heart.fft$tree.stats$train
##   tree                cues nodes                               classes
## 1    1 thal;cp;exang;slope     4 character;character;numeric;character
## 2    2       thal;cp;exang     3           character;character;numeric
## 3    3       thal;cp;exang     3           character;character;numeric
## 4    4 thal;cp;exang;slope     4 character;character;numeric;character
## 5    5       thal;cp;exang     3           character;character;numeric
## 6    6 thal;cp;exang;slope     4 character;character;numeric;character
## 7    7       thal;cp;exang     3           character;character;numeric
## 8    8 thal;cp;exang;slope     4 character;character;numeric;character
##       exits                  thresholds directions   n hi mi fa cr
## 1 0;0;0;0.5 normal;np,aa,ta;1;flat,down !=;!=;>=;= 149 22 42  2 83
## 2   0;0;0.5           normal;np,aa,ta;1   !=;!=;>= 149 26 38  4 81
## 3   0;1;0.5           normal;np,aa,ta;1   !=;!=;>= 149 39 25  7 78
## 4 0;1;1;0.5 normal;np,aa,ta;1;flat,down !=;!=;>=;= 149 46 18 14 71
## 5   1;0;0.5           normal;np,aa,ta;1   !=;!=;>= 149 58  6 20 65
## 6 1;0;1;0.5 normal;np,aa,ta;1;flat,down !=;!=;>=;= 149 58  6 24 61
## 7   1;1;0.5           normal;np,aa,ta;1   !=;!=;>= 149 61  3 36 49
## 8 1;1;1;0.5 normal;np,aa,ta;1;flat,down !=;!=;>=;= 149 63  1 50 35
##         hr        far         v   dprime
## 1 0.343750 0.02352941 0.3202206 1.583520
## 2 0.406250 0.04705882 0.3591912 1.436864
## 3 0.609375 0.08235294 0.5270221 1.667108
## 4 0.718750 0.16470588 0.5540441 1.554432
## 5 0.906250 0.23529412 0.6709559 2.039533
## 6 0.906250 0.28235294 0.6238971 1.893877
## 7 0.953125 0.42352941 0.5295956 1.868812
## 8 0.984375 0.58823529 0.3961397 1.930867

decision

The decision list contains the raw classification decisions for each tree for each training (and test) case.

Here are is how each tree classified the first five cases in the training data:

heart.fft$decision$train[1:5,]
##       [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]
## [1,] FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
## [2,] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE
## [3,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
## [4,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [5,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

levelout

The levelout list contains the levels at which each case was classified for each tree.

Here are the levels at which the first 5 test cases were classified:

heart.fft$levelout$test[1:5,]
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,]    4    3    2    2    1    1    1    1
## [2,]    1    1    1    1    3    4    2    2
## [3,]    3    3    2    2    1    1    1    1
## [4,]    3    3    2    2    1    1    1    1
## [5,]    1    1    1    1    2    2    3    4

Predicting new data with predict()

Once you’ve created an FFTrees object, you can use it to predict new data using predict(). This will return a new FFTrees object with the new data used as test data. This will overwrite any existing test data in the FFTrees object but will save all training data. In this example, I’ll use the heart.fft object to make predictions for cases 1 through 50 in the heartdisease dataset:

heart.fft <- predict(heart.fft,
                     data.test = heartdisease[1:50,]
                     )

When you look at heart.fft now, you’ll see that the new test data (with 50 cases) are stored as test data:

heart.fft
## [1] "An FFTrees object containing 8 trees using 4 predictors {thal,cp,exang,slope}"
## [1] "FFTrees AUC: (Train = 0.88, Test = 0.87)"
## [1] "My favorite training tree is #5, here is how it performed:"
##                         train  test
## n                      149.00 50.00
## p(Correct)               0.83  0.78
## Hit Rate (HR)            0.91  0.80
## False Alarm Rate (FAR)   0.24  0.23
## d-prime                  2.04  1.57

Visualising trees

Once you’ve created an FFTrees object using FFTrees() you can visualize the tree (and ROC curves) using plot(). The following code will visualize the best training tree (tree 2) applied to the test data:

plot(heart.fft,
     main = "Heart Disease",
     decision.names = c("Healthy", "Disease")
     )

You can also visualize the individual cue accuracies with the showcues() function:

showcues(heart.fft)

See the vignette on plotting trees here for more details on visualizing trees.

Additional arguments

The FFTrees() function has several additional arguments than change how trees are built. Note: Not all of these arguments have fully tested yet!