The FFTrees()
function is at the heart of the FFTrees
package. The function takes a training dataset as an argument, and generates several fast and frugal trees which attempt to classify cases into one of two classes based on cues (aka., features).
Let’s start with an example, we’ll create FFTrees fitted to the heartdisease
dataset. This dataset contains data from 202 patients suspected of having heart disease. Here’s how the dataset looks:
head(heartdisease)
## age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca
## 1 63 1 ta 145 233 1 hypertrophy 150 0 2.3 down 0
## 2 67 1 a 160 286 0 hypertrophy 108 1 1.5 flat 3
## 3 67 1 a 120 229 0 hypertrophy 129 1 2.6 flat 2
## 4 37 1 np 130 250 0 normal 187 0 3.5 down 0
## 5 41 0 aa 130 204 0 hypertrophy 172 0 1.4 up 0
## 6 56 1 aa 120 236 0 normal 178 0 0.8 up 0
## thal diagnosis
## 1 fd 0
## 2 normal 1
## 3 rd 1
## 4 normal 0
## 5 normal 0
## 6 normal 0
The critical dependent variable is diagnosis
which indicates whether a patient has heart disease or not. The other variables in the dataset (e.g.; sex, age, and several biological measurements) will be used as predictors.
Now we’ll split the original dataset into a training dataset, and a testing dataset. We will create the trees with the training set, then test its performance in the test dataset:
set.seed(100) # For replication
heart.rand <- heartdisease[sample(nrow(heartdisease)),]
heart.train <- heart.rand[1:150,]
heart.test <- heart.rand[151:303,]
We’ll create a new FFTrees object called heart.fft
using the FFTrees()
function. We’ll specify diagnosis
as the (binary) dependent variable, and include all independent variables with formula = diagnosis ~ .
:
heart.fft <- FFTrees(formula = diagnosis ~.,
data = heart.train,
data.test = heart.test)
FFTrees()
returns an object with the FFTrees class. There are many elements in an FFTrees object, here are their names:
names(heart.fft)
## [1] "formula" "data.desc" "cue.accuracies"
## [4] "tree.definitions" "tree.stats" "level.stats"
## [7] "decision" "levelout" "auc"
## [10] "params" "comp" "data"
You can view basic information about the FFTrees object by printing its name. This will give you a quick summary of the object, including how many trees it has, which cues the tree(s) use, and how well they performed.
heart.fft
## [1] "7 FFTs predicting diagnosis"
## [1] "FFT #4 {thal,cp,ca} maximizes training wacc:"
## train test
## cases :n 150.00 153.00
## speed :mcu 1.74 1.73
## frugality :pci 0.88 0.88
## accuracy :acc 0.80 0.82
## balanced :bacc 0.80 0.82
## weighted :wacc 0.80 0.82
## sensitivity :sens 0.82 0.88
## specificity :spec 0.79 0.76
You can obtain marginal cue accuracy statistics from the cue.accuracies
list. The list contains dataframes with marginal cue accuracies. That is, for each cue, the threshold that maximizes the v-statistic (HR - FAR) in the training dataset is chosen. If the object has test data, you can see the marginal cue accuracies in the test dataset (using the thresholds calculated from the training data):
heart.fft$cue.accuracies
## $train
## cue class threshold direction n hi mi fa cr
## 1 age numeric 54 > 150 47 19 31 53
## 2 sex numeric 0 > 150 53 13 48 36
## 3 cp character a = 150 48 18 18 66
## 4 trestbps numeric 138 > 150 26 40 21 63
## 5 chol numeric 223 > 150 49 17 51 33
## 6 fbs numeric 0 > 150 10 56 9 75
## 7 restecg character hypertrophy,abnormal = 150 40 26 34 50
## 8 thalach numeric 156 < 150 45 21 29 55
## 9 exang numeric 0 > 150 31 35 14 70
## 10 oldpeak numeric 0.9 > 150 41 25 21 63
## 11 slope character flat,down = 150 45 21 27 57
## 12 ca numeric 0 > 150 47 19 19 65
## 13 thal character rd,fd = 150 47 19 16 68
## sens spec far acc bacc wacc dprime
## 1 0.7121212 0.6309524 0.3690476 0.6666667 0.6715368 0.6715368 0.8939691
## 2 0.8030303 0.4285714 0.5714286 0.5933333 0.6158009 0.6158009 0.6724827
## 3 0.7272727 0.7857143 0.2142857 0.7600000 0.7564935 0.7564935 1.3962240
## 4 0.3939394 0.7500000 0.2500000 0.5933333 0.5719697 0.5719697 0.4054236
## 5 0.7424242 0.3928571 0.6071429 0.5466667 0.5676407 0.5676407 0.3789573
## 6 0.1515152 0.8928571 0.1071429 0.5666667 0.5221861 0.5221861 0.2119100
## 7 0.6060606 0.5952381 0.4047619 0.6000000 0.6006494 0.6006494 0.5101065
## 8 0.6818182 0.6547619 0.3452381 0.6666667 0.6682900 0.6682900 0.8709980
## 9 0.4696970 0.8333333 0.1666667 0.6733333 0.6515152 0.6515152 0.8913899
## 10 0.6212121 0.7500000 0.2500000 0.6933333 0.6856061 0.6856061 0.9831556
## 11 0.6818182 0.6785714 0.3214286 0.6800000 0.6801948 0.6801948 0.9364969
## 12 0.7121212 0.7738095 0.2261905 0.7466667 0.7429654 0.7429654 1.3110438
## 13 0.7121212 0.8095238 0.1904762 0.7666667 0.7608225 0.7608225 1.4357351
##
## $test
## cue class threshold direction n hi mi fa cr
## 1 age numeric 54 > 153 48 25 34 46
## 2 sex numeric 0 > 153 61 12 44 36
## 3 cp character a = 153 57 16 21 59
## 4 trestbps numeric 138 > 153 28 45 23 57
## 5 chol numeric 223 > 153 51 22 47 33
## 6 fbs numeric 0 > 153 12 61 14 66
## 7 restecg character hypertrophy,abnormal = 153 0 73 0 80
## 8 thalach numeric 156 < 153 56 17 33 47
## 9 exang numeric 0 > 153 45 28 9 71
## 10 oldpeak numeric 0.9 > 153 51 22 24 56
## 11 slope character flat,down = 153 0 73 0 80
## 12 ca numeric 0 > 153 46 27 15 65
## 13 thal character rd,fd = 153 0 73 0 80
## sens spec far acc bacc wacc dprime
## 1 0.6575342 0.5750 0.4250 0.6143791 0.6162671 0.6162671 0.59486134
## 2 0.8356164 0.4500 0.5500 0.6339869 0.6428082 0.6428082 0.85093885
## 3 0.7808219 0.7375 0.2625 0.7581699 0.7591610 0.7591610 1.41062908
## 4 0.3835616 0.7125 0.2875 0.5555556 0.5480308 0.5480308 0.26456319
## 5 0.6986301 0.4125 0.5875 0.5490196 0.5555651 0.5555651 0.29934599
## 6 0.1643836 0.8250 0.1750 0.5098039 0.4946918 0.4946918 -0.04201090
## 7 0.0000000 1.0000 0.0000 0.5228758 0.5000000 0.5000000 0.03216494
## 8 0.7671233 0.5875 0.4125 0.6732026 0.6773116 0.6773116 0.95052459
## 9 0.6164384 0.8875 0.1125 0.7581699 0.7519692 0.7519692 1.50947947
## 10 0.6986301 0.7000 0.3000 0.6993464 0.6993151 0.6993151 1.04486521
## 11 0.0000000 1.0000 0.0000 0.5228758 0.5000000 0.5000000 0.03216494
## 12 0.6301370 0.8125 0.1875 0.7254902 0.7213185 0.7213185 1.21936274
## 13 0.0000000 1.0000 0.0000 0.5228758 0.5000000 0.5000000 0.03216494
You can also view the cue accuracies in an ROC-type plot with plot()
combined with the what = "cues"
argument:
plot(heart.fft,
main = "Heartdisease Cue Accuracy",
what = "cues")
The tree.definitions
dataframe contains definitions (cues, classes, exits, thresholds, and directions) of all trees in the object:
heart.fft$tree.definitions
## tree cues nodes classes exits thresholds directions
## 1 1 thal;cp;ca;oldpeak 4 c;c;n;n 0;0;0;0.5 rd,fd;a;0;0.9 =;=;>;>
## 5 2 thal;cp;ca 3 c;c;n 0;0;0.5 rd,fd;a;0 =;=;>
## 3 3 thal;cp;ca 3 c;c;n 0;1;0.5 rd,fd;a;0 =;=;>
## 2 4 thal;cp;ca 3 c;c;n 1;0;0.5 rd,fd;a;0 =;=;>
## 6 5 thal;cp;ca;oldpeak 4 c;c;n;n 1;0;1;0.5 rd,fd;a;0;0.9 =;=;>;>
## 4 6 thal;cp;ca;oldpeak 4 c;c;n;n 1;1;0;0.5 rd,fd;a;0;0.9 =;=;>;>
## 7 7 thal;cp;ca;oldpeak 4 c;c;n;n 1;1;1;0.5 rd,fd;a;0;0.9 =;=;>;>
The tree.stats
list contains classification statistics for all trees applied to both training tree.stats$train
and test tree.stats$test
data. Here are the training statistics
heart.fft$tree.stats$train
## tree cues nodes classes exits thresholds directions
## 1 1 thal;cp;ca;oldpeak 4 c;c;n;n 0;0;0;0.5 rd,fd;a;0;0.9 =;=;>;>
## 2 2 thal;cp;ca 3 c;c;n 0;0;0.5 rd,fd;a;0 =;=;>
## 3 3 thal;cp;ca 3 c;c;n 0;1;0.5 rd,fd;a;0 =;=;>
## 4 4 thal;cp;ca 3 c;c;n 1;0;0.5 rd,fd;a;0 =;=;>
## 5 5 thal;cp;ca;oldpeak 4 c;c;n;n 1;0;1;0.5 rd,fd;a;0;0.9 =;=;>;>
## 6 6 thal;cp;ca;oldpeak 4 c;c;n;n 1;1;0;0.5 rd,fd;a;0;0.9 =;=;>;>
## 7 7 thal;cp;ca;oldpeak 4 c;c;n;n 1;1;1;0.5 rd,fd;a;0;0.9 =;=;>;>
## n hi mi fa cr sens spec far acc bacc
## 1 150 21 45 0 84 0.3181818 1.0000000 0.00000000 0.7000000 0.6590909
## 2 150 28 38 2 82 0.4242424 0.9761905 0.02380952 0.7333333 0.7002165
## 3 150 44 22 7 77 0.6666667 0.9166667 0.08333333 0.8066667 0.7916667
## 4 150 54 12 18 66 0.8181818 0.7857143 0.21428571 0.8000000 0.8019481
## 5 150 56 10 21 63 0.8484848 0.7500000 0.25000000 0.7933333 0.7992424
## 6 150 59 7 32 52 0.8939394 0.6190476 0.38095238 0.7400000 0.7564935
## 7 150 64 2 52 32 0.9696970 0.3809524 0.61904762 0.6400000 0.6753247
## wacc dprime pci mcu
## 1 0.6590909 2.053928 0.8642857 1.90
## 2 0.7002165 1.789700 0.8785714 1.70
## 3 0.7916667 1.813721 0.8885714 1.56
## 4 0.8019481 1.700096 0.8757143 1.74
## 5 0.7992424 1.704447 0.8685714 1.84
## 6 0.7564935 1.550734 0.8485714 2.12
## 7 0.6753247 1.573378 0.8357143 2.30
The decision
list contains the raw classification decisions for each tree for each training (and test) case.
Here are is how each tree classified the first five cases in the training data:
heart.fft$decision$train[1:5,]
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE FALSE FALSE TRUE TRUE
## [3,] FALSE FALSE FALSE FALSE FALSE FALSE TRUE
## [4,] FALSE FALSE FALSE TRUE TRUE TRUE TRUE
## [5,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
The levelout
list contains the levels at which each case was classified for each tree.
Here are the levels at which the first 5 test cases were classified:
heart.fft$levelout$test[1:5,]
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 1 1 1 2 2 3 4
## [2,] 2 2 3 1 1 1 1
## [3,] 4 3 2 1 1 1 1
## [4,] 4 3 2 1 1 1 1
## [5,] 1 1 1 3 4 2 2
Once you’ve created an FFTrees object, you can use it to predict new data using predict()
. To specify which tree to In this example, I’ll use the heart.fft
object to make predictions for cases 1 through 50 in the heartdisease dataset:
predict(heart.fft,
data = heartdisease[1:50,])
## [1] TRUE TRUE TRUE FALSE FALSE FALSE TRUE FALSE TRUE TRUE TRUE
## [12] FALSE TRUE TRUE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE
## [34] TRUE FALSE FALSE TRUE TRUE TRUE FALSE TRUE TRUE FALSE FALSE
## [45] FALSE TRUE FALSE TRUE FALSE FALSE
Once you’ve created an FFTrees object using FFTrees()
you can visualize the tree (and ROC curves) using plot()
. The following code will visualize the best training tree (tree 2) applied to the test data:
plot(heart.fft,
main = "Heart Disease",
decision.names = c("Healthy", "Disease"))
See the vignette on plotting trees here for more details on visualizing trees.
The FFTrees()
function has several additional arguments than change how trees are built.
max.levels
: What is the maximum number of levels the trees should have? the larger max.levels
is, the longer the trees will be, and the more trees will be created (due to the fact that all possible exit structures are used).
train.p
: What percent of the data should be used for training (if data.test
is not specified)? train.p = .1
will randomly select 10% of the data for training and leave the remaining 90% for testing. Setting train.p = 1
will train the trees to the entire dataset (and leave no data for testing).
goal
: What accuracy statistic should the trees try to maximize? The default is balanced accuracy bacc
which is the average of sensitivity and specificity. Alternatively, acc
will maximize overall accuracy (eg., absolute percentage of correct decisions).
algorithm
: As trees are being built, should cues be selected based on their marginal accuracy (algorithm = "m"
) applied to the entire dataset, or on their conditional accuracy (algorithm = "c"
) applied to all cases that have not yet been classified? Each method has potential pros and cons. The marginal method is much faster to implement and may be prone to less over-fitting. However, the conditional method could capture important conditional dependencies between cues that the marginal method misses. Additionally, the algorithm = "c"
method allows the same cue to be used multiple times in the tree. When a cue has a strong non-monotonic relationship with the criterion, this can greatly improve performance.
sens.w
: How much weight should be given to maximizing sensitivity (i.e.; avoiding misses) versus maximizing specificity (i.e, avoiding false-alarms)? The default is sens.w = .5
which treats both measures equally. However, if your decision problem strongly favors maximizing hits over avoiding false alarms, you may wish to set sens.w
to a higher value such as 0.75. Changing this value does not (currently) affect tree construction. Instead, it is used to select the tree with the highest weighted accuracy wacc
score, where wacc = sensitivity * sens.w + specificity * (1 - sens.w)