The FFTrees()
function is at the heart of the FFTrees
package. The function takes a training dataset as an argument, and generates several fast and frugal trees which attempt to classify cases into one of two classes based on cues.
Let’s start with an example, we’ll create FFTrees fitted to the heartdisease
dataset. This dataset contains data from 202 patients suspected of having heart disease. Here’s how the dataset looks:
head(heartdisease)
## age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca
## 1 63 1 ta 145 233 1 hypertrophy 150 0 2.3 down 0
## 2 67 1 a 160 286 0 hypertrophy 108 1 1.5 flat 3
## 3 67 1 a 120 229 0 hypertrophy 129 1 2.6 flat 2
## 4 37 1 np 130 250 0 normal 187 0 3.5 down 0
## 5 41 0 aa 130 204 0 hypertrophy 172 0 1.4 up 0
## 6 56 1 aa 120 236 0 normal 178 0 0.8 up 0
## thal diagnosis
## 1 fd 0
## 2 normal 1
## 3 rd 1
## 4 normal 0
## 5 normal 0
## 6 normal 0
The critical dependent variable is diagnosis
which indicates whether a patient has heart disease or not. The other variables in the dataset (e.g.; sex, age, and several biological measurements) will be used as predictors.
Now we’ll split the original dataset into a training dataset, and a testing dataset. We will create the trees with the training set, then test its performance in the test dataset:
set.seed(100) # For replication
samples <- sample(c(T, F), size = nrow(heartdisease), replace = T)
heartdisease.train <- heartdisease[samples,]
heartdisease.test <- heartdisease[samples == 0,]
We’ll create a new FFTrees object called heart.fft
using the FFTrees()
function. We’ll specify diagnosis
as the (binary) dependent variable, and include all independent variables with formula = diagnosis ~ .
:
heart.fft <- FFTrees(formula = diagnosis ~.,
data = heartdisease.train,
data.test = heartdisease.test
)
FFTrees()
returns an object with the FFTrees class. There are many elements in an FFTrees object, here are their names:
names(heart.fft)
## [1] "formula" "data" "cue.accuracies"
## [4] "tree.definitions" "tree.stats" "level.stats"
## [7] "decision" "levelout" "auc"
## [10] "lr" "cart"
You can view basic information about the FFTrees object by printing its name. This will give you a quick summary of the object, including how many trees it has, which cues the tree(s) use, and how well they performed.
heart.fft
## [1] "An FFTrees object containing 8 trees using 4 predictors {thal,cp,exang,slope}"
## [1] "FFTrees AUC: (Train = 0.88, Test = 0.85)"
## [1] "My favorite training tree is #5, here is how it performed:"
## train test
## n 149.00 154.00
## p(Correct) 0.83 0.77
## Hit Rate (HR) 0.91 0.79
## False Alarm Rate (FAR) 0.24 0.25
## d-prime 2.04 1.46
You can obtain marginal cue accuracy statistics from the cue.accuracies
list. The list contains dataframes with marginal cue accuracies. That is, for each cue, the threshold that maximizes the v-statistic (HR - FAR) in the training dataset is chosen. If the object has test data, you can see the marginal cue accuracies in the test dataset (using the thresholds calculated from the training data):
heart.fft$cue.accuracies
## $train
## cue class threshold direction n hi mi fa cr
## 10 age numeric 53.89 >= 149 43 21 42 43
## 2 sex numeric 1 >= 149 51 13 47 38
## 7 cp character np,aa,ta != 149 50 14 22 63
## 9 trestbps numeric 138.74 > 149 27 37 16 69
## 6 chol numeric 252.32 > 149 36 28 26 59
## 1 fbs numeric 0 > 149 12 52 11 74
## 21 restecg character hypertrophy,abnormal = 149 39 25 34 51
## 11 thalach numeric 144.32 <= 149 36 28 21 64
## 22 exang numeric 1 >= 149 40 24 11 74
## 4 oldpeak numeric 0.98 > 149 42 22 24 61
## 41 slope character up != 149 52 12 31 54
## 12 ca numeric 0 > 149 41 23 21 64
## 42 thal character normal != 149 47 17 17 68
## hr far v dprime
## 10 0.671875 0.4941176 0.17775735 0.4598419
## 2 0.796875 0.5529412 0.24393382 0.6974151
## 7 0.781250 0.2588235 0.52242647 1.4233984
## 9 0.421875 0.1882353 0.23363971 0.6873190
## 6 0.562500 0.3058824 0.25661765 0.6648668
## 1 0.187500 0.1294118 0.05808824 0.2420296
## 21 0.609375 0.4000000 0.20937500 0.5310375
## 11 0.562500 0.2470588 0.31544118 0.8410851
## 22 0.625000 0.1294118 0.49558824 1.4478155
## 4 0.656250 0.2823529 0.37389706 0.9781159
## 41 0.812500 0.3647059 0.44779412 1.2330547
## 12 0.640625 0.2470588 0.39356618 1.0439043
## 42 0.734375 0.2000000 0.53437500 1.4677202
##
## $test
## cue class threshold direction n hi mi fa cr
## 1 age numeric 53.89 >= 154 58 17 33 46
## 2 sex numeric 1 >= 154 63 12 45 34
## 3 cp character np,aa,ta != 154 75 0 79 0
## 4 trestbps numeric 138.74 > 154 27 48 28 51
## 5 chol numeric 252.32 > 154 33 42 31 48
## 6 fbs numeric 0 > 154 10 65 12 67
## 7 restecg character hypertrophy,abnormal = 154 0 75 0 79
## 8 thalach numeric 144.32 <= 154 44 31 13 66
## 9 exang numeric 1 >= 154 36 39 12 67
## 10 oldpeak numeric 0.98 > 154 50 25 21 58
## 11 slope character up != 154 51 24 27 52
## 12 ca numeric 0 > 154 52 23 13 66
## 13 thal character normal != 154 54 21 17 62
## hr far v dprime
## 1 0.7733333 0.4177215 0.35561181 0.95759530
## 2 0.8400000 0.5696203 0.27037975 0.81905044
## 3 1.0000000 1.0000000 0.00000000 NaN
## 4 0.3600000 0.3544304 0.00556962 0.01492777
## 5 0.4400000 0.3924051 0.04759494 0.12208684
## 6 0.1333333 0.1518987 -0.01856540 -0.08244767
## 7 0.0000000 0.0000000 0.00000000 NaN
## 8 0.5866667 0.1645570 0.42210970 1.19487886
## 9 0.4800000 0.1518987 0.32810127 0.97817037
## 10 0.6666667 0.2658228 0.40084388 1.05622330
## 11 0.6800000 0.3417722 0.33822785 0.87533020
## 12 0.6933333 0.1645570 0.52877637 1.48122129
## 13 0.7200000 0.2151899 0.50481013 1.37138350
You can also view the cue accuracies in an ROC-type plot with showcues()
:
showcues(heart.fft,
main = "Heartdisease Cue Accuracy")
The tree.definitions
dataframe contains definitions (cues, classes, exits, thresholds, and directions) of all trees in the object:
heart.fft$tree.definitions
## tree cues nodes classes
## 1 1 thal;cp;exang;slope 4 character;character;numeric;character
## 5 2 thal;cp;exang 3 character;character;numeric
## 3 3 thal;cp;exang 3 character;character;numeric
## 7 4 thal;cp;exang;slope 4 character;character;numeric;character
## 2 5 thal;cp;exang 3 character;character;numeric
## 6 6 thal;cp;exang;slope 4 character;character;numeric;character
## 4 7 thal;cp;exang 3 character;character;numeric
## 8 8 thal;cp;exang;slope 4 character;character;numeric;character
## exits thresholds directions
## 1 0;0;0;0.5 normal;np,aa,ta;1;flat,down !=;!=;>=;=
## 5 0;0;0.5 normal;np,aa,ta;1 !=;!=;>=
## 3 0;1;0.5 normal;np,aa,ta;1 !=;!=;>=
## 7 0;1;1;0.5 normal;np,aa,ta;1;flat,down !=;!=;>=;=
## 2 1;0;0.5 normal;np,aa,ta;1 !=;!=;>=
## 6 1;0;1;0.5 normal;np,aa,ta;1;flat,down !=;!=;>=;=
## 4 1;1;0.5 normal;np,aa,ta;1 !=;!=;>=
## 8 1;1;1;0.5 normal;np,aa,ta;1;flat,down !=;!=;>=;=
The tree.stats
list contains classification statistics for all trees applied to both training tree.stats$train
and test tree.stats$test
data:
heart.fft$tree.stats$train
## tree cues nodes classes
## 1 1 thal;cp;exang;slope 4 character;character;numeric;character
## 2 2 thal;cp;exang 3 character;character;numeric
## 3 3 thal;cp;exang 3 character;character;numeric
## 4 4 thal;cp;exang;slope 4 character;character;numeric;character
## 5 5 thal;cp;exang 3 character;character;numeric
## 6 6 thal;cp;exang;slope 4 character;character;numeric;character
## 7 7 thal;cp;exang 3 character;character;numeric
## 8 8 thal;cp;exang;slope 4 character;character;numeric;character
## exits thresholds directions n hi mi fa cr
## 1 0;0;0;0.5 normal;np,aa,ta;1;flat,down !=;!=;>=;= 149 22 42 2 83
## 2 0;0;0.5 normal;np,aa,ta;1 !=;!=;>= 149 26 38 4 81
## 3 0;1;0.5 normal;np,aa,ta;1 !=;!=;>= 149 39 25 7 78
## 4 0;1;1;0.5 normal;np,aa,ta;1;flat,down !=;!=;>=;= 149 46 18 14 71
## 5 1;0;0.5 normal;np,aa,ta;1 !=;!=;>= 149 58 6 20 65
## 6 1;0;1;0.5 normal;np,aa,ta;1;flat,down !=;!=;>=;= 149 58 6 24 61
## 7 1;1;0.5 normal;np,aa,ta;1 !=;!=;>= 149 61 3 36 49
## 8 1;1;1;0.5 normal;np,aa,ta;1;flat,down !=;!=;>=;= 149 63 1 50 35
## hr far v dprime
## 1 0.343750 0.02352941 0.3202206 1.583520
## 2 0.406250 0.04705882 0.3591912 1.436864
## 3 0.609375 0.08235294 0.5270221 1.667108
## 4 0.718750 0.16470588 0.5540441 1.554432
## 5 0.906250 0.23529412 0.6709559 2.039533
## 6 0.906250 0.28235294 0.6238971 1.893877
## 7 0.953125 0.42352941 0.5295956 1.868812
## 8 0.984375 0.58823529 0.3961397 1.930867
The decision
list contains the raw classification decisions for each tree for each training (and test) case.
Here are is how each tree classified the first five cases in the training data:
heart.fft$decision$train[1:5,]
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,] FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
## [2,] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE
## [3,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
## [4,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [5,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
The levelout
list contains the levels at which each case was classified for each tree.
Here are the levels at which the first 5 test cases were classified:
heart.fft$levelout$test[1:5,]
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,] 4 3 2 2 1 1 1 1
## [2,] 1 1 1 1 3 4 2 2
## [3,] 3 3 2 2 1 1 1 1
## [4,] 3 3 2 2 1 1 1 1
## [5,] 1 1 1 1 2 2 3 4
Once you’ve created an FFTrees object, you can use it to predict new data using predict()
. This will return a new FFTrees object with the new data used as test data. This will overwrite any existing test data in the FFTrees object but will save all training data. In this example, I’ll use the heart.fft
object to make predictions for cases 1 through 50 in the heartdisease dataset:
heart.fft <- predict(heart.fft,
data.test = heartdisease[1:50,]
)
When you look at heart.fft
now, you’ll see that the new test data (with 50 cases) are stored as test data:
heart.fft
## [1] "An FFTrees object containing 8 trees using 4 predictors {thal,cp,exang,slope}"
## [1] "FFTrees AUC: (Train = 0.88, Test = 0.87)"
## [1] "My favorite training tree is #5, here is how it performed:"
## train test
## n 149.00 50.00
## p(Correct) 0.83 0.78
## Hit Rate (HR) 0.91 0.80
## False Alarm Rate (FAR) 0.24 0.23
## d-prime 2.04 1.57
Once you’ve created an FFTrees object using FFTrees()
you can visualize the tree (and ROC curves) using plot()
. The following code will visualize the best training tree (tree 2) applied to the test data:
plot(heart.fft,
main = "Heart Disease",
decision.names = c("Healthy", "Disease")
)
You can also visualize the individual cue accuracies with the showcues()
function:
showcues(heart.fft)
See the vignette on plotting trees here for more details on visualizing trees.
The FFTrees()
function has several additional arguments than change how trees are built. Note: Not all of these arguments have fully tested yet!
train.p
: What percent of the data should be used for training? train.p = .1
will randomly select 10% of the data for training and leave the remaining 90% for testing. Setting train.p = 1
will fit the trees to the entire dataset (with no testing).
hr.weight
: How much weight should be given to maximizing hits versus avoiding false alarms when building the tree? The default is hr.weight = .5
which treats both measures equally. However, if your decision problem strongly favors maximizing hits over avoiding false alarms, you may wish to set hr.weight
to a higher value such as 0.75. When you do, the tree growth algorithm will favor using cues that maximize hits over those that minimize false-alarms.
rank.method
: As trees are being built, should cues be selected based on their marginal accuracy (rank.method = "m"
) applied to the entire dataset, or on their conditional accuracy (rank.method = "c"
) applied to all cases that have not yet been classified? Each method has potential pros and cons. The marginal method is much faster to implement and may be prone to less over-fitting. However, the conditional method could capture important conditional dependencies between cues that the marginal method misses. Additionally, the rank.method = "c"
method allows the same cue to be used multiple times in the tree. When a cue has a strong non-monotonic relationship with the criterion, this can greatly improve performance.