This function is at the heart of the FFTrees package. The function takes a training dataset as an argument, and generates several FFT (more details about the algorithms coming soon…)
Let’s start with an example, we’ll create FFTs fitted to the heartdisease dataset. Here’s how the dataset looks:
head(heartdisease)##   age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal
## 1  63   1  1      145  233   1       2     150     0     2.3     3  0    6
## 2  67   1  4      160  286   0       2     108     1     1.5     2  3    3
## 3  67   1  4      120  229   0       2     129     1     2.6     2  2    7
## 4  37   1  3      130  250   0       0     187     0     3.5     3  0    3
## 5  41   0  2      130  204   0       2     172     0     1.4     1  0    3
## 6  56   1  2      120  236   0       0     178     0     0.8     1  0    3
##   diagnosis
## 1         0
## 2         1
## 3         1
## 4         0
## 5         0
## 6         0We’ll create a new fft object called heart.fft using the fft() function. We’ll set the criterion to heartdisease$diagnosis and use all other columns (heartdisease[,names(heartdisease) != "diagnosis"] as potential predictors. Additionally, we’ll define two parameters:
train.p = .5: Train the trees on a random sample of 50% of the original training dataset, and test the trees on the remaining 50%max.levels = 4: The maximum number of levels (e.g.; cues) the trees will consider is 4. Because each of the max.levels - 1 levels can have two exit structures, this will lead to \(2^{3}\) possible trees.set.seed(100) # For reproducability
heart.fft <- fft(
  train.cue.df = heartdisease[,names(heartdisease) != "diagnosis"],
  train.criterion.v = heartdisease$diagnosis,
  train.p = .5,
  max.levels = 4
  )As you can see, fft() returns an object with the fft class
class(heart.fft)## [1] "fft"There are many elements in an fft object:
names(heart.fft)##  [1] "trees"             "cue.accuracies"    "cart"             
##  [4] "lr"                "train.cue"         "train.crit"       
##  [7] "test.cue"          "test.crit"         "train.decision.df"
## [10] "test.decision.df"  "train.levelout.df" "test.levelout.df" 
## [13] "best.train.tree"   "best.test.tree"The cue.accuracies dataframe contains the original, marginal cue accuracies. That is, for each cue, the threshold that maximizes v (hr - far) is chosen (this is done using the cuerank() function):
heart.fft$cue.accuracies##     cue.name cue.class level.threshold level.sigdirection hi mi fa cr
## 12       age   numeric              57                 >= 44 22 26 59
## 2        sex   numeric               1                 >= 53 13 48 37
## 4         cp   numeric               4                 >= 48 18 18 67
## 9   trestbps   numeric             139                 >= 26 40 21 64
## 7       chol   numeric             218                 >= 51 15 56 29
## 21       fbs   numeric               1                 >= 10 56  9 76
## 1    restecg   numeric               0                  > 40 26 35 50
## 121  thalach   numeric             154                  < 43 23 27 58
## 11     exang   numeric               0                  > 31 35 14 71
## 22   oldpeak   numeric               1                 >= 41 25 21 64
## 13     slope   numeric               1                  > 45 21 27 58
## 14        ca   numeric               0                  > 47 19 19 66
## 15      thal   numeric               3                  > 47 19 16 69
##            hr       far         v    dprime correction hr.weight
## 12  0.6666667 0.3058824 0.3607843 0.4691417       0.25       0.5
## 2   0.8030303 0.5647059 0.2383244 0.3447918       0.25       0.5
## 4   0.7272727 0.2117647 0.5155080 0.7024492       0.25       0.5
## 9   0.3939394 0.2470588 0.1468806 0.2073541       0.25       0.5
## 7   0.7727273 0.6588235 0.1139037 0.1693021       0.25       0.5
## 21  0.1515152 0.1058824 0.0456328 0.1093854       0.25       0.5
## 1   0.6060606 0.4117647 0.1942959 0.2460370       0.25       0.5
## 121 0.6515152 0.3176471 0.3338681 0.4318515       0.25       0.5
## 11  0.4696970 0.1647059 0.3049911 0.4496339       0.25       0.5
## 22  0.6212121 0.2470588 0.3741533 0.4962201       0.25       0.5
## 13  0.6818182 0.3176471 0.3641711 0.4735389       0.25       0.5
## 14  0.7121212 0.2235294 0.4885918 0.6599599       0.25       0.5
## 15  0.7121212 0.1882353 0.5238859 0.7220052       0.25       0.5Here, we can see that the thal cue had the highest v value of 0.5239 while cp had the second highest v value of 0.5155.
The trees dataframe contains all tree definitions and training (and possibly test) statistics for all (\(2^{max.levels - 1}\)) trees. For our heart.fft example, there are \(2^{4 - 1} = 8\) trees.
Tree definitions (exit directions, cue order, and cue thresholds) are contained in columns 1 through 6:
heart.fft$trees[,1:6]   # Tree info are in columns 1:6##   tree.num         level.name                     level.class level.exit
## 1        1 thal;cp;ca;oldpeak numeric;numeric;numeric;numeric  0;0;0;0.5
## 2        2         thal;cp;ca         numeric;numeric;numeric    1;0;0.5
## 3        3         thal;cp;ca         numeric;numeric;numeric    0;1;0.5
## 4        4 thal;cp;ca;oldpeak numeric;numeric;numeric;numeric  1;1;0;0.5
## 5        5         thal;cp;ca         numeric;numeric;numeric    0;0;0.5
## 6        6         thal;cp;ca         numeric;numeric;numeric    1;0;0.5
## 7        7         thal;cp;ca         numeric;numeric;numeric    0;1;0.5
## 8        8 thal;cp;ca;oldpeak numeric;numeric;numeric;numeric  1;1;1;0.5
##   level.threshold level.sigdirection
## 1         3;4;0;1          >;>=;>;>=
## 2           3;4;0             >;>=;>
## 3           3;4;0             >;>=;>
## 4         3;4;0;1          >;>=;>;>=
## 5           3;4;0             >;>=;>
## 6           3;4;0             >;>=;>
## 7           3;4;0             >;>=;>
## 8         3;4;0;1          >;>=;>;>=Training statistics are contained in columns 7:15 and have the .train suffix.
heart.fft$trees[,7:15]   # Training stats are in columns 7:15##   n.train hi.train mi.train fa.train cr.train  hr.train  far.train
## 1     151       21       45        0       85 0.3181818 0.00000000
## 2     151       54       12       18       67 0.8181818 0.21176471
## 3     151       44       22        7       78 0.6666667 0.08235294
## 4     151       59        7       32       53 0.8939394 0.37647059
## 5     151       28       38        2       83 0.4242424 0.02352941
## 6     151       54       12       18       67 0.8181818 0.21176471
## 7     151       44       22        7       78 0.6666667 0.08235294
## 8     151       64        2       52       33 0.9696970 0.61176471
##     v.train dprime.train
## 1 0.3166249    1.1436132
## 2 0.6064171    0.8543855
## 3 0.5843137    0.9100723
## 4 0.5174688    0.7812587
## 5 0.4007130    0.8973592
## 6 0.6064171    0.8543855
## 7 0.5843137    0.9100723
## 8 0.3579323    0.7962186For our heart disease dataset, it looks like trees 2 and 6 had the highest training v (HR - FAR) values.
Test statistics are contained in columns 16:24 and have the .test suffix.
heart.fft$trees[,16:24]   # Test stats are in columns 16:24##   n.test hi.test mi.test fa.test cr.test   hr.test  far.test    v.test
## 1    152      23      50       0      79 0.3150685 0.0000000 0.3131819
## 2    152      64       9      19      60 0.8767123 0.2405063 0.6362060
## 3    152      49      24       8      71 0.6712329 0.1012658 0.5699671
## 4    152      69       4      35      44 0.9452055 0.4430380 0.5021675
## 5    152      28      45       0      79 0.3835616 0.0000000 0.3812091
## 6    152      64       9      19      60 0.8767123 0.2405063 0.6362060
## 7    152      49      24       8      71 0.6712329 0.1012658 0.5699671
## 8    152      72       1      56      23 0.9863014 0.7088608 0.2774406
##   dprime.test
## 1   1.1271540
## 2   0.9316912
## 3   0.8588460
## 4   0.8716571
## 5   1.2191190
## 6   0.9316912
## 7   0.8588460
## 8   0.8278755It looks like trees 2 and 6 also had the highest test v (HR - FAR) values.
The best trees for training and testing are in best.train.tree and best.test.tree. That is, which of the trees had the best performance (in terms of v (HR - FAR)) in the training dataset and which had the best performance in the test dataset? We want these two values to be the same. If they are different, then the tree algorithm might be over-fitting to the training dataset.
# which tree had the best training statistics?
heart.fft$best.train.tree## [1] 2# Which tree had the best testing statistics?
heart.fft$best.test.tree## [1] 2This is a good sign for our heartdisease dataset. It means that tree 2 did the best for both training and test.
The train.decision.df and test.decision.df contain the raw classification decisions for each tree for each training (and test) case.
Here are each of the 8 tree decisions for the first 5 training cases.
heart.fft$train.decision.df[1:5,]##   tree.1 tree.2 tree.3 tree.4 tree.5 tree.6 tree.7 tree.8
## 1      0      0      0      0      0      0      0      0
## 2      0      0      0      1      0      0      0      1
## 3      0      0      0      0      0      0      0      1
## 4      0      1      0      1      0      1      0      1
## 5      0      0      0      0      0      0      0      0The train.levelout.df and test.levelout.df contain the levels at which each case was classified for each tree.
Here are the levels at which the first 5 training cases were classified:
heart.fft$train.levelout.df[1:5,]##   tree.1 tree.2 tree.3 tree.4 tree.5 tree.6 tree.7 tree.8
## 1      1      2      1      3      1      2      1      4
## 2      1      2      1      4      1      2      1      3
## 3      1      2      1      4      1      2      1      3
## 4      2      1      3      1      2      1      3      1
## 5      1      2      1      3      1      2      1      4The cart and lr dataframes contain information about how CART (using the rpart package) and Logistic Regression performed on the same data.
The cart dataframe shows training and test statistics using different miss and false alarm costs (the standard tree is in the first row where the miss and false alarm costs are both set to 1).
heart.fft$cart##    miss.cost fa.cost  hr.train  far.train   v.train dprime.train   hr.test
## 1          1       1 0.8333333 0.14117647 0.6921569    1.0212351 0.6849315
## 2          2       1 0.8030303 0.11764706 0.6853832    1.0196632 0.6438356
## 3          3       1 0.5303030 0.04705882 0.4832442    0.8750488 0.4657534
## 4          4       1 0.3181818 0.00000000 0.3166249    1.1436132 0.3561644
## 5          5       1 0.4090909 0.01176471 0.3973262    1.0174217 0.3972603
## 6          1       2 0.9242424 0.24705882 0.6771836    1.0589873 0.7945205
## 7          3       2 0.8333333 0.14117647 0.6921569    1.0212351 0.6849315
## 8          4       2 0.8030303 0.11764706 0.6853832    1.0196632 0.6438356
## 9          5       2 0.8030303 0.11764706 0.6853832    1.0196632 0.7123288
## 10         1       3 0.9545455 0.28235294 0.6721925    1.1332437 0.8493151
## 11         2       3 0.8333333 0.14117647 0.6921569    1.0212351 0.6849315
## 14         1       4 0.9696970 0.35294118 0.6167558    1.1268753 0.8630137
## 15         2       4 0.9242424 0.24705882 0.6771836    1.0589873 0.7945205
## 16         3       4 0.8333333 0.14117647 0.6921569    1.0212351 0.6849315
## 18         1       5 1.0000000 0.44705882 0.5488722    1.4026303 0.9315068
## 19         2       5 0.9545455 0.28235294 0.6721925    1.1332437 0.8493151
## 20         3       5 0.9545455 0.27058824 0.6839572    1.1508283 0.7671233
## 21         4       5 0.8333333 0.14117647 0.6921569    1.0212351 0.6849315
##      far.test    v.test dprime.test               cart.cues.vec
## 1  0.24050633 0.4444252   0.5931044             thal;ca;ca;chol
## 2  0.13924051 0.5045951   0.7262341              thal;ca;ca;age
## 3  0.03797468 0.4277787   0.8443696                     thal;ca
## 4  0.01265823 0.3435062   0.9339046             thal;ca;oldpeak
## 5  0.05063291 0.3466274   0.6891513     thal;oldpeak;ca;oldpeak
## 6  0.39240506 0.4021155   0.5476317   thal;ca;thalach;age;cp;ca
## 7  0.24050633 0.4444252   0.5931044             thal;ca;ca;chol
## 8  0.13924051 0.5045951   0.7262341              thal;ca;ca;age
## 9  0.16455696 0.5477718   0.7680505               thal;ca;ca;cp
## 10 0.50632911 0.3429860   0.5088174 thal;ca;thalach;age;cp;chol
## 11 0.24050633 0.4444252   0.5931044             thal;ca;ca;chol
## 14 0.51898734 0.3440264   0.5231738         thal;ca;thalach;age
## 15 0.39240506 0.4021155   0.5476317   thal;ca;thalach;age;cp;ca
## 16 0.24050633 0.4444252   0.5931044             thal;ca;ca;chol
## 18 0.62025316 0.3112537   0.5904811     thal;ca;thalach;age;age
## 19 0.50632911 0.3429860   0.5088174 thal;ca;thalach;age;cp;chol
## 20 0.46835443 0.2987689   0.4044065 thal;ca;thalach;age;ca;chol
## 21 0.24050633 0.4444252   0.5931044             thal;ca;ca;cholThe lr data frame shows training and test statistics using different probabilistic thresholds for decisions. A threshold value of 0.5 is the standard logistic regression model.
heart.fft$lr##   threshold  hr.train  far.train   hr.test   far.test
## 1       0.9 0.4545455 0.01176471 0.4520548 0.01265823
## 2       0.8 0.6212121 0.03529412 0.6164384 0.01265823
## 3       0.7 0.6969697 0.05882353 0.6575342 0.03797468
## 4       0.6 0.7727273 0.08235294 0.7534247 0.06329114
## 5       0.5 0.8181818 0.09411765 0.7808219 0.16455696
## 6       0.4 0.8636364 0.14117647 0.8082192 0.20253165
## 7       0.3 0.8787879 0.21176471 0.8356164 0.26582278
## 8       0.2 0.9090909 0.25882353 0.9041096 0.40506329
## 9       0.1 0.9696970 0.51764706 0.9726027 0.58227848Once you’ve created an fft object using fft() you can visualize the tree (and ROC curves) using plot(). The following code will visualize the best training tree (tree 2) applied to the test data:
plot(heart.fft,
     which.tree = "best.train",
     which.data = "test",
     description = "Heart Disease",
     decision.names = c("Healthy", "Disease")
     )See the vignette on plot.fft vignette("fft_plot", package = "fft") for more details.
The fft() function has several additional arguments than change how trees are built. Note: Not all of these arguments have fully tested yet!
train.p: What percent of the data should be used for training? train.p = .1 will randomly select 10% of the data for training and leave the remaining 90% for testing. Settting train.p = 1 will fit the trees to the entire dataset (with no testing).
test.cue.df, test.criterion.v: If you have a specific set of training data that you want to test, you can specify them here. If you do, the function will use the entire training data (train.cue.df, train.criterion.v) for training and then will apply the training trees to the testing data you specify. Thus, this will bypass the train.p argument.
rank.method: As trees are being built, should cues be selected based on their marginal accuracy (rank.method = "m") applied to the entire dataset, or on their conditional accuracy (rank.method = "c") applied to all cases that have not yet been classified? Each method has potential pros and cons. The marginal method is much faster to implement and may be prone to less over-fitting. However, the conditional method could capture important conditional dependencies between cues that the marginal method misses.
stopping.rule, stopping.par: When should trees stop growing? While all trees will (currently) stop if the number of levels hits max.levels, you can also stop trees using additional criteria.
stopping.rule = "levels" will always the tree at the level indicated by stopping.par (in this case, it makes more sense to just set max.levels to the number of levels you want to stop at.).stopping.rule = "exemplars" will stop the tree if only a small percentage of cases remain unclassified. This percentage is indicated by stopping.par. For example, stopping.par = .05 will stop the tree if less than 5% of cases remain.