We usually create fast-and-frugal trees (FFTs) from data by using the
FFTrees() function (see the Main
guide: FFTrees overview and the vignette on Creating FFTs with FFTrees() for
details). However, we occasionally want to design and test a specific
FFT (e.g., to check a hypothesis or use some variables based on
theoretical considerations).
There are two ways to define fast-and-frugal trees manually when
using the FFTrees() function:
as a sentence using the my.tree argument (the easier
way), or
as a data frame using the tree.definitions argument
(the harder way).
Both of these methods require some data to evaluate the performance of FFTs, but will bypass the tree construction algorithms built into the FFTrees package. As manually created FFTs are no longer optimized, the conceptual distinction between data fitting and predicting data disppears for such FFTs. Although we can still distinguish between two sets of ‘train’ vs. ‘test’ data, a manually defined FFT should not be expected to perform systematically better on ‘train’ data than on ‘test’ data.
my.treeThe first method is to use the my.tree argument, where
my.tree is a sentence describing a (single) FFT. When this
argument is specified in FFTrees(), the function
(specifically, an auxiliary fftrees_wordstofftrees()
function) will try to convert the verbal description into the definition
of a FFT (of an FFTrees object).
For example, let’s look at the heartdisease data to find
out how some predictor variables (e.g., sex,
age, etc.) predict the criterion variable
(diagnosis):
| sex | age | thal | cp | ca | diagnosis |
|---|---|---|---|---|---|
| 1 | 63 | fd | ta | 0 | FALSE |
| 1 | 67 | normal | a | 3 | TRUE |
| 1 | 67 | rd | a | 2 | TRUE |
| 1 | 37 | normal | np | 0 | FALSE |
| 0 | 41 | normal | aa | 0 | FALSE |
| 1 | 56 | normal | aa | 0 | FALSE |
Here’s how we could verbally describe an FFT by using the first three cues in conditional sentences:
in_words <- "If sex = 1, predict True.
If age < 45, predict False.
If thal = {fd, normal}, predict True.
Otherwise, predict False."As we will see shortly, the FFTrees() function accepts
such descriptions (assigned here to a character string
in_words) as its my.tree argument, create a
corresponding FFT, and evaluate it on a corresponding dataset.
Here are some instructions for manually specifying trees:
Each node must start with the word “If” and should correspond to
the form:
If <CUE> <DIRECTION> <THRESHOLD>, predict <EXIT>.
Numeric thresholds should be specified directly (without
brackets), like age > 21.
For categorical variables, factor thresholds must be specified
within curly braces, like sex = {male}. For factors with
sets of values, categories within a threshold should be separated by
commas like eyecolor = {blue,brown}.
To specify cue directions, standard logical comparisons
=, !=, <, >=
(etc.) are valid. For numeric cues, only use >,
>=, <, or <=. For
factors, only use = or !=.
Positive exits are indicated by True, while negative
exits are specified by False.
The final node of an FFT is always bi-directional (i.e., has both
a positive and a negative exit). The description of the final node
always mentions its positive (True) exit first. The text
Otherwise, predict EXIT that we have included in the
example above is actually not necessary (and ignored).
Now, let’s use our verbal description of an FFT (assigned to
in_words above) as the my.tree argument of the
FFTrees() function. This creates a corresponding FFT and
applies it to the heartdisease data:
# Create FFTrees from a verbal FFT description (as my.tree):
my_fft <- FFTrees(diagnosis ~.,
data = heartdisease,
main = "My 1st FFT",
my.tree = in_words)#> Aiming to create a new FFTrees object:
#> — Setting 'goal = bacc'
#> — Setting 'goal.chase = bacc'
#> — Setting 'goal.threshold = bacc'
#> — Setting 'max.levels = 4'
#> — Setting 'cost.outcomes = list(hi = 0, mi = 1, fa = 1, cr = 0)'
#> Successfully created a new FFTrees object.
#> Aiming to define FFTs:
#> Aiming to create an FFT from 'my.tree' description:
#> Successfully created an FFT from 'my.tree' description.
#> Successfully defined 1 FFT.
#> Aiming to apply FFTs to 'train' data:
#> Successfully applied FFTs to 'train' data.
#> Aiming to fit comparative algorithms (disable by do.comp = FALSE):
#> Successfully fitted comparative algorithms.
Let’s see how well our manually constructed FFT (my_fft)
did:
# Inspect FFTrees object:
plot(my_fft)
Figure 1: An FFT manually constructed using the
my.tree argument of FFTrees().
When manually constructing a tree, the resulting FFTrees
object only contains a single FFT. Hence, the ROC plot (in the right
bottom panel of Figure 1) cannot show a range of FFTs,
but locates the constructed FFT in ROC space.
As it turns out, the performance of our first FFT created from a verbal description is a mixed affair: The tree has a rather high sensitivity (of 91%), but its low specificity (of only 10%) allows for many false alarms. Consequently, its accuracy measures are only around baseline level.
Let’s see if we can come up with a better FFT. The following example
uses the cues thal, cp, and ca in
the my.tree argument:
# Create 2nd FFTrees from an alternative FFT description (as my.tree):
my_fft_2 <- FFTrees(diagnosis ~.,
data = heartdisease,
main = "My 2nd FFT",
my.tree = "If thal = {rd,fd}, predict True.
If cp != {a}, predict False.
If ca > 1, predict True.
Otherwise, predict False.")#> Aiming to create a new FFTrees object:
#> — Setting 'goal = bacc'
#> — Setting 'goal.chase = bacc'
#> — Setting 'goal.threshold = bacc'
#> — Setting 'max.levels = 4'
#> — Setting 'cost.outcomes = list(hi = 0, mi = 1, fa = 1, cr = 0)'
#> Successfully created a new FFTrees object.
#> Aiming to define FFTs:
#> Aiming to create an FFT from 'my.tree' description:
#> Successfully created an FFT from 'my.tree' description.
#> Successfully defined 1 FFT.
#> Aiming to apply FFTs to 'train' data:
#> Successfully applied FFTs to 'train' data.
#> Aiming to fit comparative algorithms (disable by do.comp = FALSE):
#> Successfully fitted comparative algorithms.
# Inspect FFTrees object:
plot(my_fft_2)
Figure 2: Another FFT manually constructed using the
my.tree argument of FFTrees().
This alternative FFT is nicely balancing sensitivity and specificity and performs much better overall. Nevertheless, it is still far from perfect — so check out whether you can create even better ones!
tree.definitionsMore experienced users may want to define and evaluate more than one
FFTs at a time. To achieve this, the FFTrees() function
allows providing sets of tree.definitions (as a data
frame). However, as questions regarding specific trees usually arise
late in an exploration of FFTs, the tree.definitions
argument is mostly used in combination with an existing
FFTrees object x. In this case, the parameters
(e.g., regarding the formula, data and goals
to be used) from x are being used, but its tree definitions
(stored in x$trees$definitions) are replaced by those in
tree.definitions and the object is re-evaluated for those
FFTs.
We illustrate a typical workflow by redefining some FFTs that were
built in the Tutorial: FFTs for heart
disease and evaluating them on the (full) heartdisease
data.
First, we use our default algorithms to create an
FFTrees object heart.fft:
# Create an FFTrees object x:
x <- FFTrees(formula = diagnosis ~ ., # criterion and (all) predictors
data = heart.train, # training data
data.test = heart.test, # testing data
main = "Heart Disease 1", # initial label
decision.labels = c("low risk", "high risk"), # exit labels
quiet = TRUE) # hide user feedbackAs we have seen in the Tutorial,
evaluating this expression yields a set of 7 FFTs. Rather than
evaluating them individually (by issuing print(x) or
plot(x) commands to inspect specific trees), we can obtain
both their definitions and their performance characteristics on a
variety of measures either by running summary(x) or by
inspecting corresponding parts of the FFTrees object. For
instance, the following alternatives would both show the current
definitions of the generated FFTs:
# Tree definitions of x:
# summary(x)$definitions # from summary()
x$trees$definitions # from FFTrees object x#> # A tibble: 7 × 7
#> tree nodes classes cues directions thresholds exits
#> <int> <int> <chr> <chr> <chr> <chr> <chr>
#> 1 1 3 c;c;n thal;cp;ca =;=;> rd,fd;a;0 1;0;0.5
#> 2 2 4 c;c;n;c thal;cp;ca;slope =;=;>;= rd,fd;a;0;flat,down 1;0;1;0.5
#> 3 3 3 c;c;n thal;cp;ca =;=;> rd,fd;a;0 0;1;0.5
#> 4 4 4 c;c;n;c thal;cp;ca;slope =;=;>;= rd,fd;a;0;flat,down 1;1;0;0.5
#> 5 5 3 c;c;n thal;cp;ca =;=;> rd,fd;a;0 0;0;0.5
#> 6 6 4 c;c;n;c thal;cp;ca;slope =;=;>;= rd,fd;a;0;flat,down 0;0;0;0.5
#> 7 7 4 c;c;n;c thal;cp;ca;slope =;=;>;= rd,fd;a;0;flat,down 1;1;1;0.5
Each line in these tree definitions defines an FFT in the context of
our current FFTrees object x (see the vignette
on Creating FFTs with FFTrees() for
help on interpreting tree definitions). As the “ifan” algorithm
responsible for creating these trees yields a family of highly similar
FFTs (as the FFTs vary only by their exits, and some truncate the last
cue), we may want to examine alternative versions for these trees.
To demonstrate how to create and evaluate manual FFT definitions, we copy the existing tree definitions (as a data frame), select three FFTs (rows), and then create a 4th definition (with a different exit structure):
# 0. Copy and choose some existing FFT definitions:
tree_df <- x$trees$definitions # get FFT definitions (as df)
tree_df <- tree_df[c(1, 3, 5), ] # filter 3 particular FFTs
# 1. Add a tree with 1;1;0.5 exit structure (a "rake" tree with Signal bias):
tree_df[4, ] <- tree_df[1, ] # initialize new FFT #4 (as copy of FFT #1)
tree_df$exits[4] <- c("1; 1; 0.5") # modify exits of FFT #4
tree_df$tree <- 1:nrow(tree_df) # adjust tree numbers
# tree_dfMoreover, let’s define four additional FFTs that reverse the order of
the 1st and 2nd cues. As both cues are categorical (i.e., of
class c) and have the same direction (i.e.,
=), we only need to reverse the thresholds (so
that they correspond to the new cue order):
# 2. Change cue orders:
tree_df[5:8, ] <- tree_df[1:4, ] # 4 new FFTs (as copiess of existing ones)
tree_df$cues[5:8] <- "cp; thal; ca" # modify order of cues
tree_df$thresholds[5:8] <- "a; rd,fd; 0" # modify order of thresholds accordingly
tree_df$tree <- 1:nrow(tree_df) # adjust tree numbers
# tree_dfThe resulting data frame tree_df contains the
definitions of eight FFTs. The first three are copies of trees
in x, but the other five are new.
tree.definitionsWe can evaluate this set by running the FFTrees()
function with the previous FFTrees object x
(i.e., with its formula and data settings) and
specifying tree_df in the tree.definitions
argument:
# Create a modified FFTrees object y:
y <- FFTrees(object = x, # use previous FFTrees object x
tree.definitions = tree_df, # but with new tree definitions
main = "Heart Disease 2" # revised label
)#> Aiming to create a new FFTrees object:
#> — Setting 'goal = bacc'
#> — Setting 'goal.chase = bacc'
#> — Setting 'goal.threshold = bacc'
#> — Setting 'max.levels = 4'
#> — Setting 'cost.outcomes = list(hi = 0, mi = 1, fa = 1, cr = 0)'
#> Successfully created a new FFTrees object.
#> Aiming to define FFTs:
#> Using 8 FFTs from 'tree.definitions' as current trees.
#> Successfully defined 8 FFTs.
#> Aiming to apply FFTs to 'train' data:
#> Successfully applied FFTs to 'train' data.
#> Aiming to apply FFTs to 'test' data:
#> Successfully applied FFTs to 'test' data.
#> Aiming to fit comparative algorithms (disable by do.comp = FALSE):
#> Successfully fitted comparative algorithms.
The resulting FFTrees object y contains the
trees and summary statistics of all eight FFTs. Although it is unlikely
that one of the newly created trees beats the automatically created
FFTs, we find that reversing the order of the first cues has only
minimal effects on training accuracy (as measured by
bacc):
y$trees$definitions # tree definitions#> # A tibble: 8 × 7
#> tree nodes classes cues directions thresholds exits
#> <int> <int> <chr> <chr> <chr> <chr> <chr>
#> 1 1 3 c;c;n thal;cp;ca =;=;> rd,fd;a;0 1;0;0.5
#> 2 2 3 c;c;n cp; thal; ca =;=;> a; rd,fd; 0 1;0;0.5
#> 3 3 3 c;c;n thal;cp;ca =;=;> rd,fd;a;0 0;1;0.5
#> 4 4 3 c;c;n cp; thal; ca =;=;> a; rd,fd; 0 0;1;0.5
#> 5 5 3 c;c;n thal;cp;ca =;=;> rd,fd;a;0 1; 1; 0.5
#> 6 6 3 c;c;n cp; thal; ca =;=;> a; rd,fd; 0 1; 1; 0.5
#> 7 7 3 c;c;n thal;cp;ca =;=;> rd,fd;a;0 0;0;0.5
#> 8 8 3 c;c;n cp; thal; ca =;=;> a; rd,fd; 0 0;0;0.5
y$trees$stats$train # training statistics#> # A tibble: 8 × 20
#> tree n hi fa mi cr sens spec far ppv npv dprime
#> <int> <int> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 150 54 18 12 66 0.818 0.786 0.214 0.75 0.846 1.69
#> 2 2 150 55 20 11 64 0.833 0.762 0.238 0.733 0.853 1.66
#> 3 3 150 44 7 22 77 0.667 0.917 0.0833 0.863 0.778 1.79
#> 4 4 150 44 7 22 77 0.667 0.917 0.0833 0.863 0.778 1.79
#> 5 5 150 63 42 3 42 0.955 0.5 0.5 0.6 0.933 1.66
#> 6 6 150 63 42 3 42 0.955 0.5 0.5 0.6 0.933 1.66
#> 7 7 150 28 2 38 82 0.424 0.976 0.0238 0.933 0.683 1.74
#> 8 8 150 28 2 38 82 0.424 0.976 0.0238 0.933 0.683 1.74
#> # … with 8 more variables: acc <dbl>, bacc <dbl>, wacc <dbl>, cost_dec <dbl>,
#> # cost_cue <dbl>, cost <dbl>, pci <dbl>, mcu <dbl>
Note that the trees in y were sorted by their
performance on the current goal (here bacc).
For instance, the new rake tree with cue order cp; thal; ca
and exits 1; 1; 0.5 is now FFT #6. When examining its
performance on "test" data (i.e., for prediction):
# Print and plot FFT #6:
print(y, tree = 6, data = "test")
plot(y, tree = 6, data = "test")we see that it has a balanced accuracy bacc of 70%. More
precisely, its bias for predicting disease (i.e., signal or
True) yields near-perfect sensitivity (96%), but very poor specificity
(44%).
If we wanted to change more aspects of x (e.g., use
different data or goal settings), we could
have created a new FFTrees object without supplying the
previous object x, as long as the FFTs defined in
tree.definitions fit to the settings of
formula and data.
Here is a complete list of the vignettes available in the FFTrees package:
| Vignette | Description | |
|---|---|---|
| Main guide: FFTrees overview | An overview of the FFTrees package | |
| 1 | Tutorial: FFTs for heart disease | An example of using FFTrees() to model
heart disease diagnosis |
| 2 | Accuracy statistics | Definitions of accuracy statistics used throughout the package |
| 3 | Creating FFTs with FFTrees() | Details on the main function
FFTrees() |
| 4 | Manually specifying FFTs | How to directly create FFTs with my.tree
without using the built-in algorithms |
| 5 | Visualizing FFTs with plot() | Plotting FFTrees objects, from full trees
to icon arrays |
| 6 | Examples of FFTs | Examples of FFTs from different datasets contained in the package |