An introduction to CART
Sudheesh K Kattumannil
Indian Statistical Institute, Chennai, India.
ISI Chennai, 30 January, 2024
Sudheesh K Kattumannil An introduction to CART
Examples
I Predicted whether a patient, hospitalized due to a heart
attack, will have a second heart attack
Data: demographic, diet, clinical measurements
I Amount of glucose in the blood of a diabetic
Data: Infrared absorption spectrum of blood sample
Sudheesh K Kattumannil An introduction to CART
I Data are predicted on the basis of a set of features (e.g.
diet or clinical measurements)
I The prediction model is called a learner
Sudheesh K Kattumannil An introduction to CART
Example
I Data: 4601 email messages:each labelled email (+) or
spam (-)
I The relative frequencies of the 57 most commonly
occurring words and punctuation marks in the message
I Prediction goal: label future messages email (+) or spam
(-)
I Supervised learning problem on categorical data:
classification problem
Sudheesh K Kattumannil An introduction to CART
Tree versus Linear models
Sudheesh K Kattumannil An introduction to CART
Decision trees
I Graphical representation of possible solutions to a decision
based on certain conditions.
I Decision tree defines a tree-structured hierarchy of rules.
I Root node, Decision/Internal node, Leaf/Terminal node.
I Root and internal nodes contain the rules.
I Leaf nodes define the predictions.
I Decision Tree learning is about learning such a tree from
the labeled training data.
Sudheesh K Kattumannil An introduction to CART
An example: Decision trees
Sudheesh K Kattumannil An introduction to CART
Classification and Regression Tree: CART
I Classification : Problem of identifying to which of a set of
categories a new observation belongs, on the basis of a
training set of data containing observations whose
category membership is known.
I Classification tree: Dependent variable is categorical.
I Regression Tree: Dependent variable is continuous.
Sudheesh K Kattumannil An introduction to CART
Classification Tree: Binary classification
Assume training data with each input having 2 features (x1 ; x2 )
Sudheesh K Kattumannil An introduction to CART
Classification Tree: Binary classification
Sudheesh K Kattumannil An introduction to CART
Classification Tree: Binary classification
Is x1 greater than 3?
Sudheesh K Kattumannil An introduction to CART
Classification Tree: Binary classification
Given x1 > 3, is feature 2 (x2 ) greater than 3?
Sudheesh K Kattumannil An introduction to CART
Classification Tree: Binary classification
Given x1 < 3, is feature 2 (x2 ) greater than 1?
Sudheesh K Kattumannil An introduction to CART
Classification Tree: Binary classification
Sudheesh K Kattumannil An introduction to CART
Decision Trees: Predicting Baseball Players’ Salaries
I We need to predict a baseball player’s Salary based on
Years and Hits
I Overall, the tree stratifies or segments the players into
three regions of predictor space:
I Players who have played for five or fewer years,
I Players who have played for five or more years and who
made at least 118 hits last year.
I The predicted salary for these players is given by the mean
response value for the players in the partitioned data set.
I R1 = {X |Years < 5}
I R2 = {X |Years ≥ 5, Hits < 117.5} and
R3 = {X |Years ≥ 5, Hits ≥ 117.5}.
Sudheesh K Kattumannil An introduction to CART
Predicting Baseball Players’ Salaries:Players with
Years<5
I We use the Hitters data set to predict a baseball player’s
Salary based on Years and Hits
I Consider the player having less than 4.5 years experience.
I For such players, the mean log salary is 5.107,
I Hence the salary of these players is 165,174 dollars.
I The predicted salaries for other two groups are
I 1, 000 ∗ e5.999 = 402, 834, and 1, 000 ∗ e6.740 = 845, 346,
respectively.
Sudheesh K Kattumannil An introduction to CART
Sudheesh K Kattumannil An introduction to CART
Building a regression tree
I Mainly there are two steps.
I Step 1: We divide the predictor space; that is, the set of
possible values for X1 , X2 , ..., Xp -into J distinct and
non-overlapping regions, R1 , R2 , ..., RJ .
I Step 2: For every observation that falls into the region Rj ,
we make the same prediction, which is simply the mean of
the response values for the training observations in Rj .
I Suppose that in Step 1 we obtain two regions, R1 and R2 .
I Assume that the response mean of the training
observations in the first region is 10, while the response
mean of the training observations in the second region is
20.
I Then for a given observation X = x, if x ∈ R1 we will
predict a value of 10, and if x ∈ R2 we will predict a value
of 20.
Sudheesh K Kattumannil An introduction to CART
Construction of the regions
I Divide the predictor space into high-dimensional
rectangles, or boxes
I Find boxes R1 , ..., RJ that minimize the RSS, given by
J X
X
(yi − ŷR j )2
j=1 i∈Rj
where ŷR j is the mean response for the training
observations within the jth box.
Sudheesh K Kattumannil An introduction to CART
Construction of the regions
I In binary splitting, we first select the predictor Xj and the
cut point s such that splitting the predictor space into the
regions {X |Xj < s} and {X |Xj ≥ s} leads to the greatest
possible reduction in RSS.
I That is, we consider all predictors X1 , ..., Xp , and all
possible values of the cut point s for each of the predictors,
and then choose the predictor and cutpoint such that the
resulting tree has the lowest RSS
I That is, for any j and s, we define the pair of half-planes
R1 (j, s) = {X |Xj < s} and R2 (j, s) = {X |Xj ≥ s}
I We seek the value of j and s that minimize the equation
X X
(yi − ŷR i )2 and (yi − ŷR 2 )2 ,
i:xi ∈R1 (j,s) i:xi ∈R2 (j,s)
I where ŷR i is the mean response for the training
observations in R1 (j, s)
Sudheesh K Kattumannil An introduction to CART
Tree with five region
Sudheesh K Kattumannil An introduction to CART
Tree with five region
Sudheesh K Kattumannil An introduction to CART
Overfitting versus underfitting
Sudheesh K Kattumannil An introduction to CART
Overfitting versus underfitting
I Underfitting occurs when a model is not able to make
accurate predictions based on the training data.
Accordingly, may not have the capacity to generalize it on
new data.
I Machine learning models with underfitting tend to have
poor performance both in training and testing sets.
I Underfitting models usually have high bias and low
variance.
I A model is considered overfitting when it does extremely
well on training data but fails to perform on the same level
on the validation/test data.
I Models that are overfitting usually have low bias and high
variance
Sudheesh K Kattumannil An introduction to CART
Overfitting versus underfitting
Sudheesh K Kattumannil An introduction to CART
Tree Pruning
I The process described above (recursive binary splitting)
may produce good predictions on the training set, but is
likely to overfit the data, leading to poor test set
performance.
I This is because the resulting tree might be too complex.
I A smaller tree with fewer splits (that is, fewer regions
R1 , . . . , RJ ) might lead to lower variance and better
interpretation at the cost of a little bias.
I A strategy is to grow a very large tree T0 and then prune it
back in order to obtain a subtree.
I How do we determine the best prune way to prune the
tree?
I Our goal is to select a subtree that subtree leads to the
lowest test error rate.
Sudheesh K Kattumannil An introduction to CART
Tree Pruning
I Cost complexity pruning—also known as weakest link
pruning—gives us a way to do just this.
I Rather than considering every possible subtree, we
consider a sequence of trees indexed by a nonnegative
tuning parameter α.
Sudheesh K Kattumannil An introduction to CART
Tree Pruning: Algorithm
I Step 1: Use recursive binary splitting to grow a large tree
on the training data, stopping only when each terminal
node has fewer than some minimum number of
observations.
I Step 2: Apply cost complexity pruning to the large tree in
order to obtain a sequence of best subtrees, as a function
of α.
I Step 3: Use K-fold cross-validation to choose α (see next
slide).
I Step 4: Return the subtree from Step 2 that corresponds to
the chosen value of α.
Sudheesh K Kattumannil An introduction to CART
How to choose Alpha
I Use K-fold cross-validation to choose α. That is, divide the
training observations into K folds. For each k = 1, ..., K :
I Use K -fold cross-validation to choose α. That is, divide the
training observations into K folds. For each k = 1, ..., K :
I Evaluate the mean squared prediction error on the data in
the left-out kth fold, as a function of α.
I Average the results for each value of α, and pick α to
minimize the average error
n
1X
CV = MSEk
n
k =1
Sudheesh K Kattumannil An introduction to CART
Classification Trees
Sudheesh K Kattumannil An introduction to CART
Classification Trees
I A classification tree is very similar to a regression tree,
except that it is classification used to predict a qualitative
response rather than a quantitative one.
Sudheesh K Kattumannil An introduction to CART
Classification Trees
I A classification tree is very similar to a regression tree,
except that it is classification used to predict a qualitative
response rather than a quantitative one.
I We predict that each observation belongs to the most
commonly occurring class of training observations in the
region to which it belongs.
Sudheesh K Kattumannil An introduction to CART
Classification Trees
I A classification tree is very similar to a regression tree,
except that it is classification used to predict a qualitative
response rather than a quantitative one.
I We predict that each observation belongs to the most
commonly occurring class of training observations in the
region to which it belongs.
I A natural alternative to RSS is the classification error rate.
Sudheesh K Kattumannil An introduction to CART
Classification Trees
I A classification tree is very similar to a regression tree,
except that it is classification used to predict a qualitative
response rather than a quantitative one.
I We predict that each observation belongs to the most
commonly occurring class of training observations in the
region to which it belongs.
I A natural alternative to RSS is the classification error rate.
I The Gini index is defined by
K
X
G= p̂mk (1 − p̂mk ),
k =1
where p̂mk represents the proportion of training
observations in the mth region that are from the kth class,
a measure of total variance across the K classes.
Sudheesh K Kattumannil An introduction to CART
Advantages and Disadvantages of Trees
I Trees are very easy to explain to people.
Sudheesh K Kattumannil An introduction to CART
Advantages and Disadvantages of Trees
I Trees are very easy to explain to people.
I Decision trees more closely mirror human decision-making
than do the regression and classification
Sudheesh K Kattumannil An introduction to CART
Advantages and Disadvantages of Trees
I Trees are very easy to explain to people.
I Decision trees more closely mirror human decision-making
than do the regression and classification
I Trees can be displayed graphically, and are easily
interpreted even by a non-expert
Sudheesh K Kattumannil An introduction to CART
Advantages and Disadvantages of Trees
I Trees are very easy to explain to people.
I Decision trees more closely mirror human decision-making
than do the regression and classification
I Trees can be displayed graphically, and are easily
interpreted even by a non-expert
I Trees can easily handle qualitative predictors without the
need to create dummy variables
Sudheesh K Kattumannil An introduction to CART
Advantages and Disadvantages of Trees
I Trees are very easy to explain to people.
I Decision trees more closely mirror human decision-making
than do the regression and classification
I Trees can be displayed graphically, and are easily
interpreted even by a non-expert
I Trees can easily handle qualitative predictors without the
need to create dummy variables
I Trees generally do not have the same level of predictive
accuracy as GLM and classification
Sudheesh K Kattumannil An introduction to CART
Advantages and Disadvantages of Trees
I Trees are very easy to explain to people.
I Decision trees more closely mirror human decision-making
than do the regression and classification
I Trees can be displayed graphically, and are easily
interpreted even by a non-expert
I Trees can easily handle qualitative predictors without the
need to create dummy variables
I Trees generally do not have the same level of predictive
accuracy as GLM and classification
I Trees can be very non-robust.
Sudheesh K Kattumannil An introduction to CART
Bootstrap aggregation or bagging
I In a given set of n independent observations Z1 , ..., Zn
each with variance σ 2 , the variance of the mean Z̄ of the
observations is given by σ 2 /n
Sudheesh K Kattumannil An introduction to CART
Bootstrap aggregation or bagging
I In a given set of n independent observations Z1 , ..., Zn
each with variance σ 2 , the variance of the mean Z̄ of the
observations is given by σ 2 /n
I A natural way to increase the prediction accuracy of a
statistical learning method is; Take as many training sets
from the population, build a separate prediction model and
average them.
Sudheesh K Kattumannil An introduction to CART
Bootstrap aggregation or bagging
I We could predict f (x) by f̂ 1 (x), f̂ 2 (x), ..., f̂ B (x) using B
separate training sets and average them in order to obtain
statistical learning model
B
1X 1
f̂avg (x) = f̂ (x)
B
b=1
Sudheesh K Kattumannil An introduction to CART
Bootstrap aggregation or bagging
I We could predict f (x) by f̂ 1 (x), f̂ 2 (x), ..., f̂ B (x) using B
separate training sets and average them in order to obtain
statistical learning model
B
1X 1
f̂avg (x) = f̂ (x)
B
b=1
I We generate B different bootstrapped training data sets
B
1X b
f̂bag (x) = f̂ (x)
B
b=1
Sudheesh K Kattumannil An introduction to CART
Bagging: In regression tree
I To apply bagging to regression trees, we simply construct
B regression trees using B bootstrapped training sets
Sudheesh K Kattumannil An introduction to CART
Bagging: In regression tree
I To apply bagging to regression trees, we simply construct
B regression trees using B bootstrapped training sets
I Average the resulting predictions.
Sudheesh K Kattumannil An introduction to CART
Bagging: In regression tree
I To apply bagging to regression trees, we simply construct
B regression trees using B bootstrapped training sets
I Average the resulting predictions.
I Averaging these B trees reduces the variance.
Sudheesh K Kattumannil An introduction to CART
Bagging: Classification
I For a given test observation, we can record the class
predicted by each of the B trees, and take a majority vote:
Sudheesh K Kattumannil An introduction to CART
Bagging: Classification
I For a given test observation, we can record the class
predicted by each of the B trees, and take a majority vote:
I The overall prediction is the most commonly occurring
majority class among the B predictions.
Sudheesh K Kattumannil An introduction to CART
Random Forests
I While building decision trees, each time we consider a split
in a tree
Sudheesh K Kattumannil An introduction to CART
Random Forests
I While building decision trees, each time we consider a split
in a tree
I Suppose that there is one very strong predictor in the data
set, along with a number of other moderately strong
predictors.
Sudheesh K Kattumannil An introduction to CART
Random Forests
I While building decision trees, each time we consider a split
in a tree
I Suppose that there is one very strong predictor in the data
set, along with a number of other moderately strong
predictors.
I In bagged tree, most of the trees will use this strong
predictor in the top split.
Sudheesh K Kattumannil An introduction to CART
Random Forests
I While building decision trees, each time we consider a split
in a tree
I Suppose that there is one very strong predictor in the data
set, along with a number of other moderately strong
predictors.
I In bagged tree, most of the trees will use this strong
predictor in the top split.
I Hence the predictions from the bagged trees will be highly
correlated. Averaging many highly correlated quantities
does not lead to large of a reduction in variance
Sudheesh K Kattumannil An introduction to CART
Random Forests
I While building decision trees, each time we consider a split
in a tree
I Suppose that there is one very strong predictor in the data
set, along with a number of other moderately strong
predictors.
I In bagged tree, most of the trees will use this strong
predictor in the top split.
I Hence the predictions from the bagged trees will be highly
correlated. Averaging many highly correlated quantities
does not lead to large of a reduction in variance
I Random forests overcome this problem by forcing each
split to consider only a subset of the predictors.
Sudheesh K Kattumannil An introduction to CART
Random Forests
I While building decision trees, each time we consider a split
in a tree
I Suppose that there is one very strong predictor in the data
set, along with a number of other moderately strong
predictors.
I In bagged tree, most of the trees will use this strong
predictor in the top split.
I Hence the predictions from the bagged trees will be highly
correlated. Averaging many highly correlated quantities
does not lead to large of a reduction in variance
I Random forests overcome this problem by forcing each
split to consider only a subset of the predictors.
I A random sample of m predictors is chosen as split
candidates from the full set of p predictors (In classification
√
m = p and in regression m = p3 .)
Sudheesh K Kattumannil An introduction to CART
Number of parameters in Random Forest algorithm
I Number of trees used in the forest(ntree) and number of
predictor used in each tree(mtry).
Sudheesh K Kattumannil An introduction to CART
Number of parameters in Random Forest algorithm
I Number of trees used in the forest(ntree) and number of
predictor used in each tree(mtry).
I A tree with low error rate is a strong classifier.
Sudheesh K Kattumannil An introduction to CART
Number of parameters in Random Forest algorithm
I Number of trees used in the forest(ntree) and number of
predictor used in each tree(mtry).
I A tree with low error rate is a strong classifier.
I As mtry decreases, correlation and strength decrease.
Sudheesh K Kattumannil An introduction to CART
Number of parameters in Random Forest algorithm
I Number of trees used in the forest(ntree) and number of
predictor used in each tree(mtry).
I A tree with low error rate is a strong classifier.
I As mtry decreases, correlation and strength decrease.
I We need to find an optimal mtry.
Sudheesh K Kattumannil An introduction to CART
Number of parameters in Random Forest algorithm
I Set mtry to a default value.
Sudheesh K Kattumannil An introduction to CART
Number of parameters in Random Forest algorithm
I Set mtry to a default value.
I Try different ntree values.
Sudheesh K Kattumannil An introduction to CART
Number of parameters in Random Forest algorithm
I Set mtry to a default value.
I Try different ntree values.
I Record OOB error rate.
Sudheesh K Kattumannil An introduction to CART
Number of parameters in Random Forest algorithm
I Set mtry to a default value.
I Try different ntree values.
I Record OOB error rate.
I See the ntree where OOB error rate stabilizes and reaches
minimum.
Sudheesh K Kattumannil An introduction to CART
Number of parameters in Random Forest algorithm
I Set mtry to a default value.
I Try different ntree values.
I Record OOB error rate.
I See the ntree where OOB error rate stabilizes and reaches
minimum.
I See the link http://www.r2d3.us/
visual-intro-to-machine-learning-part-1/
Sudheesh K Kattumannil An introduction to CART
I Hastie, T., Tibshirani, R. and Friedman, J. (2009), The
elements of Statistical learning, data mining, inference and
prediction, Springer.
I Kuhn, M. and Johnson, K. (2013), Applied Predictive
modelling, Springer.
I James,G., Witten, D., Hastie, T. and Tibshirani, R. (2013),
An introduction to Statistical learning with applications in R,
Springer.
Sudheesh K Kattumannil An introduction to CART