0% found this document useful (0 votes)

87 views65 pages

Introduction to CART: Decision Trees

The document provides an introduction to classification and regression trees (CART). It describes how decision trees work for classification and regression problems. It discusses concepts like overfitting, underfitting, tree pruning and how to select the parameter to prune the tree using cross-validation.

Uploaded by

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

87 views65 pages

Introduction to CART: Decision Trees

Uploaded by

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 65

An introduction to CART

Sudheesh K Kattumannil

Indian Statistical Institute, Chennai, India.

ISI Chennai, 30 January, 2024

Sudheesh K Kattumannil An introduction to CART

Examples

I Predicted whether a patient, hospitalized due to a heart

attack, will have a second heart attack
Data: demographic, diet, clinical measurements
I Amount of glucose in the blood of a diabetic
Data: Infrared absorption spectrum of blood sample

Sudheesh K Kattumannil An introduction to CART

I Data are predicted on the basis of a set of features (e.g.
diet or clinical measurements)
I The prediction model is called a learner

Sudheesh K Kattumannil An introduction to CART

Example

I Data: 4601 email messages:each labelled email (+) or

spam (-)
I The relative frequencies of the 57 most commonly
occurring words and punctuation marks in the message
I Prediction goal: label future messages email (+) or spam
(-)
I Supervised learning problem on categorical data:
classification problem

Sudheesh K Kattumannil An introduction to CART

Tree versus Linear models

Sudheesh K Kattumannil An introduction to CART

Decision trees

I Graphical representation of possible solutions to a decision

based on certain conditions.
I Decision tree defines a tree-structured hierarchy of rules.
I Root node, Decision/Internal node, Leaf/Terminal node.
I Root and internal nodes contain the rules.
I Leaf nodes define the predictions.
I Decision Tree learning is about learning such a tree from
the labeled training data.

Sudheesh K Kattumannil An introduction to CART

An example: Decision trees

Sudheesh K Kattumannil An introduction to CART

Classification and Regression Tree: CART

I Classification : Problem of identifying to which of a set of

categories a new observation belongs, on the basis of a
training set of data containing observations whose
category membership is known.
I Classification tree: Dependent variable is categorical.
I Regression Tree: Dependent variable is continuous.

Sudheesh K Kattumannil An introduction to CART

Classification Tree: Binary classification
Assume training data with each input having 2 features (x1 ; x2 )

Sudheesh K Kattumannil An introduction to CART

Classification Tree: Binary classification

Sudheesh K Kattumannil An introduction to CART

Classification Tree: Binary classification

Is x1 greater than 3?

Sudheesh K Kattumannil An introduction to CART

Classification Tree: Binary classification

Given x1 > 3, is feature 2 (x2 ) greater than 3?

Sudheesh K Kattumannil An introduction to CART

Classification Tree: Binary classification

Given x1 < 3, is feature 2 (x2 ) greater than 1?

Sudheesh K Kattumannil An introduction to CART

Classification Tree: Binary classification

Sudheesh K Kattumannil An introduction to CART

Decision Trees: Predicting Baseball Players’ Salaries

I We need to predict a baseball player’s Salary based on

Years and Hits
I Overall, the tree stratifies or segments the players into
three regions of predictor space:
I Players who have played for five or fewer years,
I Players who have played for five or more years and who
made at least 118 hits last year.
I The predicted salary for these players is given by the mean
response value for the players in the partitioned data set.
I R1 = {X |Years < 5}
I R2 = {X |Years ≥ 5, Hits < 117.5} and
R3 = {X |Years ≥ 5, Hits ≥ 117.5}.

Sudheesh K Kattumannil An introduction to CART

Predicting Baseball Players’ Salaries:Players with
Years<5

I We use the Hitters data set to predict a baseball player’s

Salary based on Years and Hits
I Consider the player having less than 4.5 years experience.
I For such players, the mean log salary is 5.107,
I Hence the salary of these players is 165,174 dollars.
I The predicted salaries for other two groups are
I 1, 000 ∗ e5.999 = 402, 834, and 1, 000 ∗ e6.740 = 845, 346,
respectively.

Sudheesh K Kattumannil An introduction to CART

Sudheesh K Kattumannil An introduction to CART
Building a regression tree
I Mainly there are two steps.
I Step 1: We divide the predictor space; that is, the set of
possible values for X1 , X2 , ..., Xp -into J distinct and
non-overlapping regions, R1 , R2 , ..., RJ .
I Step 2: For every observation that falls into the region Rj ,
we make the same prediction, which is simply the mean of
the response values for the training observations in Rj .
I Suppose that in Step 1 we obtain two regions, R1 and R2 .
I Assume that the response mean of the training
observations in the first region is 10, while the response
mean of the training observations in the second region is
20.
I Then for a given observation X = x, if x ∈ R1 we will
predict a value of 10, and if x ∈ R2 we will predict a value
of 20.
Sudheesh K Kattumannil An introduction to CART
Construction of the regions

I Divide the predictor space into high-dimensional

rectangles, or boxes
I Find boxes R1 , ..., RJ that minimize the RSS, given by

J X
X
(yi − ŷR j )2
j=1 i∈Rj

where ŷR j is the mean response for the training

observations within the jth box.

Sudheesh K Kattumannil An introduction to CART

Construction of the regions
I In binary splitting, we first select the predictor Xj and the
cut point s such that splitting the predictor space into the
regions {X |Xj < s} and {X |Xj ≥ s} leads to the greatest
possible reduction in RSS.
I That is, we consider all predictors X1 , ..., Xp , and all
possible values of the cut point s for each of the predictors,
and then choose the predictor and cutpoint such that the
resulting tree has the lowest RSS
I That is, for any j and s, we define the pair of half-planes
R1 (j, s) = {X |Xj < s} and R2 (j, s) = {X |Xj ≥ s}
I We seek the value of j and s that minimize the equation
X X
(yi − ŷR i )2 and (yi − ŷR 2 )2 ,
i:xi ∈R1 (j,s) i:xi ∈R2 (j,s)

I where ŷR i is the mean response for the training

observations in R1 (j, s)
Sudheesh K Kattumannil An introduction to CART
Tree with five region

Sudheesh K Kattumannil An introduction to CART

Tree with five region

Sudheesh K Kattumannil An introduction to CART

Overfitting versus underfitting

Sudheesh K Kattumannil An introduction to CART

Overfitting versus underfitting

I Underfitting occurs when a model is not able to make

accurate predictions based on the training data.
Accordingly, may not have the capacity to generalize it on
new data.
I Machine learning models with underfitting tend to have
poor performance both in training and testing sets.
I Underfitting models usually have high bias and low
variance.
I A model is considered overfitting when it does extremely
well on training data but fails to perform on the same level
on the validation/test data.
I Models that are overfitting usually have low bias and high
variance

Sudheesh K Kattumannil An introduction to CART

Overfitting versus underfitting

Sudheesh K Kattumannil An introduction to CART

Tree Pruning

I The process described above (recursive binary splitting)

may produce good predictions on the training set, but is
likely to overfit the data, leading to poor test set
performance.
I This is because the resulting tree might be too complex.
I A smaller tree with fewer splits (that is, fewer regions
R1 , . . . , RJ ) might lead to lower variance and better
interpretation at the cost of a little bias.
I A strategy is to grow a very large tree T0 and then prune it
back in order to obtain a subtree.
I How do we determine the best prune way to prune the
tree?
I Our goal is to select a subtree that subtree leads to the
lowest test error rate.

Sudheesh K Kattumannil An introduction to CART

Tree Pruning
I Cost complexity pruning—also known as weakest link
pruning—gives us a way to do just this.
I Rather than considering every possible subtree, we
consider a sequence of trees indexed by a nonnegative
tuning parameter α.

Sudheesh K Kattumannil An introduction to CART

Tree Pruning: Algorithm

I Step 1: Use recursive binary splitting to grow a large tree

on the training data, stopping only when each terminal
node has fewer than some minimum number of
observations.
I Step 2: Apply cost complexity pruning to the large tree in
order to obtain a sequence of best subtrees, as a function
of α.
I Step 3: Use K-fold cross-validation to choose α (see next
slide).
I Step 4: Return the subtree from Step 2 that corresponds to
the chosen value of α.

Sudheesh K Kattumannil An introduction to CART

How to choose Alpha

I Use K-fold cross-validation to choose α. That is, divide the

training observations into K folds. For each k = 1, ..., K :
I Use K -fold cross-validation to choose α. That is, divide the
training observations into K folds. For each k = 1, ..., K :
I Evaluate the mean squared prediction error on the data in
the left-out kth fold, as a function of α.
I Average the results for each value of α, and pick α to
minimize the average error
n
1X
CV = MSEk
n
k =1

Sudheesh K Kattumannil An introduction to CART

Classification Trees

Sudheesh K Kattumannil An introduction to CART

Classification Trees
I A classification tree is very similar to a regression tree,
except that it is classification used to predict a qualitative
response rather than a quantitative one.

Sudheesh K Kattumannil An introduction to CART

Classification Trees
I A classification tree is very similar to a regression tree,
except that it is classification used to predict a qualitative
response rather than a quantitative one.
I We predict that each observation belongs to the most
commonly occurring class of training observations in the
region to which it belongs.

Sudheesh K Kattumannil An introduction to CART

K
X
G= p̂mk (1 − p̂mk ),
k =1

where p̂mk represents the proportion of training

observations in the mth region that are from the kth class,
a measure of total variance across the K classes.
Sudheesh K Kattumannil An introduction to CART
Advantages and Disadvantages of Trees

I Trees are very easy to explain to people.

Sudheesh K Kattumannil An introduction to CART

Advantages and Disadvantages of Trees

I Trees are very easy to explain to people.

I Decision trees more closely mirror human decision-making
than do the regression and classification

Sudheesh K Kattumannil An introduction to CART

Advantages and Disadvantages of Trees

I Trees are very easy to explain to people.

I Decision trees more closely mirror human decision-making
than do the regression and classification
I Trees can be displayed graphically, and are easily
interpreted even by a non-expert

Sudheesh K Kattumannil An introduction to CART

Advantages and Disadvantages of Trees

I Trees are very easy to explain to people.

Sudheesh K Kattumannil An introduction to CART

Advantages and Disadvantages of Trees

I Trees are very easy to explain to people.

I Decision trees more closely mirror human decision-making
than do the regression and classification
I Trees can be displayed graphically, and are easily
interpreted even by a non-expert
I Trees can easily handle qualitative predictors without the
need to create dummy variables
I Trees generally do not have the same level of predictive
accuracy as GLM and classification

Sudheesh K Kattumannil An introduction to CART

Advantages and Disadvantages of Trees

I Trees are very easy to explain to people.

Sudheesh K Kattumannil An introduction to CART

Bootstrap aggregation or bagging

I In a given set of n independent observations Z1 , ..., Zn

each with variance σ 2 , the variance of the mean Z̄ of the
observations is given by σ 2 /n

Sudheesh K Kattumannil An introduction to CART

Bootstrap aggregation or bagging

I In a given set of n independent observations Z1 , ..., Zn

each with variance σ 2 , the variance of the mean Z̄ of the
observations is given by σ 2 /n
I A natural way to increase the prediction accuracy of a
statistical learning method is; Take as many training sets
from the population, build a separate prediction model and
average them.

Sudheesh K Kattumannil An introduction to CART

Bootstrap aggregation or bagging

I We could predict f (x) by f̂ 1 (x), f̂ 2 (x), ..., f̂ B (x) using B

separate training sets and average them in order to obtain
statistical learning model
B
1X 1
f̂avg (x) = f̂ (x)
B
b=1

Sudheesh K Kattumannil An introduction to CART

Bootstrap aggregation or bagging

I We could predict f (x) by f̂ 1 (x), f̂ 2 (x), ..., f̂ B (x) using B

separate training sets and average them in order to obtain
statistical learning model
B
1X 1
f̂avg (x) = f̂ (x)
B
b=1

I We generate B different bootstrapped training data sets

B
1X b
f̂bag (x) = f̂ (x)
B
b=1

Sudheesh K Kattumannil An introduction to CART

Bagging: In regression tree

I To apply bagging to regression trees, we simply construct

B regression trees using B bootstrapped training sets

Sudheesh K Kattumannil An introduction to CART

Bagging: In regression tree

I To apply bagging to regression trees, we simply construct

B regression trees using B bootstrapped training sets
I Average the resulting predictions.

Sudheesh K Kattumannil An introduction to CART

Bagging: In regression tree

I To apply bagging to regression trees, we simply construct

B regression trees using B bootstrapped training sets
I Average the resulting predictions.
I Averaging these B trees reduces the variance.

Sudheesh K Kattumannil An introduction to CART

Bagging: Classification

I For a given test observation, we can record the class

predicted by each of the B trees, and take a majority vote:

Sudheesh K Kattumannil An introduction to CART

Bagging: Classification

I For a given test observation, we can record the class

predicted by each of the B trees, and take a majority vote:
I The overall prediction is the most commonly occurring
majority class among the B predictions.

Sudheesh K Kattumannil An introduction to CART

Random Forests
I While building decision trees, each time we consider a split
in a tree

Sudheesh K Kattumannil An introduction to CART

Random Forests
I While building decision trees, each time we consider a split
in a tree
I Suppose that there is one very strong predictor in the data
set, along with a number of other moderately strong
predictors.
I In bagged tree, most of the trees will use this strong
predictor in the top split.
I Hence the predictions from the bagged trees will be highly
correlated. Averaging many highly correlated quantities
does not lead to large of a reduction in variance
I Random forests overcome this problem by forcing each
split to consider only a subset of the predictors.
I A random sample of m predictors is chosen as split
candidates from the full set of p predictors (In classification
√
m = p and in regression m = p3 .)
Sudheesh K Kattumannil An introduction to CART
Number of parameters in Random Forest algorithm