0% found this document useful (0 votes)

10 views99 pages

Lecture Trees

Lecture notes

Uploaded by

mmonica0703

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views99 pages

Lecture Trees

Lecture notes

Uploaded by

mmonica0703

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Decision Trees

An Early Classifier

Jason Corso

SUNY at Buffalo

J. Corso (SUNY at Buffalo) Trees 1/1

Introduction to Non-Metric Methods

We cover such problems involving nominal data in this

J. Corso (SUNY at Buffalo) Trees 2/1

Introduction to Non-Metric Methods

We cover such problems involving nominal data in this

chapter—that is, data that are discrete and without any natural
notion of similarity or even ordering.
For example (DHS), some teeth are small and fine (as in baleen
whales) for straining tiny prey from the sea; others (as in sharks) come
in multiple rows; other sea creatures have tusks (as in walruses), yet
others lack teeth altogether (as in squid). There is no clear notion of
similarity for this information about teeth.
Most of the other methods we study will involve real-valued feature
vectors with clear metrics.
We may also consider problems involving data tuples and data strings.
And for recognition of these, decision trees and string grammars,
respectively.

J. Corso (SUNY at Buffalo) Trees 2/1

Decision Trees

20 Questions

I am thinking of a person. Ask me up to 20 yes/no questions to

determine who this person is that I am thinking about.
Consider your questions wisely...

J. Corso (SUNY at Buffalo) Trees 3/1

Decision Trees

20 Questions

I am thinking of a person. Ask me up to 20 yes/no questions to

determine who this person is that I am thinking about.
Consider your questions wisely...
How did you ask the questions?
What underlying measure led you the questions, if any?

J. Corso (SUNY at Buffalo) Trees 3/1

Decision Trees

20 Questions

I am thinking of a person. Ask me up to 20 yes/no questions to

determine who this person is that I am thinking about.
Consider your questions wisely...
How did you ask the questions?
What underlying measure led you the questions, if any?
Most importantly, iterative yes/no questions of this sort require no
metric and are well suited for nominal data.

J. Corso (SUNY at Buffalo) Trees 3/1

Decision Trees

These sequence of questions are a decision tree...

root Color? level 0

green red
yellow

Size? Shape? Size? level 1

medium

big small round thin medium small

Watermelon Apple Grape Size? Banana Apple Taste? level 2

big small sweet sour

Grapefruit Lemon Cherry Grape level 3

URE 8.1. Classification in a basic decision tree proceeds from top to bottom. The questions aske
node concern a particular property of the pattern, and the downward links correspond to the poss
es. Successive nodes are visited until a terminal or leaf node is reached, where the category label is r
that the same(SUNY
J. Corso question, Size?, appears in different
at Buffalo) Trees places in the tree and that different questions
4/1
Decision Trees

Decision Trees 101

The root node of the tree, displayed at the top, is connected to

successive branches to the other nodes.

J. Corso (SUNY at Buffalo) Trees 5/1

Decision Trees

Decision Trees 101

The root node of the tree, displayed at the top, is connected to

successive branches to the other nodes.
The connections continue until the leaf nodes are reached, implying
a decision.

J. Corso (SUNY at Buffalo) Trees 5/1

Decision Trees

Decision Trees 101

The root node of the tree, displayed at the top, is connected to

J. Corso (SUNY at Buffalo) Trees 5/1

Decision Trees

Decision Trees 101

The root node of the tree, displayed at the top, is connected to

J. Corso (SUNY at Buffalo) Trees 5/1

Decision Trees

Decision Trees 101

The root node of the tree, displayed at the top, is connected to

successive branches to the other nodes.
The connections continue until the leaf nodes are reached, implying
a decision.
The classification of a particular pattern begins at the root node,
which queries a particular property (selected during tree learning).
The links off of the root node correspond to different possible values
of the property.
We follow the link corresponding to the appropriate value of the
pattern and continue to a new node, at which we check the next
property. And so on.

J. Corso (SUNY at Buffalo) Trees 5/1

Decision Trees

Decision Trees 101

The root node of the tree, displayed at the top, is connected to

J. Corso (SUNY at Buffalo) Trees 5/1

Decision Trees

When to Consider Decision Trees

Instances are wholly or partly described by attribute-value pairs.

Target function is discrete valued.
Disjunctive hypothesis may be required.
Possibly noisy training data.
Examples
Equipment or medical diagnosis.
Credit risk analysis.
Modeling calendar scheduling preferences.

J. Corso (SUNY at Buffalo) Trees 6/1

Decision Trees CART

CART for Decision Tree Learning

Assume we have a set of D labeled training data and we have decided

on a set of properties that can be used to discriminate patterns.

J. Corso (SUNY at Buffalo) Trees 7/1

Decision Trees CART

CART for Decision Tree Learning

Assume we have a set of D labeled training data and we have decided

on a set of properties that can be used to discriminate patterns.
Now, we want to learn how to organize these properties into a
decision tree to maximize accuracy.

J. Corso (SUNY at Buffalo) Trees 7/1

Decision Trees CART

CART for Decision Tree Learning

Assume we have a set of D labeled training data and we have decided

J. Corso (SUNY at Buffalo) Trees 7/1

Decision Trees CART

CART for Decision Tree Learning

Assume we have a set of D labeled training data and we have decided

J. Corso (SUNY at Buffalo) Trees 7/1

Decision Trees CART

CART for Decision Tree Learning

Assume we have a set of D labeled training data and we have decided

on a set of properties that can be used to discriminate patterns.
Now, we want to learn how to organize these properties into a
decision tree to maximize accuracy.
Any decision tree will progressively split the data into subsets.
If at any point all of the elements of a particular subset are of the
same category, then we say this node is pure and we can stop
splitting.
Unfortunately, this rarely happens and we have to decide between
whether to stop splitting and accept an imperfect decision or instead
to select another property and grow the tree further.

J. Corso (SUNY at Buffalo) Trees 7/1

Decision Trees CART

The basic CART strategy to recursively defining the tree is the

following: Given the data represented at a node, either declare
that node to be a leaf or find another property to use to split
the data into subsets.

J. Corso (SUNY at Buffalo) Trees 8/1

Decision Trees CART

The basic CART strategy to recursively defining the tree is the

J. Corso (SUNY at Buffalo) Trees 8/1

Decision Trees CART

The basic CART strategy to recursively defining the tree is the

J. Corso (SUNY at Buffalo) Trees 8/1

Decision Trees CART

The basic CART strategy to recursively defining the tree is the

J. Corso (SUNY at Buffalo) Trees 8/1

Decision Trees CART

The basic CART strategy to recursively defining the tree is the

J. Corso (SUNY at Buffalo) Trees 8/1

Decision Trees CART

The basic CART strategy to recursively defining the tree is the

following: Given the data represented at a node, either declare
that node to be a leaf or find another property to use to split
the data into subsets.
There are 6 general kinds of questions that arise:
1 How many branches will be selected from a node?
2 Which property should be tested at a node?
3 When should a node be declared a leaf?
4 How can we prune a tree once it has become too large?

J. Corso (SUNY at Buffalo) Trees 8/1

Decision Trees CART

The basic CART strategy to recursively defining the tree is the

J. Corso (SUNY at Buffalo) Trees 8/1

Decision Trees CART

The basic CART strategy to recursively defining the tree is the

J. Corso (SUNY at Buffalo) Trees 8/1

Decision Trees CART

Number of Splits

The number of splits at a node, or its branching factor B, is

generally set by the designer (as a function of the way the test is
selected) and can vary throughout the tree.

J. Corso (SUNY at Buffalo) Trees 9/1

Decision Trees CART

Number of Splits

The number of splits at a node, or its branching factor B, is

J. Corso (SUNY at Buffalo) Trees 9/1

Decision Trees CART

Number of Splits

The number of splits at a node, or its branching factor B, is

J. Corso (SUNY at Buffalo) Trees 9/1

Decision Trees CART

Number of Splits

The number of splits at a node, or its branching factor B, is

generally set by the designer (as a function of the way the test is
selected) and can vary throughout the tree.
Note that any split with a factor greater than 2 can easily be
converted into a sequence of binary splits.
So, DHS focuses on only binary tree learning.
But, we note that in certain circumstances for learning and inference,
the selection of a test at a node or its inference may be
computationally expensive and a 3- or 4-way split may be more
desirable for computational reasons.

J. Corso (SUNY at Buffalo) Trees 9/1

Decision Trees CART

Query Selection and Node Impurity

The fundamental principle underlying tree creation is that of
simplicity: we prefer decisions that lead to a simple, compact
tree with few nodes.

J. Corso (SUNY at Buffalo) Trees 10 / 1

Decision Trees CART

Query Selection and Node Impurity

J. Corso (SUNY at Buffalo) Trees 10 / 1

Decision Trees CART

Query Selection and Node Impurity

J. Corso (SUNY at Buffalo) Trees 10 / 1

Decision Trees CART

Query Selection and Node Impurity

The fundamental principle underlying tree creation is that of
simplicity: we prefer decisions that lead to a simple, compact
tree with few nodes.
We seek a property query T at each node N that makes the data
reaching the immediate descendant nodes as “pure” as possible.
Let i(N ) denote the impurity of a node N .
In all cases, we want i(N ) to be 0 if all of the patterns that reach the
node bear the same category, and to be large if the categories are
equally represented.

J. Corso (SUNY at Buffalo) Trees 10 / 1

Decision Trees CART

Query Selection and Node Impurity

It will be minimized for a node that has elements of only one class
(pure).
J. Corso (SUNY at Buffalo) Trees 10 / 1
Decision Trees CART

For the two-category case, a useful definition of impurity is that

variance impurity:

i(N ) = P (ω1 )P (ω2 ) (2)

J. Corso (SUNY at Buffalo) Trees 11 / 1

Decision Trees CART

For the two-category case, a useful definition of impurity is that

variance impurity:

i(N ) = P (ω1 )P (ω2 ) (2)

Its generalization to the multi-class is the Gini impurity:

 
X 1 X
i(N ) = P (ωi )P (ωj ) = 1− P 2 (ωj ) (3)
2
i6=j j

which is the expected error rate at node N if the category is selected

randomly from the class distribution present at the node.

J. Corso (SUNY at Buffalo) Trees 11 / 1

Decision Trees CART

For the two-category case, a useful definition of impurity is that

variance impurity:

i(N ) = P (ω1 )P (ω2 ) (2)

Its generalization to the multi-class is the Gini impurity:

 
X 1 X
i(N ) = P (ωi )P (ωj ) = 1− P 2 (ωj ) (3)
2
i6=j j

which is the expected error rate at node N if the category is selected

randomly from the class distribution present at the node.
The misclassification impurity measures the minimum probability
that a training pattern would be misclassified at N :

i(N ) = 1 − max P (ωj ) (4)

J. Corso (SUNY at Buffalo) Trees 11 / 1

Decision Trees CART

i(P)

en
ni/

tro
va

py
ria
n
tio

n
ica

ce
sif
las
sc
mi

P
0 0.5 1

ForForthe thetwo-category
two-category case,case, the impurity
the impurity functions
functions peak peak at e
at equal class
thefrequencies.
variance and the Gini impurity
J. Corso (SUNY at Buffalo) Trees
functions are identical12 / 1
Decision Trees CART

Query Selection

Key Question: Given a partial tree down to node N , what

feature s should we choose for the property test T ?

J. Corso (SUNY at Buffalo) Trees 13 / 1

Decision Trees CART

Query Selection

Key Question: Given a partial tree down to node N , what

feature s should we choose for the property test T ?
The obvious heuristic is to choose the feature that yields as big a
decrease in the impurity as possible.

J. Corso (SUNY at Buffalo) Trees 13 / 1

Decision Trees CART

Query Selection

Key Question: Given a partial tree down to node N , what

feature s should we choose for the property test T ?
The obvious heuristic is to choose the feature that yields as big a
decrease in the impurity as possible.
The impurity gradient is

∆i(N ) = i(N ) − PL i(NL ) − (1 − PL )i(NR ) , (5)

where NL and NR are the left and right descendants, respectively, PL

is the fraction of data that will go to the left sub-tree when property
T is used.

J. Corso (SUNY at Buffalo) Trees 13 / 1

Decision Trees CART

Query Selection

Key Question: Given a partial tree down to node N , what

feature s should we choose for the property test T ?
The obvious heuristic is to choose the feature that yields as big a
decrease in the impurity as possible.
The impurity gradient is

∆i(N ) = i(N ) − PL i(NL ) − (1 − PL )i(NR ) , (5)

where NL and NR are the left and right descendants, respectively, PL

is the fraction of data that will go to the left sub-tree when property
T is used.
The strategy is then to choose the feature that maximizes ∆i(N ).

J. Corso (SUNY at Buffalo) Trees 13 / 1

Decision Trees CART

Query Selection

Key Question: Given a partial tree down to node N , what

feature s should we choose for the property test T ?
The obvious heuristic is to choose the feature that yields as big a
decrease in the impurity as possible.
The impurity gradient is

∆i(N ) = i(N ) − PL i(NL ) − (1 − PL )i(NR ) , (5)

where NL and NR are the left and right descendants, respectively, PL

is the fraction of data that will go to the left sub-tree when property
T is used.
The strategy is then to choose the feature that maximizes ∆i(N ).
If the entropy impurity is used, this corresponds to choosing the
feature that yields the highest information gain.

J. Corso (SUNY at Buffalo) Trees 13 / 1

Decision Trees CART

What can we say about this strategy?

For the binary-case, it yields one-dimensional optimization problem

(which may have non-unique optima).

J. Corso (SUNY at Buffalo) Trees 14 / 1

Decision Trees CART

What can we say about this strategy?

For the binary-case, it yields one-dimensional optimization problem

J. Corso (SUNY at Buffalo) Trees 14 / 1

Decision Trees CART

What can we say about this strategy?

For the binary-case, it yields one-dimensional optimization problem

J. Corso (SUNY at Buffalo) Trees 14 / 1

Decision Trees CART

What can we say about this strategy?

For the binary-case, it yields one-dimensional optimization problem

(which may have non-unique optima).
In the higher branching factor case, it would yield a
higher-dimensional optimization problem.
In multi-class binary tree creation, we would want to use the twoing
criterion. The goal is to find the split that best separates groups of
the c categories. A candidate “supercategory” C1 consists of all
patterns in some subset of the categories and C2 has the remainder.
When searching for the feature s, we also need to search over possible
category groupings.
This is a local, greedy optimization strategy.
Hence, there is no guarantee that we have either the global optimum
(in classification accuracy) or the smallest tree.

J. Corso (SUNY at Buffalo) Trees 14 / 1

Decision Trees CART

What can we say about this strategy?

For the binary-case, it yields one-dimensional optimization problem

A Note About Multiway Splits

In the case of selecting a multiway split with branching factor B, the

following is the direct generalization of the impurity gradient function:
B
X
∆i(s) = i(N ) − Pk i(Nk ) (6)
k=1

J. Corso (SUNY at Buffalo) Trees 15 / 1

Decision Trees CART

A Note About Multiway Splits

In the case of selecting a multiway split with branching factor B, the

following is the direct generalization of the impurity gradient function:
B
X
∆i(s) = i(N ) − Pk i(Nk ) (6)
k=1

This direct generalization is biased toward higher branching factors.

To see this, consider the uniform splitting case.

J. Corso (SUNY at Buffalo) Trees 15 / 1

Decision Trees CART

A Note About Multiway Splits

In the case of selecting a multiway split with branching factor B, the

following is the direct generalization of the impurity gradient function:
B
X
∆i(s) = i(N ) − Pk i(Nk ) (6)
k=1

This direct generalization is biased toward higher branching factors.

To see this, consider the uniform splitting case.
So, we need to normalize each:
∆i(s)
∆iB (s) = PB . (7)
− k=1 Pk log Pk

And then we can again choose the feature that maximizes this
normalized criterion.
J. Corso (SUNY at Buffalo) Trees 15 / 1
Decision Trees CART

When to Stop Splitting?

J. Corso (SUNY at Buffalo) Trees 16 / 1

Decision Trees CART

When to Stop Splitting?

J. Corso (SUNY at Buffalo) Trees 16 / 1

Decision Trees CART

When to Stop Splitting?

J. Corso (SUNY at Buffalo) Trees 16 / 1

Decision Trees CART

When to Stop Splitting?

If we continue to grow the tree until each leaf node has its lowest
impurity (just one sample datum), then we will likely have
over-trained the data. This tree will most definitely not generalize
well.
Conversely, if we stop growing the tree too early, the error on the
training data will not be sufficiently low and performance will again
suffer.
So, how to stop splitting?
1 Cross-validation...
2 Threshold on the impurity gradient.
3 Incorporate a tree-complexity term and minimize.
4 Statistical significance of the impurity gradient.

J. Corso (SUNY at Buffalo) Trees 16 / 1

Decision Trees CART

Stopping by Thresholding the Impurity Gradient

Splitting is stopped if the best candidate split at a node reduces the

impurity by less than the preset amount, β:

max ∆i(s) ≤ β . (8)

J. Corso (SUNY at Buffalo) Trees 17 / 1

Decision Trees CART

Stopping by Thresholding the Impurity Gradient

Splitting is stopped if the best candidate split at a node reduces the

impurity by less than the preset amount, β:

max ∆i(s) ≤ β . (8)

Benefit 1: Unlike cross-validation, the tree is trained on the complete

training data set.

J. Corso (SUNY at Buffalo) Trees 17 / 1

Decision Trees CART

Stopping by Thresholding the Impurity Gradient

Splitting is stopped if the best candidate split at a node reduces the

impurity by less than the preset amount, β:

max ∆i(s) ≤ β . (8)

Benefit 1: Unlike cross-validation, the tree is trained on the complete

training data set.
Benefit 2: Leaf nodes can lie in different levels of the tree, which is
desirable whenver the complexity of the data varies throughout the
range of values.

J. Corso (SUNY at Buffalo) Trees 17 / 1

Decision Trees CART

Stopping by Thresholding the Impurity Gradient

Splitting is stopped if the best candidate split at a node reduces the

impurity by less than the preset amount, β:

max ∆i(s) ≤ β . (8)

Benefit 1: Unlike cross-validation, the tree is trained on the complete

training data set.
Benefit 2: Leaf nodes can lie in different levels of the tree, which is
desirable whenver the complexity of the data varies throughout the
range of values.
Drawback: But, how do we set the value of the threshold β?

J. Corso (SUNY at Buffalo) Trees 17 / 1

Decision Trees CART

Stopping with a Complexity Term

Define a new global criterion function

X
α · size + i(N ) . (9)
leaf nodes

which trades complexity for accuracy. Here, size could represent the
number of nodes or links and α is some positive constant.

J. Corso (SUNY at Buffalo) Trees 18 / 1

Decision Trees CART

Stopping with a Complexity Term

Define a new global criterion function

X
α · size + i(N ) . (9)
leaf nodes

J. Corso (SUNY at Buffalo) Trees 18 / 1

Decision Trees CART

Stopping with a Complexity Term

Define a new global criterion function

X
α · size + i(N ) . (9)
leaf nodes

which trades complexity for accuracy. Here, size could represent the
number of nodes or links and α is some positive constant.
The strategy is then to split until a minimum of this global criterion
function has been reached.
Given the entropy impurity, this global measure is related to the
minimum description length principle.
The sum of the impurities at the leaf nodes is a measure of uncertainty
in the training data given the model represented by the tree.

J. Corso (SUNY at Buffalo) Trees 18 / 1

Decision Trees CART

Stopping with a Complexity Term

Define a new global criterion function

X
α · size + i(N ) . (9)
leaf nodes

J. Corso (SUNY at Buffalo) Trees 18 / 1

Decision Trees CART

Stopping by Testing the Statistical Significance

During construction, estimate the distribution of the impurity

gradients ∆i for the current collection of nodes.

J. Corso (SUNY at Buffalo) Trees 19 / 1

Decision Trees CART

Stopping by Testing the Statistical Significance

During construction, estimate the distribution of the impurity

gradients ∆i for the current collection of nodes.
For any candidate split, estimate if it is statistical different from zero.
One possibility is the chi-squared test.

J. Corso (SUNY at Buffalo) Trees 19 / 1

Decision Trees CART

Stopping by Testing the Statistical Significance

During construction, estimate the distribution of the impurity

J. Corso (SUNY at Buffalo) Trees 19 / 1

Decision Trees CART

Stopping by Testing the Statistical Significance

During construction, estimate the distribution of the impurity

gradients ∆i for the current collection of nodes.
For any candidate split, estimate if it is statistical different from zero.
One possibility is the chi-squared test.
More generally, we can consider a hypothesis testing approach to
stopping: we seek to determine whether a candidate split differs
significantly from a random split.
Suppose we have n samples at node N . A particular split s sends P n
patterns to the left branch and (1 − P )n patterns to the right branch.
A random split would place Pn1 of the ω1 samples to the left, Pn2 of
the ω2 samples to the left and corresponding amounts to the right.

J. Corso (SUNY at Buffalo) Trees 19 / 1

Decision Trees CART

The chi-squared statistic calculates the deviation of a particular split s

from this random one:
2
2
X (niL − nie )2
χ = (10)
nie
i=1

where niL is the number of ω1 patterns sent to the left under s, and
nie = P ni is the number expected by the random rule.

J. Corso (SUNY at Buffalo) Trees 20 / 1

Decision Trees CART

The chi-squared statistic calculates the deviation of a particular split s

from this random one:
2
2
X (niL − nie )2
χ = (10)
nie
i=1

J. Corso (SUNY at Buffalo) Trees 20 / 1

Decision Trees CART

The chi-squared statistic calculates the deviation of a particular split s

from this random one:
2
2
X (niL − nie )2
χ = (10)
nie
i=1

where niL is the number of ω1 patterns sent to the left under s, and
nie = P ni is the number expected by the random rule.
The larger the chi-squared statistic, the more the candidate split
deviates from a random one.
When it is greater than a critical value (based on desired significance
bounds), we reject the null hypothesis (the random split) and proceed
with s.

J. Corso (SUNY at Buffalo) Trees 20 / 1

Decision Trees CART

Pruning

Tree construction based on “when to stop splitting” biases the

learning algorithm toward trees in which the greatest impurity
reduction occurs near the root. It makes no attempt to look ahead at
what splits may occur in the leaf and beyond.

J. Corso (SUNY at Buffalo) Trees 21 / 1

Decision Trees CART

Pruning

Tree construction based on “when to stop splitting” biases the

J. Corso (SUNY at Buffalo) Trees 21 / 1

Decision Trees CART

Pruning

Tree construction based on “when to stop splitting” biases the

J. Corso (SUNY at Buffalo) Trees 21 / 1

Decision Trees CART

Pruning

Tree construction based on “when to stop splitting” biases the

J. Corso (SUNY at Buffalo) Trees 21 / 1

Decision Trees CART

Pruning

Tree construction based on “when to stop splitting” biases the

learning algorithm toward trees in which the greatest impurity
reduction occurs near the root. It makes no attempt to look ahead at
what splits may occur in the leaf and beyond.
Pruning is the principal alternative strategy for tree construction.
In pruning, we exhaustively build the tree. Then, all pairs of
neighboring leafs nodes are considered for elimination.
Any pair that yields a satisfactory increase in impurity (a small one) is
eliminated and the common ancestor node is declared a leaf.
Unbalanced trees often result from this style of pruning/merging.

J. Corso (SUNY at Buffalo) Trees 21 / 1

Decision Trees CART

Pruning

Tree construction based on “when to stop splitting” biases the

J. Corso (SUNY at Buffalo) Trees 21 / 1

Decision Trees CART

Assignment of Leaf Node Labels

This part is easy...a particular leaf node should make the label
assignment based on the distribution of samples in it during training.
Take the label of the maximally represented class.
We will see clear justification for this in the next chapter on Decision
Theory.

J. Corso (SUNY at Buffalo) Trees 22 / 1

Decision Trees CART

Instability of the Tree Construction

J. Corso (SUNY at Buffalo) Trees 23 / 1

Decision Trees CART

Importance of Feature Choice

The selection of features will ultimately play a major role in accuracy,
generalization, and complexity.
This is an instance of the Ugly Duckling principle.
x2
x1 < 0.27
1
R1
x2 < 0.32 x2 < 0.6
.8

.6
x1 < 0.07 ω1 ω2 x1 < 0.55

R2
.4 ω1 ω2 ω1 x2 < 0.86

.2
ω2 x1 < 0.81

0 x1
.2 .4 .6 .8 1 ω1 ω2
x2

1 R1 - 1.2 x1 + x2 < 0.1

.8
ω2 ω1
.6

.4
R2

0 x1
.2 .4 .6 .8 1

FIGURE 8.5. If the class of node decisions does not match the form of the training data,
J. Corso (SUNY at Buffalo) Trees
a very complicated decision tree will result, as shown at the top. Here decisions are 24 / 1
Decision Trees CART

Furthermore, the use of multiple variables in selecting a decision rule

may greatly improve the accuracy and generalization.
x2

1
x2 < 0.5

0.8
R1
x1 < 0.95 x2 < 0.56
0.6
R2
R1
0.4 ω2 ω1 x2 < 0.54 ω1
R2
0.2
ω1 ω2
0
x1
0.2 0.4 0.6 0.8 1

x2
0.04 x1 + 0.16 x2 < 0.11
1

0.8
0.27 x1 - 0.44 x2 < -0.02 ω1
R1

0.6
0.96 x1 - 1.77x2 < -0.45 ω2
0.4
R2
5.43 x1 - 13.33 x2 < -6.03 ω2
0.2

x1 ω1 ω2
0 0.2 0.4 0.6 0.8 1

J. Corso (SUNYFIGURE 8.6. One form of multivariate tree

at Buffalo) employs general linear decisions at each
Trees 25 / 1
Decision Trees ID3

ID3 Method

ID3 is another tree growing method.

J. Corso (SUNY at Buffalo) Trees 26 / 1

Decision Trees ID3

ID3 Method

ID3 is another tree growing method.

It assumes nominal inputs.

J. Corso (SUNY at Buffalo) Trees 26 / 1

Decision Trees ID3

ID3 Method

ID3 is another tree growing method.

It assumes nominal inputs.
Every split has a branching factor Bj , where Bj is the number of
discrete attribute bins of the variable j chosen for splitting.

J. Corso (SUNY at Buffalo) Trees 26 / 1

Decision Trees ID3

ID3 Method

ID3 is another tree growing method.

It assumes nominal inputs.
Every split has a branching factor Bj , where Bj is the number of
discrete attribute bins of the variable j chosen for splitting.
These are, hence, seldom binary.

J. Corso (SUNY at Buffalo) Trees 26 / 1

Decision Trees ID3

ID3 Method

ID3 is another tree growing method.

J. Corso (SUNY at Buffalo) Trees 26 / 1

Decision Trees ID3

ID3 Method

ID3 is another tree growing method.

It assumes nominal inputs.
Every split has a branching factor Bj , where Bj is the number of
discrete attribute bins of the variable j chosen for splitting.
These are, hence, seldom binary.
The number of levels in the trees are equal to the number of input
variables.
The algorithm continues until all nodes are pure or there are no more
variables on which to split.

J. Corso (SUNY at Buffalo) Trees 26 / 1

Decision Trees ID3

ID3 Method

ID3 is another tree growing method.

J. Corso (SUNY at Buffalo) Trees 26 / 1

Decision Trees C4.5

C4.5 Method (in brief)

This is a successor to the ID3 method.

J. Corso (SUNY at Buffalo) Trees 27 / 1

Decision Trees C4.5

C4.5 Method (in brief)

This is a successor to the ID3 method.

It handles real valued variables like CART and uses the ID3 multiway
splits for nominal data.

J. Corso (SUNY at Buffalo) Trees 27 / 1

Decision Trees C4.5

C4.5 Method (in brief)

This is a successor to the ID3 method.

It handles real valued variables like CART and uses the ID3 multiway
splits for nominal data.
Pruning is performed based on statistical significance tests.

J. Corso (SUNY at Buffalo) Trees 27 / 1

Decision Trees Example

Example from T. Mitchell Book: PlayTennis

Day Outlook Temperature Humidity Wind PlayTennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No

J. Corso (SUNY at Buffalo) Trees 28 / 1

Decision Trees Example

Which attribute is the best classifier?

S: [9+,5-] S: [9+,5-]
E =0.940 E =0.940

Humidity Wind

High Normal Weak Strong

[3+,4-] [6+,1-] [6+,2-] [3+,3-]

E =0.985 E =0.592 E =0.811 E =1.00

Gain (S, Humidity ) Gain (S, Wind )

= .940 - (7/14).985 - (7/14).592 = .940 - (8/14).811 - (6/14)1.0
= .151 = .048

J. Corso (SUNY at Buffalo) Trees 29 / 1

Decision Trees Example

{D1, D2, ..., D14}

[9+,5−]

Outlook

Sunny Overcast Rain

{D1,D2,D8,D9,D11} {D3,D7,D12,D13} {D4,D5,D6,D10,D14}

[2+,3−] [4+,0−] [3+,2−]

? Yes
?

Which attribute should be tested here?

Ssunny = {D1,D2,D8,D9,D11}

Gain (Ssunny , Humidity) = .970 − (3/5) 0.0 − (2/5) 0.0 = .970

Gain (Ssunny , Temperature) = .970 − (2/5) 0.0 − (2/5) 1.0 − (1/5) 0.0 = .570
Gain (Ssunny , Wind) = .970 − (2/5) 1.0 − (3/5) .918 = .019

J. Corso (SUNY at Buffalo) Trees 30 / 1

Decision Trees Example

Hypothesis Space Search by ID3

+ – +

...
A2
A1
+ – + + + – + –

...

A2 A2

+ – + – + – + –
A3 A4
–
+

... ...

J. Corso (SUNY at Buffalo) Trees 31 / 1

Decision Trees Example

Learned Tree

Outlook

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak

No Yes No Yes

J. Corso (SUNY at Buffalo) Trees 32 / 1

Decision Trees Example

Overfitting Instance
Consider adding a new, noisy training example #15:
Sunny, Hot, N ormal, Strong, P layT ennis = N o
What effect would it have on the earlier tree?

J. Corso (SUNY at Buffalo) Trees 33 / 1

Decision Trees Example

Overfitting Instance
Consider adding a new, noisy training example #15:
Sunny, Hot, N ormal, Strong, P layT ennis = N o
What effect would it have on the earlier tree?
0.9

0.85

0.8

0.75
Accuracy

0.7

0.65

0.6 On training data

On test data
0.55

0.5
0 at Buffalo)
J. Corso (SUNY 10 20 30 40 Trees 50 60 70 80 90 100
33 / 1

Decision Trees
No ratings yet
Decision Trees
53 pages
Decisiontrees
No ratings yet
Decisiontrees
28 pages
Lecture Note #5 - PEC-CS701E
No ratings yet
Lecture Note #5 - PEC-CS701E
16 pages
Decision Tree
No ratings yet
Decision Tree
13 pages
Decision Tree Is An Upside
No ratings yet
Decision Tree Is An Upside
7 pages
Unit Ii
No ratings yet
Unit Ii
22 pages
TEAA - Tree Ensembles-1
No ratings yet
TEAA - Tree Ensembles-1
43 pages
Decision Tree Is An Upside
No ratings yet
Decision Tree Is An Upside
17 pages
Lec.7.intro.D.S. Fall 2023
No ratings yet
Lec.7.intro.D.S. Fall 2023
26 pages
Unit IV Da Online - PPTX 2 82
No ratings yet
Unit IV Da Online - PPTX 2 82
81 pages
Decision Tree Learning (8 Hours)
No ratings yet
Decision Tree Learning (8 Hours)
141 pages
Machine - Learning - Lecture - 08 - Decision Tree Learning
No ratings yet
Machine - Learning - Lecture - 08 - Decision Tree Learning
67 pages
Decision Trees Lectures
No ratings yet
Decision Trees Lectures
55 pages
Decision Trees
No ratings yet
Decision Trees
34 pages
Chapter 4classification and Prediction
No ratings yet
Chapter 4classification and Prediction
19 pages
Decision Tree
No ratings yet
Decision Tree
31 pages
Adobe Scan 16 May 2023
No ratings yet
Adobe Scan 16 May 2023
14 pages
AI&Ml-module 4 (Complete)
No ratings yet
AI&Ml-module 4 (Complete)
124 pages
AI&Ml-module 4 (Part 1)
No ratings yet
AI&Ml-module 4 (Part 1)
85 pages
ML - Module-3-Chapter-6 RNSIT
No ratings yet
ML - Module-3-Chapter-6 RNSIT
10 pages
Decision Tree
No ratings yet
Decision Tree
14 pages
Decision Tree Learning
No ratings yet
Decision Tree Learning
15 pages
Decision Trees: Make A Decision (Represent An Outcome
No ratings yet
Decision Trees: Make A Decision (Represent An Outcome
4 pages
9-Module 5 Decision Tree-21-03-2024
No ratings yet
9-Module 5 Decision Tree-21-03-2024
83 pages
Decision Tree
No ratings yet
Decision Tree
66 pages
Decision Tree
No ratings yet
Decision Tree
18 pages
Unit 3 - ML (NEW)
No ratings yet
Unit 3 - ML (NEW)
68 pages
Understanding Decision Trees in ML
No ratings yet
Understanding Decision Trees in ML
13 pages
Genetic Algorithm for Decision Trees
No ratings yet
Genetic Algorithm for Decision Trees
13 pages
Lecture 07 On Decision Trees
No ratings yet
Lecture 07 On Decision Trees
36 pages
Cours #4-Decision Tree
No ratings yet
Cours #4-Decision Tree
18 pages
Unit IV Decision Trees
No ratings yet
Unit IV Decision Trees
37 pages
Tree-Based Machine Learning Methods
100% (1)
Tree-Based Machine Learning Methods
138 pages
CART
No ratings yet
CART
26 pages
Decision Trees and Regression Techniques
No ratings yet
Decision Trees and Regression Techniques
27 pages
Wk. 5.2. Decision Trees (27.10.2020)
No ratings yet
Wk. 5.2. Decision Trees (27.10.2020)
57 pages
Unit 15
No ratings yet
Unit 15
12 pages
Decision Tree
No ratings yet
Decision Tree
74 pages
14 2 DT
No ratings yet
14 2 DT
40 pages
23 Ens RandomForests
No ratings yet
23 Ens RandomForests
27 pages
Decision Tree Classification Guide
No ratings yet
Decision Tree Classification Guide
23 pages
Pks Machine Learning Module 3 2
No ratings yet
Pks Machine Learning Module 3 2
80 pages
Decision Trees
100% (6)
Decision Trees
28 pages
Decision Trees
No ratings yet
Decision Trees
15 pages
Lecture 5a
No ratings yet
Lecture 5a
24 pages
Aiml M4 C1
No ratings yet
Aiml M4 C1
101 pages
ML Chapter 4 Part2
No ratings yet
ML Chapter 4 Part2
75 pages
Understanding Decision Trees
No ratings yet
Understanding Decision Trees
33 pages
Decision Tree
No ratings yet
Decision Tree
2 pages
Decision Trees and Probabilistic Models
No ratings yet
Decision Trees and Probabilistic Models
32 pages
5 Intro To Tree Methods LT
No ratings yet
5 Intro To Tree Methods LT
15 pages
Decision Trees
No ratings yet
Decision Trees
26 pages
UNIT-IV - Decision Tree Induction
No ratings yet
UNIT-IV - Decision Tree Induction
19 pages
Decision Tree Structure and Algorithms
No ratings yet
Decision Tree Structure and Algorithms
5 pages
Non-Metric Classification & Decision Trees
No ratings yet
Non-Metric Classification & Decision Trees
35 pages
Dmi Unit 4
No ratings yet
Dmi Unit 4
34 pages
21cs54 Aiml Module4
No ratings yet
21cs54 Aiml Module4
128 pages
Decision Tree
0% (1)
Decision Tree
24 pages
Flower Business in Bangladesh A Study On Jashore District
No ratings yet
Flower Business in Bangladesh A Study On Jashore District
9 pages
Family, Lawyers Sue Apartment Owners After 22-Year-Old Killed in Attempted Dognapping
No ratings yet
Family, Lawyers Sue Apartment Owners After 22-Year-Old Killed in Attempted Dognapping
11 pages
2 Structure
No ratings yet
2 Structure
6 pages
LESSON 2.2a - Organizing Data in Excel
No ratings yet
LESSON 2.2a - Organizing Data in Excel
3 pages
SVMG Svmi
No ratings yet
SVMG Svmi
2 pages
Quotation: Pos. Quantity Description Price/unit Total Price
No ratings yet
Quotation: Pos. Quantity Description Price/unit Total Price
4 pages
G Online Inspire ReadWriteData DocTemp 20241028185502677567575
No ratings yet
G Online Inspire ReadWriteData DocTemp 20241028185502677567575
6 pages
Globalization's Business Impact
No ratings yet
Globalization's Business Impact
6 pages
Applied Economics: Module No. 5: Week 5: First Quarter
No ratings yet
Applied Economics: Module No. 5: Week 5: First Quarter
9 pages
Oxford Companion To Childrens Literature Review
No ratings yet
Oxford Companion To Childrens Literature Review
6 pages
Dealers
100% (2)
Dealers
2 pages
Solved Assignment Information Security
No ratings yet
Solved Assignment Information Security
3 pages
ICDF2025: Digital Forensics Conference Invite
No ratings yet
ICDF2025: Digital Forensics Conference Invite
1 page
2011 Annual Report - English 2011
No ratings yet
2011 Annual Report - English 2011
81 pages
Projectile Simulation Lab
No ratings yet
Projectile Simulation Lab
4 pages
4 We Iot in Der Praxis
No ratings yet
4 We Iot in Der Praxis
42 pages
Grade 6 Detailed Lesson Plan: II. Content Iii. Learning Resources
No ratings yet
Grade 6 Detailed Lesson Plan: II. Content Iii. Learning Resources
5 pages
Carregadores e Baterias para Notebooks
No ratings yet
Carregadores e Baterias para Notebooks
15 pages
AAJ Report: Ten Worst Insurance Companies FINAL
No ratings yet
AAJ Report: Ten Worst Insurance Companies FINAL
29 pages
QAM-SU-6086-A Full Length Automated Ultrasonic Test Equipment Qualification For Line Pipe, Drill Pipe, and Oil Country Tubular Goods
No ratings yet
QAM-SU-6086-A Full Length Automated Ultrasonic Test Equipment Qualification For Line Pipe, Drill Pipe, and Oil Country Tubular Goods
24 pages
wph12 01 Que 20240516
100% (1)
wph12 01 Que 20240516
28 pages
Integration Pathways For Traditional and Digital M
No ratings yet
Integration Pathways For Traditional and Digital M
4 pages
Liquiloans Statement 2022-04-01 To 2022-04-15
No ratings yet
Liquiloans Statement 2022-04-01 To 2022-04-15
1 page
Labor Standards: Atty. Nelson T. Bandoles, LPT
100% (1)
Labor Standards: Atty. Nelson T. Bandoles, LPT
374 pages
Tolentino v. Secretary of Finance - 249 SCRA 635
No ratings yet
Tolentino v. Secretary of Finance - 249 SCRA 635
2 pages
Recent Progress and Future Prospects of Silicon Solar Module Recycling
No ratings yet
Recent Progress and Future Prospects of Silicon Solar Module Recycling
9 pages
Mechanical Design Basics
No ratings yet
Mechanical Design Basics
18 pages
1989 P Cr. L J 291 (Lahore) Before Abdul Majeed Tiwana, J MUHAMMAD SHAFI Appellant Versus THE STATE Respondent
No ratings yet
1989 P Cr. L J 291 (Lahore) Before Abdul Majeed Tiwana, J MUHAMMAD SHAFI Appellant Versus THE STATE Respondent
3 pages
I M Lab Report LabVIEW 1
No ratings yet
I M Lab Report LabVIEW 1
11 pages