Decision Tree
It is a tree-structured learning technique that is used for classification as well as regression.
Function type
Non-linear
Two types of nodes
Decision nodes
o A test is performed to choose between two or more paths
o The test is done based on the values of a feature or attribute of the instance
Leaf nodes
o Indicates the class of an example
o Value of the example; for example, probability, score
Fig. 1 An example decision tree – Approval of loan
Generating a decision tree
Issues
• Many decision trees
• Different extent of errors
• All fit the data
• Noisy examples
• No decision tree that exactly fit the data
Given some training examples, what decision tree should be generated?
One proposal:
Prefer the smallest tree that is consistent with the data (Bias)
Possible method:
Search the decision trees space for the smallest decision tree that fits the data
Features of a smaller tree
• Low depth
• Smaller count of nodes
Searching algorithm for the smallest decision tree
Searching the decision tree space for the smallest decision tree that fits the data is a
hard problem with extremely high computational resources
One proposal
Greedy algorithm
Fig. 2 A snapshot of training data
Building a decision tree
Searching for a good tree
The decision tree space is too large for a systematic search
At a particular point, two choices to make:
• STOP and
• return the value for the target feature
• a distribution over target feature values
• CONTINUE and
• choose an input feature to split on
• for each value of the test, build a subtree for those examples with this value
Choosing attribute tests
→ Greedy search to minimize the depth of the final tree
→ Choosing the attribute
→ Perfect
→ Divide the examples into sets, each of which is same (positive or negative)
Entropy
→ Measure of the uncertainty of a random variable
→ Acquisition of information corresponds to a reduction in entropy
→ Entropy of a random variable V with values vk each with probability P(vk)
Entropy: ( ) ∑ ( )
( )
∑ ( ) ( )
→ A flip of a fair coin is equally likely to come up heads or tails, 0 or 1
→ This counts as “1 bit” entropy
→ Fair coin flip : H(fair)
( ) ( )
= 1 bit
→ If the coin is loaded to give 99% heads
( ) ( )
= 0.08 bits
( ) : Entropy of a Boolean random variable
→ True with probability q
→ ( ) ( ( ) ( ))
Decision Tree Learning
→ A training set ( ) has : positive, negative examples
→ The entropy of the goal attribute on the whole set
( ) ( )
Choosing the attribute tests
→ The greedy search used to approximately minimize the depth of the final tree.
An attribute
→ distinct values
→ The values divide the training set into subsets
Each subset has
→ positive examples, negative examples
→ In that branch
→ We need an additional ( ( + )) bits of information to answer the
question.
→ Each example has the th value of the attribute A
With probability
For attribute
→ The expected entropy remaining after testing attribute
→ ( ) ∑ ( )
Information gain from the attribute test
→ Expected reduction in entropy
→ Gain ( ) = ( ) ( )
Example:
Day Outlook Temperature Humidity Wind Play Tennis
X1 Sunny Hot High Weak No
X2 Sunny Hot High Strong No
X3 Overcast Hot High Weak Yes
X4 Rain Mild High Weak Yes
X5 Rain Cool Normal Weak Yes
X6 Rain Cool Normal Strong No
X7 Overcast Cool Normal Strong Yes
X8 Sunny Mild High Weak No
X9 Sunny Cool Normal Weak Yes
X10 Rain Mild Normal Weak Yes
X11 Sunny Mild Normal Strong Yes
X12 Overcast Mild High Strong Yes
X13 Overcast Hot Normal Weak Yes
X14 Rain Mild High Strong No
A decision tree
Entropy measures homogeneity of examples
→ 9 positive and 5 negative examples: = 9, =5
→ Entropy: ( ) ( ) ( )
( ) ( )
Boolean Classification
1
→ Entropy is 0 if all examples belong to the same
class
→ Entropy is 1 if half of them belong to the either
0.5
class
→ Otherwise it is between 0 and 1
Entropy
→ If the target attribute can take on different values,
then the entropy of (examples) relative to their -
0 0.5 1
wise classification is defined as
( ) ∑
Information Gain Measures
→ The expected reduction in Entropy
→ Caused by partitioning the examples according to this attribute
→ = Set of examples and = attribute
→ ( ) ( ) ∑ ( ) ( )
→ Values( )
→ Set of all possible values for attribute
→
→ Subset of for which has value
Entropy ( )
→ Entropy of the original collection
→ The second term ( )
→ Expected value of the entropy after is partitioned using attribute
→ Sum of the entropies of each subset weighted by the fraction of examples
that belong to
Gain ( )
→ Expected reduction in entropy caused by knowing the value of attribute
→ The information provided about the target function value, given the value of some
other attribute
→ The value of Gain( ) is the number of bits saved when encoding the target value of
an arbitrary member of , by knowing the value of attribute
→ Application
→ Values(Wind) = {weak, strong}
→ : ( = 9, = 5)
→ : ( = 6, = 2)
→ : ( = 3, = 3)
→ ( )
( ) ∑ ( )
* +
( ) ( )
( ( ) ( )) ( ( ) ( ))
( ) ( )
Values (Humidity) = {High, Normal}
→ : ( = 9, = 5)
→ : ( = 3, = 4)
→ : ( = 6, = 1)
( )
( ) ∑ ( )
* +
= ( )( ( ) ( )) ( )( ( ) ( ))
( ) ( )