NPTEL
Video Course on Machine Learning
Professor Carl Gustaf Jansson, KTH
Week 4 Inductive Learning based on
Symbolic Representations
and Weak Theories
Video 4.3 Decision Tree Learning Algorithms Part 1
Agenda for the lecture
• Decision Trees in general
• TDIDT algorithms
• Information theoretical measures
• ID3 algoritm
• Treating overfitting through pruning
• Alternative algorithms
Decision Trees
The challenge is to design a decision tree such that the tree optimizes the
fit of considered data-items and the predictive performance for still
unseen data-items (minimal prediction error).
Decision trees analysis are of two main types:
- Classification tree analysis when the leaves are labelled according to
the k target classes included in the data-set
- Regression tree analysis when the leaves are real numbers or intervals.
Decision trees represent a disjunction of conjunctions of constraints on
the feature values of instances i.e.,(...∧... ∧...) ∨(... ∧... ∧...) ∨...
A decision tree can also be seen as equivalent to a set of if-then-rules
where each branch represents one if-then-rule where the if part
corresponds to the conjunctions of feature tests on the nodes and the
then-part corresponds to the class label or numerical range of the branch.
.
Decision Trees and Learning Algorithms
Pros
• Easy interpretable for humans
• Very compact formalism
• Easy handling of irrelevant anttributes
• Reasonable handling of missing data and noise
• Very fast at testing time
Cons
• Only axis-aligned splits of data-items
• Greedy and may not find the globally optimal tree.
Learning for this Representation
We will no focus on a particular category of learning techniques called
Top-Down Induction of Decision Trees (TDIDT).
The scenario for learning is supervised non-incremental data-driven learning from examples.
The systems are presented with a set of instances and develops a decision tree from the top down, guided
by frequency information in the examples. The trees are constructed beginning with the root of the tree
and proceeding down to its leaves.
The order in which instances are handled is not supposed to influence the build up of the trees.
The systems typically examine and re-examine all of the instances at many stages during learning.
Building the tree from from the top and downward, the issue is to choose and order features that
discriminate data-items (instances) in an optimal way.
Subtopics:
- Use of information theoretic measures to guide the selection and ordering of features
- Avoiding underfitting and overfitting by pruning of the tree
- Generation of several decision trees in parallel (e.g. random forest).
- Introduction of some kind of Inductive Bias (e.g Occam´s razor)
Purity or Homogeneity
The entire Data-set (all training instances) is associated
with the tree as a whole (the Root).
For every decision split based on a chosen feature and its
values, the Data-set is partitioned and the sub-sets
become associated with the nodes.
This is repeated recursively down to the leaves.
Purity or Homogeneity refers to the distribution of Data-
items of the k target classes both for the root and for each
of the nodes. Less degree of mix of classes implies higher
purity.
Most algorithms aim to maximize the purity of all nodes.
The purity or im-purity of nodes is measured by a set of
alternative information theoretic metrics.
Information Theoretic Measures
Information Gain based on Entropy
Information Gain based on Gini measure
Variance reduction
Information Gain and Entropy measures
Information Gain is a statistical measure that indicates how well
a given feature F separates (discriminates) instances according to the target classes
for an arbitrary collection of examples = S. |S|= cardinality of S.
Gain (S, F) = Entropy ( S ) − (v ∈values(F)): Sum ((|S v|/ |S|) * Entropy ( S v ))
S v = subsets of sets with value v of feature F.
Entropy is a statistical measure from information theory that characterizes
impurity of an arbitrary collection of examples = S.
For binary classification: H(S) = −p (+) log2 p(+) − p (-) log2 p(-)
For n-ary classification: H(S) = - (all c in target classes):Sum (p(c) * log2 p (c))
Gini Gain and Gini Impurity
Gini Gain is a statistical measure that indicates how well a given feature F
separates the instances in a given set S. |S|= cardinality of S.
Gini Gain (S, F) = Gini impurity ( S )
− (v ∈values(F)) Sum ( (|Sv|/ |S|) * Gini impurity ( S v ))
S v = subsets of sets with value v of feature F.
Gini Impurity is a measurement of the likelihood of an incorrect classification of
a new instance of a random variable, if that new instance were randomly
classified according to the distribution of class labels from the data set
Gini impurity: Gini(S) = 1 - (all c in target classes): Sum ( (p c) * 2)
Example
Example cont.
Entropy H of S (whole Data-set)
S={D1,...,D14}=[9+,5−]
H(S)=−9 /14*log2 9/14 −5/14*log2 5/14=0.940
Information Gain for Wind feature
S for Weak value ={D1,D3,D4,D5,D8,D9,D10,D13}=[6+,2−]
S for Strong value ={D2,D6,D7,D11,D12,D4}=[3+,3−]
Gain(S, Wind) = H(S) – for al v of Wind Sum ( |Sv|/|S| *H(Sv) ).
= H(S) – 8/14*H(S weak)-6/14*H(S Strong) =0.940-8/14*0.811 – 6/14* 1.0 =0.048
Information gains for the four features:
Gain(S,Outlook)=0.246 Gain(S,Humidity)=0.151
Gain(S,Wind)=0.048 Gain(S,Temperature)=0.029
Outlook has the highest Information Gain and is the preferred feature to discriminate
among data-items.
Example cont.
Gini impurity of S (whole Data-set)
S={D1,...,D14}=[9+,5−]
Gini(S)= 1 - 9/14^2 – 5/14^2 =…....
Gini impurity for Wind feature
S for Weak value ={D1,D3,D4,D5,D8,D9,D10,D13}=[6+,2−]
S for Strong value ={D2,D6,D7,D11,D12,D4}=[3+,3−]
Gini(S Weak) = 1 - 6/8^2 – 2/8^2 =........
Gini(S Strong) = 1- 3/6^2 – 3/6^2 =............
Gini Gain for Wind feature
Gini Gain(S, Wind) = Gini(S) – for al v of Wind Sum ( |Sv|/|S| *Gini(Sv) ).
= Gini(S) – 8/14* Gini(S weak) – 6/14 *Gini(S Strong) = .........
To be continued in Part 2