0% found this document useful (0 votes)
19 views13 pages

4.3-DecisionTreesLearningAlgorithms Part 1

Uploaded by

Sujithra Jones
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views13 pages

4.3-DecisionTreesLearningAlgorithms Part 1

Uploaded by

Sujithra Jones
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

NPTEL

Video Course on Machine Learning

Professor Carl Gustaf Jansson, KTH

Week 4 Inductive Learning based on


Symbolic Representations
and Weak Theories

Video 4.3 Decision Tree Learning Algorithms Part 1


Agenda for the lecture

• Decision Trees in general


• TDIDT algorithms
• Information theoretical measures
• ID3 algoritm
• Treating overfitting through pruning
• Alternative algorithms
Decision Trees
The challenge is to design a decision tree such that the tree optimizes the
fit of considered data-items and the predictive performance for still
unseen data-items (minimal prediction error).

Decision trees analysis are of two main types:


- Classification tree analysis when the leaves are labelled according to
the k target classes included in the data-set
- Regression tree analysis when the leaves are real numbers or intervals.

Decision trees represent a disjunction of conjunctions of constraints on


the feature values of instances i.e.,(...∧... ∧...) ∨(... ∧... ∧...) ∨...

A decision tree can also be seen as equivalent to a set of if-then-rules


where each branch represents one if-then-rule where the if part
corresponds to the conjunctions of feature tests on the nodes and the
then-part corresponds to the class label or numerical range of the branch.

.
Decision Trees and Learning Algorithms
Pros

• Easy interpretable for humans


• Very compact formalism
• Easy handling of irrelevant anttributes
• Reasonable handling of missing data and noise
• Very fast at testing time

Cons

• Only axis-aligned splits of data-items


• Greedy and may not find the globally optimal tree.
Learning for this Representation
We will no focus on a particular category of learning techniques called
Top-Down Induction of Decision Trees (TDIDT).

The scenario for learning is supervised non-incremental data-driven learning from examples.

The systems are presented with a set of instances and develops a decision tree from the top down, guided
by frequency information in the examples. The trees are constructed beginning with the root of the tree
and proceeding down to its leaves.

The order in which instances are handled is not supposed to influence the build up of the trees.
The systems typically examine and re-examine all of the instances at many stages during learning.

Building the tree from from the top and downward, the issue is to choose and order features that
discriminate data-items (instances) in an optimal way.

Subtopics:
- Use of information theoretic measures to guide the selection and ordering of features
- Avoiding underfitting and overfitting by pruning of the tree
- Generation of several decision trees in parallel (e.g. random forest).
- Introduction of some kind of Inductive Bias (e.g Occam´s razor)
Purity or Homogeneity
The entire Data-set (all training instances) is associated
with the tree as a whole (the Root).

For every decision split based on a chosen feature and its


values, the Data-set is partitioned and the sub-sets
become associated with the nodes.

This is repeated recursively down to the leaves.

Purity or Homogeneity refers to the distribution of Data-


items of the k target classes both for the root and for each
of the nodes. Less degree of mix of classes implies higher
purity.

Most algorithms aim to maximize the purity of all nodes.

The purity or im-purity of nodes is measured by a set of


alternative information theoretic metrics.
Information Theoretic Measures

Information Gain based on Entropy

Information Gain based on Gini measure

Variance reduction
Information Gain and Entropy measures

Information Gain is a statistical measure that indicates how well


a given feature F separates (discriminates) instances according to the target classes
for an arbitrary collection of examples = S. |S|= cardinality of S.

Gain (S, F) = Entropy ( S ) − (v ∈values(F)): Sum ((|S v|/ |S|) * Entropy ( S v ))

S v = subsets of sets with value v of feature F.

Entropy is a statistical measure from information theory that characterizes


impurity of an arbitrary collection of examples = S.

For binary classification: H(S) = −p (+) log2 p(+) − p (-) log2 p(-)

For n-ary classification: H(S) = - (all c in target classes):Sum (p(c) * log2 p (c))
Gini Gain and Gini Impurity

Gini Gain is a statistical measure that indicates how well a given feature F
separates the instances in a given set S. |S|= cardinality of S.

Gini Gain (S, F) = Gini impurity ( S )


− (v ∈values(F)) Sum ( (|Sv|/ |S|) * Gini impurity ( S v ))

S v = subsets of sets with value v of feature F.

Gini Impurity is a measurement of the likelihood of an incorrect classification of


a new instance of a random variable, if that new instance were randomly
classified according to the distribution of class labels from the data set

Gini impurity: Gini(S) = 1 - (all c in target classes): Sum ( (p c) * 2)


Example
Example cont.
Entropy H of S (whole Data-set)
S={D1,...,D14}=[9+,5−]
H(S)=−9 /14*log2 9/14 −5/14*log2 5/14=0.940

Information Gain for Wind feature


S for Weak value ={D1,D3,D4,D5,D8,D9,D10,D13}=[6+,2−]
S for Strong value ={D2,D6,D7,D11,D12,D4}=[3+,3−]
Gain(S, Wind) = H(S) – for al v of Wind Sum ( |Sv|/|S| *H(Sv) ).
= H(S) – 8/14*H(S weak)-6/14*H(S Strong) =0.940-8/14*0.811 – 6/14* 1.0 =0.048

Information gains for the four features:


Gain(S,Outlook)=0.246 Gain(S,Humidity)=0.151
Gain(S,Wind)=0.048 Gain(S,Temperature)=0.029

Outlook has the highest Information Gain and is the preferred feature to discriminate
among data-items.
Example cont.
Gini impurity of S (whole Data-set)
S={D1,...,D14}=[9+,5−]
Gini(S)= 1 - 9/14^2 – 5/14^2 =…....

Gini impurity for Wind feature


S for Weak value ={D1,D3,D4,D5,D8,D9,D10,D13}=[6+,2−]
S for Strong value ={D2,D6,D7,D11,D12,D4}=[3+,3−]
Gini(S Weak) = 1 - 6/8^2 – 2/8^2 =........
Gini(S Strong) = 1- 3/6^2 – 3/6^2 =............

Gini Gain for Wind feature


Gini Gain(S, Wind) = Gini(S) – for al v of Wind Sum ( |Sv|/|S| *Gini(Sv) ).
= Gini(S) – 8/14* Gini(S weak) – 6/14 *Gini(S Strong) = .........
To be continued in Part 2

You might also like