Classificatoin
Prof. Satanik Mitra
BITS Pilani
Decision Tree
• Decision trees are a classification methodology, wherein the classification process is modeled with the
use of a set of hierarchical decisions on the feature variables, arranged in a tree-like structure.
• The decision at a particular node of the tree, which is referred to as the split criterion, is typically a
condition on one or more feature variables in the training data.
• For example, consider the case where Age is an attribute, and the split criterion is Age ≤ 30. In this case,
the left branch of the decision tree contains all training examples with age at most 30, whereas the right
branch contains all examples with age greater than 30.
• The goal is to identify a split criterion.
• The main difference from clustering is that the partitioning criterion in the decision tree is supervised
with the class label in the training instances.
• Some classical decision tree algorithms include C4.5, ID3, and CART.
Decision Tree
• Splits can be univariate or multivariate.
• A decision tree induction algorithm has two types of nodes, referred to as the
internal nodes and leaf nodes. Each leaf node is labeled with the dominant class
at that node.
• Eventually, the decision tree algorithm stops the growth of the tree based on a
stopping criterion.
• To avoid the degradation in accuracy associated with overfitting, the classifier
uses a postpruning mechanism.
• The goal of the split criterion is to maximize the separation of the different
classes among the children nodes.
• The design of the split criterion depends on the nature of the underlying
attribute.
• Binary attribute: Only one type of split is possible, and the tree is always binary.
• Categorical attribute: If a categorical attribute has r different values, there are multiple ways
to split it.
• Numeric attribute: If the numeric attribute contains a small number r of ordered values (e.g.,
integers in a small range [1, r]), it is possible to create an r-way split for each distinct value.
Decision Tree
• Split criterion need to be checked and quality of split needs to be quantified.
• Error rate: Let p be the fraction of the instances in a set of data points S belonging to the dominant class.
Then, the error rate is simply 1−p.
• Gini index: The Gini index G(S) for a set S of data points may be computed on the class distribution p1 .
. . pk of the training data points in S. The CART algorithm uses the Gini index as the split criterion.
• Entropy: The entropy measure is used in one of the earliest classification algorithms, referred to as ID3.
The entropy E(S) for a set S may be computed on the class distribution p1 . . . pk of the training data
points in the node.
• Lower values of the entropy are more desirable.
Information Gain and Entropy
• The entropy (very common in Information Theory) characterizes the (im)purityof an arbitrary collection
of examples.
• Information Gain is the expected reduction in entropy caused by partitioning the examples according to a
given attribute.
• If we have a set with k different values in it, we can calculate the entropy as follows:
• Where P(valuei) is the probability of getting the ithvalue when randomly selecting one from the set. So,
for the set R = {a,a,a,b,b,b,b,b}
• Dip. di Matematica Pura ed Applicata F. Aiolli -Sistemi Informativi 2007/2008
Dataset
• Dip. di Matematica Pura ed Applicata F. Aiolli -Sistemi Informativi 2007/2008
Information Gain and Entropy
• 16 instances: 9 positive, 7 negative.
• This equals: 0.9836
• For attribute “Size” we have left branch with entry “Small” we have 8 examples and get entropy of 0.8113
• The right branch with entry “Large” we have 8 examples and get entropy of 0.9544
• I(S ) = (8/16)*.8113 + (8/16)*.9544 = 0.8828.
Size
• We want to calculate the information gain(or entropy reduction). This is the reduction in ‘uncertainty’ when
choosing our first branch as ‘size’. We will represent information gain as “G.”
• G(size) = I(S) – I(S )
size
• G(size) = 0.9836 – 0.8828
• G(size) = 0.1008
• We have gained 0.1008 bits of information about the dataset by choosing ‘size’ as the first branch of our
decision tree.