Classification
Naïve Bayesian classification, Decision trees, Decision rules,
Instance-based methods
Supervised vs. Unsupervised Learning
Supervised learning (classification)
• Supervision: The training data (observations, measurements, etc.) are
accompanied by labels indicating the class of the observations.
• New data is classified based on the training set.
Unsupervised learning (clustering)
• The class labels of training data is unknown.
• Given a set of measurements, observations, etc. with the aim of
establishing the existence of classes or clusters in the data.
Prediction Problems: Classification vs. Numeric Prediction
Classification
• Predicts categorical class labels (discrete or nominal)
• Classifies data (constructs a model) based on the training set and the values
(class labels) in a classifying attribute and uses it in classifying new data
Numeric Prediction
• Models continuous-valued functions, i.e., predicts unknown or missing values
Typical applications
• Credit/loan approval:
• Medical diagnosis: if a tumor is cancerous or benign
• Fraud detection: if a transaction is fraudulent
• Web page categorization: which category it is
Classification
Data classification is a two-step process: Learning & Classification step.
i. Learning step (or Training phase or Model construction): where a
classification model is constructed.
• Describing a set of predetermined classes.
• “Learning from” a training set made up of database tuples and their
associated class labels.
• Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute.
• The set of tuples used for model construction is training set
• The model is represented as classification rules, decision trees, or
mathematical formulae
• The class label attribute is categorical (or nominal) in that each value
serves as a category or class.
Classification
ii. Classification step (Model usage ): where the model is used to predict
class labels for given data.
• Classifying future or unknown objects
• If the accuracy is acceptable, use the model to classify new data
Note: If the test set is used to select models, it is called validation (test)
set
Classification
• Estimate accuracy of the model
• The known label of test sample is compared with the classified result
from the model
• Accuracy rate is the percentage of test set samples that are correctly
classified by the model
• Test set is independent of training set (otherwise overfitting)
Decision Tree
• A decision tree is a flowchart-like tree structure:
• Each internal node (nonleaf node) denotes a test on an attribute
• Each branch represents an outcome of the test.
• Each leaf node (or terminal node) holds a class label.
• Internal nodes are denoted
by rectangles.
• Leaf nodes are denoted by
ovals.
• Some decision tree
algorithms produce only
binary trees.
• Others can produce
nonbinary trees.
Decision Tree
How are decision trees used for classification?
• Given a tuple, X, for which the associated class label is unknown, the
attribute values of the tuple are tested against the decision tree.
• A path is traced from the root to a leaf node, which holds the class
prediction for that tuple.
• Decision trees can easily be
converted to classification
rules.
• Decision tree classifiers have
good accuracy. However,
successful use may depend
on the data at hand.
Decision Tree
• Popular traditional Decision Tree Algorithms:
✔ ID3 (Iterative Dichotomiser)
✔ C4.5 (a successor of ID3)
✔ CART (Classification And Regression Trees)
• ID3, C4.5, and CART adopt a greedy (i.e., nonbacktracking) approach in
which decision trees are constructed in a top-down recursive
divide-and-conquer manner.
Decision Tree
Decision tree algorithm has three parameters:
i. D: data partition, Initially, it is the complete set of training tuples and their
associated class labels.
ii. Parameter attribute list: the set of candidate attributes.
iii. Attribute selection method: specifies a heuristic procedure for selecting
the attribute that “best” discriminates the given tuples according to class.
The criterion consists of a splitting attribute and, possibly, either a
split-point or splitting subset.
Attribute selection measure such as information gain or the Gini index.
Gini index: Tree is strictly binary
Information gain: Multiway splits (i.e., two or more branches to be grown
from a node)
Decision Tree
• If the tuples in D are all of the same class, then node N becomes a leaf
and is labeled with that class (Stopping criteria).
• Otherwise, the algorithm calls Attribute selection method to determine
the splitting criterion.
• All the tuples in partition D (represented at node N) belong to the same
class.
Stopping criteria: Recursive partitioning stops only when any one of the
following terminating conditions is true :
✔ All the tuples in partition D (represented at node N) belong to the
same class.
✔ There are no remaining attributes on which the tuples may be further
partitioned.
✔ There are no tuples for a given branch, that is, a partition Dj is empty.
Decision Tree
• Let A be the splitting attribute. A has v distinct values, {a 1, a2,..., av},
based on the training data.
• A is discrete-valued: In this case, the outcomes of the test at node N
correspond directly to the known values of A.
• A branch is created for each known value, aj, of A and labeled with that
value.
• A is continuous-valued: In this case, the test at node N has two possible
outcomes, corresponding to the conditions A ≤ split point and A > split
point, respectively.
Decision Tree: Attribute Selection Measures
Three popular attribute selection measures—
a) Information gain (ID3)
b) Gain ratio (C4.5)
c) Gini index (CART)
Decision Tree: Attribute Selection Measures- Information gain
Information Gain:
• ID3 uses information gain as its attribute selection measure.
• The attribute with the highest information gain is chosen as the
splitting attribute for node N.
• To classify the tuples in the resulting partitions and reflects the least
randomness or “impurity” in these partitions.
Decision Tree: Attribute Selection Measures- Information gain
Entropy (Information Theory)
• A measure of uncertainty associated with a random variable.
• High entropy -> higher uncertainty
• Lower entropy -> lower uncertainty
• The expected information (or entropy) needed to classify a tuple in D
is:
Decision Tree: Attribute Selection Measures- Information gain
• The expected information (or entropy) needed to classify a tuple in D
is:
• where pi is the nonzero probability that an arbitrary tuple in D belongs
to class Ci and is estimated by |Ci,D|/|D|.
• A log function to the base 2 is used, because the information is encoded
in bits.
• Info(D) is just the average amount of information needed to identify the
class label of a tuple in D.
Decision Tree: Attribute Selection Measures- Information gain
• Information needed (after using A to split D into v partitions) to classify
D:
• The term acts as the weight of the jth partition.
• InfoA(D) is the expected information required to classify a tuple from D
based on the partitioning by A.
• The smaller the expected information (still) required, the greater the
purity of the partitions.
Decision Tree: Attribute Selection Measures- Information gain
• Information gain is defined as the difference between the original
information requirement (i.e., based on just the proportion of classes)
and the new requirement (i.e., obtained after partitioning on A).
• Information gained by branching on attribute A
Decision Tree: Attribute Selection Measures- Information gain
• In this example, class label attribute, buys_computer, has two distinct
values (namely, yes, no).
• There are nine tuples of class yes and five tuples of class no.
• A (root) node N is created for the tuples in D.
• To find the splitting criterion for
these tuples, we first need to
compute the expected information
to classify a tuple in D.
Decision Tree: Attribute Selection Measures- Information gain
• Next, we need to compute the expected information requirement for
each attribute.
• Let’s start with the attribute age. We need to look at the distribution of
yes and no tuples for each category of age.
Decision Tree: Attribute Selection Measures- Information gain
• For the age category “youth,” there are two yes tuples and three no
tuples.
• For the category “middle aged,” there are four yes tuples and zero no
tuples.
• For the category “senior,” there
are three yes tuples and two no
tuples.
Decision Tree: Attribute Selection Measures- Information gain
• Similarly, we can compute Gain(income) =0.029 bits, Gain(student) =
0.151 bits, and Gain(credit_rating) = 0.048 bits.
• Because age has the highest
information gain among the
attributes, it is selected as the
splitting attribute.
Decision Tree: Attribute Selection Measures- Information gain
• *Note the tuples falling into the partition for age D
middle aged all belong to the same class.
• Therefore be created at the end of this branch and
labeled “yes.”
Decision Tree: Attribute Selection Measures- Information gain
Decision Tree: Attribute Selection Measures
• Computing Information-Gain for Continuous-Valued Attributes:
• Let attribute A be a continuous-valued attribute
• Must determine the best split point for A
Sort the value A in increasing order
Typically, the midpoint between each pair of adjacent values is
considered as a possible split point
(ai+ai+1)/2 is the midpoint between the values of ai and ai+1
The point with the minimum expected information requirement for
A is selected as the split-point for A
• Split:
D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is the set
of tuples in D satisfying A > split-point
Decision Tree: Attribute Selection Measures- Gain Ratio
• Information gain measure is biased toward tests with many outcomes.
• Information gain prefers to select attributes having a large number of
values.
• For example, consider an attribute that acts as a unique identifier such
as product ID.
• A split on product ID would result in a large number of partitions (as
many as there are values), each one containing just one tuple.
• Each partition is pure, the information required to classify data set D
based on this partitioning would be Infoproduct_ID(D)= 0.
• Therefore, the information gained by partitioning on this attribute is
maximal.
Decision Tree: Attribute Selection Measures- Gain Ratio
• C4.5, a successor of ID3, uses an extension to information gain known
as gain ratio, which attempts to overcome this bias.
• For each outcome, it considers the number of tuples having that
outcome with respect to the total number of tuples in D.
• The attribute with the maximum gain ratio is selected as the splitting
attribute.
Decision Tree: Attribute Selection Measures- Gain Ratio
• Computation of gain ratio for the attribute income.
• We have Gain(income) = 0.029.
• Therefore,
GainRatio(income) 0.029/1.557
= 0.019.
Decision Tree: Attribute Selection Measures- Gini Index
• Gini index measures the impurity of a data partition or set of training
tuples.
• where pi is the probability that a tuple in D belongs to class Ci and is
estimated by |Ci,D|/|D|. The sum is computed over m classes.
• Gini index considers a binary split for each attribute.
• To determine the best binary split on A, we examine all the possible
subsets that can be formed using known values of A.
• For example, if income has three possible values, namely {low, medium,
high}, then the possible subsets are {low, medium, high}, {low, medium},
{low, high}, {medium, high}, {low}, {medium}, {high}, and {}.
Decision Tree: Attribute Selection Measures- Gini Index
• We exclude the power set, {low, medium, high}, and the empty set {}
from consideration since, conceptually, they do not represent a split.
• Therefore, there are 2v − 2 possible ways to form two partitions of the
data D, based on a binary split on A.
• When considering a binary split, we compute a weighted sum of the
impurity of each resulting partition.
• For example, if a binary split on A partitions D into D1 and D2, the Gini
index of D given that partitioning is:
• For a discrete-valued attribute, the subset that gives the minimum Gini
index for that attribute is selected as its splitting subset.
Decision Tree: Attribute Selection Measures- Gini Index
• The reduction in impurity that would be incurred by a binary split on a
discrete- or continuous-valued attribute A is:
• The attribute that maximizes the reduction in impurity (or, equivalently,
has the minimum Gini index) is selected as the splitting attribute.
Decision Tree: Attribute Selection Measures- Gini Index
• Gini index to compute the impurity of D:
• To find the splitting criterion for the tuples in D, we need to compute
the Gini index for each attribute.
• For attribute income and consider each
of the possible splitting subsets.
• The subset {low, medium} in 10 tuples
in partition D1 satisfying the condition
“income ∈ {low, medium}.”
• The remaining four tuples of D would
be assigned to partition D2.
Decision Tree: Attribute Selection Measures- Gini Index
• Gini index value computed based on partitioning {low, medium} is
• Gini index for the subsets {low, high}
and {medium} is 0.458
• {medium, high} and {low}= 0.450.
• The best binary split for attribute
income is on {low, medium*} (or {high})
because it minimizes the Gini index.
Decision Tree: Attribute Selection Measures- Gini Index
• Evaluating age, we obtain {youth, senior} (or {middle aged}) as the best
split for age with a Gini index of 0.375.
• The attributes student and credit_rating are both binary, with Gini index
values of 0.367 and 0.429, respectively.
• Gini index for the subsets {low, high}
and {medium} is 0.458
• {medium, high} and {low}= 0.450
Decision Tree
The three measures, in general, return good results but
• Information gain:
biased towards multivalued attributes
• Gain ratio:
tends to prefer unbalanced splits in which one partition is much
smaller than the others
• Gini index:
biased to multivalued attributes
has difficulty when # of classes is large
tends to favor tests that result in equal-sized partitions and purity in
both partitions
Bayes Classification
• Bayesian classifiers are statistical classifiers.
• Bayesian classifier can predict class membership probabilities such as
the probability that a given tuple belongs to a particular class.
• Bayesian classification is based on Bayes’ theorem.
• A simple Bayesian classifier known as the naïve Bayesian classifier.
• Naïve Bayesian classifiers assume that the effect of an attribute value
on a given class is independent of the values of the other attributes.
• Bayes’ theorem is useful in that it provides a way of calculating the
posterior probability, P(H|X) from P(H), P(X|H), and P(X).
Bayes Classification
• For example, X is a 35-year-old customer with an income of $40,000.
Suppose that H is the hypothesis that our customer will buy a
computer.
• Then P(H|X) reflects the probability that customer X will buy a
computer given that we know the customer’s age and income.
Bayes Classification
• Naïve Bayesian classifiers assume that the effect of an attribute value
on a given class is independent of the values of the other attributes.
• For example, X is a 35-year-old customer with an income of $40,000.
Suppose that H is the hypothesis that our customer will buy a
computer.
• Then P(H|X) reflects the probability that customer X will buy a
computer given that we know the customer’s age and income.
Naïve Bayesian Classification
• Naïve Bayesian classifier, or simple Bayesian classifier, works as follows:
• Let D be a training set of tuples and their associated class labels. As
usual, each tuple is represented by an n-dimensional attribute vector, X
= (x1, x2,…, xn), depicting n measurements made on the tuple from n
attributes, respectively, A1, A2,…, An.
• Suppose that there are m classes, C1, C2,:::, Cm. Given a tuple, X, the
classifier will predict that X belongs to the class having the highest
posterior probability, conditioned on X. That is, the na¨ıve Bayesian
classifier predicts that tuple X belongs to the class Ci if and only if
Summary
• Data sets are made up of data objects. A data object represents an
entity. Data objects are described by attributes. Attributes can be
nominal, binary, ordinal, or numeric.
• The values of a nominal (or categorical) attribute are symbols or names
of things, where each value represents some kind of category, code, or
state.
• Binary attributes are nominal attributes with only two possible states
(such as 1 and 0 or true and false). If the two states are equally
important, the attribute is symmetric; otherwise it is asymmetric.
• An ordinal attribute is an attribute with possible values that have a
meaningful order or ranking among them, but the magnitude between
successive values is not known.
References
• Jiawei Han, Micheline Kamber and Jian Pei, Data Mining: Concepts and
Techniques, Morgan Kaufmann, 3rd Edition.
• Tan, Pang-Ning, Michael Steinbach, and Vipin Kumar, Introduction to
data mining, Pearson India.