Chapter 6.
Classification and Prediction
What is classification? What is Prediction
prediction? Accuracy and error measures
Issues regarding classification Summary
and prediction
Classification by decision tree
induction
Bayesian classification
Rule-based classification
January 27, 2015 Data Mining: Concepts and Techniques 1
Classification vs. Prediction
Classification
predicts categorical class labels (discrete or nominal)
classifies data (constructs a model) based on the
training set and the values (class labels) in a
classifying attribute and uses it in classifying new data
Prediction
models continuous-valued functions, i.e., predicts
unknown or missing values
Typical applications
Credit approval
Target marketing
Medical diagnosis
Fraud detection
January 27, 2015 Data Mining: Concepts and Techniques 2
Classification—A Two-Step Process
Model construction: describing a set of predetermined classes
Each tuple/sample is assumed to belong to a predefined class,
as determined by the class label attribute
The set of tuples used for model construction is training set
The model is represented as classification rules, decision trees,
or mathematical formulae
Model usage: for classifying future or unknown objects
Estimate accuracy of the model
The known label of test sample is compared with the
classified result from the model
Accuracy rate is the percentage of test set samples that are
correctly classified by the model
Test set is independent of training set, otherwise over-fitting
will occur
If the accuracy is acceptable, use the model to classify data
tuples whose class labels are not known
January 27, 2015 Data Mining: Concepts and Techniques 3
Process (1): Model Construction
Classification
Algorithms
Training
Data
NAME RANK YEARS TENURED Classifier
M ike A ssistant P rof 3 no (Model)
M ary A ssistant P rof 7 yes
B ill P rofessor 2 yes
Jim A ssociate P rof 7 yes
IF rank = ‘professor’
D ave A ssistant P rof 6 no
OR years > 6
A nne A ssociate P rof 3 no
THEN tenured = ‘yes’
January 27, 2015 Data Mining: Concepts and Techniques 4
Process (2): Using the Model in Prediction
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
T om A ssistant P rof 2 no Tenured?
M erlisa A ssociate P rof 7 no
G eorge P rofessor 5 yes
Joseph A ssistant P rof 7 yes
January 27, 2015 Data Mining: Concepts and Techniques 5
Supervised vs. Unsupervised Learning
Supervised learning (classification)
Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
New data is classified based on the training set
Unsupervised learning (clustering)
The class labels of training data is unknown
Given a set of measurements, observations, etc. with
the aim of establishing the existence of classes or
clusters in the data
January 27, 2015 Data Mining: Concepts and Techniques 6
Issues: Data Preparation
Data cleaning
Preprocess data in order to reduce noise and handle
missing values
Relevance analysis (feature selection)
Remove the irrelevant or redundant attributes
Data transformation
Generalize and/or normalize data
January 27, 2015 Data Mining: Concepts and Techniques 7
Issues: Evaluating Classification Methods
Accuracy
classifier accuracy: predicting class label
predictor accuracy: guessing value of predicted
attributes
Speed
time to construct the model (training time)
time to use the model (classification/prediction time)
Robustness: handling noise and missing values
Scalability: efficiency in disk-resident databases
Interpretability
understanding and insight provided by the model
Other measures, e.g., goodness of rules, such as decision
tree size or compactness of classification rules
January 27, 2015 Data Mining: Concepts and Techniques 8
Decision Tree Induction: Training Dataset
age income student credit_rating buys_computer
<=30 high no fair no
This <=30 high no excellent no
31…40 high no fair yes
follows an >40 medium no fair yes
example >40 low yes fair yes
of >40 low yes excellent no
31…40 low yes excellent yes
Quinlan’s <=30 medium no fair no
ID3 <=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
January 27, 2015 Data Mining: Concepts and Techniques 9
Output: A Decision Tree for “buys_computer”
age?
<=30 overcast
31..40 >40
student? yes credit rating?
no yes excellent fair
no yes yes
January 27, 2015 Data Mining: Concepts and Techniques 10
Algorithm for Decision Tree Induction
Basic algorithm (a greedy algorithm)
Tree is constructed in a top-down recursive divide-and-conquer
manner
At start, all the training examples are at the root
Attributes are categorical (if continuous-valued, they are
discretized in advance)
Examples are partitioned recursively based on selected attributes
Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
Conditions for stopping partitioning
All samples for a given node belong to the same class
There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf
There are no samples left
January 27, 2015 Data Mining: Concepts and Techniques 11
Algorithm for Decision Tree
Induction
Algorithm: Generate_decision_tree.Generate a
decision tree from the given data
Input: The training samples , represented by
discrete valued attributes, the set of candidate
attributes, attribute_list.
Output: A decision tree.
Method:
(1) create a node N;
(2) if samples are all of the same class , C then
(3) return N as a leaf node labelled with the
class C;
January 27, 2015 Data Mining: Concepts and Techniques 12
Algorithm for Decision Tree
Induction
(4) if attribute_list is empty then
(5) return N as a leaf node labelled with the most
common class in samples;// majority voting
(6) select test_attribute , the attribute among
attribute_list with the highest information
gain;
(7) label node N with the test_attribute;
(8) for each known value ai of test attribute // partition the
samples
(9) grow a branch from node N for the condition test_attribute= ai
(10) let si be the set of samples in samples for which test attribute = ai
January 27, 2015 Data Mining: Concepts and Techniques 13
Algorithm for Decision Tree
Induction
(11) if si is empty then
(12) attach a leaf labelled with the most
common class in samples ;
(13) else attach a node returned by
Generate_decision_tree
(si,attribute_list,test_attribute)
January 27, 2015 Data Mining: Concepts and Techniques 14
15
Attribute Selection Measure:
Information Gain (ID3/C4.5)
Select the attribute with the highest information gain
Let pi be the probability that an arbitrary tuple in D
belongs to class Ci, estimated by |Ci, D|/|D|
Expected information (entropy) needed to classify a tuple
in D: m
Info( D) pi log 2 ( pi )
i 1
Information needed (after using A to split D into v
partitions) to classify D: v |D |
InfoA ( D) I (D j )
j
j 1 | D |
Information gained by branching on attribute A
Gain(A) Info(D) InfoA(D)
January 27, 2015 Data Mining: Concepts and Techniques 16
Attribute Selection: Information Gain
Class P: buys_computer = “yes” age pi ni I(pi, ni)
Class N: buys_computer = “no”
<=30 2 3 0.971
Info( D) I (9,5)
9 9
log 2 ( )
5 5
log 2 ( ) 0.940 31…40 4 0 0
14 14 14 14
>40 3 2 0.971
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no Gain(age) Info( D) Infoage ( D) 0.246
<=30 low yes fair yes
>40 medium yes fair yes Gain(income) 0.029
<=30 medium yes excellent yes
31…40 medium no excellent yes Gain( student) 0.151
31…40 high yes fair yes
>40 medium no excellent no Gain(credit _ rating) 0.048
17
Gini index (CART, IBM IntelligentMiner)
Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no”
2 2
9 5
gini( D) 1 0.459
14 14
Suppose the attribute income partitions D into 10 in D1: {low,
medium} and 4 in D2 gini 10 4
income{low, medium} ( D ) Gini( D1 ) Gini( D1 )
14 14
but gini{medium,high} is 0.30 and thus the best since it is the lowest
All attributes are assumed continuous-valued
May need other tools, e.g., clustering, to get the possible split values
Can be modified for categorical attributes
January 27, 2015 Data Mining: Concepts and Techniques 25
Overfitting and Tree Pruning
Overfitting:
Overfitting results in decision trees that are more
complex than necessary
An induced tree may overfit the training data
Too many branches, some may reflect anomalies due to noise or
outliers
Poor accuracy for unseen samples
Two approaches to avoid overfitting
Prepruning: Halt tree construction early—do not split a node if this
would result in the goodness measure falling below a threshold
Difficult to choose an appropriate threshold
January 27, 2015 Data Mining: Concepts and Techniques 27
Post pruning
– Trim the nodes of the decision tree in a
bottom-up fashion
– If generalization error improves after trimming,
replace sub-tree by a leaf node.
– Class label of leaf node is determined from
majority class of instances in the sub-tree
January 27, 2015 Data Mining: Concepts and Techniques 28
Classification in Large Databases
Classification—a classical problem extensively studied by
statisticians and machine learning researchers
Scalability: Classifying data sets with millions of examples
and hundreds of attributes with reasonable speed
Why decision tree induction in data mining?
relatively faster learning speed (than other classification
methods)
convertible to simple and easy to understand
classification rules
can use SQL queries for accessing databases
comparable classification accuracy with other methods
January 27, 2015 Data Mining: Concepts and Techniques 30
Scalable Decision Tree
Induction Methods in Data
Mining Studies
SLIQ
builds an index for each attribute and only class list and
the current attribute list reside in memory.
Handles disk resident data sets using disk resident
attribute list and memory resident class list.
Memory restriction is there when the training set is tool
large.
When a class list becomes too large performance of
SLIQ decreases.
SPRINT
constructs an attribute list data structure .
SPRINT removes all memory restrictions.
Designed to be easily parallelized.
January 27, 2015 Data Mining: Concepts and Techniques 31
Scalable Decision Tree Induction
Methods in Data Mining Studies
PUBLIC
integrates tree splitting and tree pruning: stop growing
the tree earlier
RainForest
separates the scalability aspects from the criteria that
determine the quality of the tree
builds an AVC-list (attribute, value, class label)
Rain forest report a speed up over SPRINT.
January 27, 2015 Data Mining: Concepts and Techniques 32
Bayesian Classification: Why?
A statistical classifier: performs probabilistic prediction,
i.e., predicts class membership probabilities
Foundation: Based on Bayes’ Theorem.
Performance: A simple Bayesian classifier, naïve Bayesian
classifier, has comparable performance with decision tree
and selected neural network classifiers
Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is
correct — prior knowledge can be combined with observed
data
Standard: Even when Bayesian methods are
computationally intractable, they can provide a standard
of optimal decision making against which other methods
can be measured
January 27, 2015 Data Mining: Concepts and Techniques 33
Bayesian Theorem: Basics
Let X be a data sample (“evidence”): class label is unknown
Let H be a hypothesis that X belongs to class C
Classification is to determine P(H|X), the probability that
the hypothesis holds given the observed data sample X
P(H) (prior probability), the initial probability
E.g., X will buy computer, regardless of age, income, …
P(X): probability that sample data is observed
P(X|H) (posteriori probability), the probability of observing
the sample X, given that the hypothesis holds
E.g., Given that X will buy computer, the prob. that X is
31..40, medium income
January 27, 2015 Data Mining: Concepts and Techniques 34
Bayesian Theorem
Given training data X, posteriori probability of a
hypothesis H, P(H|X), follows the Bayes theorem
P(H | X) P(X | H )P(H )
P(X)
Informally, this can be written as
posteriori = likelihood x prior/evidence
Predicts X belongs to Ci iff the probability P(Ci|X) is the
highest among all the P(Ck|X) for all the k classes
Practical difficulty: require initial knowledge of many
probabilities, significant computational cost
January 27, 2015 Data Mining: Concepts and Techniques 35
Towards Naïve Bayesian Classifier
Let D be a training set of tuples and their associated class
labels, and each tuple is represented by an n-D attribute
vector X = (x1, x2, …, xn)
Suppose there are m classes C1, C2, …, Cm.
Classification is to derive the maximum posteriori, i.e., the
maximal P(Ci|X)
This can be derived from Bayes’ theorem
P(X | C )P(C )
P(C | X) i i
i P(X)
Since P(X) is constant for all classes, only
P(C | X) P(X | C )P(C )
i i i
needs to be maximized
January 27, 2015 Data Mining: Concepts and Techniques 36
Derivation of Naïve Bayes Classifier
A simplified assumption: attributes are conditionally
independent (i.e., no dependence relation between
attributes): n
P( X | C i) P( x | C i) P( x | C i) P( x | C i) ... P( x | C i)
k 1 2 n
k 1
This greatly reduces the computation cost: Only counts
the class distribution
If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having
value xk for Ak divided by |Ci, D| (# of tuples of Ci in D)
If Ak is continous-valued, P(xk|Ci) is usually computed
based on Gaussian distribution with a mean μ and
standard deviation σ
( x ) 2
1
g ( x, , ) e 2 2
2
and P(xk|Ci) is
P(X | Ci) g ( xk , Ci , Ci )
January 27, 2015 Data Mining: Concepts and Techniques 37
Naïve Bayesian Classifier: Training Dataset
age income studentcredit_rating
buys_compu
<=30 high no fair no
<=30 high no excellent no
Class: 31…40 high no fair yes
C1:buys_computer = ‘yes’ >40 medium no fair yes
C2:buys_computer = ‘no’ >40 low yes fair yes
>40 low yes excellent no
Data sample
31…40 low yes excellent yes
X = (age <=30,
Income = medium, <=30 medium no fair no
Student = yes <=30 low yes fair yes
Credit_rating = Fair) >40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
January 27, 2015 Data Mining: Concepts and Techniques 38
Naïve Bayesian Classifier: An Example
P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357
Compute P(X|Ci) for each class
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore, X belongs to class (“buys_computer = yes”)
January 27, 2015 Data Mining: Concepts and Techniques 39
Example - 2
Outlook Temperature Humidity Windy Class
sunny hot high false N
sunny hot high true N
An unseen sample
overcast hot high false P
rain mild high false P X = <rain, hot, high,
rain cool normal false P false>
rain cool normal true N
overcast cool normal true P
sunny mild high false N
sunny cool normal false P
rain mild normal false P
sunny mild normal true P
overcast mild high true P
overcast hot normal false P
rain mild high true N
January 27, 2015 Data Mining: Concepts and Techniques 40
Play-tennis example: estimating
P(xi|C)
Outlook Temperature Humidity Windy Class outlook
sunny hot high false N P(sunny|p) = 2/9 P(sunny|n) = 3/5
sunny hot high true N
overcast hot high false P P(overcast|p) = P(overcast|n) = 0
rain mild high false P 4/9
rain cool normal false P P(rain|p) = 3/9 P(rain|n) = 2/5
rain cool normal true N temperature
overcast cool normal true P
sunny mild high false N P(hot|p) = 2/9 P(hot|n) = 2/5
sunny cool normal false P P(mild|p) = 4/9 P(mild|n) = 2/5
rain mild normal false P P(cool|p) = 3/9 P(cool|n) = 1/5
sunny mild normal true P
overcast mild high true P humidity
overcast hot normal false P P(high|p) = 3/9 P(high|n) = 4/5
rain mild high true N P(normal|p) = P(normal|n) =
6/9 2/5
P(p) = 9/14
windy
P(n) = 5/14 P(true|p) = 3/9 P(true|n) = 3/5
January 27, 2015 Data Mining: Concepts and Techniques 41
Play-tennis example: classifying X
An unseen sample X = <rain, hot, high, false>
P(X|p)·P(p) =
P(rain|p)·P(hot|p)·P(high|p)·P(false|p)·P(p) =
3/9·2/9·3/9·6/9·9/14 = 0.010582
P(X|n)·P(n) =
P(rain|n)·P(hot|n)·P(high|n)·P(false|n)·P(n) =
2/5·2/5·4/5·2/5·5/14 = 0.018286
Sample X is classified in class n (don’t play)
January 27, 2015 Data Mining: Concepts and Techniques 42
Naïve Bayesian Classifier: Comments
Advantages
Easy to implement
Good results obtained in most of the cases
Disadvantages
Assumption: class conditional independence, therefore
loss of accuracy
Practically, dependencies exist among variables
E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc.
Dependencies among these cannot be modeled by Naïve
Bayesian Classifier
How to deal with these dependencies?
Bayesian Belief Networks
January 27, 2015 Data Mining: Concepts and Techniques 43
Bayesian Belief Networks
Bayesian belief network allows a subset of the variables
conditionally independent
A graphical model of casual relationships
Represents dependency among the variables
Gives a specification of joint probability distribution
Nodes: random variables
Links: dependency
X Y X and Y are the parents of Z, and Y is
the parent of P
Z No dependency between Z and P
P Has no loops or cycles
January 27, 2015 Data Mining: Concepts and Techniques 44
Bayesian Belief Network: An Example
Family The conditional probability table
Smoker
History (CPT) for variable LungCancer:
(FH, S) (FH, ~S) (~FH, S) (~FH, ~S)
LC 0.8 0.5 0.7 0.1
~LC 0.2 0.5 0.3 0.9
LungCancer Emphysema
CPT shows the conditional probability for each
possible combination of its parents. The CPT for
a variable Z specifies the conditional distribution
P(Z/Parents(Z)).
P(Lungcancer=“yes” | FamilyHistory = “yes” ,
PositiveXRay Dyspnea smoker=“yes”)=0.8
Bayesian Belief Networks Derivation of the probability of a particular
combination of values of X, from CPT:
n
P( x1 ,..., xn ) P( xi | Parents( X i ))
January 27, 2015 i 1 45
Chapter 6. Classification and Prediction
What is classification? What is Prediction
prediction? Accuracy and error measures
Issues regarding classification Summary
and prediction
Classification by decision tree
induction
Bayesian classification
Rule-based classification
January 27, 2015 Data Mining: Concepts and Techniques 46
What Is Prediction?
(Numerical) prediction is similar to classification
construct a model
use model to predict continuous or ordered value for a given input
Prediction is different from classification
Classification refers to predict categorical class label
Prediction models continuous-valued functions
Major method for prediction: regression
model the relationship between one or more independent or
predictor variables and a dependent or response variable
Regression analysis
Linear and multiple regression
Non-linear regression
Other regression methods: generalized linear model, Poisson
regression, log-linear models, regression trees
January 27, 2015 Data Mining: Concepts and Techniques 47
Linear Regression
Linear regression: involves a response variable y and a single
predictor variable x
y = w0 + w 1 x
where w0 (y-intercept) and w1 (slope) are regression coefficients
Method of least squares: estimates the best-fitting straight line
| D|
(x x )( yi y )
w i 1
i
w y w x
1 | D|
0 1
(x
i 1
i x )2
Multiple linear regression: involves more than one predictor variable
Training data is of the form (X1, y1), (X2, y2),…, (X|D|, y|D|)
Ex. For 2-D data, we may have: y = w0 + w1 x1+ w2 x2
Solvable by extension of least square method
Many nonlinear functions can be transformed into the above
January 27, 2015 Data Mining: Concepts and Techniques 48
Regression - Example
Table shows a set of X Y
paired data where X is Years Salary (in $
Experience 1000s)
the number of years of 3 30
work experience of a 8 57
college graduate and y 9 64
is the corresponding 13 72
salary of the graduate. 3 36
6 43
Y = 23.6 + 3.5X 11 59
Predict the salary for a 21 90
graduate with 10 yrs of 1 20
experience. 16 83
Y = 58.6$
January 27, 2015 Data Mining: Concepts and Techniques 49
Nonlinear Regression
Some nonlinear models can be modeled by a polynomial
function
A polynomial regression model can be transformed into
linear regression model. For example,
y = w0 + w1 x + w2 x2 + w3 x3
convertible to linear with new variables: x2 = x2, x3= x3
y = w0 + w1 x + w2 x2 + w3 x3
Other functions, such as power function, can also be
transformed to linear model
Some models are intractable nonlinear (e.g., sum of
exponential terms)
possible to obtain least square estimates through
extensive calculation on more complex formulae
January 27, 2015 Data Mining: Concepts and Techniques 50
Chapter 6. Classification and Prediction
What is classification? What is Prediction
prediction? Accuracy and error measures
Issues regarding classification Summary
and prediction
Classification by decision tree
induction
Bayesian classification
Rule-based classification
January 27, 2015 Data Mining: Concepts and Techniques 51
Evaluating the Accuracy of a Classifier
or Predictor (I)
Holdout method
Given data is randomly partitioned into two independent sets
Training set (e.g., 2/3) for model construction
Test set (e.g., 1/3) for accuracy estimation
Derive Estimate
Training Classifier Accuracy
set
Data
Test set
Random sampling: a variation of holdout
Repeat holdout k times, accuracy = avg. of the accuracies
obtained
January 27, 2015 Data Mining: Concepts and Techniques 52
Evaluating the Accuracy of a Classifier
or Predictor (I)
Cross-validation (k-fold, where k = 10 is most popular)
Randomly partition the data into k mutually exclusive subsets,
each approximately equal size
At i-th iteration, use Di as test set and others as training set
The accuracy estimate =
Overall number of correct classifications from the k iterations
Total number of samples in the initial data
Leave-one-out: k folds where k = # of tuples, for small sized data
Stratified cross-validation: folds are stratified so that class dist. in
each fold is approx. the same as that in the initial data.
January 27, 2015 Data Mining: Concepts and Techniques 53
Ensemble Methods: Increasing the Accuracy
Ensemble methods
Use a combination of models to increase accuracy
Combine a series of k learned models, M1, M2, …, Mk,
with the aim of creating an improved model M*
Popular ensemble methods
Bagging: averaging the prediction over a collection of
classifiers
Boosting: weighted vote with a collection of classifiers
Ensemble: combining a set of heterogeneous classifiers
January 27, 2015 Data Mining: Concepts and Techniques 55
Bagging: Boostrap Aggregation
Analogy: Diagnosis based on multiple doctors’ majority vote
Training
Given a set D of d tuples, at each iteration i, a training set Di of d
tuples is sampled with replacement from D (i.e., boostrap)
A classifier model Mi is learned for each training set Di
Classification: classify an unknown sample X
Each classifier Mi returns its class prediction
The bagged classifier M* counts the votes and assigns the class
with the most votes to X
Prediction: can be applied to the prediction of continuous values by
taking the average value of each prediction for a given test tuple
Accuracy
Often significant better than a single classifier derived from D
For noise data: not considerably worse, more robust
Proved improved accuracy in prediction
January 27, 2015 Data Mining: Concepts and Techniques 56
Boosting
Analogy: Consult several doctors, based on a combination of weighted
diagnoses—weight assigned based on the previous diagnosis accuracy
How boosting works?
Weights are assigned to each training tuple
A series of k classifiers is iteratively learned
After a classifier Mi is learned, the weights are updated to allow the
subsequent classifier, Mi+1, to pay more attention to the training
tuples that were misclassified by Mi
The final M* combines the votes of each individual classifier, where
the weight of each classifier's vote is a function of its accuracy
The boosting algorithm can be extended for the prediction of
continuous values
Comparing with bagging: boosting tends to achieve greater accuracy,
but it also risks overfitting the model to misclassified data
January 27, 2015 Data Mining: Concepts and Techniques 57
Classifier Accuracy Measures and Confusion
matrix
t_pos (Eg “cancer samples” that were correctly
classified as such)
t_neg (“not_cancer” samples that were
correctly classified as such)
False positives (“not_cancer” samples that were
incorrectly labeled as “cancer”)
False negative(“cancer” samples that were
incorrectly labeled as “not_cancer”)
pos is the number of positive C1 C2
samples C1 t_pos f_neg
neg is the number of negative C2 f_pos t_neg
samples
January 27, 2015 Data Mining: Concepts and Techniques 58
Classifier Accuracy Measures
classes buy_computer = yes buy_computer = no total recognition(%)
buy_computer = yes 6954 46 7000 99.34
buy_computer = no 412 2588 3000 86.27
total 7366 2634 10000 95.52
Accuracy of a classifier M, acc(M): percentage of test set tuples that are
correctly classified by the model M
Error rate (misclassification rate) of M = 1 – acc(M)
Given m classes, CMi,j, an entry in a confusion matrix, indicates #
of tuples in class i that are labeled by the classifier as class j
Alternative accuracy measures (e.g., for cancer diagnosis)
sensitivity = t-pos/pos /* true positive recognition rate */
specificity = t-neg/neg /* true negative recognition rate */
precision = t-pos/(t-pos + f-pos)
accuracy = sensitivity * pos/(pos + neg) + specificity * neg/(pos + neg)
This model can also be used for cost-benefit analysis
January 27, 2015 Data Mining: Concepts and Techniques 59
Predictor Error Measures
Measure predictor accuracy: measure how far off the predicted value is
from the actual known value
Loss function: measures the error betw. yi and the predicted value yi’
Absolute error: | yi – yi’|
Squared error: (yi – yi’)2
Test error (generalization error):
d
the average loss over the test
d
set
Mean absolute error: | y
i 1
i yi ' | Mean squared error: ( y y ')
i 1
i i
2
d d
d
Relative absolute error: | y
d
i yi ' |
Relative squared error: ( yi yi ' ) 2
i 1
i 1
d d
| y
i 1
i y| ( y y)
i 1
i
2
The mean squared-error exaggerates the presence of outliers
Popularly use (square) root mean-square error, similarly, root relative
squared error
January 27, 2015 Data Mining: Concepts and Techniques 60
January 27, 2015 Data Mining: Concepts and Techniques 61