26-10-2020
Evaluation
Underfitting and Overfitting
Overfitting
Underfitting: when model is too simple, both training and test errors are large
Overfitting: when model is too complex, training error is small but test error is large
Overfitting due to Noise
Decision boundary is distorted by noise point
1
26-10-2020
Overfitting due to Insufficient Examples
Lack of data points in the lower half of the diagram makes it difficult to
predict correctly the class labels of that region
- Insufficient number of training records in the region causes the decision
tree to predict the test examples using other training records that are
irrelevant to the classification task
Underfitting and Overfitting (Example)
Two class problem:
+ : 5200 instances
• 5000 instances generated from a
Gaussian centered at (10,10)
• 200 noisy instances added
o : 5200 instances
• Generated from a uniform
distribution
10 % of the data used for
training and 90% of the data
used for testing
Increasing number of nodes in Decision Trees
2
26-10-2020
Decision Tree with 4 nodes
Decision Tree
Decision boundaries on Training data
Decision Tree with 50 nodes
Decision Tree
Decision boundaries on Training data
Which tree is better?
Decision Tree with 4 nodes
Which tree is better ?
Decision Tree with 50 nodes
3
26-10-2020
Model Overfitting
Underfitting: when model is too simple, both training and test errors are large
Overfitting: when model is too complex, training error is small but test error is large
10
Model Overfitting
Using twice the number of data instances
• If training data is under-representative, testing errors increase and training
errors decrease on increasing number of nodes
• Increasing the size of training data reduces the difference between training and
testing errors at a given number of nodes
11
Model Overfitting
Decision Tree with 50 nodes Decision Tree with 50 nodes
Using twice the number of data instances
• If training data is under-representative, testing errors increase and training
errors decrease on increasing number of nodes
• Increasing the size of training data reduces the difference between training and
testing errors at a given number of nodes
12
4
26-10-2020
Notes on Overfitting
• Overfitting results in decision trees that are more complex than necessary
• Training error no longer provides a good estimate of how well the tree will
perform on previously unseen records
• Need new ways for estimating errors
13
How to Address Overfitting
• Pre-Pruning (Early Stopping Rule)
– Stop the algorithm before it becomes a fully-grown tree
– Typical stopping conditions for a node:
• Stop if all instances belong to the same class
• Stop if all the attribute values are the same
– More restrictive conditions:
• Stop if number of instances is less than some user-specified threshold
• Stop if class distribution of instances are independent of the available features
(e.g., using 2 test)
• Stop if expanding the current node does not improve impurity
measures (e.g., Gini or information gain).
14
Model Selection for Decision Trees
• Post-pruning
– Grow decision tree to its entirety
– Subtree replacement
• Trim the nodes of the decision tree in a bottom-up fashion
• If generalization error improves after trimming, replace sub-tree
by a leaf node
• Class label of leaf node is determined from majority class of
instances in the sub-tree
– Subtree raising
• Replace subtree with most frequently used branch
15
5
26-10-2020
Examples of Post-pruning
16
Model Evaluation
• Metrics for Performance Evaluation
– How to evaluate the performance of a model?
• Methods for Performance Evaluation
– How to obtain reliable estimates?
• Methods for Model Comparison
– How to compare the relative performance among competing models?
17
Model Evaluation
• Metrics for Performance Evaluation
– How to evaluate the performance of a model?
• Methods for Performance Evaluation
– How to obtain reliable estimates?
• Methods for Model Comparison
– How to compare the relative performance among competing models?
18
6
26-10-2020
Metrics for Performance Evaluation
• Focus on the predictive capability of a model
– Rather than how fast it takes to classify or build models, scalability,
etc.
• Confusion Matrix:
PREDICTED CLASS
a: TP (true positive)
Class=Yes Class=No b: FN (false negative)
c: FP (false positive)
Class=Yes a b d: TN (true negative)
ACTUAL
CLASS Class=No c d
19
Metrics for Performance Evaluation
PREDICTED CLASS a: TP (true positive)
b: FN (false negative)
Class=Yes Class=No c: FP (false positive)
d: TN (true negative)
Class=Yes a b
ACTUAL
CLASS Class=No c d
20
Metrics for Performance Evaluation…
PREDICTED CLASS
Class=Yes Class=No
Class=Yes a b
ACTUAL (TP) (FN)
CLASS Class=No c d
(FP) (TN)
• Most widely-used metric:
ad TP TN
Accuracy
a b c d TP TN FP FN
21
7
26-10-2020
Limitation of Accuracy
• Consider a 2-class problem
– Number of Class 0 examples = 9990
– Number of Class 1 examples = 10
• If model predicts everything to be class 0, accuracy is 9990/10000 = 99.9%
– Accuracy is misleading because model does not detect any class 1
example
22
Performance metrics
PREDICTED CLASS
Class=Yes Class=No
Class=Yes a b
(TP) (FN)
ACTUAL
CLASS Class=No c d
(FP) (TN)
23
24
8
26-10-2020
25
Cost Matrix
PREDICTED CLASS
C(i|j) Class=Yes Class=No
Class=Yes C(Yes|Yes) C(No|Yes)
ACTUAL
CLASS Class=No C(Yes|No) C(No|No)
C(i|j): Cost of misclassifying class j example as class i
26
Computing Cost of Classification
Cost PREDICTED CLASS
Matrix
C(i|j) + -
ACTUAL + -1 100
CLASS
- 1 0
Model PREDICTED CLASS Model PREDICTED CLASS
M1 M2
+ - + -
ACTUAL + 150 40 ACTUAL + 250 45
CLASS CLASS
- 60 250 - 5 200
Accuracy = 80% Accuracy = 90%
Cost = 3910 Cost = 4255
27
9
26-10-2020
Cost vs Accuracy
Count PREDICTED CLASS Accuracy is proportional to cost if
1. C(Yes|No)=C(No|Yes) = q
Class=Yes Class=No
2. C(Yes|Yes)=C(No|No) = p
Class=Yes a b
ACTUAL N=a+b+c+d
CLASS Class=No c d
Accuracy = (a + d)/N
Cost PREDICTED CLASS
Cost = p (a + d) + q (b + c)
Class=Yes Class=No
= p (a + d) + q (N – a – d)
Class=Yes p q = q N – (q – p)(a + d)
ACTUAL
CLASS Class=No
= N [q – (q-p) Accuracy]
q p
28
Model Evaluation
• Metrics for Performance Evaluation
– How to evaluate the performance of a model?
• Methods for Performance Evaluation
– How to obtain reliable estimates?
• Methods for Model Comparison
– How to compare the relative performance among competing models?
29
Model Evaluation
• Purpose:
– To estimate performance of classifier on previously
unseen data (test set)
• Holdout
– Reserve k% for training and (100-k)% for testing
– Random subsampling: repeated holdout
• Cross validation
– Partition data into k disjoint subsets
– k-fold: train on k-1 partitions, test on the remaining
one
– Leave-one-out: k=n
30
10
26-10-2020
31
32
Cross-validation Example
• 3-fold cross-validation
33
11
26-10-2020
34
Model Evaluation
• Metrics for Performance Evaluation
– How to evaluate the performance of a model?
• Methods for Performance Evaluation
– How to obtain reliable estimates?
• Methods for Model Comparison
– How to compare the relative performance among competing models?
35
ROC curves
• ROC = Receiver Operating Characteristic
• Started in electronic signal detection theory (1940s -
1950s)
• Has become very popular in in machine learning
applications to assess classifiers
36
12
26-10-2020
Call these “negatives” Call these “positives”
Test Result
37
True Positives
Test Result
38
Test Result False
Positives
39
13
26-10-2020
True
negatives
Test Result
40
False
negatives
Test Result
41
‘‘-’’ ‘‘+’’
Test Result
42
14
26-10-2020
‘‘-’’ ‘‘+’’
Test Result
43
ROC curve
100%
True Positive Rate
(sensitivity)
0
% 100%
0
% False Positive Rate
(1-specificity)
44
A good test: A poor test:
100% 100%
True Positive Rate
True Positive Rate
0 0
% %
0 100 100
0
% False Positive % False Positive %
%
Rate Rate
45
15
26-10-2020
Best Test: Worst test:
100 100
% %
True Positive
True Positive
Rate
Rate
0
0 %
% 0 10
0 100 False Positive 0%
False Positive % %
% Rate
Rate
The distributions The distributions
don’t overlap at all overlap completely
46
Area under ROC curve (AUC)
• Overall measure of test performance
• Comparisons between two tests based on
differences between (estimated) AUC
• For continuous data, AUC equivalent to Mann-
Whitney U-statistic (nonparametric test of
difference in location between two populations)
47
100 100
% %
AUC = 100%
True Positive
True Positive
Rate
Rate
AUC = 50%
0
0 %
% 0 100
0 100 False Positive
% %
% False Positive %
Rate
Rate
100 100
% %
AUC = 90%
True Positive
True Positive
AUC = 65%
Rate
Rate
0 0
% %
0 100 100
False Positive 0
% %
% False Positive %
Rate Rate
48
16
26-10-2020
RECEIVER OPERATOR CHARACTERISTIC
49
RECEIVER OPERATOR CHARACTERISTIC
50
51
17
26-10-2020
52
Significance Tests
• T-Test between two difference performances
• ANOVA between two or more difference
performances
53
18