M6: Data Mining
Model Overfitting
Sumber: Introduction to Data Mining, 2nd Edition
By Tan, Steinbach, Karpatne, Kumar
02/03/2021 1
Classification Errors
Training errors: Errors committed on the training set
Test errors: Errors committed on the test set
Generalization errors: Expected error of a model over random selection of records from same distribution
Tid Attrib1 Attrib2 Attrib3 Class Learning
1 Yes Large 125K No
algorithm
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
Induction
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn
8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes
Model
10
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ? Deduction
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
02/03/2021 2
Example Data Set
Two class problem:
+ : 5400 instances
• 5000 instances generated
from a Gaussian centered at
(10,10)
• 400 noisy instances added
o : 5400 instances
• Generated from a uniform
distribution
10 % of the data used for
training and 90% of the
data used for testing
02/03/2021 3
Increasing number of nodes in Decision Trees
02/03/2021 4
Decision Tree with 4 nodes
Decision Tree
Decision boundaries on Training data
02/03/2021 5
Decision Tree with 50 nodes
Decision Tree
Decision boundaries on Training data
02/03/2021 6
Which tree is better?
Decision Tree with 4 nodes
Which tree is better ?
Decision Tree with 50 nodes
02/03/2021 7
Model Underfitting and Overfitting
•As the model becomes more and more complex, test errors can start
increasing even though training error may be decreasing
Underfitting: when model is too simple, both training and test errors are large
Overfitting: when model is too complex, training error is small but test error is large
02/03/2021 8
Model Overfitting – Impact of Training Data Size
Using twice the number of data instances
• Increasing the size of training data reduces the difference between training and
testing errors at a given size of model
02/03/2021 9
Model Overfitting – Impact of Training Data Size
Decision Tree with 50 nodes Decision Tree with 50 nodes
Using twice the number of data instances
• Increasing the size of training data reduces the difference between training and
testing errors at a given size of model
02/03/2021 10
Reasons for Model Overfitting
Not enough training data
High model complexity
– Multiple Comparison Procedure
02/03/2021 11
Effect of Multiple Comparison Procedure
Consider the task of predicting whether Day 1 Up
stock market will rise/fall in the next 10 Day 2 Down
trading days
Day 3 Down
Day 4 Up
Random guessing:
Day 5 Down
P(correct) = 0.5 Day 6 Down
Day 7 Up
Make 10 random guesses in a row: Day 8 Up
Day 9 Up
10 10 10
+ + Day 10 Down
P(# correct 8) = 8 9 10
= 0.0547
10
2
02/03/2021 12
Effect of Multiple Comparison Procedure
Approach:
– Get 50 analysts
– Each analyst makes 10 random guesses
– Choose the analyst that makes the most
number of correct predictions
Probability that at least one analyst makes at
least 8 correct predictions
P(# correct 8) = 1 − (1 − 0.0547)50 = 0.9399
02/03/2021 13
Effect of Multiple Comparison Procedure
Many algorithms employ the following greedy strategy:
– Initial model: M
– Alternative model: M’ = M ,
where is a component to be added to the model
(e.g., a test condition of a decision tree)
– Keep M’ if improvement, (M,M’) >
Often times, is chosen from a set of alternative
components, = {1, 2, …, k}
If many alternatives are available, one may inadvertently
add irrelevant components to the model, resulting in
model overfitting
02/03/2021 14
Effect of Multiple Comparison - Example
Use additional 100 noisy variables
generated from a uniform distribution
along with X and Y as attributes.
Use 30% of the data for training and
70% of the data for testing
Using only X and Y as attributes
02/03/2021 15
Notes on Overfitting
Overfitting results in decision trees that are more
complex than necessary
Training error does not provide a good estimate
of how well the tree will perform on previously
unseen records
Need ways for estimating generalization errors
02/03/2021 16
Model Selection
Performed during model building
Purpose is to ensure that model is not overly
complex (to avoid overfitting)
Need to estimate generalization error
– Using Validation Set
– Incorporating Model Complexity
02/03/2021 17
Model Selection:
Using Validation Set
Divide training data into two parts:
– Training set:
◆ use for model building
– Validation set:
◆ use for estimating generalization error
◆ Note: validation set is not the same as test set
Drawback:
– Less data available for training
02/03/2021 18
Model Selection:
Incorporating Model Complexity
Rationale: Occam’s Razor
– Given two models of similar generalization errors,
one should prefer the simpler model over the more
complex model
– A complex model has a greater chance of being fitted
accidentally
– Therefore, one should include model complexity when
evaluating a model
Gen. Error(Model) = Train. Error(Model, Train. Data) +
x Complexity(Model)
02/03/2021 19
Estimating the Complexity of Decision Trees
Pessimistic Error Estimate of decision tree T
with k leaf nodes:
– err(T): error rate on all training records
– : trade-off hyper-parameter (similar to )
◆ Relative cost of adding a leaf node
– k: number of leaf nodes
– Ntrain: total number of training records
02/03/2021 20
Estimating the Complexity of Decision Trees: Example
e(TL) = 4/24
+: 3 +: 5 +: 1 +: 3 +: 3 e(TR) = 6/24
-: 0 -: 2 -: 4 -: 0 -: 6
+: 3 +: 2 +: 0 +: 1 +: 3 +: 0 =1
-: 1 -: 1 -: 2 -: 2 -: 1 -: 5
Decision Tree, TL Decision Tree, TR
egen(TL) = 4/24 + 1*7/24 = 11/24 = 0.458
egen(TR) = 6/24 + 1*4/24 = 10/24 = 0.417
02/03/2021 21
Estimating the Complexity of Decision Trees
Resubstitution Estimate:
– Using training error as an optimistic estimate of
generalization error
– Referred to as optimistic error estimate
e(TL) = 4/24
e(TR) = 6/24
+: 3 +: 5 +: 1 +: 3 +: 3
-: 0 -: 2 -: 4 -: 0 -: 6
+: 3 +: 2 +: 0 +: 1 +: 3 +: 0
-: 1 -: 1 -: 2 -: 2 -: 1 -: 5
Decision Tree, TL Decision Tree, TR
02/03/2021 22
Minimum Description Length (MDL)
A?
X y Yes No
X y
X1 1 0 B? X1 ?
X2 0 B1 B2
X2 ?
X3 0 C? 1
A C1 C2 B X3 ?
X4 1
0 1 X4 ?
… …
Xn
… …
1
Xn ?
Cost(Model,Data) = Cost(Data|Model) + x Cost(Model)
– Cost is the number of bits needed for encoding.
– Search for the least costly model.
Cost(Data|Model) encodes the misclassification errors.
Cost(Model) uses node encoding (number of children)
plus splitting condition encoding.
02/03/2021 23
Model Selection for Decision Trees
Pre-Pruning (Early Stopping Rule)
– Stop the algorithm before it becomes a fully-grown tree
– Typical stopping conditions for a node:
◆ Stop if all instances belong to the same class
◆ Stop if all the attribute values are the same
– More restrictive conditions:
◆ Stop if number of instances is less than some user-specified
threshold
◆ Stop if class distribution of instances are independent of the
available features (e.g., using 2 test)
◆ Stop if expanding the current node does not improve impurity
measures (e.g., Gini or information gain).
◆ Stop if estimated generalization error falls below certain threshold
02/03/2021 24
Model Selection for Decision Trees
Post-pruning
– Grow decision tree to its entirety
– Subtree replacement
◆ Trim the nodes of the decision tree in a bottom-up
fashion
◆ If generalization error improves after trimming,
replace sub-tree by a leaf node
◆ Class label of leaf node is determined from
majority class of instances in the sub-tree
02/03/2021 25
Example of Post-Pruning
Training Error (Before splitting) = 10/30
Class = Yes 20 Pessimistic error = (10 + 0.5)/30 = 10.5/30
Class = No 10 Training Error (After splitting) = 9/30
Error = 10/30 Pessimistic error (After splitting)
= (9 + 4 0.5)/30 = 11/30
PRUNE!
A?
A1 A4
A2 A3
Class = Yes 8 Class = Yes 3 Class = Yes 4 Class = Yes 5
Class = No 4 Class = No 4 Class = No 1 Class = No 1
02/03/2021 26
Examples of Post-pruning
Decision Tree:
depth = 1 :
| breadth > 7 : class 1
| breadth <= 7 :
| | breadth <= 3 : Simplified Decision Tree:
| | | ImagePages > 0.375 : class 0
| | | ImagePages <= 0.375 : depth = 1 :
| | | | totalPages <= 6 : class 1
| ImagePages <= 0.1333 : class 1
| | | | totalPages > 6 :
| | | | | breadth <= 1 : class 1
Subtree | ImagePages > 0.1333 :
| | | | | breadth > 1 : class 0 Raising | | breadth <= 6 : class 0
| | width > 3 : | | breadth > 6 : class 1
| | | MultiIP = 0:
| | | | ImagePages <= 0.1333 : class 1 depth > 1 :
| | | | ImagePages > 0.1333 : | MultiAgent = 0: class 0
| | | | | breadth <= 6 : class 0 | MultiAgent = 1:
| | | | | breadth > 6 : class 1
| | | MultiIP = 1:
| | totalPages <= 81 : class 0
| | | | TotalTime <= 361 : class 0 | | totalPages > 81 : class 1
| | | | TotalTime > 361 : class 1
depth > 1 :
| MultiAgent = 0:
| | depth > 2 : class 0
| | depth <= 2 : Subtree
| | | MultiIP = 1: class 0
| | | MultiIP = 0:
Replacement
| | | | breadth <= 6 : class 0
| | | | breadth > 6 :
| | | | | RepeatedAccess <= 0.0322 : class 0
| | | | | RepeatedAccess > 0.0322 : class 1
| MultiAgent = 1:
| | totalPages <= 81 : class 0
| | totalPages > 81 : class 1
02/03/2021 27
Model Evaluation
Purpose:
– To estimate performance of classifier on previously
unseen data (test set)
Holdout
– Reserve k% for training and (100-k)% for testing
– Random subsampling: repeated holdout
Cross validation
– Partition data into k disjoint subsets
– k-fold: train on k-1 partitions, test on the remaining one
– Leave-one-out: k=n
02/03/2021 28
Cross-validation Example
3-fold cross-validation
02/03/2021 29
Variations on Cross-validation
Repeated cross-validation
– Perform cross-validation a number of times
– Gives an estimate of the variance of the
generalization error
Stratified cross-validation
– Guarantee the same percentage of class
labels in training and test
– Important when classes are imbalanced and
the sample is small
Use nested cross-validation approach for model
selection and evaluation
02/03/2021 30