0% found this document useful (0 votes)

58 views30 pages

M6 - Model Overfitting

The document discusses model overfitting in data mining. It explains how more complex models can overfit the training data which results in poor performance on unseen test data. Several techniques are presented to avoid overfitting like using a validation set for model selection and incorporating model complexity.

Uploaded by

Andaro Production

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

58 views30 pages

M6 - Model Overfitting

Uploaded by

Andaro Production

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

M6: Data Mining

Model Overfitting

Sumber: Introduction to Data Mining, 2nd Edition

By Tan, Steinbach, Karpatne, Kumar

02/03/2021 1
Classification Errors

Training errors: Errors committed on the training set

Test errors: Errors committed on the test set

Generalization errors: Expected error of a model over random selection of records from same distribution

Tid Attrib1 Attrib2 Attrib3 Class Learning

1 Yes Large 125K No
algorithm
2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No

Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn

8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes

Model
10

Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ? Deduction

14 No Small 95K ?

15 No Large 67K ?
10

Test Set

02/03/2021 2
Example Data Set

Two class problem:

+ : 5400 instances
• 5000 instances generated
from a Gaussian centered at
(10,10)

• 400 noisy instances added

o : 5400 instances
• Generated from a uniform
distribution

10 % of the data used for

training and 90% of the
data used for testing

02/03/2021 3
Increasing number of nodes in Decision Trees

02/03/2021 4
Decision Tree with 4 nodes

Decision Tree

Decision boundaries on Training data

02/03/2021 5
Decision Tree with 50 nodes

Decision Tree

Decision boundaries on Training data

02/03/2021 6
Which tree is better?

Decision Tree with 4 nodes

Which tree is better ?

Decision Tree with 50 nodes

02/03/2021 7
Model Underfitting and Overfitting

•As the model becomes more and more complex, test errors can start
increasing even though training error may be decreasing

Underfitting: when model is too simple, both training and test errors are large
Overfitting: when model is too complex, training error is small but test error is large

02/03/2021 8
Model Overfitting – Impact of Training Data Size

Using twice the number of data instances

• Increasing the size of training data reduces the difference between training and
testing errors at a given size of model

02/03/2021 9
Model Overfitting – Impact of Training Data Size

Decision Tree with 50 nodes Decision Tree with 50 nodes

Using twice the number of data instances

• Increasing the size of training data reduces the difference between training and
testing errors at a given size of model

02/03/2021 10
Reasons for Model Overfitting

Not enough training data

High model complexity

– Multiple Comparison Procedure

02/03/2021 11
Effect of Multiple Comparison Procedure

Consider the task of predicting whether Day 1 Up

stock market will rise/fall in the next 10 Day 2 Down
trading days
Day 3 Down
Day 4 Up
Random guessing:
Day 5 Down
P(correct) = 0.5 Day 6 Down
Day 7 Up
Make 10 random guesses in a row: Day 8 Up
Day 9 Up
10  10  10 
  +   +   Day 10 Down
P(# correct  8) =  8   9  10 
= 0.0547
10
2

02/03/2021 12
Effect of Multiple Comparison Procedure

Approach:
– Get 50 analysts
– Each analyst makes 10 random guesses
– Choose the analyst that makes the most
number of correct predictions

Probability that at least one analyst makes at

least 8 correct predictions
P(# correct  8) = 1 − (1 − 0.0547)50 = 0.9399

02/03/2021 13
Effect of Multiple Comparison Procedure

Many algorithms employ the following greedy strategy:

– Initial model: M
– Alternative model: M’ = M  ,
where  is a component to be added to the model
(e.g., a test condition of a decision tree)
– Keep M’ if improvement, (M,M’) > 

Often times,  is chosen from a set of alternative

components,  = {1, 2, …, k}

If many alternatives are available, one may inadvertently

add irrelevant components to the model, resulting in
model overfitting

02/03/2021 14
Effect of Multiple Comparison - Example

Use additional 100 noisy variables

generated from a uniform distribution
along with X and Y as attributes.

Use 30% of the data for training and

70% of the data for testing
Using only X and Y as attributes
02/03/2021 15
Notes on Overfitting

Overfitting results in decision trees that are more

complex than necessary

Training error does not provide a good estimate

of how well the tree will perform on previously
unseen records

Need ways for estimating generalization errors

02/03/2021 16
Model Selection

Performed during model building

Purpose is to ensure that model is not overly
complex (to avoid overfitting)
Need to estimate generalization error
– Using Validation Set
– Incorporating Model Complexity

02/03/2021 17
Model Selection:
Using Validation Set
Divide training data into two parts:
– Training set:
◆ use for model building
– Validation set:
◆ use for estimating generalization error
◆ Note: validation set is not the same as test set

Drawback:
– Less data available for training

02/03/2021 18
Model Selection:
Incorporating Model Complexity
Rationale: Occam’s Razor
– Given two models of similar generalization errors,
one should prefer the simpler model over the more
complex model

– A complex model has a greater chance of being fitted

accidentally

– Therefore, one should include model complexity when

evaluating a model

Gen. Error(Model) = Train. Error(Model, Train. Data) +

x Complexity(Model)
02/03/2021 19
Estimating the Complexity of Decision Trees

Pessimistic Error Estimate of decision tree T

with k leaf nodes:

– err(T): error rate on all training records

– : trade-off hyper-parameter (similar to )
◆ Relative cost of adding a leaf node
– k: number of leaf nodes
– Ntrain: total number of training records

02/03/2021 20
Estimating the Complexity of Decision Trees: Example

e(TL) = 4/24

+: 3 +: 5 +: 1 +: 3 +: 3 e(TR) = 6/24
-: 0 -: 2 -: 4 -: 0 -: 6

+: 3 +: 2 +: 0 +: 1 +: 3 +: 0 =1
-: 1 -: 1 -: 2 -: 2 -: 1 -: 5

Decision Tree, TL Decision Tree, TR

egen(TL) = 4/24 + 1*7/24 = 11/24 = 0.458

egen(TR) = 6/24 + 1*4/24 = 10/24 = 0.417

02/03/2021 21
Estimating the Complexity of Decision Trees

Resubstitution Estimate:
– Using training error as an optimistic estimate of
generalization error
– Referred to as optimistic error estimate

e(TL) = 4/24

e(TR) = 6/24

+: 3 +: 5 +: 1 +: 3 +: 3
-: 0 -: 2 -: 4 -: 0 -: 6

+: 3 +: 2 +: 0 +: 1 +: 3 +: 0
-: 1 -: 1 -: 2 -: 2 -: 1 -: 5

Decision Tree, TL Decision Tree, TR

02/03/2021 22
Minimum Description Length (MDL)
A?
X y Yes No
X y
X1 1 0 B? X1 ?
X2 0 B1 B2
X2 ?
X3 0 C? 1
A C1 C2 B X3 ?
X4 1
0 1 X4 ?
… …
Xn
… …
1
Xn ?

Cost(Model,Data) = Cost(Data|Model) + x Cost(Model)

– Cost is the number of bits needed for encoding.
– Search for the least costly model.
Cost(Data|Model) encodes the misclassification errors.
Cost(Model) uses node encoding (number of children)
plus splitting condition encoding.
02/03/2021 23
Model Selection for Decision Trees

Pre-Pruning (Early Stopping Rule)

– Stop the algorithm before it becomes a fully-grown tree
– Typical stopping conditions for a node:
◆ Stop if all instances belong to the same class
◆ Stop if all the attribute values are the same
– More restrictive conditions:
◆ Stop if number of instances is less than some user-specified
threshold
◆ Stop if class distribution of instances are independent of the
available features (e.g., using  2 test)
◆ Stop if expanding the current node does not improve impurity
measures (e.g., Gini or information gain).
◆ Stop if estimated generalization error falls below certain threshold

02/03/2021 24
Model Selection for Decision Trees

Post-pruning
– Grow decision tree to its entirety
– Subtree replacement
◆ Trim the nodes of the decision tree in a bottom-up
fashion
◆ If generalization error improves after trimming,
replace sub-tree by a leaf node
◆ Class label of leaf node is determined from
majority class of instances in the sub-tree

02/03/2021 25
Example of Post-Pruning
Training Error (Before splitting) = 10/30

Class = Yes 20 Pessimistic error = (10 + 0.5)/30 = 10.5/30

Class = No 10 Training Error (After splitting) = 9/30

Error = 10/30 Pessimistic error (After splitting)

= (9 + 4  0.5)/30 = 11/30
PRUNE!
A?

A1 A4
A2 A3

Class = Yes 8 Class = Yes 3 Class = Yes 4 Class = Yes 5

Class = No 4 Class = No 4 Class = No 1 Class = No 1

02/03/2021 27
Model Evaluation

Purpose:
– To estimate performance of classifier on previously
unseen data (test set)

Holdout
– Reserve k% for training and (100-k)% for testing
– Random subsampling: repeated holdout
Cross validation
– Partition data into k disjoint subsets
– k-fold: train on k-1 partitions, test on the remaining one
– Leave-one-out: k=n

02/03/2021 28
Cross-validation Example

3-fold cross-validation

02/03/2021 29
Variations on Cross-validation

Repeated cross-validation
– Perform cross-validation a number of times
– Gives an estimate of the variance of the
generalization error
Stratified cross-validation
– Guarantee the same percentage of class
labels in training and test
– Important when classes are imbalanced and
the sample is small
Use nested cross-validation approach for model
selection and evaluation

02/03/2021 30

3 1 Overfitting
No ratings yet
3 1 Overfitting
25 pages
Classification Error: Training Errors Generalization Errors
No ratings yet
Classification Error: Training Errors Generalization Errors
39 pages
Classification Problems
No ratings yet
Classification Problems
53 pages
Unit-4 DM
No ratings yet
Unit-4 DM
19 pages
M01 Tree-Based Methods
No ratings yet
M01 Tree-Based Methods
38 pages
Decissin Tree & Over Fitting
No ratings yet
Decissin Tree & Over Fitting
22 pages
Clase10 11
No ratings yet
Clase10 11
18 pages
Classification: Basic Concepts, Decision Trees, and Model Evaluation
No ratings yet
Classification: Basic Concepts, Decision Trees, and Model Evaluation
46 pages
Classification
No ratings yet
Classification
8 pages
Data Mining for Analysts
No ratings yet
Data Mining for Analysts
30 pages
CH 8 Data Mining
No ratings yet
CH 8 Data Mining
30 pages
Understanding Model Overfitting in Data Mining
No ratings yet
Understanding Model Overfitting in Data Mining
15 pages
Week 6 - 7 - Classification
No ratings yet
Week 6 - 7 - Classification
67 pages
Decision Tree Classification Overview
No ratings yet
Decision Tree Classification Overview
37 pages
Decision Tree & Random Forest SEO Guide
No ratings yet
Decision Tree & Random Forest SEO Guide
26 pages
Chp8 Classification Basic Concepts - Lecture#8
No ratings yet
Chp8 Classification Basic Concepts - Lecture#8
40 pages
7 Classification
100% (3)
7 Classification
63 pages
Decision Tree and Evalaution
No ratings yet
Decision Tree and Evalaution
50 pages
DM Unit-3
No ratings yet
DM Unit-3
23 pages
Classification and Prediction Guide
No ratings yet
Classification and Prediction Guide
93 pages
Overfitting vs Underfitting in ML
No ratings yet
Overfitting vs Underfitting in ML
20 pages
Decision Trees for Data Mining Students
No ratings yet
Decision Trees for Data Mining Students
30 pages
Chap5 02 Overfitting
No ratings yet
Chap5 02 Overfitting
17 pages
CH 5
No ratings yet
CH 5
84 pages
ES335
No ratings yet
ES335
22 pages
Data Mining: Classification
No ratings yet
Data Mining: Classification
70 pages
Decision Trees
No ratings yet
Decision Trees
38 pages
Week 5
No ratings yet
Week 5
72 pages
ML Lab Record2
No ratings yet
ML Lab Record2
42 pages
Data Mining: Concepts and Techniques: - Chapter 7
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 7
61 pages
Model Generalization
No ratings yet
Model Generalization
117 pages
3-Classification, Clustering and Prediction
No ratings yet
3-Classification, Clustering and Prediction
142 pages
DWDM Unit-3: What Is Classification? What Is Prediction?
No ratings yet
DWDM Unit-3: What Is Classification? What Is Prediction?
12 pages
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining
No ratings yet
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining
50 pages
DWDM Unit IV Note
No ratings yet
DWDM Unit IV Note
21 pages
19-Introduction Classification Algorithm-18-09-2024
No ratings yet
19-Introduction Classification Algorithm-18-09-2024
102 pages
IML Trees
No ratings yet
IML Trees
66 pages
08 Class Basic
No ratings yet
08 Class Basic
103 pages
Data Mining Classification Guide
No ratings yet
Data Mining Classification Guide
54 pages
Types of Pruning Techniques
No ratings yet
Types of Pruning Techniques
10 pages
Classification
No ratings yet
Classification
33 pages
Module 3
No ratings yet
Module 3
64 pages
Data Mining: Class Imbalance Solutions
No ratings yet
Data Mining: Class Imbalance Solutions
56 pages
RB's ML2 Notes
No ratings yet
RB's ML2 Notes
5 pages
Decision Trees CLS
No ratings yet
Decision Trees CLS
43 pages
Data Mining and Classification Basics
No ratings yet
Data Mining and Classification Basics
129 pages
Lec 16,17
No ratings yet
Lec 16,17
90 pages
Lecture1 MCQ Guide
No ratings yet
Lecture1 MCQ Guide
4 pages
4 & 5 DWM 2024-25
No ratings yet
4 & 5 DWM 2024-25
32 pages
DM Module-3 Notes
No ratings yet
DM Module-3 Notes
25 pages
Data Mining Techniques for Big Data
No ratings yet
Data Mining Techniques for Big Data
130 pages
Decision Trees
No ratings yet
Decision Trees
53 pages
Classification & Prediction Techniques
No ratings yet
Classification & Prediction Techniques
71 pages
20150908-Lecture-3-Draft Asd Def HFL DFGF Lkreglker Lerg Kelr GK
No ratings yet
20150908-Lecture-3-Draft Asd Def HFL DFGF Lkreglker Lerg Kelr GK
15 pages
Overfitting in Decision Trees
No ratings yet
Overfitting in Decision Trees
19 pages
Classification Model Evaluation Methods
No ratings yet
Classification Model Evaluation Methods
18 pages

M6 - Model Overfitting

Uploaded by

M6 - Model Overfitting

Uploaded by

M6: Data Mining

Sumber: Introduction to Data Mining, 2nd Edition

Training errors: Errors committed on the training set

Test errors: Errors committed on the test set

Tid Attrib1 Attrib2 Attrib3 Class Learning

4 Yes Medium 120K No

7 Yes Large 220K No Learn

10 No Small 90K Yes

12 Yes Medium 80K ?

13 Yes Large 110K ? Deduction

Two class problem:

• 400 noisy instances added

10 % of the data used for

Decision boundaries on Training data

Decision boundaries on Training data

Decision Tree with 4 nodes

Which tree is better ?

Using twice the number of data instances

Decision Tree with 50 nodes Decision Tree with 50 nodes

Using twice the number of data instances

Not enough training data

High model complexity

Consider the task of predicting whether Day 1 Up

Probability that at least one analyst makes at

Many algorithms employ the following greedy strategy:

Often times,  is chosen from a set of alternative

If many alternatives are available, one may inadvertently

Use additional 100 noisy variables

Use 30% of the data for training and

Overfitting results in decision trees that are more

Training error does not provide a good estimate

Need ways for estimating generalization errors

Performed during model building

– A complex model has a greater chance of being fitted

– Therefore, one should include model complexity when

Gen. Error(Model) = Train. Error(Model, Train. Data) +

Pessimistic Error Estimate of decision tree T

– err(T): error rate on all training records

Decision Tree, TL Decision Tree, TR

egen(TL) = 4/24 + 1*7/24 = 11/24 = 0.458

egen(TR) = 6/24 + 1*4/24 = 10/24 = 0.417

Decision Tree, TL Decision Tree, TR

Cost(Model,Data) = Cost(Data|Model) + x Cost(Model)

Pre-Pruning (Early Stopping Rule)

Class = Yes 20 Pessimistic error = (10 + 0.5)/30 = 10.5/30

Class = No 10 Training Error (After splitting) = 9/30

Error = 10/30 Pessimistic error (After splitting)

Class = Yes 8 Class = Yes 3 Class = Yes 4 Class = Yes 5

You might also like