UET
Since 2004
ĐẠI HỌC CÔNG NGHỆ, ĐHQGHN
VNU-University of Engineering and Technology
INT3405 - Machine Learning
Lecture 7: Model Optimization
Hanoi, 10/2024
Outline
● True Error versus Empirical Error
● Overfitting, Underfitting
● Bias-Variance Tradeoff
● Model Optimization
○ Feature Selection
○ Regularization
○ Model Ensemble
FIT-CS INT3405E - Machine Learning 2
Recap: Support Vector Machines (SVM)
Support
Vectors
FIT-CS INT3405E - Machine Learning 3
Recap: Soft Margin SVM
●Standard Linear SVM
○Introduce slack variables
○Relax the constraints
○Penalize the relaxation
Primal
Problem:
C is a regularization parameter. Soft margin SVM trade off between
maximizing the margin and minimizing the misclassification error rate
FIT-CS INT3405E - Machine Learning 4
True Error versus Empirical Error
• True Error/Risk: target performance measure
• Classification: probability of misclassification
• Regression: mean squared error
• Performance on a random test/unseen point (X,Y)
• Empirical Error/risk: performance on training data
• Classification: proportion of misclassified examples
• Regression: average squared error
FIT-CS INT3405E - Machine Learning 5
True Error versus Empirical Error
FIT-CS INT3405E - Machine Learning 6
Overfitting
Example: Linear regression (housing prices)
Price
Price
Price
Size Size Size
Overfitting: If we have too many features (complicated predictor), the
learned hypothesis may fit the training set very well, but fail to generalize
to new examples (predict prices on new examples).
FIT-CS INT3405E - Machine Learning 7
Overfitting versus Underfitting
Example: Logistic regression
y y y
x x x
1
( = sigmoid function)
“Underfitting” “Overfitting”
FIT-CS INT3405E - Machine Learning 8
Model Complexity
Error
True
Error
Empirical Error
model
complexity
underfitting Best model overfittin
g
Empirical error (training error) is no longer a good indicator of true error
FIT-CS INT3405E - Machine Learning 9
Examples of Model Complexity
● Examples of Model Spaces with increasing complexity:
○ Regression with polynomials of order k=0,1,2,…
Higher degree => higher complexity
○ Decision Trees with depth k or with k leaves
Higher depth/ More # leaves => Higher complexity
○ KNN classifiers with varying neighbourhood sizes k =1,2,3…
Small neighbourhood => Higher complexity
FIT-CS INT3405E - Machine Learning 10
Risk Analysis (1)
• True Error/Risk vs Empirical Error/Risk
• Optimal Predictor
• Empirical Error Minimization over class
FIT-CS INT3405E - Machine Learning 11
Risk Analysis (2)
• Excess Risk
Estimation error Approximation error
Due to randomness Due to restriction
of training data (finite sample size) of model class
Estimation Excess risk
error
Approx. error
FIT-CS INT3405E - Machine Learning 12
Risk Analysis (2)
Estimation error Approximation error
Estimation Excess risk
error
Approx. error
FIT-CS INT3405E - Machine Learning 13
Bias - Variance Trade-off
• Regression:
Notice that Optimal predictor
does not have zero error
FIT-CS INT3405E - Machine Learning 14
Bias
FIT-CS INT3405E - Machine Learning 15
Variance
FIT-CS INT3405E - Machine Learning 16
Bias and Variance: Intuition
Low Variance High Variance
Underfitting
High Bias
Low Bias
Overfitting
FIT-CS INT3405E - Machine Learning 17
Bias and Variance: Intuition
Low Variance High Variance
Underfitting
High Bias
Low Bias
Overfitting
FIT-CS INT3405E - Machine Learning 18
Bias -Variance Trade-off
• High bias, Low variance – poor approximation
but robust/stable
3 Independent
• Low bias, high variance – good approximation training
but instable datasets
FIT-CS INT3405E - Machine Learning 19
Model Optimization
Error
True
Error
Empirical Error
model
complexity
underfitting Best model overfittin
g
FIT-CS INT3405E - Machine Learning 20
Learning Curve
High High
Bias Varianc
e
error
error
Test/CV
Error
high Test/CV
error Error
Training large
Error gap
Training
Error
(training set size) (training set size)
FIT-CS INT3405E - Machine Learning 21
How to address Over-fitting
● Reduce number of features
○ Feature selection
○ Model selection algorithms
● Regularization
○ Incorporate model complexity for optimization, penalize
complex models using prior knowledge
○ Keep all the features, but reduce magnitude/values of model
parameters
○ Works well when we have a lot of features, each of which
contributes a bit to the prediction
FIT-CS INT3405E - Machine Learning 22
Feature Selection
Idea: Find the best set of features that allows one to build optimized
models => Reduce the model complexity,
FIT-CS INT3405E - Machine Learning 23
Feature Selection Methods
FIT-CS INT3405E - Machine Learning 24
Supervised Feature Selection
FIT-CS INT3405E - Machine Learning 25
Feature Selection - Filter Methods
Idea: Compute the importance of the feature => Choose the most
important ones
FIT-CS INT3405E - Machine Learning 26
Feature Selection - Information Gain
FIT-CS INT3405E - Machine Learning 27
Feature Selection - Chi-square
FIT-CS INT3405E - Machine Learning 28
Feature Selection - Wrapper Methods
Idea: Gradually choose the most important features
FIT-CS INT3405E - Machine Learning 29
Feature Selection - Regularization
• Regularized learning framework
Cost of model / model complexity
• Penalize complex models using prior knowledge.
• Two Examples
• Regularized Linear Regression (rigid regression)
• Regularized Logistic Regression
FIT-CS INT3405E - Machine Learning 30
Regularized Linear Regression
• Linear Regression
• Regularized Linear Regression
• Choice of regularizer
• “Rigid Regression”
• “Lasso” (Least absolute shrinkage and selection operator)
FIT-CS INT3405E - Machine Learning 31
Regularized Logistic Regression
• -regularized Logistic Regression
• -regularized Logistic Regression (“Sparse Logistic
Regressions”)
FIT-CS INT3405E - Machine Learning 32
Model Ensemble
• Basic Idea: Instead of learning one
mode, learning several and
combine them
• Typically improves the accuracy,
often by a lot
FIT-CS INT3405E - Machine Learning 33
Why does It Work?
Suppose there are 25 base classifiers
• Each classifier has error rate, ε = 0.35
• Assume classifiers are independent
• Probability that the ensemble classifier makes a wrong
prediction (i.e.,13 out of the 25 classifiers misclassified)
FIT-CS INT3405E - Machine Learning 34
Bagging Classifiers
• In general, sampling from p(h|D)
is difficult
• P(h|D) is difficult to compute
• P(h|D) is impossible to
compute for non-probabilistic
classifier such as SVM
• Bagging Classifiers:
• Realize sampling p(h|D) by
sampling training examples
FIT-CS INT3405E - Machine Learning 35
Boostrap Sampling
• Bagging = Boostrap aggregating
• Boostrap sampling: given set D containing n training examples
• Create Di by drawing n examples at random with
replacement from D
• Di expects to leave out about 0.37 of examples from D
FIT-CS INT3405E - Machine Learning 36
Bagging
• Sampling with replacement
• Build classifier on each bootstrap sample
• Each sample has probability (1 – 1/n) of being remained
• All training data has probability (1 – 1/n)^n of being remained
• This value tends to be 1/e ~ 0.37 for large n
FIT-CS INT3405E - Machine Learning 37
Bagging Algorithm
FIT-CS INT3405E - Machine Learning 38
Bagging ~ Bayesian Average
FIT-CS INT3405E - Machine Learning 39
Inefficiency with Bagging
• Inefficient boostrap sampling:
• Every example has equal chance to be
sampled
• No distinction between “easy”
examples and “difficult” examples
• Inefficient model combination:
• A constant weight for each classifier
• No distinction between accurate
classifiers and inaccurate classifiers
FIT-CS INT3405E - Machine Learning 40
Improve the Efficiency of Bagging
• Better sampling strategy
• Focus on the examples that are difficult to classify
• Better combination strategy
• Accurate model should be assigned larger weights
FIT-CS INT3405E - Machine Learning 41
Intuition
FIT-CS INT3405E - Machine Learning 42
Boosting: Example
• Instances that are wrongly classified will have their weights increased
• Instances that are correctly classified will have their weights decreased
• Example 4 is hard to classify
• Its weight is increased, therefore it is more likely to be
chosen again in subsequent rounds
FIT-CS INT3405E - Machine Learning 43
AdaBoost
FIT-CS INT3405E - Machine Learning 44
AdaBoost Example
FIT-CS INT3405E - Machine Learning 45
Stacking
FIT-CS INT3405E - Machine Learning 46
Summary
● True Error versus Empirical Error
● Overfitting, Underfitting
● Bias-Variance Tradeoff
● Model Optimization
○ Feature Selection
○ Regularization
○ Model Ensemble
FIT-CS INT3405E - Machine Learning 47
UET
Since 2004
ĐẠI HỌC CÔNG NGHỆ, ĐHQGHN
VNU-University of Engineering and Technology
Thank you