0% found this document useful (0 votes)

5 views40 pages

Lecture05 Regularization

The document provides an overview of regularization in machine learning, specifically focusing on its definition, applications, and techniques such as ridge regression. It explains how regularization modifies learning algorithms to reduce generalization error and includes mathematical formulations for linear regression and ridge regression. The document also discusses properties, interpretations, and the effects of regularization on model parameters and predictions.

Uploaded by

ana

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views40 pages

Lecture05 Regularization

Uploaded by

ana

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Regularization

UE de Master 2, AOS1
Fall 2020

S. Rousseau

1
General definition

What is regularization? From Goodfellow et al. 2016:

Regularization is any modification we make to a learning algorithm that is
intented to reduce its generalization error but not its training error.

• Adding a penalty term to a loss function

• Data augmentation
• Early stopping
• ...

Today’s course is focused on the first point

2
What is learning

Learning consists in minimizing a training objective

b )
fb = arg min L(f
f ∈H

• H is the set of admissible classifier/regression function

• Lb is an empirical loss function (computed on a train set)
• fb is the learnt solution

Most of the time the set H is parametrized by a parameter θ ∈ Θ

b = arg min L(θ)

θ b
θ∈Θ

3
Application to linear regression

• Set of admissible solutions is linear functions

H = {linear functions}

• Linear functions on Rp are parametrized by β ∈ Rp

H = {x 7→ hx, βi , x ∈ Rp }

• Training objective is the residual sum of squares (RSS)

Empirical loss function based on square loss, prediction is hx i , βi, observed is yi
n
X
b
L(β) = RSS (β) = (yi − hx i , βi)2
i=1

4
Application to linear regression (2)

• The learning algorithm is then

n
X
b ols = arg min
β (yi − hx i , βi)2 (ordinary least square)
β∈Rp i=1
 T  
x1 y1
 ..   .. 
• Compact matrix formulation X =  .  and y =  . 
xT
n yn

b ols = arg min ky − X βk2

β
β∈Rp

5
Regularization

• Unregularized objective
θ b
b = arg min L(θ) (1)
θ∈Θ

• Regularized objective

breg = arg min L(θ)

θ b + λ · R(θ)
λ
θ∈Θ

• Training objective

• Tuning parameter

• Regularization term

6
Regularization (2)

Regularized objective
breg = arg min L(θ)
θ b + λ · R(θ)
θ∈Θ

• R(θ) penalizes some θ’s

• λ > 0 is the strength of the penalty
• λ = 0: no penality: regular solution
• λ −→ +∞, solution tends to arg min R(θ)
θ∈Θ
• Some tradeoff has to be found between the two extreme cases

7
Ridge regularization

Most simple regularization we can think of:

P
• We choose R(θ) = kθk2 = pi=1 θi2
• Penalizes large parameter: prevents the βi ’s from exploding
• Ridge regularization is then

bridge = arg min L(θ)

θ b + λ kθk2 (ridge regularization)
λ
θ∈Θ

Also known as L2 -regularization, Tikhonov regularization or weight decay (neural network)

8
Application to ridge regression

• Previous linear regression learning algorithm was

b ols = arg min ky − X βk2

β
β∈Rp

• Adding the ridge regularizer term yields

b ridge = arg min ky − X βk2 + λ kβk2

β (ridge regression)
λ
β∈Rp
n
X p
X
2
= arg min (yi − hx i , βi) + λ βi2
β∈Rp i=1 i=1

9
Solution to ridge regression

b ridge = arg min ky − X βk2 + λ kβk2

β (ridge regression)
λ
β∈Rp

• Define the penalized residual sum of squares (PRSS) as

PRSS (β) = ky − X βk2 + λ kβk2

• PRSS is (strictly) convex w.r.t. β: unique solution

• By differentiating w.r.t. β we get

∇β PRSS = −2X T (y − X β) + 2λβ (2)

10
Solution to ridge regression

• Setting the derivative to zero we finally get

−1
b ridge = X T X + λIp
β XTy (3)
λ

• Fitted values are then

−1
b ridge = X X T X + λIp
ybridge = X β XTy
λ

• For λ = 0 we have the OLS solution

β ridge
λ=0 = β
ols

−1
ybols = X X T X XTy

11
Caveats

• The intercept (if present) should not be part of the regularizing parameter. Two
possible strategies:
• center the design matrix X so there is no intercept
• or we remove the intercept from the regularizing parameter (set β ? as β without β0 )

• The features should be on the same scale; unlike linear regression ridge regression
predictions are sensitive to features rescaling

12
Properties of ridge regression

• Unlike linear regression, there is always a solution (when λ > 0)

• X T X is positive semi-definite
• X T X + λIp is then positive definite when λ > 0 hence is invertible
• It improves the conditioning of the problem
• Like linear regression but unlike Lasso, it admits a closed form solution
−1
b ridge = X T X + λIp
β XTy
λ

b = Uβ
• Invariant to rotation: if Y = XU T is a rotation of the samples then β b
Y X
• Unlike linear regression, both the β i ’s estimate and predictions are biased
• The β i ’s estimate are drawn toward zero w.r.t the OLS solution
• Might have lower variance than OLS
13
b ridge is biased
β λ

Suppose that the design matrix X is fixed (conditioning on it)

Linear case: Ridge case:

ols −1 ridge −1
E βb T
=E X X T
X y b
E βλ T
= E X X + λIp T
X y
−1 −1
= XTX X T E(y ) = X T X + λIp XTXβ
−1 −1 −1
= XTX XTXβ = T
X X T
X X + λIp β
=β −1 −1
T
= Ip + λ X X β 6= β
Unbiased!
Biased!
14
Data augmentation interpretation

• Let’s rewrite the PRSS

PRSS (β) = ky − X βk2 + λ kβk2

n
X p
X
2
= (yi − hx i , βi) + λ βi2
i=1 i=1
Xn X
p D√ E2
= (yi − hx i , βi)2 + 0− λe i , β
i=1 i=1

• Same as adding p extra samples in addition to the n x i ’s

√
• Additional samples and observations are λe i , 0 for i = 1, . . . , p
• Same as adding an observation on each axis that is zero

15
Data augmentation interpretation (cont’d)

• Switching back to matrix form we define

 
y1
   
X  y2 
√  
 λ 0 ... 0   .. 
 √  .
   
Xλ =  0
 λ ... 0   and y = 
0
yn 

 ..   
 0 0 . 0  0
√ .
0 0 ... .
λ .
0
• And the PRSS can now be written
! !
2 X y
y 0 − Xλ β s.t. Xλ = √ y0 =
λIp 0

16
b ridge
Geometric interpretation β

b ols is the OLS solution

• Suppose 2-dimensional case (p = 2), β
• Ellipses are level line of the PRSS: ky − X βk2
• A solution for some λ is at the intersection of the L2 ball and a level line
• Whatever the form of ellipses, ridge solution is systematically drawn toward zero

β2
b ols
β
b ridge
β

β1

17
Geometric interpretation of yb ols

First see the OLS case

• Let X = USV T the SVD of X
• Matrix U gathers the (unit) principal components u1 , . . . , uk
• Ordinary least squares orthogonally projects y onto the space spanned by the
columns of X :
y
−1
ybols = X X T X X T y = UU T y
p
X
= uT y ui b ols
i ybols = X β
i=1
Span(X )

18
Geometric interpretation of yb ridge (cont’d)

• Ridge regression is doing the same thing plus an additional “shrinking” step
−1 −1 T
ybridge = X X T X + λIp X T y = U S S 2 + λIp S U y
Xp
σi2
= uT y ui
σi2 + λ i
i=1

σi2
• Coordinates are now shrunk towards zero since <1
σi2 + λ
• Remember that the ui are here the (unit) principal components of X
• The lesser the variance of the principal component is the greater it is shrunk
• Smooth version of PCA followed by linear regression

19
Geometric interpretation of yb ridge (cont’d)

u2
• Coordinate of ybols along u1 is shrunk
σ2
by 2 1
σ1 + λ
u1
• Coordinate of ybols along u2 is shrunk ybridge
ybols
σ2
by 2 2
σ2 + λ
Span(X )

20
b ridge
Geometric interpretation of β λ

b ridge ?
What is the link between the unknown parameter β and its ridge estimate β λ

• Let X = USV T the SVD of X

• Recall that V gathers the principal directions (new basis of representation)
b ridge in basis V , we can show that
• Let’s look at β λ

ridge −1 ols

b
VTβ = Ip + λS −2 b
VTβ
λ

In the basis defined by V , β is shrunk

• If we look at the ith coordinate
σi2 T b ols
VTβb ridge = V β
λ
i λ + σi2 i

21
b ridge
Geometric interpretation of β λ

ridge
σi2 T b ols
b
VTβ = V β
λ
i λ + σi2 i β2

v1
b ridge along v1 is shrunk
• Coordinate of β λ
σ12
by 2 v2
σ1 + λ
b ridge along v2 is shrunk
• Coordinate of β β1
λ
σ22
by 2
σ2 + λ

22
Gradient descent interpretation

• General gradient descent update of step η

k+1
θ = θ − η∇L θ k ,
k

• Gradient descent update for OLS (L = RSS)

β k+1 = β k − η∇ RSS β k (4)

• Gradient descent update for ridge regression

β k+1 = β k − η∇ PRSS β k

• Which writes
β k+1 = (1 − 2ηλ)β k − η∇ RSS β k (5)

The update use a shrunk version of β k

23
Effect on the βi ’s: the regularization path

Regularization path is the plot of βi ’s against regularization parameter λ.

Here for some centered dataset with 10 covariates:

• βi ’s from linear regression as λ goes to 0

500
• When regularization is too strong, we fit the
constant function

βi ’s
0
• βi ’s are shrunk as λ increases
−500
• All the βi ’s might not be non-increasing but
kβk is
10−2 102
• The βi ’s are never zero λ

24
Effect on the fitted line

In the case of simple linear regression (p = 1)

Thruth
8 OLS
Ridge (λ = 1)
• Regularizing is like adding the Ridge (λ = 10)
√ 6
point ( λ, 0) Ridge (λ = 50)
Ridge (λ → +∞)
• As λ goes to infinity, we fit a 4

constant function
2

0 2 4

25
Effect on fitted curve

Adding polynomial features (degree 15): Xi , Xi2 , . . . , Xi15

20
ground truth
Linear (λ = 0)
10 Ridge (λ = 2)
Ridge (λ = 10)
0 training points

−10

−20
0 2 4 6 8 10

26
Effect on the βi ’s estimates: bias–variance tradeoff

The slope β1 is biased but variance of estimations is smaller

10
OLS
8 Ridge (λ = 1000)
Thruth
6

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0

27
Effect on predictions: bias–variance tradeoff

Predictions at 0 (for example) are biased but less spread out

10
OLS
8 Ridge (λ = 1000)
Thruth
6

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0

28
Lasso regularization

Other famous regularization

Pp
• We choose R(θ) = kθk1 = i=1 |θi |
• Penalizes large parameter: prevents the βi ’s from exploding
• Lasso regularization is then

blasso = arg min L(θ)

θ b + λ kθk1 (Lasso regularization)
λ
θ∈Θ

• Lasso linear regression

b lasso = arg min ky − X βk2 + λ kβk

β (Lasso regression)
λ 1
β∈Rp

29
Sparsity promoting property

Compare the coefficients βi ’s with ridge and lasso regularization

750 750

500 500

250 250
βi ’s

0 0

−250 −250

−500 −500

−750 −750
10−3 100 103 10−3 100 103

30
Effect on the βi ’s: the regularization path

• βi ’s from linear regression as λ goes to

0 500

• βi ’s are shrunk as λ increases

βi ’s
0
• All the βi ’s are shrunk to exactly zero
at some point: sparsity promoting −500
effect
10−2 102
• When regularization is too strong, we
λ
fit the constant function

31
Explaining the sparsity property

b ols is the ordinary least square solution

• β

β2 β2
b ols
β b ols
β
b ridge
β b lasso
β

β1 β1

Ridge can be anywhere on the L2 ball Lasso solution lies on edge of L1 ball
(sparse solution)

32
b lasso
Geometric interpretation of β λ

For simplicity, suppose that n = p and X is the identity matrix

• OLS solution is: βb ols = y
b lasso = arg min
• Lasso regularization reads: β 2
λ β∈Rp ky − βk + λ kβk1

b lasso
β = arg min (yi − βi )2 + λ |βi | sλ/2 (yi )
λ
i βi ∈R

lasso max (y − λ/2, 0) if y > 0
b i i
β λ = −λ/2
i max (yi + λ/2, 0) if yi < 0 yi
λ/2
(soft thresholding)

33
Gradient descent interpretation

• Using subgradient to differentiate k·k1

∇β kβk1 = sign (β)

• Gradient descent update for lasso regression

β k+1 = β k − η∇ RSS β k − ηλ sign β k

• Shrink the βi ’s regardless of the βi ’s magnitude

34
Lasso properties

• No closed from solution

• Convex problem
• Feature selection ability
• Biased predictions and parameter estimate
• Might be unstable if highly correlated variables

35
Elastic net

Why elastic-net?
• Ridge regression is not selecting
β2
variables (all the β i ’s are nonzero)
b ols
β
• Lasso regression is but in an unstable
way b lasso
β
• Small changes in X might lead to
β1
entirely different set of selected
predictors

36
Elastic net

Mixing the two strategies

belastic = arg min L(θ)
θ b + λ α kθk1 + (1 − α) kθk22 (Elastic net regularization)
λ,α
θ∈Θ

• λ is the regularizing parameter

• α controls the balance between L1 and L2 regularizing terms

β2 β2 β2

β1 β1 β1

(a) L2 : no selection (b) L1 : unstable selection (c) Elastic net: stable selection

37
Elastic net

Explaining the stability of elastic net regularization

β2
• Corners are still sharp: elastic net is
still encouraging sparsity b ols
β
• Elastic net ball is also round (4 portions
b elnet
β
of (big) circles): stable when some
variables are strongly correlated β1

38
Ridge/Lasso/Elastic net regression in Python and Scikit–Learn

• Import the PCA module

from sklearn.linear_model import LinearRegression, Ridge, Lasso

• Instantiate (no parameter), fit and • Instantiate with tuning parameter

predict alpha
lr = LinearRegression() lr = Ridge(alpha=1.0)
lr.fit(X, y) lr.fit(X, y)
res = lr.predict(new_X) res = lr.predict(new_X)

39
References i

Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The Elements of Statistical
Learning. Vol. 1. Springer series in statistics New York, 2001.
Ian Goodfellow et al. Deep Learning. Vol. 1. MIT press Cambridge, 2016.

SLChapter 5
No ratings yet
SLChapter 5
16 pages
Regularization Methods Intro 1694372556
No ratings yet
Regularization Methods Intro 1694372556
38 pages
Lecture03d Ridge
No ratings yet
Lecture03d Ridge
13 pages
L11+ Regularization
No ratings yet
L11+ Regularization
25 pages
Ridge Regression LASSO
No ratings yet
Ridge Regression LASSO
18 pages
L3 Linear Regression
No ratings yet
L3 Linear Regression
23 pages
Regularization & Generalized Linear Models
No ratings yet
Regularization & Generalized Linear Models
135 pages
9 - Linear Regression-Problems and Solutions
No ratings yet
9 - Linear Regression-Problems and Solutions
23 pages
10 - Linear Regression-Problems and Solutions
No ratings yet
10 - Linear Regression-Problems and Solutions
23 pages
10 - Linear Regression-Problems and Solutions
No ratings yet
10 - Linear Regression-Problems and Solutions
23 pages
Aiml 6
No ratings yet
Aiml 6
30 pages
L13 Ridge Regression
No ratings yet
L13 Ridge Regression
26 pages
Machine Learning With Ridge and Lasso Regression
No ratings yet
Machine Learning With Ridge and Lasso Regression
19 pages
Regularization L1 L2 Examples
No ratings yet
Regularization L1 L2 Examples
2 pages
Ch5 Regularization
No ratings yet
Ch5 Regularization
23 pages
Cs 7265 Big Data Analytics Regularization On Linear Model: Mingon Kang, PH.D Computer Science, Kennesaw State University
No ratings yet
Cs 7265 Big Data Analytics Regularization On Linear Model: Mingon Kang, PH.D Computer Science, Kennesaw State University
24 pages
Lasso
No ratings yet
Lasso
15 pages
Notes - Lecture 13 - Regularization - LASSO and RIDGE Regression
No ratings yet
Notes - Lecture 13 - Regularization - LASSO and RIDGE Regression
29 pages
Regularization
No ratings yet
Regularization
3 pages
ML - Perplexity
No ratings yet
ML - Perplexity
71 pages
445 Lecture 7
No ratings yet
445 Lecture 7
30 pages
Machine Learning
No ratings yet
Machine Learning
19 pages
Scribe Notes Fall 2022
No ratings yet
Scribe Notes Fall 2022
41 pages
Ridge Regression and Bayesian Methods
No ratings yet
Ridge Regression and Bayesian Methods
3 pages
AI34
No ratings yet
AI34
3 pages
Regression Shrinkage Techniques
No ratings yet
Regression Shrinkage Techniques
5 pages
EDA 4th Module
No ratings yet
EDA 4th Module
26 pages
Machine Learning PPT Part II
No ratings yet
Machine Learning PPT Part II
56 pages
Ridge and Lasso Regresssion
No ratings yet
Ridge and Lasso Regresssion
22 pages
Shrinkage Regression: Rolf Sundberg Volume 4, PP 1994-1998 in
No ratings yet
Shrinkage Regression: Rolf Sundberg Volume 4, PP 1994-1998 in
5 pages
02 - Linear Models - C - Regularization - Logistic - Regression
No ratings yet
02 - Linear Models - C - Regularization - Logistic - Regression
16 pages
L11+ Regularization
No ratings yet
L11+ Regularization
24 pages
Regularization and Feature Selectio N
No ratings yet
Regularization and Feature Selectio N
102 pages
Regularized Linear Models Overview
No ratings yet
Regularized Linear Models Overview
27 pages
21csc305p ML Unit 2
No ratings yet
21csc305p ML Unit 2
115 pages
MIT Regression
No ratings yet
MIT Regression
5 pages
PGN AI and ML Presentation
No ratings yet
PGN AI and ML Presentation
28 pages
Ridge Regression: Addressing Collinearity Issues
No ratings yet
Ridge Regression: Addressing Collinearity Issues
132 pages
Class03 RLS
No ratings yet
Class03 RLS
28 pages
Regularization: Ridge Regression and The LASSO: Statistics 305: Autumn Quarter 2006/2007
No ratings yet
Regularization: Ridge Regression and The LASSO: Statistics 305: Autumn Quarter 2006/2007
56 pages
Rudyregularization PDF
No ratings yet
Rudyregularization PDF
56 pages
CS2011 2
No ratings yet
CS2011 2
14 pages
Coordinate Descent for GLM Regularization
No ratings yet
Coordinate Descent for GLM Regularization
22 pages
Lec20 RidgeRegression
No ratings yet
Lec20 RidgeRegression
21 pages
Course5 ML Linear Regression
No ratings yet
Course5 ML Linear Regression
1 page
ISYE 8803 - Kamran - M6 - LD Learning Using Regularization
No ratings yet
ISYE 8803 - Kamran - M6 - LD Learning Using Regularization
25 pages
Group30 Linear Regression
No ratings yet
Group30 Linear Regression
20 pages
Ridge vs Lasso Regression Guide
No ratings yet
Ridge vs Lasso Regression Guide
5 pages
Dependent Independent Variable (S) : Regression: What Is Regression
No ratings yet
Dependent Independent Variable (S) : Regression: What Is Regression
15 pages
05 Regression Least Squares
No ratings yet
05 Regression Least Squares
5 pages
Lecture 4
No ratings yet
Lecture 4
41 pages
Unit 2
No ratings yet
Unit 2
92 pages
B Ridge - and - Lasso - Regression
No ratings yet
B Ridge - and - Lasso - Regression
5 pages
Ridge Mt1cars
No ratings yet
Ridge Mt1cars
4 pages
Regularization in Machine Learning
No ratings yet
Regularization in Machine Learning
8 pages
Understanding Bias, Variance, and Regularization
No ratings yet
Understanding Bias, Variance, and Regularization
33 pages
Lect 6
No ratings yet
Lect 6
10 pages
Lasso Algorithms and Applications
No ratings yet
Lasso Algorithms and Applications
44 pages
Sparse Regression Techniques Explained
No ratings yet
Sparse Regression Techniques Explained
25 pages
Lecture06 Kernel Methods
No ratings yet
Lecture06 Kernel Methods
39 pages
Lecture07 Time Series
No ratings yet
Lecture07 Time Series
38 pages
Day7 8 - Lp-Upt
No ratings yet
Day7 8 - Lp-Upt
57 pages
Te Zgjidhura Sakte
No ratings yet
Te Zgjidhura Sakte
4 pages
IPython, DFS, Graphs, and Data Science Insights
No ratings yet
IPython, DFS, Graphs, and Data Science Insights
10 pages
HTML5 Questions
100% (2)
HTML5 Questions
19 pages
R for Financial Engineering
No ratings yet
R for Financial Engineering
60 pages
Linear Regression Analysis in Education and Economics
No ratings yet
Linear Regression Analysis in Education and Economics
19 pages
Logistic Regression via Newton Method
No ratings yet
Logistic Regression via Newton Method
16 pages
Optimizing Pharmaceutical Supply Chains An Intelligent Approach To Sustainable Business Growth
No ratings yet
Optimizing Pharmaceutical Supply Chains An Intelligent Approach To Sustainable Business Growth
14 pages
UG_37_151020251760513430218
No ratings yet
UG_37_151020251760513430218
62 pages
ITXXX Applied Forecasting Methods Winter - Pritam Anand
No ratings yet
ITXXX Applied Forecasting Methods Winter - Pritam Anand
3 pages
Urban Public Space Safety Perception and The Influ
No ratings yet
Urban Public Space Safety Perception and The Influ
22 pages
2007 Process Optimization of Injection Moulding Using An Adaptive Surrogate Model With Gaussian Process Approach
No ratings yet
2007 Process Optimization of Injection Moulding Using An Adaptive Surrogate Model With Gaussian Process Approach
11 pages
Semiparametric Regression 1st Edition David Ruppert Instant Download
100% (2)
Semiparametric Regression 1st Edition David Ruppert Instant Download
56 pages
Allama Iqbal Open University Islamabad: Book Name (8614) Level: B.Ed
No ratings yet
Allama Iqbal Open University Islamabad: Book Name (8614) Level: B.Ed
7 pages
Using The UTAUT Model To Analyze User Intention To Accept Electronic Payment Systems in Serbia
No ratings yet
Using The UTAUT Model To Analyze User Intention To Accept Electronic Payment Systems in Serbia
20 pages
The Application of Computer Vision Machine and Deep Learning Alg
No ratings yet
The Application of Computer Vision Machine and Deep Learning Alg
58 pages
Crane, M., Kwok, K. W., Wells, C., Whitehouse, P., & Lui, G. C. (2007) - Use of Field Data To Support European Water Framework Directive Quality S
No ratings yet
Crane, M., Kwok, K. W., Wells, C., Whitehouse, P., & Lui, G. C. (2007) - Use of Field Data To Support European Water Framework Directive Quality S
8 pages
DAT 530 - Research Project
No ratings yet
DAT 530 - Research Project
2 pages
RA-1 Regression
No ratings yet
RA-1 Regression
32 pages
Exstat
No ratings yet
Exstat
132 pages
Syllabus 3rdsem
No ratings yet
Syllabus 3rdsem
11 pages
Statistics For Engineers and Scientists 5th Edition William Navidi - The Latest Updated Ebook Version Is Ready For Download
100% (2)
Statistics For Engineers and Scientists 5th Edition William Navidi - The Latest Updated Ebook Version Is Ready For Download
49 pages
Tutorial On PLS and PCA
100% (1)
Tutorial On PLS and PCA
17 pages
Machine Learning Coursera
100% (1)
Machine Learning Coursera
55 pages
Response Surface Methodology Overview
No ratings yet
Response Surface Methodology Overview
4 pages
Econometrics: Multiple Regression Basics
No ratings yet
Econometrics: Multiple Regression Basics
9 pages
Pulping
No ratings yet
Pulping
5 pages
Meaning of Demand Estimation / Forecasting
No ratings yet
Meaning of Demand Estimation / Forecasting
9 pages
MSC Thesis SEPAM - Yannick Maltha - 4019032
No ratings yet
MSC Thesis SEPAM - Yannick Maltha - 4019032
101 pages
Sales and Operations Planning
No ratings yet
Sales and Operations Planning
15 pages
STAT 135: Linear Regression: Joan Bruna
No ratings yet
STAT 135: Linear Regression: Joan Bruna
232 pages
Applied Statistics Exam Questions & Answers
100% (1)
Applied Statistics Exam Questions & Answers
22 pages
Accounting-Based Earnings Forecasts
No ratings yet
Accounting-Based Earnings Forecasts
24 pages
IE506 Assignment1
No ratings yet
IE506 Assignment1
2 pages

Lecture05 Regularization

Uploaded by

Lecture05 Regularization

Uploaded by

Regularization

What is regularization? From Goodfellow et al. 2016:

• Adding a penalty term to a loss function

Today’s course is focused on the first point

Learning consists in minimizing a training objective

• H is the set of admissible classifier/regression function

Most of the time the set H is parametrized by a parameter θ ∈ Θ

b = arg min L(θ)

• Set of admissible solutions is linear functions

• Linear functions on Rp are parametrized by β ∈ Rp

• Training objective is the residual sum of squares (RSS)

• The learning algorithm is then

b ols = arg min ky − X βk2

breg = arg min L(θ)

• R(θ) penalizes some θ’s

Most simple regularization we can think of:

bridge = arg min L(θ)

Also known as L2 -regularization, Tikhonov regularization or weight decay (neural network)

• Previous linear regression learning algorithm was

b ols = arg min ky − X βk2

• Adding the ridge regularizer term yields

b ridge = arg min ky − X βk2 + λ kβk2

b ridge = arg min ky − X βk2 + λ kβk2

• Define the penalized residual sum of squares (PRSS) as

PRSS (β) = ky − X βk2 + λ kβk2

• PRSS is (strictly) convex w.r.t. β: unique solution

∇β PRSS = −2X T (y − X β) + 2λβ (2)

• Setting the derivative to zero we finally get

• Fitted values are then

• For λ = 0 we have the OLS solution

• Unlike linear regression, there is always a solution (when λ > 0)

Suppose that the design matrix X is fixed (conditioning on it)

Linear case: Ridge case:

• Let’s rewrite the PRSS

PRSS (β) = ky − X βk2 + λ kβk2

• Same as adding p extra samples in addition to the n x i ’s

• Switching back to matrix form we define

b ols is the OLS solution

First see the OLS case

• Let X = USV T the SVD of X

ridge −1 ols

In the basis defined by V , β is shrunk

• General gradient descent update of step η

• Gradient descent update for OLS (L = RSS)

• Gradient descent update for ridge regression

The update use a shrunk version of β k

Regularization path is the plot of βi ’s against regularization parameter λ.

• βi ’s from linear regression as λ goes to 0

In the case of simple linear regression (p = 1)

Adding polynomial features (degree 15): Xi , Xi2 , . . . , Xi15

The slope β1 is biased but variance of estimations is smaller

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0

Predictions at 0 (for example) are biased but less spread out

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0

Other famous regularization

blasso = arg min L(θ)

• Lasso linear regression

b lasso = arg min ky − X βk2 + λ kβk

Compare the coefficients βi ’s with ridge and lasso regularization

• βi ’s from linear regression as λ goes to

• βi ’s are shrunk as λ increases

b ols is the ordinary least square solution

For simplicity, suppose that n = p and X is the identity matrix

• Using subgradient to differentiate k·k1

∇β kβk1 = sign (β)

• Gradient descent update for lasso regression

• Shrink the βi ’s regardless of the βi ’s magnitude

• No closed from solution

Mixing the two strategies

• λ is the regularizing parameter

Explaining the stability of elastic net regularization

• Import the PCA module

• Instantiate (no parameter), fit and • Instantiate with tuning parameter

You might also like