0% found this document useful (0 votes)
5 views40 pages

Lecture05 Regularization

The document provides an overview of regularization in machine learning, specifically focusing on its definition, applications, and techniques such as ridge regression. It explains how regularization modifies learning algorithms to reduce generalization error and includes mathematical formulations for linear regression and ridge regression. The document also discusses properties, interpretations, and the effects of regularization on model parameters and predictions.

Uploaded by

ana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views40 pages

Lecture05 Regularization

The document provides an overview of regularization in machine learning, specifically focusing on its definition, applications, and techniques such as ridge regression. It explains how regularization modifies learning algorithms to reduce generalization error and includes mathematical formulations for linear regression and ridge regression. The document also discusses properties, interpretations, and the effects of regularization on model parameters and predictions.

Uploaded by

ana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Regularization

UE de Master 2, AOS1
Fall 2020

S. Rousseau

1
General definition

What is regularization? From Goodfellow et al. 2016:


Regularization is any modification we make to a learning algorithm that is
intented to reduce its generalization error but not its training error.

• Adding a penalty term to a loss function


• Data augmentation
• Early stopping
• ...

Today’s course is focused on the first point

2
What is learning

Learning consists in minimizing a training objective

b )
fb = arg min L(f
f ∈H

• H is the set of admissible classifier/regression function


• Lb is an empirical loss function (computed on a train set)
• fb is the learnt solution

Most of the time the set H is parametrized by a parameter θ ∈ Θ

b = arg min L(θ)


θ b
θ∈Θ

3
Application to linear regression

• Set of admissible solutions is linear functions

H = {linear functions}

• Linear functions on Rp are parametrized by β ∈ Rp

H = {x 7→ hx, βi , x ∈ Rp }

• Training objective is the residual sum of squares (RSS)


Empirical loss function based on square loss, prediction is hx i , βi, observed is yi
n
X
b
L(β) = RSS (β) = (yi − hx i , βi)2
i=1

4
Application to linear regression (2)

• The learning algorithm is then


n
X
b ols = arg min
β (yi − hx i , βi)2 (ordinary least square)
β∈Rp i=1
 T  
x1 y1
 ..   .. 
• Compact matrix formulation X =  .  and y =  . 
xT
n yn

b ols = arg min ky − X βk2


β
β∈Rp

5
Regularization

• Unregularized objective
θ b
b = arg min L(θ) (1)
θ∈Θ

• Regularized objective

breg = arg min L(θ)


θ b + λ · R(θ)
λ
θ∈Θ

• Training objective

• Tuning parameter

• Regularization term

6
Regularization (2)

Regularized objective
breg = arg min L(θ)
θ b + λ · R(θ)
θ∈Θ

• R(θ) penalizes some θ’s


• λ > 0 is the strength of the penalty
• λ = 0: no penality: regular solution
• λ −→ +∞, solution tends to arg min R(θ)
θ∈Θ
• Some tradeoff has to be found between the two extreme cases

7
Ridge regularization

Most simple regularization we can think of:


P
• We choose R(θ) = kθk2 = pi=1 θi2
• Penalizes large parameter: prevents the βi ’s from exploding
• Ridge regularization is then

bridge = arg min L(θ)


θ b + λ kθk2 (ridge regularization)
λ
θ∈Θ

Also known as L2 -regularization, Tikhonov regularization or weight decay (neural network)

8
Application to ridge regression

• Previous linear regression learning algorithm was

b ols = arg min ky − X βk2


β
β∈Rp

• Adding the ridge regularizer term yields

b ridge = arg min ky − X βk2 + λ kβk2


β (ridge regression)
λ
β∈Rp
n
X p
X
2
= arg min (yi − hx i , βi) + λ βi2
β∈Rp i=1 i=1

9
Solution to ridge regression

b ridge = arg min ky − X βk2 + λ kβk2


β (ridge regression)
λ
β∈Rp

• Define the penalized residual sum of squares (PRSS) as

PRSS (β) = ky − X βk2 + λ kβk2

• PRSS is (strictly) convex w.r.t. β: unique solution


• By differentiating w.r.t. β we get

∇β PRSS = −2X T (y − X β) + 2λβ (2)

10
Solution to ridge regression

• Setting the derivative to zero we finally get


 −1
b ridge = X T X + λIp
β XTy (3)
λ

• Fitted values are then


 −1
b ridge = X X T X + λIp
ybridge = X β XTy
λ

• For λ = 0 we have the OLS solution

β ridge
λ=0 = β
ols

 −1
ybols = X X T X XTy

11
Caveats

• The intercept (if present) should not be part of the regularizing parameter. Two
possible strategies:
• center the design matrix X so there is no intercept
• or we remove the intercept from the regularizing parameter (set β ? as β without β0 )

• The features should be on the same scale; unlike linear regression ridge regression
predictions are sensitive to features rescaling

12
Properties of ridge regression

• Unlike linear regression, there is always a solution (when λ > 0)


• X T X is positive semi-definite
• X T X + λIp is then positive definite when λ > 0 hence is invertible
• It improves the conditioning of the problem
• Like linear regression but unlike Lasso, it admits a closed form solution
 −1
b ridge = X T X + λIp
β XTy
λ

b = Uβ
• Invariant to rotation: if Y = XU T is a rotation of the samples then β b
Y X
• Unlike linear regression, both the β i ’s estimate and predictions are biased
• The β i ’s estimate are drawn toward zero w.r.t the OLS solution
• Might have lower variance than OLS
13
b ridge is biased
β λ

Suppose that the design matrix X is fixed (conditioning on it)

Linear case: Ridge case:


 ols   −1   ridge   −1 
E βb T
=E X X T
X y b
E βλ T
= E X X + λIp T
X y
 −1  −1
= XTX X T E(y ) = X T X + λIp XTXβ
 −1  −1  −1
= XTX XTXβ = T
X X T
X X + λIp β
=β   −1 −1
T
= Ip + λ X X β 6= β
Unbiased!
Biased!
14
Data augmentation interpretation

• Let’s rewrite the PRSS

PRSS (β) = ky − X βk2 + λ kβk2


n
X p
X
2
= (yi − hx i , βi) + λ βi2
i=1 i=1
Xn X
p D√ E2
= (yi − hx i , βi)2 + 0− λe i , β
i=1 i=1

• Same as adding p extra samples in addition to the n x i ’s


√ 
• Additional samples and observations are λe i , 0 for i = 1, . . . , p
• Same as adding an observation on each axis that is zero

15
Data augmentation interpretation (cont’d)

• Switching back to matrix form we define


 
y1
   
X  y2 
√  
 λ 0 ... 0   .. 
 √  .
   
Xλ =  0
 λ ... 0   and y = 
0
yn 

 ..   
 0 0 . 0  0
√ .
0 0 ... .
λ .
0
• And the PRSS can now be written
! !
2 X y
y 0 − Xλ β s.t. Xλ = √ y0 =
λIp 0

16
b ridge
Geometric interpretation β

b ols is the OLS solution


• Suppose 2-dimensional case (p = 2), β
• Ellipses are level line of the PRSS: ky − X βk2
• A solution for some λ is at the intersection of the L2 ball and a level line
• Whatever the form of ellipses, ridge solution is systematically drawn toward zero

β2
b ols
β
b ridge
β

β1

17
Geometric interpretation of yb ols

First see the OLS case


• Let X = USV T the SVD of X
• Matrix U gathers the (unit) principal components u1 , . . . , uk
• Ordinary least squares orthogonally projects y onto the space spanned by the
columns of X :
y
 −1
ybols = X X T X X T y = UU T y
p 
X 
= uT y ui b ols
i ybols = X β
i=1
Span(X )

18
Geometric interpretation of yb ridge (cont’d)

• Ridge regression is doing the same thing plus an additional “shrinking” step
 −1  −1  T
ybridge = X X T X + λIp X T y = U S S 2 + λIp S U y
Xp  
σi2
= uT y ui
σi2 + λ i
i=1

σi2
• Coordinates are now shrunk towards zero since <1
σi2 + λ
• Remember that the ui are here the (unit) principal components of X
• The lesser the variance of the principal component is the greater it is shrunk
• Smooth version of PCA followed by linear regression

19
Geometric interpretation of yb ridge (cont’d)

u2
• Coordinate of ybols along u1 is shrunk
σ2
by 2 1
σ1 + λ
u1
• Coordinate of ybols along u2 is shrunk ybridge
ybols
σ2
by 2 2
σ2 + λ
Span(X )

20
b ridge
Geometric interpretation of β λ

b ridge ?
What is the link between the unknown parameter β and its ridge estimate β λ

• Let X = USV T the SVD of X


• Recall that V gathers the principal directions (new basis of representation)
b ridge in basis V , we can show that
• Let’s look at β λ

ridge −1 ols


b
VTβ = Ip + λS −2 b
VTβ
λ

In the basis defined by V , β is shrunk


• If we look at the ith coordinate
  σi2  T b ols 
VTβb ridge = V β
λ
i λ + σi2 i

21
b ridge
Geometric interpretation of β λ

 ridge
 σi2  T b ols 
b
VTβ = V β
λ
i λ + σi2 i β2

v1
b ridge along v1 is shrunk
• Coordinate of β λ
σ12
by 2 v2
σ1 + λ
b ridge along v2 is shrunk
• Coordinate of β β1
λ
σ22
by 2
σ2 + λ

22
Gradient descent interpretation

• General gradient descent update of step η


 
k+1
θ = θ − η∇L θ k ,
k

• Gradient descent update for OLS (L = RSS)


 
β k+1 = β k − η∇ RSS β k (4)

• Gradient descent update for ridge regression


 
β k+1 = β k − η∇ PRSS β k

• Which writes  
β k+1 = (1 − 2ηλ)β k − η∇ RSS β k (5)

The update use a shrunk version of β k


23
Effect on the βi ’s: the regularization path

Regularization path is the plot of βi ’s against regularization parameter λ.


Here for some centered dataset with 10 covariates:

• βi ’s from linear regression as λ goes to 0


500
• When regularization is too strong, we fit the
constant function

βi ’s
0
• βi ’s are shrunk as λ increases
−500
• All the βi ’s might not be non-increasing but
kβk is
10−2 102
• The βi ’s are never zero λ

24
Effect on the fitted line

In the case of simple linear regression (p = 1)

Thruth
8 OLS
Ridge (λ = 1)
• Regularizing is like adding the Ridge (λ = 10)
√ 6
point ( λ, 0) Ridge (λ = 50)
Ridge (λ → +∞)
• As λ goes to infinity, we fit a 4

constant function
2

0 2 4

25
Effect on fitted curve

Adding polynomial features (degree 15): Xi , Xi2 , . . . , Xi15

20
ground truth
Linear (λ = 0)
10 Ridge (λ = 2)
Ridge (λ = 10)
0 training points

−10

−20
0 2 4 6 8 10

26
Effect on the βi ’s estimates: bias–variance tradeoff

The slope β1 is biased but variance of estimations is smaller

10
OLS
8 Ridge (λ = 1000)
Thruth
6

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0

27
Effect on predictions: bias–variance tradeoff

Predictions at 0 (for example) are biased but less spread out

10
OLS
8 Ridge (λ = 1000)
Thruth
6

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0

28
Lasso regularization

Other famous regularization


Pp
• We choose R(θ) = kθk1 = i=1 |θi |
• Penalizes large parameter: prevents the βi ’s from exploding
• Lasso regularization is then

blasso = arg min L(θ)


θ b + λ kθk1 (Lasso regularization)
λ
θ∈Θ

• Lasso linear regression

b lasso = arg min ky − X βk2 + λ kβk


β (Lasso regression)
λ 1
β∈Rp

29
Sparsity promoting property

Compare the coefficients βi ’s with ridge and lasso regularization

750 750

500 500

250 250
βi ’s

0 0

−250 −250

−500 −500

−750 −750
10−3 100 103 10−3 100 103

30
Effect on the βi ’s: the regularization path

• βi ’s from linear regression as λ goes to


0 500

• βi ’s are shrunk as λ increases

βi ’s
0
• All the βi ’s are shrunk to exactly zero
at some point: sparsity promoting −500
effect
10−2 102
• When regularization is too strong, we
λ
fit the constant function

31
Explaining the sparsity property

b ols is the ordinary least square solution


• β

β2 β2
b ols
β b ols
β
b ridge
β b lasso
β

β1 β1

Ridge can be anywhere on the L2 ball Lasso solution lies on edge of L1 ball
(sparse solution)

32
b lasso
Geometric interpretation of β λ

For simplicity, suppose that n = p and X is the identity matrix


• OLS solution is: βb ols = y
b lasso = arg min
• Lasso regularization reads: β 2
λ β∈Rp ky − βk + λ kβk1

 
b lasso
β = arg min (yi − βi )2 + λ |βi | sλ/2 (yi )
λ
i βi ∈R

 lasso  max (y − λ/2, 0) if y > 0
b i i
β λ = −λ/2
i max (yi + λ/2, 0) if yi < 0 yi
λ/2
(soft thresholding)

33
Gradient descent interpretation

• Using subgradient to differentiate k·k1

∇β kβk1 = sign (β)

• Gradient descent update for lasso regression


   
β k+1 = β k − η∇ RSS β k − ηλ sign β k

• Shrink the βi ’s regardless of the βi ’s magnitude

34
Lasso properties

• No closed from solution


• Convex problem
• Feature selection ability
• Biased predictions and parameter estimate
• Might be unstable if highly correlated variables

35
Elastic net

Why elastic-net?
• Ridge regression is not selecting
β2
variables (all the β i ’s are nonzero)
b ols
β
• Lasso regression is but in an unstable
way b lasso
β
• Small changes in X might lead to
β1
entirely different set of selected
predictors

36
Elastic net

Mixing the two strategies


 
belastic = arg min L(θ)
θ b + λ α kθk1 + (1 − α) kθk22 (Elastic net regularization)
λ,α
θ∈Θ

• λ is the regularizing parameter


• α controls the balance between L1 and L2 regularizing terms

β2 β2 β2

β1 β1 β1

(a) L2 : no selection (b) L1 : unstable selection (c) Elastic net: stable selection

37
Elastic net

Explaining the stability of elastic net regularization

β2
• Corners are still sharp: elastic net is
still encouraging sparsity b ols
β
• Elastic net ball is also round (4 portions
b elnet
β
of (big) circles): stable when some
variables are strongly correlated β1

38
Ridge/Lasso/Elastic net regression in Python and Scikit–Learn

• Import the PCA module


from sklearn.linear_model import LinearRegression, Ridge, Lasso

• Instantiate (no parameter), fit and • Instantiate with tuning parameter


predict alpha
lr = LinearRegression() lr = Ridge(alpha=1.0)
lr.fit(X, y) lr.fit(X, y)
res = lr.predict(new_X) res = lr.predict(new_X)

39
References i

Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The Elements of Statistical
Learning. Vol. 1. Springer series in statistics New York, 2001.
Ian Goodfellow et al. Deep Learning. Vol. 1. MIT press Cambridge, 2016.

40

You might also like