Regularization
UE de Master 2, AOS1
Fall 2020
S. Rousseau
1
General definition
What is regularization? From Goodfellow et al. 2016:
Regularization is any modification we make to a learning algorithm that is
intented to reduce its generalization error but not its training error.
• Adding a penalty term to a loss function
• Data augmentation
• Early stopping
• ...
Today’s course is focused on the first point
2
What is learning
Learning consists in minimizing a training objective
b )
fb = arg min L(f
f ∈H
• H is the set of admissible classifier/regression function
• Lb is an empirical loss function (computed on a train set)
• fb is the learnt solution
Most of the time the set H is parametrized by a parameter θ ∈ Θ
b = arg min L(θ)
θ b
θ∈Θ
3
Application to linear regression
• Set of admissible solutions is linear functions
H = {linear functions}
• Linear functions on Rp are parametrized by β ∈ Rp
H = {x 7→ hx, βi , x ∈ Rp }
• Training objective is the residual sum of squares (RSS)
Empirical loss function based on square loss, prediction is hx i , βi, observed is yi
n
X
b
L(β) = RSS (β) = (yi − hx i , βi)2
i=1
4
Application to linear regression (2)
• The learning algorithm is then
n
X
b ols = arg min
β (yi − hx i , βi)2 (ordinary least square)
β∈Rp i=1
T
x1 y1
.. ..
• Compact matrix formulation X = . and y = .
xT
n yn
b ols = arg min ky − X βk2
β
β∈Rp
5
Regularization
• Unregularized objective
θ b
b = arg min L(θ) (1)
θ∈Θ
• Regularized objective
breg = arg min L(θ)
θ b + λ · R(θ)
λ
θ∈Θ
• Training objective
• Tuning parameter
• Regularization term
6
Regularization (2)
Regularized objective
breg = arg min L(θ)
θ b + λ · R(θ)
θ∈Θ
• R(θ) penalizes some θ’s
• λ > 0 is the strength of the penalty
• λ = 0: no penality: regular solution
• λ −→ +∞, solution tends to arg min R(θ)
θ∈Θ
• Some tradeoff has to be found between the two extreme cases
7
Ridge regularization
Most simple regularization we can think of:
P
• We choose R(θ) = kθk2 = pi=1 θi2
• Penalizes large parameter: prevents the βi ’s from exploding
• Ridge regularization is then
bridge = arg min L(θ)
θ b + λ kθk2 (ridge regularization)
λ
θ∈Θ
Also known as L2 -regularization, Tikhonov regularization or weight decay (neural network)
8
Application to ridge regression
• Previous linear regression learning algorithm was
b ols = arg min ky − X βk2
β
β∈Rp
• Adding the ridge regularizer term yields
b ridge = arg min ky − X βk2 + λ kβk2
β (ridge regression)
λ
β∈Rp
n
X p
X
2
= arg min (yi − hx i , βi) + λ βi2
β∈Rp i=1 i=1
9
Solution to ridge regression
b ridge = arg min ky − X βk2 + λ kβk2
β (ridge regression)
λ
β∈Rp
• Define the penalized residual sum of squares (PRSS) as
PRSS (β) = ky − X βk2 + λ kβk2
• PRSS is (strictly) convex w.r.t. β: unique solution
• By differentiating w.r.t. β we get
∇β PRSS = −2X T (y − X β) + 2λβ (2)
10
Solution to ridge regression
• Setting the derivative to zero we finally get
−1
b ridge = X T X + λIp
β XTy (3)
λ
• Fitted values are then
−1
b ridge = X X T X + λIp
ybridge = X β XTy
λ
• For λ = 0 we have the OLS solution
β ridge
λ=0 = β
ols
−1
ybols = X X T X XTy
11
Caveats
• The intercept (if present) should not be part of the regularizing parameter. Two
possible strategies:
• center the design matrix X so there is no intercept
• or we remove the intercept from the regularizing parameter (set β ? as β without β0 )
• The features should be on the same scale; unlike linear regression ridge regression
predictions are sensitive to features rescaling
12
Properties of ridge regression
• Unlike linear regression, there is always a solution (when λ > 0)
• X T X is positive semi-definite
• X T X + λIp is then positive definite when λ > 0 hence is invertible
• It improves the conditioning of the problem
• Like linear regression but unlike Lasso, it admits a closed form solution
−1
b ridge = X T X + λIp
β XTy
λ
b = Uβ
• Invariant to rotation: if Y = XU T is a rotation of the samples then β b
Y X
• Unlike linear regression, both the β i ’s estimate and predictions are biased
• The β i ’s estimate are drawn toward zero w.r.t the OLS solution
• Might have lower variance than OLS
13
b ridge is biased
β λ
Suppose that the design matrix X is fixed (conditioning on it)
Linear case: Ridge case:
ols −1 ridge −1
E βb T
=E X X T
X y b
E βλ T
= E X X + λIp T
X y
−1 −1
= XTX X T E(y ) = X T X + λIp XTXβ
−1 −1 −1
= XTX XTXβ = T
X X T
X X + λIp β
=β −1 −1
T
= Ip + λ X X β 6= β
Unbiased!
Biased!
14
Data augmentation interpretation
• Let’s rewrite the PRSS
PRSS (β) = ky − X βk2 + λ kβk2
n
X p
X
2
= (yi − hx i , βi) + λ βi2
i=1 i=1
Xn X
p D√ E2
= (yi − hx i , βi)2 + 0− λe i , β
i=1 i=1
• Same as adding p extra samples in addition to the n x i ’s
√
• Additional samples and observations are λe i , 0 for i = 1, . . . , p
• Same as adding an observation on each axis that is zero
15
Data augmentation interpretation (cont’d)
• Switching back to matrix form we define
y1
X y2
√
λ 0 ... 0 ..
√ .
Xλ = 0
λ ... 0 and y =
0
yn
..
0 0 . 0 0
√ .
0 0 ... .
λ .
0
• And the PRSS can now be written
! !
2 X y
y 0 − Xλ β s.t. Xλ = √ y0 =
λIp 0
16
b ridge
Geometric interpretation β
b ols is the OLS solution
• Suppose 2-dimensional case (p = 2), β
• Ellipses are level line of the PRSS: ky − X βk2
• A solution for some λ is at the intersection of the L2 ball and a level line
• Whatever the form of ellipses, ridge solution is systematically drawn toward zero
β2
b ols
β
b ridge
β
β1
17
Geometric interpretation of yb ols
First see the OLS case
• Let X = USV T the SVD of X
• Matrix U gathers the (unit) principal components u1 , . . . , uk
• Ordinary least squares orthogonally projects y onto the space spanned by the
columns of X :
y
−1
ybols = X X T X X T y = UU T y
p
X
= uT y ui b ols
i ybols = X β
i=1
Span(X )
18
Geometric interpretation of yb ridge (cont’d)
• Ridge regression is doing the same thing plus an additional “shrinking” step
−1 −1 T
ybridge = X X T X + λIp X T y = U S S 2 + λIp S U y
Xp
σi2
= uT y ui
σi2 + λ i
i=1
σi2
• Coordinates are now shrunk towards zero since <1
σi2 + λ
• Remember that the ui are here the (unit) principal components of X
• The lesser the variance of the principal component is the greater it is shrunk
• Smooth version of PCA followed by linear regression
19
Geometric interpretation of yb ridge (cont’d)
u2
• Coordinate of ybols along u1 is shrunk
σ2
by 2 1
σ1 + λ
u1
• Coordinate of ybols along u2 is shrunk ybridge
ybols
σ2
by 2 2
σ2 + λ
Span(X )
20
b ridge
Geometric interpretation of β λ
b ridge ?
What is the link between the unknown parameter β and its ridge estimate β λ
• Let X = USV T the SVD of X
• Recall that V gathers the principal directions (new basis of representation)
b ridge in basis V , we can show that
• Let’s look at β λ
ridge −1 ols
b
VTβ = Ip + λS −2 b
VTβ
λ
In the basis defined by V , β is shrunk
• If we look at the ith coordinate
σi2 T b ols
VTβb ridge = V β
λ
i λ + σi2 i
21
b ridge
Geometric interpretation of β λ
ridge
σi2 T b ols
b
VTβ = V β
λ
i λ + σi2 i β2
v1
b ridge along v1 is shrunk
• Coordinate of β λ
σ12
by 2 v2
σ1 + λ
b ridge along v2 is shrunk
• Coordinate of β β1
λ
σ22
by 2
σ2 + λ
22
Gradient descent interpretation
• General gradient descent update of step η
k+1
θ = θ − η∇L θ k ,
k
• Gradient descent update for OLS (L = RSS)
β k+1 = β k − η∇ RSS β k (4)
• Gradient descent update for ridge regression
β k+1 = β k − η∇ PRSS β k
• Which writes
β k+1 = (1 − 2ηλ)β k − η∇ RSS β k (5)
The update use a shrunk version of β k
23
Effect on the βi ’s: the regularization path
Regularization path is the plot of βi ’s against regularization parameter λ.
Here for some centered dataset with 10 covariates:
• βi ’s from linear regression as λ goes to 0
500
• When regularization is too strong, we fit the
constant function
βi ’s
0
• βi ’s are shrunk as λ increases
−500
• All the βi ’s might not be non-increasing but
kβk is
10−2 102
• The βi ’s are never zero λ
24
Effect on the fitted line
In the case of simple linear regression (p = 1)
Thruth
8 OLS
Ridge (λ = 1)
• Regularizing is like adding the Ridge (λ = 10)
√ 6
point ( λ, 0) Ridge (λ = 50)
Ridge (λ → +∞)
• As λ goes to infinity, we fit a 4
constant function
2
0 2 4
25
Effect on fitted curve
Adding polynomial features (degree 15): Xi , Xi2 , . . . , Xi15
20
ground truth
Linear (λ = 0)
10 Ridge (λ = 2)
Ridge (λ = 10)
0 training points
−10
−20
0 2 4 6 8 10
26
Effect on the βi ’s estimates: bias–variance tradeoff
The slope β1 is biased but variance of estimations is smaller
10
OLS
8 Ridge (λ = 1000)
Thruth
6
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
27
Effect on predictions: bias–variance tradeoff
Predictions at 0 (for example) are biased but less spread out
10
OLS
8 Ridge (λ = 1000)
Thruth
6
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
28
Lasso regularization
Other famous regularization
Pp
• We choose R(θ) = kθk1 = i=1 |θi |
• Penalizes large parameter: prevents the βi ’s from exploding
• Lasso regularization is then
blasso = arg min L(θ)
θ b + λ kθk1 (Lasso regularization)
λ
θ∈Θ
• Lasso linear regression
b lasso = arg min ky − X βk2 + λ kβk
β (Lasso regression)
λ 1
β∈Rp
29
Sparsity promoting property
Compare the coefficients βi ’s with ridge and lasso regularization
750 750
500 500
250 250
βi ’s
0 0
−250 −250
−500 −500
−750 −750
10−3 100 103 10−3 100 103
30
Effect on the βi ’s: the regularization path
• βi ’s from linear regression as λ goes to
0 500
• βi ’s are shrunk as λ increases
βi ’s
0
• All the βi ’s are shrunk to exactly zero
at some point: sparsity promoting −500
effect
10−2 102
• When regularization is too strong, we
λ
fit the constant function
31
Explaining the sparsity property
b ols is the ordinary least square solution
• β
β2 β2
b ols
β b ols
β
b ridge
β b lasso
β
β1 β1
Ridge can be anywhere on the L2 ball Lasso solution lies on edge of L1 ball
(sparse solution)
32
b lasso
Geometric interpretation of β λ
For simplicity, suppose that n = p and X is the identity matrix
• OLS solution is: βb ols = y
b lasso = arg min
• Lasso regularization reads: β 2
λ β∈Rp ky − βk + λ kβk1
b lasso
β = arg min (yi − βi )2 + λ |βi | sλ/2 (yi )
λ
i βi ∈R
lasso max (y − λ/2, 0) if y > 0
b i i
β λ = −λ/2
i max (yi + λ/2, 0) if yi < 0 yi
λ/2
(soft thresholding)
33
Gradient descent interpretation
• Using subgradient to differentiate k·k1
∇β kβk1 = sign (β)
• Gradient descent update for lasso regression
β k+1 = β k − η∇ RSS β k − ηλ sign β k
• Shrink the βi ’s regardless of the βi ’s magnitude
34
Lasso properties
• No closed from solution
• Convex problem
• Feature selection ability
• Biased predictions and parameter estimate
• Might be unstable if highly correlated variables
35
Elastic net
Why elastic-net?
• Ridge regression is not selecting
β2
variables (all the β i ’s are nonzero)
b ols
β
• Lasso regression is but in an unstable
way b lasso
β
• Small changes in X might lead to
β1
entirely different set of selected
predictors
36
Elastic net
Mixing the two strategies
belastic = arg min L(θ)
θ b + λ α kθk1 + (1 − α) kθk22 (Elastic net regularization)
λ,α
θ∈Θ
• λ is the regularizing parameter
• α controls the balance between L1 and L2 regularizing terms
β2 β2 β2
β1 β1 β1
(a) L2 : no selection (b) L1 : unstable selection (c) Elastic net: stable selection
37
Elastic net
Explaining the stability of elastic net regularization
β2
• Corners are still sharp: elastic net is
still encouraging sparsity b ols
β
• Elastic net ball is also round (4 portions
b elnet
β
of (big) circles): stable when some
variables are strongly correlated β1
38
Ridge/Lasso/Elastic net regression in Python and Scikit–Learn
• Import the PCA module
from sklearn.linear_model import LinearRegression, Ridge, Lasso
• Instantiate (no parameter), fit and • Instantiate with tuning parameter
predict alpha
lr = LinearRegression() lr = Ridge(alpha=1.0)
lr.fit(X, y) lr.fit(X, y)
res = lr.predict(new_X) res = lr.predict(new_X)
39
References i
Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The Elements of Statistical
Learning. Vol. 1. Springer series in statistics New York, 2001.
Ian Goodfellow et al. Deep Learning. Vol. 1. MIT press Cambridge, 2016.
40