Chapter 7:
Regularization for
Deep Learning
Deep Learning Textbook Study Group, SF
Safak Ozkan
April 15, 2017
1 Safak Ozkan
Chapter 7: Regularization for Deep Learning
L2 Parameter Regularization
L1 Parameter Regularization
Norm Penalties and Constrained Optimization
Regularization and Under-Constrained Problems
Dataset Augmentation
Noise Robustness
Injecting Noise at Output Targets
Early Stopping
Semi Supervised Learning
Multi-Task Learning
Parameter Tying and Parameter Sharing
Bagging and Other Ensemble Methods
Dropout
Adversarial Training
Tangent Distance, Manifold Tangent Classifier
2 of 13 Safak Ozkan
Definition
Regularization is any modification we make to a
learning algorithm that is intended to reduce its
test error but NOT its training error.
Etrain : Training Error Etest : Test Error
(or Generalization Error)
3 of 13 Safak Ozkan
L2 Regularization
(a.k.a. Weight decay, Tikhonov regularization, Ridge regression)
Regularization increases bias and reduces variance.
Regularization
parameter
Regularized Regularization term
cost function Unregularized
Cost function
Gradient Descent update rule:
Additional term
4 of 13 Safak Ozkan
L2 Regularization
Lagrangian Constrained Optimization
Lagrangian
multiplier
is equivalent to optimizing
such that .
5 of 13 Safak Ozkan
L2 Regularization
Lagrangian Constrained Optimization
We typically dont set explicitly,
We set .
Unregularized
solution
Regularized Large small
solution constraint region
Large
6 of 13 Safak Ozkan
L2 Regularization
2nd degree Taylor Approximation of around :
unregularized
problem
At ,
Analysis through e-vector decomposition
Stretching in i th small eigen-directions will be affected
eigen direction: more than larger eigen-directions.
7 of 13 Safak Ozkan
L2 Regularization
Normal Equations for Linear Regression
Assume:
Then, would shrink
more than components.
covariance of input features
covariance of
with the target values.
input features
regularization causes the learning algorithm to
perceive the input with increased variance.
8 of 13 Safak Ozkan
L1 Regularization
(a.k.a. LASSO)
Regularization
Term
2nd degree Taylor Approximation
of around :
(Induces
Sparsity)
9 of 13 Safak Ozkan
Under-Constrained Problems
E.g. Logistic Regression
Linearly non seperable Linearly separable
Well behaved problem. Under-determined problem.
( will continue to increase
in a GD Algorithm)
10 of 13 Safak Ozkan
Data Augmentation
Best way to improve generalization of a model is
to train it on more data.
Data Augmentation works particularly well for
Object Recognition tasks.
Injecting noise to input works well for
Speech Recognition.
Affine Elastic
Distortion Noise Deformation
Original
Input Image
Horizontal Random Hue
Flip Translation Shift
11 of 13 Safak Ozkan
Noise Robustness
Addition of noise with a small variance is
equivalent to imposing norm penalty on weights.
Noise on weights: A stochastic implementation of
Bayesian Inference (uncertainty on weights are
represented by a probability distribution)
For each input data,
apply noise on weights
modified cost
function
regularization term
12 of 13 Safak Ozkan
Early Stopping
regularization
number of parameter
learning rate
steps
13 of 13 Safak Ozkan
Early Stopping
HAPTER 7. REGULARIZATION FOR DEEP LEARNING
Early stopping: Terminate while validation set
performance is better
0.20
Loss (negative log-likelihood)
Training set loss
0.15 Validation set loss
0.10
0.05
0.00
0 50 100 150 200 250
Time (epochs)
gure 7.3: Learning curves showing how the negative log-likelihood loss changes o
14 of 13 Safak Ozkan