Training Neural Networks:
Optimization
Intro to Deep Learning, Fall 2020
1
Recap
• Neural networks are universal approximators
• We must train them to approximate any
function
• Networks are trained to minimize total “error”
on a training set
– We do so through empirical risk minimization
• We use variants of gradient descent to do so
– Gradients are computed through backpropagation
17
The training formulation
output (y)
Input (X)
• Given input output pairs at a number of
locations, estimate the entire function
21
Gradient descent
• Start with an initial function
• Adjust its value at all points to make the outputs closer to the required
value
– Gradient descent adjusts parameters to adjust the function value at all points
– Repeat this iteratively until we get arbitrarily close to the target function at the
training points
22
Gradient descent
• Start with an initial function
• Adjust its value at all points to make the outputs closer to the required
value
– Gradient descent adjusts parameters to adjust the function value at all points
– Repeat this iteratively until we get arbitrarily close to the target function at the
training points
23
Gradient descent
• Start with an initial function
• Adjust its value at all points to make the outputs closer to the required
value
– Gradient descent adjusts parameters to adjust the function value at all points
– Repeat this iteratively until we get arbitrarily close to the target function at the
training points
24
Gradient descent
• Start with an initial function
• Adjust its value at all points to make the outputs closer to the required
value
– Gradient descent adjusts parameters to adjust the function value at all points
– Repeat this iteratively until we get arbitrarily close to the target function at the
training points
27
Effect of number of samples
• Problem with conventional gradient descent: we try to
simultaneously adjust the function at all training points
– We must process all training points before making a single
adjustment
– “Batch” update
28
Alternative: Incremental update
• Alternative: adjust the function at one training point at a time
– Keep adjustments small
– Eventually, when we have processed all the training points, we will
have adjusted the entire function
• With greater overall adjustment than we would if we made a single “Batch”
update
29
Alternative: Incremental update
• Alternative: adjust the function at one training point at a time
– Keep adjustments small
– Eventually, when we have processed all the training points, we will
have adjusted the entire function
• With greater overall adjustment than we would if we made a single “Batch”
update
30
Alternative: Incremental update
• Alternative: adjust the function at one training point at a time
– Keep adjustments small
– Eventually, when we have processed all the training points, we will
have adjusted the entire function
• With greater overall adjustment than we would if we made a single “Batch”
update
31
Alternative: Incremental update
• Alternative: adjust the function at one training point at a time
– Keep adjustments small
– Eventually, when we have processed all the training points, we will
have adjusted the entire function
• With greater overall adjustment than we would if we made a single “Batch”
update
33
Incremental Update
• Given , ,…,
• Initialize all weights
• Do:
– For all
• For every layer :
– Compute 𝒕 𝒕
– Update
• Until has converged
34
Incremental Updates
• The iterations can make multiple passes over
the data
• A single pass through the entire training data
is called an “epoch”
– An epoch over a training set with samples
results in updates of parameters
35
Incremental Update
• Given , ,…,
• Initialize all weights
• Do: Over multiple epochs One epoch
– For all
• For every layer :
– Compute 𝒕 𝒕
– Update
One update
• Until has converged
36
Caveats: order of presentation
• If we loop through the samples in the same
order, we may get cyclic behavior
37
Caveats: order of presentation
• If we loop through the samples in the same
order, we may get cyclic behavior
38
Caveats: order of presentation
• If we loop through the samples in the same
order, we may get cyclic behavior
39
Caveats: order of presentation
• If we loop through the samples in the same
order, we may get cyclic behavior
40
Caveats: order of presentation
• If we loop through the samples in the same order,
we may get cyclic behavior
• We must go through them randomly to get more
convergent behavior
41
Incremental Update: Stochastic
Gradient Descent
• Given , ,…,
• Initialize all weights
• Do:
– Randomly permute , ,…,
– For all
• For every layer :
– Compute 𝒕 𝒕
– Update
𝒕 𝒕
• Until has converged
46
Story so far
• In any gradient descent optimization problem,
presenting training instances incrementally
can be more effective than presenting them
all at once
– Provided training instances are provided in
random order
– “Stochastic Gradient Descent”
• This also holds for training neural networks
47
Batch vs SGD
Batch SGD
• Batch gradient descent operates over T training instances
to get a single update
• SGD gets T updates for the same computation
52
Caveats: learning rate
output (y)
Input (X)
• Except in the case of a perfect fit, even an optimal overall
fit will look incorrect to individual instances
– Correcting the function for individual instances will lead to
never-ending, non-convergent updates
– We must shrink the learning rate with iterations to prevent this
• Correction for individual instances with the eventual miniscule
learning rates will not modify the function 56
Incremental Update: Stochastic
Gradient Descent
• Given , ,…,
• Initialize all weights ;
• Do:
– Randomly permute , ,…,
– For all
•
• For every layer :
– Compute 𝒕 𝒕
– Update
𝒕 𝒕
• Until has converged
57
Incremental Update: Stochastic
Gradient Descent
• Given , ,…,
• Initialize all weights ;
• Do:
– Randomly permute , ,…,
– For all
Randomize input order
•
• For every layer :
Learning rate reduces with j
– Compute 𝒕 𝒕
– Update
𝒕 𝒕
• Until has converged
58
SGD example
• A simpler problem: K-means
• Note: SGD converges slower
• Also note the rather large variation between runs
– Lets try to understand these results.. 65
Explaining the variance
• The blue curve is the function being approximated
• The red curve is the approximation by the model at a given
• The heights of the shaded regions represent the point-by-point error
– The divergence is a function of the error
– We want to find the that minimizes the average divergence
73
Explaining the variance
• Sample estimate approximates the shaded area with the
average length of the lines of these curves is the red curve
itself
• Variance: The spread between the different curves is the
variance
74
Explaining the variance
• Sample estimate approximates the shaded area
with the average length of the lines
• This average length will change with position of
the samples 75
Explaining the variance
• Having more samples makes the estimate more
robust to changes in the position of samples
– The variance of the estimate is smaller
77
Explaining the variance
With only one sample
• Having very few samples makes the estimate
swing wildly with the sample position
– Since our estimator learns the to minimize this
estimate, the learned too can swing wildly 78
Explaining the variance
With only one sample
• Having very few samples makes the estimate
swing wildly with the sample position
– Since our estimator learns the to minimize this
estimate, the learned too can swing wildly 79
Explaining the variance
With only one sample
• Having very few samples makes the estimate
swing wildly with the sample position
– Since our estimator learns the to minimize this
estimate, the learned too can swing wildly 80
SGD vs batch
• SGD uses the gradient from only one sample
at a time, and is consequently high variance
• But also provides significantly quicker updates
than batch
• Is there a good medium?
82
Alternative: Mini-batch update
• Alternative: adjust the function at a small, randomly chosen subset of
points
– Keep adjustments small
– If the subsets cover the training set, we will have adjusted the entire function
• As before, vary the subsets randomly in different passes through the
training data
83
Alternative: Mini-batch update
• Alternative: adjust the function at a small, randomly chosen subset of
points
– Keep adjustments small
– If the subsets cover the training set, we will have adjusted the entire function
• As before, vary the subsets randomly in different passes through the
training data
84
Alternative: Mini-batch update
• Alternative: adjust the function at a small, randomly chosen subset of
points
– Keep adjustments small
– If the subsets cover the training set, we will have adjusted the entire function
• As before, vary the subsets randomly in different passes through the
training data
85
Alternative: Mini-batch update
• Alternative: adjust the function at a small, randomly chosen subset of
points
– Keep adjustments small
– If the subsets cover the training set, we will have adjusted the entire function
• As before, vary the subsets randomly in different passes through the
training data
86
Incremental Update: Mini-batch
update
• Given , ,…,
• Initialize all weights ;
• Do:
– Randomly permute , ,…,
– For
• 𝑗 =𝑗+1
• For every layer k:
– ∆𝑊 = 0
• For t’ = t : t+b-1
– For every layer 𝑘:
» Compute 𝛻 𝐷𝑖𝑣(𝑌 , 𝑑 )
» ∆𝑊 = ∆𝑊 + 𝛻 𝐷𝑖𝑣(𝑌 , 𝑑 )
• Update
– For every layer k:
𝑊 = 𝑊 − 𝜂 ∆𝑊
• Until has converged 87
Incremental Update: Mini-batch
update
• Given , ,…,
• Initialize all weights ;
• Do:
– Randomly permute , ,…,
– For
• 𝑗 =𝑗+1 Mini-batch size
• For every layer k:
– ∆𝑊 = 0 Shrinking step size
• For t’ = t : t+b-1
– For every layer 𝑘:
» Compute 𝛻 𝐷𝑖𝑣(𝑌 , 𝑑 )
» ∆𝑊 = ∆𝑊 + 𝛻 𝐷𝑖𝑣(𝑌 , 𝑑 )
• Update
– For every layer k:
𝑊 = 𝑊 − 𝜂 ∆𝑊
• Until has converged 88
Mini Batches
di
Xi
• Mini-batch updates compute and minimize a batch loss
• The expected value of the batch loss is also the expected divergence
89
SGD example
• Mini-batch performs comparably to batch
training on this simple problem
– But converges orders of magnitude faster
93
Training and minibatches
• In practice, training is usually performed using mini-
batches
– The mini-batch size is a hyper parameter to be optimized
• Convergence depends on learning rate
– Simple technique: fix learning rate until the error plateaus,
then reduce learning rate by a fixed factor (e.g. 10)
– Advanced methods: Adaptive updates, where the learning
rate is itself determined as part of the estimation
95
Momentum and incremental updates
SGD instance
or minibatch
loss
• The momentum method
• Incremental SGD and mini-batch gradients tend to have
high variance
• Momentum smooths out the variations
– Smoother and faster convergence
100
Momentum: Mini-batch update
• Given , ,…,
• Initialize all weights ; ,
• Do:
– Randomly permute , ,…,
– For
• 𝑗 =𝑗+1
• For every layer k:
– 𝛻 𝐿𝑜𝑠𝑠 = 0
• For t’ = t : t+b-1
– For every layer 𝑘:
» Compute 𝛻 𝐷𝑖𝑣(𝑌 , 𝑑 )
» 𝛻 𝐿𝑜𝑠𝑠 += 𝛻 𝑫𝒊𝒗(𝑌 , 𝑑 )
• Update
– For every layer k:
Δ𝑊 = 𝛽Δ𝑊 − 𝜂 (𝛻 𝐿𝑜𝑠𝑠)
𝑊 = 𝑊 + ∆𝑊
• Until has converged
101
Nestorov’s Accelerated Gradient
• At any iteration, to compute the current step:
– First extend the previous step
– Then compute the gradient at the resultant position
– Add the two to obtain the final step
• This also applies directly to incremental update methods
– The accelerated gradient smooths out the variance in the
gradients
102
Nestorov’s Accelerated Gradient
SGD instance
or minibatch
loss
• Nestorov’s method
( )
103
Nestorov: Mini-batch update
• Given , ,…,
• Initialize all weights ; 𝑗 = 0, ∆𝑊 = 0
• Do:
– Randomly permute 𝑋 , 𝑑 , 𝑋 , 𝑑 ,…, 𝑋 , 𝑑
– For 𝑡 = 1: 𝑏: 𝑇
• 𝑗=𝑗+1
• For every layer k:
– 𝑊 = 𝑊 + 𝛽Δ𝑊
– 𝛻 𝐿𝑜𝑠𝑠 = 0
• For t’ = t : t+b-1
– For every layer 𝑘:
» Compute 𝛻 𝐷𝑖𝑣(𝑌 , 𝑑 )
» 𝛻 𝐿𝑜𝑠𝑠 += 𝛻 𝑫𝒊𝒗(𝑌 , 𝑑 )
• Update
– For every layer k:
𝑊 =𝑊 −𝜂 𝛻 𝐿𝑜𝑠𝑠𝑇
Δ𝑊 = 𝛽Δ𝑊 − 𝜂 𝛻 𝐿𝑜𝑠𝑠𝑇
• Until has converged
104
Still higher-order methods
• Momentum and Nestorov’s method improve
convergence by normalizing the mean of the
derivatives
• More recent methods take this one step further by also
considering their variance
– RMS Prop
– Adagrad
– AdaDelta
– ADAM: very popular in practice
– …
• All roughly equivalent in performance
105
Smoothing the trajectory
Step X component Y component
1 1 +2.5
2 1 -3
1 2 4 3 2 +2.5
3 5
4 1 -2
5 1.5 1.5
• Observation: Steps in “oscillatory” directions show large total
movement
– In the example, total motion in the vertical direction is much greater
than in the horizontal direction
– Can happen even when momentum or Nestorov are used
• Improvement: Dampen step size in directions with high motion
– Second order term
106
Normalizing steps by second moment
• Modify usual gradient-based update:
– Scale updates in every component in inverse proportion to the total
movement of that component in recent past
• According to their variation (not just their average)
• This will change the relative update sizes for the individual
components
– In the above example it would scale down Y component
– And scale up X component (in comparison)
• We will see two popular methods that embody this principle…
107
RMS Prop
• Notation:
– Updates are by parameter
– Derivative of loss w.r.t any individual parameter is shown as
• Batch or minibatch loss, or individual divergence for batch/minibatch/SGD
– The squared derivative is
• Short-hand notation represents the squared derivative, not the second
derivative
– The mean squared derivative is a running estimate of the average
squared derivative. We will show this as
• Modified update rule: We want to
– scale down updates with large mean squared derivatives
– scale up updates with small mean squared derivatives
108
RMS Prop
• This is a variant on the basic mini-batch SGD algorithm
• Procedure:
– Maintain a running estimate of the mean squared value of
derivatives for each parameter
– Scale update of the parameter by the inverse of the root mean
squared derivative
109
RMS Prop (updates are for each
• Do:
weight of each layer)
– Randomly shuffle inputs to change their order
– Initialize: ; for all weights in all layers,
– For all (incrementing in blocks of inputs)
• For all weights in all layers initialize 𝜕 𝐷 =0
• For 𝑏 = 0: 𝐵 − 1
– Compute
» Output 𝒀(𝑿𝒕 𝒃)
𝒅𝑫𝒊𝒗(𝒀(𝑿𝒕 𝒃 ),𝒅𝒕 𝒃 )
» Compute gradient
𝒅𝒘
𝒅𝑫𝒊𝒗(𝒀(𝑿𝒕 𝒃 ),𝒅𝒕 𝒃 )
» Compute 𝜕 𝐷 +=
𝒅𝒘
• update: for all 𝑤 ∈ 𝑤 ∀𝑖, 𝑗, 𝑘
𝑬 𝝏𝟐𝒘 𝑫 𝒌
= 𝜸𝑬 𝝏𝟐𝒘 𝑫 𝒌 𝟏
+ 𝟏 − 𝜸 𝝏𝟐𝒘 𝑫 𝒌
𝜼 Typical values:
𝒘𝒌 𝟏 = 𝒘𝒌 − 𝝏𝒘 𝑫
𝑬 𝝏𝟐𝒘 𝑫 𝒌+𝝐
• 𝑘 =𝑘+1
• Until loss has converged
111
ADAM: RMSprop with momentum
• RMS prop only considers a second-moment normalized version of the current
gradient
• ADAM utilizes a smoothed version of the momentum-augmented gradient
– Considers both first and second moments
• Procedure:
– Maintain a running estimate of the mean derivative for each parameter
– Maintain a running estimate of the mean squared value of derivatives for each parameter
– Scale update of the parameter by the inverse of the root mean squared derivative
𝑚 𝑣
𝑚 = , 𝑣 =
1−𝛿 1−𝛾
𝜂
𝑤 =𝑤 − 𝑚
𝑣 +𝜖
112
ADAM: RMSprop with momentum
• RMS prop only considers a second-moment normalized version of the
current gradient
• ADAM utilizes a smoothed version of the momentum-augmented gradient
• Procedure:
– Maintain a running estimate of the mean derivative for each parameter
– Maintain a running estimate of the mean squared value of Ensures
derivatives
thatfor
theeach
parameter and terms do
– Scale update of the parameter by the inverse of the root mean squared in
not dominate
derivative early
iterations
113
Other variants of the same theme
• Many:
– Adagrad
– AdaDelta
– AdaMax
– …
• Generally no explicit learning rate to optimize
– But come with other hyper parameters to be optimized
– Typical params:
• RMSProp: ,
• ADAM: , ,
114
Visualizing the optimizers: Beale’s Function
• http://www.denizyuret.com/2015/03/alec-radfords-animations-for.html
115
Visualizing the optimizers: Long Valley
• http://www.denizyuret.com/2015/03/alec-radfords-animations-for.html
116
Visualizing the optimizers: Saddle Point
• http://www.denizyuret.com/2015/03/alec-radfords-animations-for.html
117
Story so far
• Gradient descent can be sped up by incremental
updates
– Convergence is guaranteed under most conditions
• Learning rate must shrink with time for convergence
– Stochastic gradient descent: update after each
observation. Can be much faster than batch learning
– Mini-batch updates: update after batches. Can be more
efficient than SGD
• Convergence can be improved using smoothed updates
– RMSprop and more advanced techniques
118