0% found this document useful (0 votes)

24 views62 pages

Neural Network Training: Optimization

Uploaded by

rishabh johri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views62 pages

Neural Network Training: Optimization

Uploaded by

rishabh johri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 62

Training Neural Networks:

Optimization
Intro to Deep Learning, Fall 2020

1
Recap
• Neural networks are universal approximators
• We must train them to approximate any
function
• Networks are trained to minimize total “error”
on a training set
– We do so through empirical risk minimization
• We use variants of gradient descent to do so
– Gradients are computed through backpropagation
17
The training formulation

output (y)

Input (X)

• Given input output pairs at a number of

locations, estimate the entire function
21
Gradient descent

• Start with an initial function

• Adjust its value at all points to make the outputs closer to the required
value
– Gradient descent adjusts parameters to adjust the function value at all points
– Repeat this iteratively until we get arbitrarily close to the target function at the
training points

22
Gradient descent

• Start with an initial function

23
Gradient descent

• Start with an initial function

24
Gradient descent

• Start with an initial function

27
Effect of number of samples

• Problem with conventional gradient descent: we try to

simultaneously adjust the function at all training points
– We must process all training points before making a single
adjustment
– “Batch” update
28
Alternative: Incremental update

• Alternative: adjust the function at one training point at a time

– Keep adjustments small
– Eventually, when we have processed all the training points, we will
have adjusted the entire function
• With greater overall adjustment than we would if we made a single “Batch”
update

29
Alternative: Incremental update

• Alternative: adjust the function at one training point at a time

30
Alternative: Incremental update

• Alternative: adjust the function at one training point at a time

31
Alternative: Incremental update

• Alternative: adjust the function at one training point at a time

33
Incremental Update
• Given , ,…,
• Initialize all weights
• Do:
– For all
• For every layer :
– Compute 𝒕 𝒕

– Update

• Until has converged

34
Incremental Updates
• The iterations can make multiple passes over
the data
• A single pass through the entire training data
is called an “epoch”
– An epoch over a training set with samples
results in updates of parameters

35
Incremental Update
• Given , ,…,
• Initialize all weights
• Do: Over multiple epochs One epoch

– For all
• For every layer :
– Compute 𝒕 𝒕

– Update
One update

• Until has converged

36
Caveats: order of presentation

• If we loop through the samples in the same

order, we may get cyclic behavior

37
Caveats: order of presentation

• If we loop through the samples in the same

order, we may get cyclic behavior

38
Caveats: order of presentation

• If we loop through the samples in the same

order, we may get cyclic behavior

39
Caveats: order of presentation

• If we loop through the samples in the same

order, we may get cyclic behavior

40
Caveats: order of presentation

• If we loop through the samples in the same order,

we may get cyclic behavior
• We must go through them randomly to get more
convergent behavior
41
Incremental Update: Stochastic
Gradient Descent
• Given , ,…,
• Initialize all weights
• Do:
– Randomly permute , ,…,
– For all
• For every layer :
– Compute 𝒕 𝒕
– Update
𝒕 𝒕

• Until has converged

46
Story so far
• In any gradient descent optimization problem,
presenting training instances incrementally
can be more effective than presenting them
all at once
– Provided training instances are provided in
random order
– “Stochastic Gradient Descent”

• This also holds for training neural networks

47
Batch vs SGD

Batch SGD

• Batch gradient descent operates over T training instances

to get a single update
• SGD gets T updates for the same computation
52
Caveats: learning rate

output (y)

Input (X)
• Except in the case of a perfect fit, even an optimal overall
fit will look incorrect to individual instances
– Correcting the function for individual instances will lead to
never-ending, non-convergent updates
– We must shrink the learning rate with iterations to prevent this
• Correction for individual instances with the eventual miniscule
learning rates will not modify the function 56
Incremental Update: Stochastic
Gradient Descent
• Given , ,…,
• Initialize all weights ;
• Do:
– Randomly permute , ,…,
– For all
•
• For every layer :
– Compute 𝒕 𝒕
– Update
𝒕 𝒕

• Until has converged

57
Incremental Update: Stochastic
Gradient Descent
• Given , ,…,
• Initialize all weights ;
• Do:
– Randomly permute , ,…,
– For all
Randomize input order
•
• For every layer :
Learning rate reduces with j
– Compute 𝒕 𝒕
– Update
𝒕 𝒕

• Until has converged

58
SGD example

• A simpler problem: K-means

• Note: SGD converges slower
• Also note the rather large variation between runs
– Lets try to understand these results.. 65
Explaining the variance

• The blue curve is the function being approximated

• The red curve is the approximation by the model at a given
• The heights of the shaded regions represent the point-by-point error
– The divergence is a function of the error
– We want to find the that minimizes the average divergence

73
Explaining the variance

• Sample estimate approximates the shaded area with the

average length of the lines of these curves is the red curve
itself
• Variance: The spread between the different curves is the
variance
74
Explaining the variance

• Sample estimate approximates the shaded area

with the average length of the lines
• This average length will change with position of
the samples 75
Explaining the variance

• Having more samples makes the estimate more

robust to changes in the position of samples
– The variance of the estimate is smaller
77
Explaining the variance
With only one sample

• Having very few samples makes the estimate

swing wildly with the sample position
– Since our estimator learns the to minimize this
estimate, the learned too can swing wildly 78
Explaining the variance
With only one sample

• Having very few samples makes the estimate

swing wildly with the sample position
– Since our estimator learns the to minimize this
estimate, the learned too can swing wildly 79
Explaining the variance
With only one sample

• Having very few samples makes the estimate

swing wildly with the sample position
– Since our estimator learns the to minimize this
estimate, the learned too can swing wildly 80
SGD vs batch
• SGD uses the gradient from only one sample
at a time, and is consequently high variance

• But also provides significantly quicker updates

than batch
• Is there a good medium?

82
Alternative: Mini-batch update

• Alternative: adjust the function at a small, randomly chosen subset of

points
– Keep adjustments small
– If the subsets cover the training set, we will have adjusted the entire function
• As before, vary the subsets randomly in different passes through the
training data

83
Alternative: Mini-batch update

• Alternative: adjust the function at a small, randomly chosen subset of

84
Alternative: Mini-batch update

• Alternative: adjust the function at a small, randomly chosen subset of

85
Alternative: Mini-batch update

• Alternative: adjust the function at a small, randomly chosen subset of

86
Incremental Update: Mini-batch
update
• Given , ,…,
• Initialize all weights ;
• Do:
– Randomly permute , ,…,
– For
• 𝑗 =𝑗+1
• For every layer k:
– ∆𝑊 = 0
• For t’ = t : t+b-1
– For every layer 𝑘:
» Compute 𝛻 𝐷𝑖𝑣(𝑌 , 𝑑 )

» ∆𝑊 = ∆𝑊 + 𝛻 𝐷𝑖𝑣(𝑌 , 𝑑 )

• Update
– For every layer k:
𝑊 = 𝑊 − 𝜂 ∆𝑊
• Until has converged 87
Incremental Update: Mini-batch
update
• Given , ,…,
• Initialize all weights ;
• Do:
– Randomly permute , ,…,
– For
• 𝑗 =𝑗+1 Mini-batch size
• For every layer k:
– ∆𝑊 = 0 Shrinking step size
• For t’ = t : t+b-1
– For every layer 𝑘:
» Compute 𝛻 𝐷𝑖𝑣(𝑌 , 𝑑 )

» ∆𝑊 = ∆𝑊 + 𝛻 𝐷𝑖𝑣(𝑌 , 𝑑 )

• Update
– For every layer k:
𝑊 = 𝑊 − 𝜂 ∆𝑊
• Until has converged 88
Mini Batches

• Mini-batch updates compute and minimize a batch loss

• The expected value of the batch loss is also the expected divergence

89
SGD example

• Mini-batch performs comparably to batch

training on this simple problem
– But converges orders of magnitude faster
93
Training and minibatches
• In practice, training is usually performed using mini-
batches
– The mini-batch size is a hyper parameter to be optimized

• Convergence depends on learning rate

– Simple technique: fix learning rate until the error plateaus,
then reduce learning rate by a fixed factor (e.g. 10)
– Advanced methods: Adaptive updates, where the learning
rate is itself determined as part of the estimation

95
Momentum and incremental updates

SGD instance
or minibatch
loss
• The momentum method

• Incremental SGD and mini-batch gradients tend to have

high variance
• Momentum smooths out the variations
– Smoother and faster convergence
100
Momentum: Mini-batch update
• Given , ,…,
• Initialize all weights ; ,
• Do:
– Randomly permute , ,…,
– For
• 𝑗 =𝑗+1
• For every layer k:
– 𝛻 𝐿𝑜𝑠𝑠 = 0
• For t’ = t : t+b-1
– For every layer 𝑘:
» Compute 𝛻 𝐷𝑖𝑣(𝑌 , 𝑑 )

» 𝛻 𝐿𝑜𝑠𝑠 += 𝛻 𝑫𝒊𝒗(𝑌 , 𝑑 )
• Update
– For every layer k:
Δ𝑊 = 𝛽Δ𝑊 − 𝜂 (𝛻 𝐿𝑜𝑠𝑠)
𝑊 = 𝑊 + ∆𝑊
• Until has converged
101
Nestorov’s Accelerated Gradient

• At any iteration, to compute the current step:

– First extend the previous step
– Then compute the gradient at the resultant position
– Add the two to obtain the final step
• This also applies directly to incremental update methods
– The accelerated gradient smooths out the variance in the
gradients
102
Nestorov’s Accelerated Gradient

SGD instance
or minibatch
loss
• Nestorov’s method
( )

103
Nestorov: Mini-batch update
• Given , ,…,
• Initialize all weights ; 𝑗 = 0, ∆𝑊 = 0
• Do:
– Randomly permute 𝑋 , 𝑑 , 𝑋 , 𝑑 ,…, 𝑋 , 𝑑
– For 𝑡 = 1: 𝑏: 𝑇
• 𝑗=𝑗+1
• For every layer k:
– 𝑊 = 𝑊 + 𝛽Δ𝑊
– 𝛻 𝐿𝑜𝑠𝑠 = 0
• For t’ = t : t+b-1
– For every layer 𝑘:
» Compute 𝛻 𝐷𝑖𝑣(𝑌 , 𝑑 )

» 𝛻 𝐿𝑜𝑠𝑠 += 𝛻 𝑫𝒊𝒗(𝑌 , 𝑑 )
• Update
– For every layer k:
𝑊 =𝑊 −𝜂 𝛻 𝐿𝑜𝑠𝑠𝑇
Δ𝑊 = 𝛽Δ𝑊 − 𝜂 𝛻 𝐿𝑜𝑠𝑠𝑇

• Until has converged

104
Still higher-order methods
• Momentum and Nestorov’s method improve
convergence by normalizing the mean of the
derivatives
• More recent methods take this one step further by also
considering their variance
– RMS Prop
– Adagrad
– AdaDelta
– ADAM: very popular in practice
– …
• All roughly equivalent in performance
105
Smoothing the trajectory
Step X component Y component

1 1 +2.5
2 1 -3
1 2 4 3 2 +2.5
3 5
4 1 -2
5 1.5 1.5

• Observation: Steps in “oscillatory” directions show large total

movement
– In the example, total motion in the vertical direction is much greater
than in the horizontal direction
– Can happen even when momentum or Nestorov are used
• Improvement: Dampen step size in directions with high motion
– Second order term
106
Normalizing steps by second moment

• Modify usual gradient-based update:

– Scale updates in every component in inverse proportion to the total
movement of that component in recent past
• According to their variation (not just their average)
• This will change the relative update sizes for the individual
components
– In the above example it would scale down Y component
– And scale up X component (in comparison)
• We will see two popular methods that embody this principle…
107
RMS Prop
• Notation:
– Updates are by parameter

– Derivative of loss w.r.t any individual parameter is shown as

• Batch or minibatch loss, or individual divergence for batch/minibatch/SGD

– The squared derivative is

• Short-hand notation represents the squared derivative, not the second
derivative

– The mean squared derivative is a running estimate of the average

squared derivative. We will show this as

• Modified update rule: We want to

– scale down updates with large mean squared derivatives
– scale up updates with small mean squared derivatives
108
RMS Prop
• This is a variant on the basic mini-batch SGD algorithm

• Procedure:
– Maintain a running estimate of the mean squared value of
derivatives for each parameter
– Scale update of the parameter by the inverse of the root mean
squared derivative

109
RMS Prop (updates are for each
• Do:
weight of each layer)
– Randomly shuffle inputs to change their order
– Initialize: ; for all weights in all layers,
– For all (incrementing in blocks of inputs)
• For all weights in all layers initialize 𝜕 𝐷 =0
• For 𝑏 = 0: 𝐵 − 1
– Compute
» Output 𝒀(𝑿𝒕 𝒃)
𝒅𝑫𝒊𝒗(𝒀(𝑿𝒕 𝒃 ),𝒅𝒕 𝒃 )
» Compute gradient
𝒅𝒘
𝒅𝑫𝒊𝒗(𝒀(𝑿𝒕 𝒃 ),𝒅𝒕 𝒃 )
» Compute 𝜕 𝐷 +=
𝒅𝒘

• update: for all 𝑤 ∈ 𝑤 ∀𝑖, 𝑗, 𝑘

𝑬 𝝏𝟐𝒘 𝑫 𝒌
= 𝜸𝑬 𝝏𝟐𝒘 𝑫 𝒌 𝟏
+ 𝟏 − 𝜸 𝝏𝟐𝒘 𝑫 𝒌
𝜼 Typical values:
𝒘𝒌 𝟏 = 𝒘𝒌 − 𝝏𝒘 𝑫
𝑬 𝝏𝟐𝒘 𝑫 𝒌+𝝐

• 𝑘 =𝑘+1
• Until loss has converged
111
ADAM: RMSprop with momentum
• RMS prop only considers a second-moment normalized version of the current
gradient
• ADAM utilizes a smoothed version of the momentum-augmented gradient
– Considers both first and second moments

• Procedure:
– Maintain a running estimate of the mean derivative for each parameter
– Maintain a running estimate of the mean squared value of derivatives for each parameter
– Scale update of the parameter by the inverse of the root mean squared derivative

𝑚 𝑣
𝑚 = , 𝑣 =
1−𝛿 1−𝛾
𝜂
𝑤 =𝑤 − 𝑚
𝑣 +𝜖

112
ADAM: RMSprop with momentum
• RMS prop only considers a second-moment normalized version of the
current gradient
• ADAM utilizes a smoothed version of the momentum-augmented gradient

• Procedure:
– Maintain a running estimate of the mean derivative for each parameter
– Maintain a running estimate of the mean squared value of Ensures
derivatives
thatfor
theeach
parameter and terms do
– Scale update of the parameter by the inverse of the root mean squared in
not dominate
derivative early
iterations

113
Other variants of the same theme
• Many:
– Adagrad
– AdaDelta
– AdaMax
– …
• Generally no explicit learning rate to optimize
– But come with other hyper parameters to be optimized
– Typical params:
• RMSProp: ,
• ADAM: , ,

114
Visualizing the optimizers: Beale’s Function

• http://www.denizyuret.com/2015/03/alec-radfords-animations-for.html

115
Visualizing the optimizers: Long Valley

116
Visualizing the optimizers: Saddle Point

117
Story so far
• Gradient descent can be sped up by incremental
updates
– Convergence is guaranteed under most conditions
• Learning rate must shrink with time for convergence
– Stochastic gradient descent: update after each
observation. Can be much faster than batch learning
– Mini-batch updates: update after batches. Can be more
efficient than SGD

• Convergence can be improved using smoothed updates

– RMSprop and more advanced techniques

118

7 Optimization2 Stochastic Gradient
No ratings yet
7 Optimization2 Stochastic Gradient
114 pages
Lec 7 Optimization Part 2
No ratings yet
Lec 7 Optimization Part 2
139 pages
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
No ratings yet
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
40 pages
Week 06 - Deep Feedforward Networks - Optimization
No ratings yet
Week 06 - Deep Feedforward Networks - Optimization
83 pages
Stochastic Gradient Descent Tuning
No ratings yet
Stochastic Gradient Descent Tuning
8 pages
07 Optimizers
No ratings yet
07 Optimizers
77 pages
1 Intro
No ratings yet
1 Intro
91 pages
DL UNIT II PART II (IMP) Optimization For Training Deep Model
No ratings yet
DL UNIT II PART II (IMP) Optimization For Training Deep Model
81 pages
Gradient Descent 5 Part 2
No ratings yet
Gradient Descent 5 Part 2
15 pages
Linear Regression For Machine Learning Course
No ratings yet
Linear Regression For Machine Learning Course
41 pages
Convolutional Neural Network Basics
100% (1)
Convolutional Neural Network Basics
59 pages
Gradient Descent
No ratings yet
Gradient Descent
13 pages
Unit V NNHDL
No ratings yet
Unit V NNHDL
33 pages
Linear Models & Optimization Techniques
No ratings yet
Linear Models & Optimization Techniques
24 pages
Linear Regression
No ratings yet
Linear Regression
63 pages
DL Exp2
No ratings yet
DL Exp2
6 pages
Technical Writing
No ratings yet
Technical Writing
9 pages
Optimization
No ratings yet
Optimization
44 pages
Gradient Descent - PR
No ratings yet
Gradient Descent - PR
31 pages
An Overview of Gradient Descent Optimization Algorithms PDF
No ratings yet
An Overview of Gradient Descent Optimization Algorithms PDF
12 pages
Neural Networks: A Beginner's Guide
No ratings yet
Neural Networks: A Beginner's Guide
23 pages
CS601 - Machine Learning - Unit 2 New
No ratings yet
CS601 - Machine Learning - Unit 2 New
56 pages
Neural Network Optimization Tactics
No ratings yet
Neural Network Optimization Tactics
20 pages
9.b Handout-3-GD Variants
No ratings yet
9.b Handout-3-GD Variants
3 pages
Linear Regression Notes
No ratings yet
Linear Regression Notes
15 pages
Cours 5
No ratings yet
Cours 5
23 pages
S09 DNN Gradients Wip
No ratings yet
S09 DNN Gradients Wip
28 pages
Machine Learning Basics for Students
No ratings yet
Machine Learning Basics for Students
7 pages
Supervised Learning and Linear Regression
No ratings yet
Supervised Learning and Linear Regression
141 pages
Stanford ML CS229-Merged Notes
No ratings yet
Stanford ML CS229-Merged Notes
126 pages
Machine Learning Notes by Standard Andrew NG
No ratings yet
Machine Learning Notes by Standard Andrew NG
142 pages
CS229 Lecture Notes: Supervised Learning
No ratings yet
CS229 Lecture Notes: Supervised Learning
293 pages
Chapter 8-Deep Learning Book
No ratings yet
Chapter 8-Deep Learning Book
27 pages
ML - Week 06
No ratings yet
ML - Week 06
31 pages
CS229 Lecture Notes: Supervised Learning
No ratings yet
CS229 Lecture Notes: Supervised Learning
30 pages
cs229 2
No ratings yet
cs229 2
275 pages
CS229
No ratings yet
CS229
69 pages
Notes 1
No ratings yet
Notes 1
30 pages
CS229 Lecture Notes: Supervised Learning
No ratings yet
CS229 Lecture Notes: Supervised Learning
30 pages
Ch2-Training, Optimization and Regularization of DNN-new
No ratings yet
Ch2-Training, Optimization and Regularization of DNN-new
114 pages
Unit 2-DLV
No ratings yet
Unit 2-DLV
84 pages
Regression
No ratings yet
Regression
30 pages
Gradient Descent Optimization Guide
No ratings yet
Gradient Descent Optimization Guide
9 pages
Technical Writing
No ratings yet
Technical Writing
8 pages
Linear Models-Gradient Descent, Regularization (Introduction)
No ratings yet
Linear Models-Gradient Descent, Regularization (Introduction)
26 pages
Stochastic Gradient Descent Overview
No ratings yet
Stochastic Gradient Descent Overview
24 pages
Lecture 5
No ratings yet
Lecture 5
34 pages
Unit 2
No ratings yet
Unit 2
76 pages
cs229 Notes1 PDF
No ratings yet
cs229 Notes1 PDF
28 pages
Gradient Descent for ML Practitioners
No ratings yet
Gradient Descent for ML Practitioners
27 pages
Op Tim Ization
No ratings yet
Op Tim Ization
9 pages
UNIT3
No ratings yet
UNIT3
37 pages
Gradient Descent Types Explained
No ratings yet
Gradient Descent Types Explained
5 pages
Gradient-Based Optimizers
No ratings yet
Gradient-Based Optimizers
54 pages
15.modular Electric Vehicle Platforms
No ratings yet
15.modular Electric Vehicle Platforms
13 pages
Processing For Improved Spectral Analysis
No ratings yet
Processing For Improved Spectral Analysis
5 pages
Tribute to My Selfless Father
No ratings yet
Tribute to My Selfless Father
2 pages
0000 REP JJ 0003 - Detailed Design Report & Appendices
100% (1)
0000 REP JJ 0003 - Detailed Design Report & Appendices
172 pages
Principles of Systematic Zoology by Erns
No ratings yet
Principles of Systematic Zoology by Erns
456 pages
Wine A Tasting Course 1st Edition Marnie Old Instant Download
100% (4)
Wine A Tasting Course 1st Edition Marnie Old Instant Download
48 pages
ST AN3192
No ratings yet
ST AN3192
34 pages
VLSI DSP Systems: Design & Implementation
No ratings yet
VLSI DSP Systems: Design & Implementation
18 pages
SSG Project: Gift of Sharing Proposal
75% (4)
SSG Project: Gift of Sharing Proposal
5 pages
Iso 42001 Brochure 2024
No ratings yet
Iso 42001 Brochure 2024
8 pages
First Choice歡樂大集合醫學 3
No ratings yet
First Choice歡樂大集合醫學 3
164 pages
Environmental Audit for Georgia Transport
No ratings yet
Environmental Audit for Georgia Transport
38 pages
Evidence Guide NC 2013
No ratings yet
Evidence Guide NC 2013
65 pages
Starbucks Corporation Case Study
No ratings yet
Starbucks Corporation Case Study
3 pages
Joseph Pientka Briefing Document
75% (4)
Joseph Pientka Briefing Document
7 pages
AI Pres
No ratings yet
AI Pres
2 pages
Plant Design Template
No ratings yet
Plant Design Template
13 pages
Differences Between REM & NON-REM Sleep
100% (1)
Differences Between REM & NON-REM Sleep
16 pages
Analytical Chemistry (Lab Apparatus)
No ratings yet
Analytical Chemistry (Lab Apparatus)
6 pages
Numerical Unit 21 Physics of Solids WATERMARK
No ratings yet
Numerical Unit 21 Physics of Solids WATERMARK
6 pages
Electric Flux & Field Lines Homework Help
100% (1)
Electric Flux & Field Lines Homework Help
4 pages
Sutuiltuta: $ (Rrffi
No ratings yet
Sutuiltuta: $ (Rrffi
228 pages
Responding to Friends Positively
No ratings yet
Responding to Friends Positively
4 pages
Precision-Graded Cohomology and Arithmetic Persistence For Network Sheaves
No ratings yet
Precision-Graded Cohomology and Arithmetic Persistence For Network Sheaves
30 pages
Thesis Artinya
100% (3)
Thesis Artinya
6 pages
Review On The Use of Bamboo As An Alternative and Sustainable Material For Construction
No ratings yet
Review On The Use of Bamboo As An Alternative and Sustainable Material For Construction
24 pages
Ecocriticism (Mid)
No ratings yet
Ecocriticism (Mid)
2 pages
Authorized Hacker Techniques Tools and Incident Handling 3rd Edition Ebook and TestBank Bundle
No ratings yet
Authorized Hacker Techniques Tools and Incident Handling 3rd Edition Ebook and TestBank Bundle
326 pages
The Importance and Benefits of Having A Friend
No ratings yet
The Importance and Benefits of Having A Friend
3 pages
Some Metaphysical Questions
100% (1)
Some Metaphysical Questions
2 pages

Neural Network Training: Optimization

Uploaded by

Neural Network Training: Optimization

Uploaded by

Training Neural Networks:

• Given input output pairs at a number of

• Start with an initial function

• Start with an initial function

• Start with an initial function

• Start with an initial function

• Problem with conventional gradient descent: we try to

• Alternative: adjust the function at one training point at a time

• Alternative: adjust the function at one training point at a time

• Alternative: adjust the function at one training point at a time

• Alternative: adjust the function at one training point at a time

• Until has converged

• Until has converged

• If we loop through the samples in the same

• If we loop through the samples in the same

• If we loop through the samples in the same

• If we loop through the samples in the same

• If we loop through the samples in the same order,

• Until has converged

• This also holds for training neural networks

• Batch gradient descent operates over T training instances

• Until has converged

• Until has converged

• A simpler problem: K-means

• The blue curve is the function being approximated

• Sample estimate approximates the shaded area with the

• Sample estimate approximates the shaded area

• Having more samples makes the estimate more

• Having very few samples makes the estimate

• Having very few samples makes the estimate

• Having very few samples makes the estimate

• But also provides significantly quicker updates

• Alternative: adjust the function at a small, randomly chosen subset of

• Alternative: adjust the function at a small, randomly chosen subset of

• Alternative: adjust the function at a small, randomly chosen subset of

• Alternative: adjust the function at a small, randomly chosen subset of

• Mini-batch updates compute and minimize a batch loss

• Mini-batch performs comparably to batch

• Convergence depends on learning rate

• Incremental SGD and mini-batch gradients tend to have

• At any iteration, to compute the current step:

• Until has converged

• Observation: Steps in “oscillatory” directions show large total

• Modify usual gradient-based update:

– Derivative of loss w.r.t any individual parameter is shown as

– The squared derivative is

– The mean squared derivative is a running estimate of the average

• Modified update rule: We want to

• update: for all 𝑤 ∈ 𝑤 ∀𝑖, 𝑗, 𝑘

• Convergence can be improved using smoothed updates

You might also like