Communities

Writing
Writing
Codidact Meta
Codidact Meta
The Great Outdoors
The Great Outdoors
Photography & Video
Photography & Video
Scientific Speculation
Scientific Speculation
Cooking
Cooking
Electrical Engineering
Electrical Engineering
Judaism
Judaism
Languages & Linguistics
Languages & Linguistics
Software Development
Software Development
Mathematics
Mathematics
Christianity
Christianity
Code Golf
Code Golf
Music
Music
Physics
Physics
Linux Systems
Linux Systems
Power Users
Power Users
Tabletop RPGs
Tabletop RPGs
Community Proposals
Community Proposals
tag:snake search within a tag
answers:0 unanswered questions
user:xxxx search by author id
score:0.5 posts with 0.5+ score
"snake oil" exact phrase
votes:4 posts with 4+ votes
created:<1w created < 1 week ago
post_type:xxxx type of post
Search help
Notifications
Mark all as read See all your notifications »
Incubator Q&A

Welcome to the staging ground for new communities! Each proposal has a description in the "Descriptions" category and a body of questions and answers in "Incubator Q&A". You can ask questions (and get answers, we hope!) right away, and start new proposals.

Are you here to participate in a specific proposal? Click on the proposal tag (with the dark outline) to see only posts about that proposal and not all of the others that are in progress. Tags are at the bottom of each post.

What are the differences between the different formulations for momentum in SGD optimizers? Question

+2
−1

When looking around online, you find various formulations for momentum to accelerate (stochastic) gradient descent:

  1. Most sources seem to use the following exponential moving average: $$\begin{aligned} v &\gets \mu v + \eta \nabla_\theta L(\theta) \\ \theta &\gets \theta - v. \end{aligned}$$ This is also how Tensorflow implements momentum.
  2. Another source uses the following formulation: $$\begin{aligned} v &\gets \mu v + \nabla_\theta L(\theta) \\ \theta &\gets \theta - \eta v, \end{aligned}$$ which is the one implemented in Optax (jax) and PyTorch by default.
  3. For a similar question on StackExchange, the best (and only) answer introduces a formulation based on a convex combination of momentum and gradients: $$\begin{aligned} v &\gets \mu v + (1-\mu)\nabla_\theta L(\theta) \\ \theta &\gets \theta - \eta v. \end{aligned}$$ The answer claims this should be the best approach, but this is not implemented in any (popular) DL library.

On StackExchange, the (only) answer claims that the third approach would be the preferred one, but it is not entirely clear from the answer why. Furthermore, the answer claims that some of these formulas are equivalent, but I also do not quite understand how these equivalences work.

As a result, I am left with the question(s): are these different formulations equivalent, or do they lead to significantly different results? If they are not equivalent, is there any reason to prefer one approach over the others?

History

0 comment threads

1 answer

+1
−0

If we assume fixed $\eta$ and $\mu$ and $v$ is initialised so that $v_0 = 0$, the two formulas are practically equivalent: $$\begin{aligned} \theta_1 &= \theta_0 - \eta \nabla_\theta L(\theta_0) & \theta_1 &= \theta_0 - \eta \nabla_\theta L(\theta_0) \\ \theta_2 &= \theta_1 - \big(\mu \eta \nabla_\theta L(\theta_0) + \eta \nabla_\theta L(\theta_1)\big) & \theta_2 &= \theta_1 - \eta \big(\mu \nabla_\theta L(\theta_0) + \nabla_\theta L(\theta_1)\big) \\ &\vdots & &\vdots \\ \theta_n &= \theta_{n-1} - \sum_{t=1}^{n} \mu^{n - t} \eta \nabla_\theta L(\theta_{t-1}) & \theta_n &= \theta_{n-1} - \eta \underbrace{\sum_{t=1}^{n} \mu^{n - t} \nabla_\theta L(\theta_{t-1})}_{v_t} \\ &= \theta_{n-1} - \underbrace{\eta \sum_{t=1}^{n} \mu^{n - t} \nabla_\theta L(\theta_{t-1})}_{v_t} & & \end{aligned}$$ The first column would be option 1 and the second column would be option 2. In a similar way, it is possible to show that option 2 with a modified learning rate $\eta' = \eta (1 - \mu)$ is equivalent to option 3. From this analysis, we could conclude that the only difference is that option 1|2 uses a different effective learning rate as option 3.

Another way to analyse the similarities/differences, is to focus what the goal of $v$ is supposed to be. Since all three options use $v$ to compute an exponential moving average, it could be argued that $v$ is an estimate for the average gradient. This allows us to analyse the quality of this estimate (cf. Section 3 of the Adam paper). To this end, we assume that the individual gradients are unbiased estimates of the gradients, i.e. $$\forall i : \mathbb{E}[\nabla_\theta L(\theta_i)] = \mathbb{E}_\theta[\nabla_\theta L(\theta)].$$ Using the formulations for $v_t$ from above, we have for option 1: $$\begin{align} \mathbb{E}[v_t] &= \mathbb{E}\Bigl[\eta \sum_{t=1}^{n} \mu^{n - t} \nabla_\theta L(\theta_{t-1})\Bigr] \\ &= \mathbb{E}\Bigl[\nabla_\theta L(\theta_{t-1})\Bigr] \eta \sum_{t=1}^{n} \mu^{n - t} \\ &= \mathbb{E}_\theta\Bigl[\nabla_\theta L(\theta)\Bigr] \eta \frac{1 - \mu^t}{1 - \mu}. \end{align}$$ Similarly, for option 2, we find $$\mathbb{E}[v_t] = \mathbb{E}_\theta\Bigl[\nabla_\theta L(\theta)\Bigr] \frac{1 - \mu^t}{1 - \mu}$$ and for option 3, we have $$\mathbb{E}[v_t] = \mathbb{E}_\theta\Bigl[\nabla_\theta L(\theta)\Bigr] (1 - \mu^t)$$

Because (normally) $\mu \in (0, 1)$, we have $1 - \mu < 1$ and therefore, option 3 uses a "less" biased estimate for the gradients than option 2. The bias of the estimate in option 1, on the other hand, depends on the learning rate. However, it is possible to correct for these biases (see e.g. Adam), by using $\theta \gets \theta - \frac{1 - \mu}{1 - \mu^t} v$, $\theta \gets \theta - \frac{\eta (1 - \mu)}{1 - \mu^t} v$, $\theta \gets \theta - \frac{\eta}{1 - \mu^t} v$ for options 1, 2 and 3, respectively. With these bias corrections, all three options should become equivalent.


When $\eta$ and/or $\mu$ change during learning (e.g. by using learning rate schedules), none of the previously discussed equivalences hold. In options 2 and 3, possible learning rate and momentum schedules are clearly separated, but the learning rate schedule in option 1 would interfere with the momentum schedule. This can be easily seen by considering the setting with a fixed momentum $\mu$ in combination with learning rate schedule $\eta_1, \eta_2, \ldots$: $$\begin{aligned} \theta_1 &= \theta_0 - \eta_1 \nabla_\theta L(\theta_0) & \theta_1 &= \theta_0 - \eta_1 \nabla_\theta L(\theta_0) \\ \theta_2 &= \theta_1 - \big(\mu \eta_1 \nabla_\theta L(\theta_0) + \eta_2 \nabla_\theta L(\theta_1)\big) & \theta_2 &= \theta_1 - \eta_2 \big(\mu \nabla_\theta L(\theta_0) + \nabla_\theta L(\theta_1)\big) \\ &\vdots & &\vdots \\ \theta_n &= \theta_{n-1} - \sum_{t=1}^n \mu^{n-t} \eta_t \nabla_\theta L(\theta_{t-1}) & \theta_n &= \theta_{n-1} - \eta_n \sum_{t=1}^{n} \mu^{n - t} \nabla_\theta L(\theta_{t-1}) \\ &= \theta_{n-1} - \eta_n \sum_{t=1}^n \mu^{n-t} \frac{\eta_t}{\eta_n} \nabla_\theta L(\theta_{t-1}) & & \end{aligned}$$ Again, column 1 would be option 1 and column 2 would be option 2. As a result, option 1 and option 2 are no longer equivalent when using learning rate schedules. The interpretation that option 3 can be interpreted as scaling the learning rate of option 2, holds if we only use a learning rate schedule and no momentum schedule, which is the more common scenario in deep learning settings. However, when we use a momentum schedule, also this relation no longer holds.

TL;DR: All three formulations can be made equivalent for fixed learning rate and momentum, but option 3 provides the best estimates out-of-the-box. With schedules, things become more complicated, but options 2 and 3 at least keep learning-rate and momentum schedules disentangled.

History

0 comment threads

Sign up to answer this question »