Welcome to the staging ground for new communities! Each proposal has a description in the "Descriptions" category and a body of questions and answers in "Incubator Q&A". You can ask questions (and get answers, we hope!) right away, and start new proposals.
Are you here to participate in a specific proposal? Click on the proposal tag (with the dark outline) to see only posts about that proposal and not all of the others that are in progress. Tags are at the bottom of each post.
What are the differences between the different formulations for momentum in SGD optimizers? Question
When looking around online, you find various formulations for momentum to accelerate (stochastic) gradient descent:
- Most sources seem to use the following exponential moving average: $$\begin{aligned} v &\gets \mu v + \eta \nabla_\theta L(\theta) \\ \theta &\gets \theta - v. \end{aligned}$$ This is also how Tensorflow implements momentum.
- Another source uses the following formulation: $$\begin{aligned} v &\gets \mu v + \nabla_\theta L(\theta) \\ \theta &\gets \theta - \eta v, \end{aligned}$$ which is the one implemented in Optax (jax) and PyTorch by default.
- For a similar question on StackExchange, the best (and only) answer introduces a formulation based on a convex combination of momentum and gradients: $$\begin{aligned} v &\gets \mu v + (1-\mu)\nabla_\theta L(\theta) \\ \theta &\gets \theta - \eta v. \end{aligned}$$ The answer claims this should be the best approach, but this is not implemented in any (popular) DL library.
On StackExchange, the (only) answer claims that the third approach would be the preferred one, but it is not entirely clear from the answer why. Furthermore, the answer claims that some of these formulas are equivalent, but I also do not quite understand how these equivalences work.
As a result, I am left with the question(s): are these different formulations equivalent, or do they lead to significantly different results? If they are not equivalent, is there any reason to prefer one approach over the others?
1 answer
If we assume fixed $\eta$ and $\mu$ and $v$ is initialised so that $v_0 = 0$, the two formulas are practically equivalent: $$\begin{aligned} \theta_1 &= \theta_0 - \eta \nabla_\theta L(\theta_0) & \theta_1 &= \theta_0 - \eta \nabla_\theta L(\theta_0) \\ \theta_2 &= \theta_1 - \big(\mu \eta \nabla_\theta L(\theta_0) + \eta \nabla_\theta L(\theta_1)\big) & \theta_2 &= \theta_1 - \eta \big(\mu \nabla_\theta L(\theta_0) + \nabla_\theta L(\theta_1)\big) \\ &\vdots & &\vdots \\ \theta_n &= \theta_{n-1} - \sum_{t=1}^{n} \mu^{n - t} \eta \nabla_\theta L(\theta_{t-1}) & \theta_n &= \theta_{n-1} - \eta \underbrace{\sum_{t=1}^{n} \mu^{n - t} \nabla_\theta L(\theta_{t-1})}_{v_t} \\ &= \theta_{n-1} - \underbrace{\eta \sum_{t=1}^{n} \mu^{n - t} \nabla_\theta L(\theta_{t-1})}_{v_t} & & \end{aligned}$$ The first column would be option 1 and the second column would be option 2. In a similar way, it is possible to show that option 2 with a modified learning rate $\eta' = \eta (1 - \mu)$ is equivalent to option 3. From this analysis, we could conclude that the only difference is that option 1|2 uses a different effective learning rate as option 3.
Another way to analyse the similarities/differences, is to focus what the goal of $v$ is supposed to be. Since all three options use $v$ to compute an exponential moving average, it could be argued that $v$ is an estimate for the average gradient. This allows us to analyse the quality of this estimate (cf. Section 3 of the Adam paper). To this end, we assume that the individual gradients are unbiased estimates of the gradients, i.e. $$\forall i : \mathbb{E}[\nabla_\theta L(\theta_i)] = \mathbb{E}_\theta[\nabla_\theta L(\theta)].$$ Using the formulations for $v_t$ from above, we have for option 1: $$\begin{align} \mathbb{E}[v_t] &= \mathbb{E}\Bigl[\eta \sum_{t=1}^{n} \mu^{n - t} \nabla_\theta L(\theta_{t-1})\Bigr] \\ &= \mathbb{E}\Bigl[\nabla_\theta L(\theta_{t-1})\Bigr] \eta \sum_{t=1}^{n} \mu^{n - t} \\ &= \mathbb{E}_\theta\Bigl[\nabla_\theta L(\theta)\Bigr] \eta \frac{1 - \mu^t}{1 - \mu}. \end{align}$$ Similarly, for option 2, we find $$\mathbb{E}[v_t] = \mathbb{E}_\theta\Bigl[\nabla_\theta L(\theta)\Bigr] \frac{1 - \mu^t}{1 - \mu}$$ and for option 3, we have $$\mathbb{E}[v_t] = \mathbb{E}_\theta\Bigl[\nabla_\theta L(\theta)\Bigr] (1 - \mu^t)$$
Because (normally) $\mu \in (0, 1)$, we have $1 - \mu < 1$ and therefore, option 3 uses a "less" biased estimate for the gradients than option 2. The bias of the estimate in option 1, on the other hand, depends on the learning rate. However, it is possible to correct for these biases (see e.g. Adam), by using $\theta \gets \theta - \frac{1 - \mu}{1 - \mu^t} v$, $\theta \gets \theta - \frac{\eta (1 - \mu)}{1 - \mu^t} v$, $\theta \gets \theta - \frac{\eta}{1 - \mu^t} v$ for options 1, 2 and 3, respectively. With these bias corrections, all three options should become equivalent.
When $\eta$ and/or $\mu$ change during learning (e.g. by using learning rate schedules), none of the previously discussed equivalences hold. In options 2 and 3, possible learning rate and momentum schedules are clearly separated, but the learning rate schedule in option 1 would interfere with the momentum schedule. This can be easily seen by considering the setting with a fixed momentum $\mu$ in combination with learning rate schedule $\eta_1, \eta_2, \ldots$: $$\begin{aligned} \theta_1 &= \theta_0 - \eta_1 \nabla_\theta L(\theta_0) & \theta_1 &= \theta_0 - \eta_1 \nabla_\theta L(\theta_0) \\ \theta_2 &= \theta_1 - \big(\mu \eta_1 \nabla_\theta L(\theta_0) + \eta_2 \nabla_\theta L(\theta_1)\big) & \theta_2 &= \theta_1 - \eta_2 \big(\mu \nabla_\theta L(\theta_0) + \nabla_\theta L(\theta_1)\big) \\ &\vdots & &\vdots \\ \theta_n &= \theta_{n-1} - \sum_{t=1}^n \mu^{n-t} \eta_t \nabla_\theta L(\theta_{t-1}) & \theta_n &= \theta_{n-1} - \eta_n \sum_{t=1}^{n} \mu^{n - t} \nabla_\theta L(\theta_{t-1}) \\ &= \theta_{n-1} - \eta_n \sum_{t=1}^n \mu^{n-t} \frac{\eta_t}{\eta_n} \nabla_\theta L(\theta_{t-1}) & & \end{aligned}$$ Again, column 1 would be option 1 and column 2 would be option 2. As a result, option 1 and option 2 are no longer equivalent when using learning rate schedules. The interpretation that option 3 can be interpreted as scaling the learning rate of option 2, holds if we only use a learning rate schedule and no momentum schedule, which is the more common scenario in deep learning settings. However, when we use a momentum schedule, also this relation no longer holds.
TL;DR: All three formulations can be made equivalent for fixed learning rate and momentum, but option 3 provides the best estimates out-of-the-box. With schedules, things become more complicated, but options 2 and 3 at least keep learning-rate and momentum schedules disentangled.

0 comment threads