In this previous post, I introduced momentum-based gradient descent (GD) as well as Nesterov’s accelerated gradient descent (AGD). In this post, I show a few plots of how the paths that GD, GD with momentum and AGD can differ in the simple case of minimizing a 2-dimensional quadratic form. The code for generating this plot is available at this colab. Feel free to make a copy and play around with the parameters, see what types of paths you can get!
In this first plot, we are trying to minimize the 2-dimensional function
starting at the point . We compare vanilla GD against GD with momentum 0.2, 0.5 and 0.8. For all 4 methods, we set the step size to 0.02 and take 30 steps. In the plot below, the bigger empty circles indicate the final value of each method after 30 steps.
The plot shows that for larger values of momentum, the algorithm tends to move further in each step. Large momentum values can cause the algorithm to oscillate/move toward the minimum is a less direct way (compare the teal line to the orange and blue ones), but in the case above momentum with 0.8 actually got closest to the minimum despite its winding route.
In the second plot, we are trying to maximize the same objective function from the same initial point as the first plot, but we instead set the step size to a larger value, 0.1, and take just 10 steps. With this setting, we see vanilla GD oscillating wildly. Small amounts of momentum (0.2 or 0.5 in this case) can help to dampen the oscillations, resulting in a more direct path to the minimum.
In this final plot, we want to minimize
starting at the point . Here, we compare vanilla GD, GD with momentum 0.7, and Nesterov’s AGD with momentum 0.7. We set the step size to 0.03 for all methods and take 10 steps for each method. We see that for this particular setting, AGD oscillates less than GD with momentum, and is able to move more quickly to the minimum than vanilla GD.
These comparisons are not universal, in the sense that these algorithms can take very different paths depending on the objective function, starting point, as well as algorithm parameters like step size and momentum coefficient. It’s worth duplicating the colab and playing around with the parameters yourself to see what happens under different settings. Once again, the code for creating these plots is available here.

