Skip to content

Implement simplified Nesterov momentum #53

@kloudkl

Description

@kloudkl

The main idea of Nesterov accelerated gradient (NAG, Nesterov momentum) is to update the parameter with the gradient at the predicted (peeked-ahead) parameter. To reduce the sample variance, NAG smoothes the update by exponentially averaging the histories.

Sutskever et al.[1] proved that NAG was effective to improve the stability and convergence rate of stochastic optimization of deep network. They showed it could be done in two steps.

image

Simplified Nesterov momentum updates:
image

Bengio et al.[2] reformulated it to indicate that it was equivalent to the standard momentum except for different linear weighting coefficients.

[1] Sutskever, I., Martens, J., Dahl, G. and Hinton, G. E. On the importance of momentum and initialization in deep learning. In 30th International Conference on Machine Learning, Atlanta, USA, 2013. JMLR: W&CP volume 28.
[2] Yoshua Bengio, Nicolas Boulanger-Lewandowski, Razvan Pascanu. Advances in Optimizing Recurrent Networks. arXiv 1212.0901.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions