-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Adam/AdamW implementation minor fix #22628
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Good catch! I've updated the name to emphasize that both Adam and AdamW are affected. For reference, see also papers cited in documentation, here also. |
|
This change looks good to me. I'll run a few tests before merging. |
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vincentqb has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Great, thanks! |
|
@vincentqb merged this pull request in fed5ca1. |
|
Has the same modification been done to the cpp implementation of Adam? |
I had forgotten about that. I have just addressed it in a new PR: #23737 |
Summary: This PR is in accordance with pytorch#22628 I had submitted the PR for `adam.py` and `adamw.py` but had forgotten about the `adam.cpp`. Pull Request resolved: pytorch#23737 Differential Revision: D16623828 Pulled By: vincentqb fbshipit-source-id: 4390fd751d1c0cd12f32214b4234d42a06dcbb20
|
Would it make sense to have the same modification in the implementation of SparseAdam? |
|
Yes indeed! I looked into it, but apparently forgot to update it, I can go ahead and fix it in a separate PR. |
I have noticed a small discrepancy between theory and the implementation of AdamW and in general Adam. The epsilon in the denominator of the following Adam update should not be scaled by the bias correction (Algorithm 2, L9-12). Only the running average of the gradient (m) and squared gradients (v) should be scaled by their corresponding bias corrections.
In the current implementation, the epsilon is scaled by the square root of
bias_correction2. I have plotted this ratio as a function of step givenbeta2 = 0.999andeps = 1e-8. In the early steps of optimization, this ratio slightly deviates from theory (denoted by the horizontal red line).