Skip to content

Conversation

@farhadrgh
Copy link
Contributor

I have noticed a small discrepancy between theory and the implementation of AdamW and in general Adam. The epsilon in the denominator of the following Adam update should not be scaled by the bias correction (Algorithm 2, L9-12). Only the running average of the gradient (m) and squared gradients (v) should be scaled by their corresponding bias corrections.

adam_update

In the current implementation, the epsilon is scaled by the square root of bias_correction2. I have plotted this ratio as a function of step given beta2 = 0.999 and eps = 1e-8. In the early steps of optimization, this ratio slightly deviates from theory (denoted by the horizontal red line).

plot

@pytorchbot pytorchbot added the module: optimizer Related to torch.optim label Jul 9, 2019
@soumith soumith requested a review from vincentqb July 10, 2019 00:16
@jerryzh168 jerryzh168 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jul 13, 2019
@vincentqb vincentqb changed the title AdamW implementation minor fix Adam/AdamW implementation minor fix Jul 31, 2019
@vincentqb
Copy link
Contributor

vincentqb commented Jul 31, 2019

Good catch! I've updated the name to emphasize that both Adam and AdamW are affected. For reference, see also papers cited in documentation, here also.

@vincentqb
Copy link
Contributor

This change looks good to me. I'll run a few tests before merging.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vincentqb has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@farhadrgh
Copy link
Contributor Author

Good catch! I've updated the name to emphasize that both Adam and AdamW are affected. For reference, see also papers cited in documentation, here also.

Great, thanks!

@facebook-github-bot
Copy link
Contributor

@vincentqb merged this pull request in fed5ca1.

@fl2o
Copy link

fl2o commented Aug 2, 2019

Has the same modification been done to the cpp implementation of Adam?

@farhadrgh
Copy link
Contributor Author

Has the same modification been done to the cpp implementation of Adam?

I had forgotten about that. I have just addressed it in a new PR: #23737

@farhadrgh farhadrgh deleted the adamw branch August 2, 2019 17:02
iotamudelta pushed a commit to ROCm/pytorch that referenced this pull request Aug 5, 2019
Summary:
This PR is in accordance with pytorch#22628
I had submitted the PR for `adam.py` and `adamw.py` but had forgotten about the `adam.cpp`.
Pull Request resolved: pytorch#23737

Differential Revision: D16623828

Pulled By: vincentqb

fbshipit-source-id: 4390fd751d1c0cd12f32214b4234d42a06dcbb20
@azcohen14-zz
Copy link

Would it make sense to have the same modification in the implementation of SparseAdam?

@farhadrgh
Copy link
Contributor Author

Yes indeed! I looked into it, but apparently forgot to update it, I can go ahead and fix it in a separate PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged module: bc-breaking Related to a BC-breaking change module: optimizer Related to torch.optim open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants