-
Notifications
You must be signed in to change notification settings - Fork 26.3k
AdamW and AdaBound algorithms for C++ frontend #17468
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary: # What is this? This is an implementation of the AdamW optimizer as implemented in [the fastai library](https://github.com/fastai/fastai/blob/803894051bef32304ceea0c8ea5e04db64ff26b8/fastai/callback.py) and as initially introduced in the paper [Decoupled Weight Decay Regularization](https://arxiv.org/abs/1711.05101). It decouples the weight decay regularization step from the optimization step during training. There have already been several abortive attempts to push this into pytorch in some form or fashion: #17468, #10866, #3740, #4429. Hopefully this one goes through. # Why is this important? Via a simple reparameterization, it can be shown that L2 regularization has a weight decay effect in the case of SGD optimization. Because of this, L2 regularization became synonymous with the concept of weight decay. However, it can be shown that the equivalence of L2 regularization and weight decay breaks down for more complex adaptive optimization schemes. It was shown in the paper [Decoupled Weight Decay Regularization](https://arxiv.org/abs/1711.05101) that this is the reason why models trained with SGD achieve better generalization than those trained with Adam. Weight decay is a very effective regularizer. L2 regularization, in and of itself, is much less effective. By explicitly decaying the weights, we can achieve state-of-the-art results while also taking advantage of the quick convergence properties that adaptive optimization schemes have. # How was this tested? There were test cases added to `test_optim.py` and I also ran a [little experiment](https://gist.github.com/mjacar/0c9809b96513daff84fe3d9938f08638) to validate that this implementation is equivalent to the fastai implementation. Pull Request resolved: #21250 Differential Revision: D16060339 Pulled By: vincentqb fbshipit-source-id: ded7cc9cfd3fde81f655b9ffb3e3d6b3543a4709
Summary: # What is this? This is an implementation of the AdamW optimizer as implemented in [the fastai library](https://github.com/fastai/fastai/blob/803894051bef32304ceea0c8ea5e04db64ff26b8/fastai/callback.py) and as initially introduced in the paper [Decoupled Weight Decay Regularization](https://arxiv.org/abs/1711.05101). It decouples the weight decay regularization step from the optimization step during training. There have already been several abortive attempts to push this into pytorch in some form or fashion: pytorch#17468, pytorch#10866, pytorch#3740, pytorch#4429. Hopefully this one goes through. # Why is this important? Via a simple reparameterization, it can be shown that L2 regularization has a weight decay effect in the case of SGD optimization. Because of this, L2 regularization became synonymous with the concept of weight decay. However, it can be shown that the equivalence of L2 regularization and weight decay breaks down for more complex adaptive optimization schemes. It was shown in the paper [Decoupled Weight Decay Regularization](https://arxiv.org/abs/1711.05101) that this is the reason why models trained with SGD achieve better generalization than those trained with Adam. Weight decay is a very effective regularizer. L2 regularization, in and of itself, is much less effective. By explicitly decaying the weights, we can achieve state-of-the-art results while also taking advantage of the quick convergence properties that adaptive optimization schemes have. # How was this tested? There were test cases added to `test_optim.py` and I also ran a [little experiment](https://gist.github.com/mjacar/0c9809b96513daff84fe3d9938f08638) to validate that this implementation is equivalent to the fastai implementation. Pull Request resolved: pytorch#21250 Differential Revision: D16060339 Pulled By: vincentqb fbshipit-source-id: ded7cc9cfd3fde81f655b9ffb3e3d6b3543a4709
yf225
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vincentqb Would you like to review this PR and check that it matches our Python implementation in #21250?
@dnbaker Could you add corresponding tests in test/cpp/api/optim.cpp, similar to
pytorch/test/cpp/api/optim.cpp
Lines 184 to 186 in 52de340
| TEST(OptimTest, XORConvergence_Adam) { | |
| ASSERT_TRUE(test_optimizer_xor<Adam>(AdamOptions(0.1).weight_decay(1e-6))); | |
| } |
|
I've added these tests here and see if the tests pass before asking for further review. |
|
@dnbaker Thanks! Also make sure to remove changes to |
…s no longer in its source tree, we are using the current master's branch for it.
|
If any of the maintainers are looking, could you point me toward where I need to include my torch/optim/*cpp files for the build process? The objects aren't being located, and my prior build experience has been primarily with Makefile, so this is somewhat foreign. |
|
@pytorchbot rebase this please |
| NoGradGuard guard; | ||
|
|
||
| if(options.weight_decay_ > 0) { | ||
| Tensor decoupled_weight_decay = p * (-options.weight_decay_ * step_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please note that this implementation doesn't match the original paper, nor the python implementation.
To match the paper, this should be -options.weight_decay * options.learning_rate.
Even then, there would be a difference with the pytorch implementation, because epsilon is added to sqrt(exp_average_sq), while it is added to sqrt(exp_average_sq / bias_correction2) in python. This essentially changes the scale of epsilon and makes both implementation incompatible hyperparameter-wise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: I didn't check adabound.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See in #22628 for the discussion about epsilon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll let others who are more involved work on this, then, to make it available.
I've provided implementations for AdamW and AdaBound algorithms for the C++ frontend, as described in this paper and this paper, respectively.
I'm not sure that these are integrated into libtorch compilation, but I haven't been able to figure out the build system sufficiently on my machine past linking errors. (It seems that libtorch doesn't provide libc10d on OSX, making testing difficult.) However, these object files do compile.
I've gone over the guidelines here, and I hope I've done what I've needed to, but please correct me otherwise.
Thank you!