Support Weight Decay to adaptive Optimizers #10866

alex1o1o7cloud · 2018-08-24T18:56:10Z

Summary:
as title.
AdamW (https://arxiv.org/abs/1711.05101) has shown some good result by applying weight decay to adaptive optimzers.
This diff deals with adagrad and adam.

Currently, only constant weight decay is supported. According to the paper, dynamic weight decay hyperparameters that change according to the # of batches should achieve even better results. We ll implement that in a seperate diff if weight decay techies will be demonstrate to have some early positive results. Also AdamWR is also not implemented.

Differential Revision: D9496208

Summary: as title. AdamW (https://arxiv.org/abs/1711.05101) has shown some good result by applying weight decay to adaptive optimzers. This diff deals with adagrad and adam. Currently, only constant weight decay is supported. According to the paper, dynamic weight decay hyperparameters that change according to the # of batches should achieve even better results. We ll implement that in a seperate diff if weight decay techies will be demonstrate to have some early positive results. Also AdamWR is also not implemented. Differential Revision: D9496208 fbshipit-source-id: ec701b55f15ed9ac6c74c5ca7f94834e22dec3fc

Summary: # What is this? This is an implementation of the AdamW optimizer as implemented in [the fastai library](https://github.com/fastai/fastai/blob/803894051bef32304ceea0c8ea5e04db64ff26b8/fastai/callback.py) and as initially introduced in the paper [Decoupled Weight Decay Regularization](https://arxiv.org/abs/1711.05101). It decouples the weight decay regularization step from the optimization step during training. There have already been several abortive attempts to push this into pytorch in some form or fashion: #17468, #10866, #3740, #4429. Hopefully this one goes through. # Why is this important? Via a simple reparameterization, it can be shown that L2 regularization has a weight decay effect in the case of SGD optimization. Because of this, L2 regularization became synonymous with the concept of weight decay. However, it can be shown that the equivalence of L2 regularization and weight decay breaks down for more complex adaptive optimization schemes. It was shown in the paper [Decoupled Weight Decay Regularization](https://arxiv.org/abs/1711.05101) that this is the reason why models trained with SGD achieve better generalization than those trained with Adam. Weight decay is a very effective regularizer. L2 regularization, in and of itself, is much less effective. By explicitly decaying the weights, we can achieve state-of-the-art results while also taking advantage of the quick convergence properties that adaptive optimization schemes have. # How was this tested? There were test cases added to `test_optim.py` and I also ran a [little experiment](https://gist.github.com/mjacar/0c9809b96513daff84fe3d9938f08638) to validate that this implementation is equivalent to the fastai implementation. Pull Request resolved: #21250 Differential Revision: D16060339 Pulled By: vincentqb fbshipit-source-id: ded7cc9cfd3fde81f655b9ffb3e3d6b3543a4709

Summary: # What is this? This is an implementation of the AdamW optimizer as implemented in [the fastai library](https://github.com/fastai/fastai/blob/803894051bef32304ceea0c8ea5e04db64ff26b8/fastai/callback.py) and as initially introduced in the paper [Decoupled Weight Decay Regularization](https://arxiv.org/abs/1711.05101). It decouples the weight decay regularization step from the optimization step during training. There have already been several abortive attempts to push this into pytorch in some form or fashion: pytorch#17468, pytorch#10866, pytorch#3740, pytorch#4429. Hopefully this one goes through. # Why is this important? Via a simple reparameterization, it can be shown that L2 regularization has a weight decay effect in the case of SGD optimization. Because of this, L2 regularization became synonymous with the concept of weight decay. However, it can be shown that the equivalence of L2 regularization and weight decay breaks down for more complex adaptive optimization schemes. It was shown in the paper [Decoupled Weight Decay Regularization](https://arxiv.org/abs/1711.05101) that this is the reason why models trained with SGD achieve better generalization than those trained with Adam. Weight decay is a very effective regularizer. L2 regularization, in and of itself, is much less effective. By explicitly decaying the weights, we can achieve state-of-the-art results while also taking advantage of the quick convergence properties that adaptive optimization schemes have. # How was this tested? There were test cases added to `test_optim.py` and I also ran a [little experiment](https://gist.github.com/mjacar/0c9809b96513daff84fe3d9938f08638) to validate that this implementation is equivalent to the fastai implementation. Pull Request resolved: pytorch#21250 Differential Revision: D16060339 Pulled By: vincentqb fbshipit-source-id: ded7cc9cfd3fde81f655b9ffb3e3d6b3543a4709

facebook-github-bot · 2020-10-30T17:24:47Z

Hi @alex1o1o7cloud!

Thank you for your pull request. We require contributors to sign our Contributor License Agreement, and yours needs attention.

You currently have a record in our system, but we do not have a signature on file.

In order for us to review and merge your code, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

pytorchbot · 2022-04-12T01:34:20Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
Stale pull requests will automatically be closed 30 days after being marked Stale

github-actions · 2022-06-11T02:14:48Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

ssnl added the caffe2 label Aug 24, 2018

alex1o1o7cloud force-pushed the export-D9496208 branch from fbe9cbd to 40994d7 Compare October 9, 2018 00:28

alex1o1o7cloud force-pushed the export-D9496208 branch from 40994d7 to 43dfb88 Compare October 16, 2018 23:29

ezyang mentioned this pull request Feb 26, 2019

AdamW and AdaBound algorithms for C++ frontend #17468

Closed

mjacar mentioned this pull request Jun 1, 2019

Implement AdamW optimizer #21250

Closed

pytorchbot added the open source label May 30, 2020

pytorchbot added Stale and removed Stale labels Apr 12, 2022

github-actions bot added the Stale label Jun 11, 2022

github-actions bot closed this Jul 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support Weight Decay to adaptive Optimizers #10866

Support Weight Decay to adaptive Optimizers #10866

Uh oh!

alex1o1o7cloud commented Aug 24, 2018

Uh oh!

facebook-github-bot commented Oct 30, 2020

Uh oh!

pytorchbot commented Apr 12, 2022

Uh oh!

github-actions bot commented Jun 11, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Support Weight Decay to adaptive Optimizers #10866

Support Weight Decay to adaptive Optimizers #10866

Uh oh!

Conversation

alex1o1o7cloud commented Aug 24, 2018

Uh oh!

facebook-github-bot commented Oct 30, 2020

Uh oh!

pytorchbot commented Apr 12, 2022

Uh oh!

github-actions bot commented Jun 11, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants