Skip to content

Conversation

@dnbaker
Copy link

@dnbaker dnbaker commented Feb 25, 2019

I've provided implementations for AdamW and AdaBound algorithms for the C++ frontend, as described in this paper and this paper, respectively.

I'm not sure that these are integrated into libtorch compilation, but I haven't been able to figure out the build system sufficiently on my machine past linking errors. (It seems that libtorch doesn't provide libc10d on OSX, making testing difficult.) However, these object files do compile.

I've gone over the guidelines here, and I hope I've done what I've needed to, but please correct me otherwise.

Thank you!

@ezyang
Copy link
Contributor

ezyang commented Feb 26, 2019

NB, I don't think we have either of these algorithms in the Python frontend. #3790 and a few PRs we dropped the ball on: #3740 #4429 #10866

facebook-github-bot pushed a commit that referenced this pull request Jul 2, 2019
Summary:
# What is this?
This is an implementation of the AdamW optimizer as implemented in [the fastai library](https://github.com/fastai/fastai/blob/803894051bef32304ceea0c8ea5e04db64ff26b8/fastai/callback.py) and as initially introduced in the paper [Decoupled Weight Decay Regularization](https://arxiv.org/abs/1711.05101). It decouples the weight decay regularization step from the optimization step during training.

There have already been several abortive attempts to push this into pytorch in some form or fashion: #17468, #10866, #3740, #4429. Hopefully this one goes through.
# Why is this important?
Via a simple reparameterization, it can be shown that L2 regularization has a weight decay effect in the case of SGD optimization. Because of this, L2 regularization became synonymous with the concept of weight decay. However, it can be shown that the equivalence of L2 regularization and weight decay breaks down for more complex adaptive optimization schemes. It was shown in the paper [Decoupled Weight Decay Regularization](https://arxiv.org/abs/1711.05101) that this is the reason why models trained with SGD achieve better generalization than those trained with Adam. Weight decay is a very effective regularizer. L2 regularization, in and of itself, is much less effective. By explicitly decaying the weights, we can achieve state-of-the-art results while also taking advantage of the quick convergence properties that adaptive optimization schemes have.
# How was this tested?
There were test cases added to `test_optim.py` and I also ran a [little experiment](https://gist.github.com/mjacar/0c9809b96513daff84fe3d9938f08638) to validate that this implementation is equivalent to the fastai implementation.
Pull Request resolved: #21250

Differential Revision: D16060339

Pulled By: vincentqb

fbshipit-source-id: ded7cc9cfd3fde81f655b9ffb3e3d6b3543a4709
xzhu1900 pushed a commit to xzhu1900/pytorch that referenced this pull request Jul 5, 2019
Summary:
# What is this?
This is an implementation of the AdamW optimizer as implemented in [the fastai library](https://github.com/fastai/fastai/blob/803894051bef32304ceea0c8ea5e04db64ff26b8/fastai/callback.py) and as initially introduced in the paper [Decoupled Weight Decay Regularization](https://arxiv.org/abs/1711.05101). It decouples the weight decay regularization step from the optimization step during training.

There have already been several abortive attempts to push this into pytorch in some form or fashion: pytorch#17468, pytorch#10866, pytorch#3740, pytorch#4429. Hopefully this one goes through.
# Why is this important?
Via a simple reparameterization, it can be shown that L2 regularization has a weight decay effect in the case of SGD optimization. Because of this, L2 regularization became synonymous with the concept of weight decay. However, it can be shown that the equivalence of L2 regularization and weight decay breaks down for more complex adaptive optimization schemes. It was shown in the paper [Decoupled Weight Decay Regularization](https://arxiv.org/abs/1711.05101) that this is the reason why models trained with SGD achieve better generalization than those trained with Adam. Weight decay is a very effective regularizer. L2 regularization, in and of itself, is much less effective. By explicitly decaying the weights, we can achieve state-of-the-art results while also taking advantage of the quick convergence properties that adaptive optimization schemes have.
# How was this tested?
There were test cases added to `test_optim.py` and I also ran a [little experiment](https://gist.github.com/mjacar/0c9809b96513daff84fe3d9938f08638) to validate that this implementation is equivalent to the fastai implementation.
Pull Request resolved: pytorch#21250

Differential Revision: D16060339

Pulled By: vincentqb

fbshipit-source-id: ded7cc9cfd3fde81f655b9ffb3e3d6b3543a4709
@yf225 yf225 requested a review from vincentqb July 16, 2019 18:08
Copy link
Contributor

@yf225 yf225 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vincentqb Would you like to review this PR and check that it matches our Python implementation in #21250?

@dnbaker Could you add corresponding tests in test/cpp/api/optim.cpp, similar to

TEST(OptimTest, XORConvergence_Adam) {
ASSERT_TRUE(test_optimizer_xor<Adam>(AdamOptions(0.1).weight_decay(1e-6)));
}
? Thanks!

@pytorchbot pytorchbot added the module: cpp Related to C++ API label Jul 22, 2019
@dnbaker
Copy link
Author

dnbaker commented Jul 22, 2019

I've added these tests here and see if the tests pass before asking for further review.

@yf225
Copy link
Contributor

yf225 commented Jul 22, 2019

@dnbaker Thanks! Also make sure to remove changes to third_party/ as they are likely not intended.

@pytorchbot pytorchbot added caffe2 module: build Build system issues module: nccl Problems related to nccl support module: onnx Related to torch.onnx module: pybind Related to our Python bindings / interactions with other Python libraries labels Jul 22, 2019
@dnbaker
Copy link
Author

dnbaker commented Jul 27, 2019

If any of the maintainers are looking, could you point me toward where I need to include my torch/optim/*cpp files for the build process? The objects aren't being located, and my prior build experience has been primarily with Makefile, so this is somewhat foreign.

@soumith soumith requested a review from yf225 July 29, 2019 03:30
@yf225
Copy link
Contributor

yf225 commented Jul 29, 2019

@pytorchbot rebase this please

NoGradGuard guard;

if(options.weight_decay_ > 0) {
Tensor decoupled_weight_decay = p * (-options.weight_decay_ * step_size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note that this implementation doesn't match the original paper, nor the python implementation.
To match the paper, this should be -options.weight_decay * options.learning_rate.

Even then, there would be a difference with the pytorch implementation, because epsilon is added to sqrt(exp_average_sq), while it is added to sqrt(exp_average_sq / bias_correction2) in python. This essentially changes the scale of epsilon and makes both implementation incompatible hyperparameter-wise

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: I didn't check adabound.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See in #22628 for the discussion about epsilon.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll let others who are more involved work on this, then, to make it available.

@dnbaker dnbaker closed this Sep 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

caffe2 module: build Build system issues module: cpp Related to C++ API module: nccl Problems related to nccl support module: onnx Related to torch.onnx module: pybind Related to our Python bindings / interactions with other Python libraries module: third_party open source

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants