AdamW and AdaBound algorithms for C++ frontend #17468

dnbaker · 2019-02-25T15:10:01Z

I've provided implementations for AdamW and AdaBound algorithms for the C++ frontend, as described in this paper and this paper, respectively.

I'm not sure that these are integrated into libtorch compilation, but I haven't been able to figure out the build system sufficiently on my machine past linking errors. (It seems that libtorch doesn't provide libc10d on OSX, making testing difficult.) However, these object files do compile.

I've gone over the guidelines here, and I hope I've done what I've needed to, but please correct me otherwise.

Thank you!

ezyang · 2019-02-26T17:42:34Z

NB, I don't think we have either of these algorithms in the Python frontend. #3790 and a few PRs we dropped the ball on: #3740 #4429 #10866

Summary: # What is this? This is an implementation of the AdamW optimizer as implemented in [the fastai library](https://github.com/fastai/fastai/blob/803894051bef32304ceea0c8ea5e04db64ff26b8/fastai/callback.py) and as initially introduced in the paper [Decoupled Weight Decay Regularization](https://arxiv.org/abs/1711.05101). It decouples the weight decay regularization step from the optimization step during training. There have already been several abortive attempts to push this into pytorch in some form or fashion: #17468, #10866, #3740, #4429. Hopefully this one goes through. # Why is this important? Via a simple reparameterization, it can be shown that L2 regularization has a weight decay effect in the case of SGD optimization. Because of this, L2 regularization became synonymous with the concept of weight decay. However, it can be shown that the equivalence of L2 regularization and weight decay breaks down for more complex adaptive optimization schemes. It was shown in the paper [Decoupled Weight Decay Regularization](https://arxiv.org/abs/1711.05101) that this is the reason why models trained with SGD achieve better generalization than those trained with Adam. Weight decay is a very effective regularizer. L2 regularization, in and of itself, is much less effective. By explicitly decaying the weights, we can achieve state-of-the-art results while also taking advantage of the quick convergence properties that adaptive optimization schemes have. # How was this tested? There were test cases added to `test_optim.py` and I also ran a [little experiment](https://gist.github.com/mjacar/0c9809b96513daff84fe3d9938f08638) to validate that this implementation is equivalent to the fastai implementation. Pull Request resolved: #21250 Differential Revision: D16060339 Pulled By: vincentqb fbshipit-source-id: ded7cc9cfd3fde81f655b9ffb3e3d6b3543a4709

Summary: # What is this? This is an implementation of the AdamW optimizer as implemented in [the fastai library](https://github.com/fastai/fastai/blob/803894051bef32304ceea0c8ea5e04db64ff26b8/fastai/callback.py) and as initially introduced in the paper [Decoupled Weight Decay Regularization](https://arxiv.org/abs/1711.05101). It decouples the weight decay regularization step from the optimization step during training. There have already been several abortive attempts to push this into pytorch in some form or fashion: pytorch#17468, pytorch#10866, pytorch#3740, pytorch#4429. Hopefully this one goes through. # Why is this important? Via a simple reparameterization, it can be shown that L2 regularization has a weight decay effect in the case of SGD optimization. Because of this, L2 regularization became synonymous with the concept of weight decay. However, it can be shown that the equivalence of L2 regularization and weight decay breaks down for more complex adaptive optimization schemes. It was shown in the paper [Decoupled Weight Decay Regularization](https://arxiv.org/abs/1711.05101) that this is the reason why models trained with SGD achieve better generalization than those trained with Adam. Weight decay is a very effective regularizer. L2 regularization, in and of itself, is much less effective. By explicitly decaying the weights, we can achieve state-of-the-art results while also taking advantage of the quick convergence properties that adaptive optimization schemes have. # How was this tested? There were test cases added to `test_optim.py` and I also ran a [little experiment](https://gist.github.com/mjacar/0c9809b96513daff84fe3d9938f08638) to validate that this implementation is equivalent to the fastai implementation. Pull Request resolved: pytorch#21250 Differential Revision: D16060339 Pulled By: vincentqb fbshipit-source-id: ded7cc9cfd3fde81f655b9ffb3e3d6b3543a4709

yf225

@vincentqb Would you like to review this PR and check that it matches our Python implementation in #21250?

@dnbaker Could you add corresponding tests in test/cpp/api/optim.cpp, similar to

pytorch/test/cpp/api/optim.cpp

Lines 184 to 186 in 52de340

    
           TEST(OptimTest, XORConvergence_Adam) { 
        
             ASSERT_TRUE(test_optimizer_xor<Adam>(AdamOptions(0.1).weight_decay(1e-6))); 
        
           }

? Thanks!

dnbaker · 2019-07-22T16:29:14Z

I've added these tests here and see if the tests pass before asking for further review.

yf225 · 2019-07-22T18:29:12Z

@dnbaker Thanks! Also make sure to remove changes to third_party/ as they are likely not intended.

…s no longer in its source tree, we are using the current master's branch for it.

dnbaker · 2019-07-27T17:03:02Z

If any of the maintainers are looking, could you point me toward where I need to include my torch/optim/*cpp files for the build process? The objects aren't being located, and my prior build experience has been primarily with Makefile, so this is somewhat foreign.

yf225 · 2019-07-29T19:12:09Z

@pytorchbot rebase this please

alcinos · 2019-08-30T19:26:28Z

torch/csrc/api/src/optim/adamw.cpp

+    NoGradGuard guard;
+
+    if(options.weight_decay_ > 0) {
+        Tensor decoupled_weight_decay = p * (-options.weight_decay_ * step_size);


Please note that this implementation doesn't match the original paper, nor the python implementation.
To match the paper, this should be -options.weight_decay * options.learning_rate.

Even then, there would be a difference with the pytorch implementation, because epsilon is added to sqrt(exp_average_sq), while it is added to sqrt(exp_average_sq / bias_correction2) in python. This essentially changes the scale of epsilon and makes both implementation incompatible hyperparameter-wise

Note: I didn't check adabound.

See in #22628 for the discussion about epsilon.

I'll let others who are more involved work on this, then, to make it available.

dnbaker added 3 commits February 25, 2019 00:57

Add adamw.{h,cpp}, supporting Decoupled Weight Decay Regularization.

9fd958e

Merge branch 'master' of https://github.com/pytorch/pytorch

f6623a3

Add adabound.

694f730

dnbaker requested review from ebetica, goldsborough and yf225 as code owners February 25, 2019 15:10

mjacar mentioned this pull request Jun 1, 2019

Implement AdamW optimizer #21250

Closed

ezyang added the open source label Jun 5, 2019

Merge branch 'master' of https://github.com/pytorch/pytorch

a622477

yf225 requested a review from vincentqb July 16, 2019 18:08

yf225 reviewed Jul 16, 2019

View reviewed changes

Add XORCOnvergence test for AdamW and AdaBound.

b65d3d1

pytorchbot added the module: cpp Related to C++ API label Jul 22, 2019

Update include/torch/optim.h so that it includes adamw and adabound.

58eae81

pytorchbot added the module: third_party label Jul 22, 2019

Daniel Nephi Baker added 2 commits July 22, 2019 12:38

Merge.

bd183a6

Incorporate ada{mw,bound} object files into build process.

87dcd61

pytorchbot added caffe2 module: build Build system issues module: nccl Problems related to nccl support module: onnx Related to torch.onnx module: pybind Related to our Python bindings / interactions with other Python libraries labels Jul 22, 2019

Daniel Nephi Baker added 3 commits July 22, 2019 13:01

Remove third_party changes. Note: because the original onnx-runtime i…

a202c62

…s no longer in its source tree, we are using the current master's branch for it.

Finish pointing to official repository rather than a fork of onnx.

c166eb5

Change header paths for ada{w,bound}

087ef50

soumith requested a review from yf225 July 29, 2019 03:30

Merge remote-tracking branch 'origin/master' into HEAD

0281f56

alcinos reviewed Aug 30, 2019

View reviewed changes

dnbaker closed this Sep 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

AdamW and AdaBound algorithms for C++ frontend #17468

AdamW and AdaBound algorithms for C++ frontend #17468

Uh oh!

dnbaker commented Feb 25, 2019

Uh oh!

ezyang commented Feb 26, 2019

Uh oh!

yf225 left a comment

Uh oh!

dnbaker commented Jul 22, 2019

Uh oh!

yf225 commented Jul 22, 2019

Uh oh!

dnbaker commented Jul 27, 2019

Uh oh!

yf225 commented Jul 29, 2019

Uh oh!

alcinos Aug 30, 2019

Uh oh!

alcinos Aug 30, 2019

Uh oh!

alcinos Aug 30, 2019

Uh oh!

dnbaker Sep 4, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

	TEST(OptimTest, XORConvergence_Adam) {
	ASSERT_TRUE(test_optimizer_xor<Adam>(AdamOptions(0.1).weight_decay(1e-6)));
	}

AdamW and AdaBound algorithms for C++ frontend #17468

AdamW and AdaBound algorithms for C++ frontend #17468

Uh oh!

Conversation

dnbaker commented Feb 25, 2019

Uh oh!

ezyang commented Feb 26, 2019

Uh oh!

yf225 left a comment

Choose a reason for hiding this comment

Uh oh!

dnbaker commented Jul 22, 2019

Uh oh!

yf225 commented Jul 22, 2019

Uh oh!

dnbaker commented Jul 27, 2019

Uh oh!

yf225 commented Jul 29, 2019

Uh oh!

alcinos Aug 30, 2019

Choose a reason for hiding this comment

Uh oh!

alcinos Aug 30, 2019

Choose a reason for hiding this comment

Uh oh!

alcinos Aug 30, 2019

Choose a reason for hiding this comment

Uh oh!

dnbaker Sep 4, 2019

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants