MTA AdamWOptimizer by pengwa · Pull Request #11506 · microsoft/onnxruntime

pengwa · 2022-05-12T16:58:41Z

Description: MTA (Multiple Tensor Apply) Adam Optimizer Implementation.

This is added by intention for supporting internal customers, can also be used for common training.
The implementation leverage Seq to manages groups of parameters/gradients/momentums, instead of using fixed length-ed variadic inputs (as we did for Lamb previously).
Multiple tensor apply is used.
AdamW equivalence for Torch AdamW and HF AdamW are provided, to allow models training with external libs easier to migration.

Motivation and Context

Why is this change required? What problem does it solve?
If it fixes an open issue, please link to the issue here.

…o pengwa/adamv2

onnxruntime/test/testdata/optimizers/test_adam.py

orttraining/orttraining/test/training_ops/cuda/optimizer/adam_test.cc

orttraining/orttraining/core/graph/training_op_defs.cc

orttraining/orttraining/training_ops/cuda/optimizer/adam/adam.h

orttraining/orttraining/training_ops/cuda/optimizer/adam/adam.cc

onnxruntime/test/testdata/optimizers/test_adam.py

orttraining/orttraining/core/graph/training_op_defs.cc

orttraining/orttraining/training_ops/cuda/optimizer/adam/adam.cc

orttraining/orttraining/training_ops/cuda/optimizer/adam/adam_impl.cu

…o pengwa/adamv2

onnxruntime/test/testdata/test_data_generator/adamw_test_data_generator.py

orttraining/orttraining/test/training_ops/cuda/optimizer/adamw_test.cc

orttraining/orttraining/training_ops/cuda/cuda_training_kernels.cc

orttraining/orttraining/test/training_ops/cuda/optimizer/adamw_test.cc

orttraining/orttraining/test/training_ops/cuda/optimizer/common.cc

orttraining/orttraining/training_ops/cuda/optimizer/adamw/adamw.cc

orttraining/orttraining/training_ops/cuda/optimizer/adamw/adamw_impl.cu

ashbhandare · 2022-05-25T18:48:11Z

orttraining/orttraining/training_ops/cuda/optimizer/adamw/adamw_impl.cu

+    // > there is a minor difference compared with Apex's implementation,
+    //   which uses double storing corrections before casting to float passing to kernels.
+    // > std::pow(float, int) return double since C++11, so we cast back to float.
+    alpha_correction = 1.f - static_cast<float>(std::pow(alpha, update_count));


will we get better precision by casting to float after the subtraction?
std::pow(alpha, update_count) is a small number < 1, so a precision loss will affect it much more than (1-<a small number)>) will, same for beta correction. Please correct me if I'm wrong

Yeah, it might be. While as the comment says, it is to match what we see from other frameworks.
I original use double to calculate them, then changed this from double to float, to avoid for some rare case, that might bring differences between our runs and torch/HF runs. Make sense?

nit: The comment is not clear if we are matching apex or diverging from it, so if you have other things to fix, you can make the comment a bit clearer too.

…o pengwa/adamv2

ashbhandare

LGTM

…o pengwa/adamv2

pengwa added 6 commits May 12, 2022 02:46

skeleton change

a9ad8bd

Merge branch 'master' of https://github.com/microsoft/onnxruntime int…

7e99980

…o pengwa/adamv2

adam compute kernels

9004fd4

add rtol/atol for tests

5165bf3

some clean up

0449d52

optional outputs

dab1ea4

pengwa requested review from ashbhandare, askhade and baijumeswani May 12, 2022 16:58

pengwa added the training issues related to ONNX Runtime training; typically submitted using template label May 12, 2022

pengwa added 2 commits May 12, 2022 17:15

more clean up

d8aaf79

add tests

bbac592

justinchuby reviewed May 13, 2022

View reviewed changes

orttraining/orttraining/test/training_ops/cuda/optimizer/adam_test.cc Outdated Show resolved Hide resolved

pengwa added 3 commits May 14, 2022 06:17

adamw mode=1 test pass

5e8b3ce

clean up tests

abfee62

add HF AdamW test cases

fc74c0b

askhade reviewed May 16, 2022

View reviewed changes

orttraining/orttraining/core/graph/training_op_defs.cc Outdated Show resolved Hide resolved

askhade reviewed May 16, 2022

View reviewed changes

orttraining/orttraining/core/graph/training_op_defs.cc Outdated Show resolved Hide resolved

askhade reviewed May 16, 2022

View reviewed changes

orttraining/orttraining/training_ops/cuda/optimizer/adam/adam.h Outdated Show resolved Hide resolved

askhade reviewed May 16, 2022

View reviewed changes

orttraining/orttraining/training_ops/cuda/optimizer/adam/adam.cc Show resolved Hide resolved

baijumeswani reviewed May 17, 2022

View reviewed changes

pengwa added 8 commits May 18, 2022 04:11

refactor adam test file

a0aecd0

make test pass

939b913

all test pass, fix comments

3f7e03a

rename to adamw

f02f530

make test pass again

39bfca6

fix cpplint

fc6b8bb

minor fixes

ce67d2f

fix python lint

a5a86d3

pengwa added 2 commits May 24, 2022 03:54

Refine based on comments

ca3bdee

Merge branch 'master' of https://github.com/microsoft/onnxruntime int…

f8b86d2

…o pengwa/adamv2

pengwa dismissed stale reviews from baijumeswani and askhade via f8b86d2 May 24, 2022 03:56

baijumeswani previously approved these changes May 25, 2022

View reviewed changes