Add support for gradient clipping by baijumeswani · Pull Request #11697 · microsoft/onnxruntime

baijumeswani · 2022-06-01T06:08:25Z

This pull request adds support for gradient clipping and also integrates with the AdamWOptimizer changes introduced in #11506.

To generate the onnx model for the AdamW optimizer, users can simply do

adamw = onnxblock.optim.AdamW()
with onnxblock.onnx_model() as accessor:
    output_names = adamw(training_model.parameters())

and to add gradient clipping:

adamw = onnxblock.optim.AdamW(clip_grad=onnxblock.optim.ClipGradNorm(2.5))
with onnxblock.onnx_model() as accessor:
    output_names = adamw(training_model.parameters())

The gradient clipping looks like:

lgtm-com · 2022-06-01T06:31:41Z

This pull request fixes 3 alerts when merging 74b1bc6 into 1c316d0 - view on LGTM.com

fixed alerts:

2 for Unused global variable
1 for Unused import

lgtm-com · 2022-06-01T16:47:08Z

This pull request fixes 3 alerts when merging 57f6dfb into 1c316d0 - view on LGTM.com

fixed alerts:

2 for Unused global variable
1 for Unused import

lgtm-com · 2022-06-01T19:24:12Z

This pull request fixes 3 alerts when merging 79c49e8 into 1c316d0 - view on LGTM.com

fixed alerts:

2 for Unused global variable
1 for Unused import

pengwa · 2022-06-02T01:34:10Z

SequenceConstruct currently in ORT will bring copies for every tensors in its inputs. Here in optimizer graph, we have all grad as inputs, so have to construct a sequence then feed it into optimizer.

Another way I would think is, we can pass in the sequence of grads, params, momentums directly as inputs of optimizer graph. This is feasible we manage the sequence in Step()

lgtm-com · 2022-06-06T22:32:48Z

This pull request fixes 3 alerts when merging d5b31c9 into 1c316d0 - view on LGTM.com

fixed alerts:

2 for Unused global variable
1 for Unused import

lgtm-com · 2022-06-07T06:47:54Z

This pull request fixes 3 alerts when merging cc36334 into 1c316d0 - view on LGTM.com

fixed alerts:

2 for Unused global variable
1 for Unused import

lgtm-com · 2022-06-08T23:33:10Z

This pull request fixes 3 alerts when merging 9eb06b0 into 1c316d0 - view on LGTM.com

fixed alerts:

2 for Unused global variable
1 for Unused import

orttraining/orttraining/python/training/onnxblock/optim/optim.py

orttraining/orttraining/training_api/optimizer.cc

orttraining/orttraining/test/training_api/checkpoint/checkpoint_test.cc

lgtm-com · 2022-06-09T06:22:50Z

This pull request fixes 3 alerts when merging ae60521 into 540935a - view on LGTM.com

fixed alerts:

2 for Unused global variable
1 for Unused import

orttraining/orttraining/training_api/optimizer.cc

lgtm-com · 2022-06-10T18:31:23Z

This pull request fixes 3 alerts when merging 26e7df2 into 540935a - view on LGTM.com

fixed alerts:

2 for Unused global variable
1 for Unused import

lgtm-com · 2022-06-15T23:08:19Z

This pull request fixes 3 alerts when merging e1391e7 into f63e28c - view on LGTM.com

fixed alerts:

2 for Unused global variable
1 for Unused import

lgtm-com · 2022-06-17T01:21:47Z

This pull request fixes 3 alerts when merging 7657940 into f63e28c - view on LGTM.com

fixed alerts:

2 for Unused global variable
1 for Unused import

onnxruntime/core/framework/utils.cc

orttraining/orttraining/training_api/training_session.cc

orttraining/orttraining/training_api/onnxruntime_training_c_api.cc

askhade · 2022-06-17T19:49:13Z

orttraining/orttraining/training_api/onnxruntime_training_c_api.cc

+    const std::vector<std::shared_ptr<onnxruntime::IExecutionProviderFactory>>& provider_factories) {
+  std::vector<std::shared_ptr<onnxruntime::IExecutionProvider>> execution_providers;
+  for (const auto& factory : provider_factories) {
+    execution_providers.emplace_back(std::move(factory->CreateProvider()));


Since we are creating the provider here it means the same instance of the provider will be shared among training, eval and optimizer session right?

For inference scenarios we don't share ep instance among inference sessions but in ortmodule we do... Just wondering do we know any implications of sharing the provider instance?

I am not sure if there are any guidelines around sharing the provider among different inference sessions. If there is, please let me know.

I think from a API design viewpoint, we should think of TrainingSession as an equivalent to the InferenceSession just in the training world. Extending that further, all components inside the TrainingSession should share the same instance of the provider and by extension the allocator (per provider). But if this is not the expected usage for the provider, we can change it.

@pengwa, @ashbhandare please provide any insight that may be relevant.

orttraining/orttraining/test/training_api/core/training_api_tests.cc

…ing session

lgtm-com · 2022-06-18T02:14:20Z

This pull request fixes 3 alerts when merging 83c24a6 into a3ec2d6 - view on LGTM.com

fixed alerts:

2 for Unused global variable
1 for Unused import

pengwa · 2022-06-20T13:32:23Z

onnxruntime/core/framework/utils.cc

+      const auto& tensor_seq = feed.Get<TensorSeq>();
+      if (tensor_seq.Size() != std::size_t{0}) {
+        feed_locations[i] = tensor_seq.Get(0).Location().device;
+      }


shall we fix the output too.

Yes, we should. Let me get this in a follow-up pull request so that this PR can focus specifically on the use case for on device training. Will create a work item and complete in another PR.

orttraining/orttraining/test/training_api/core/checkpoint_test.cc

orttraining/orttraining/test/training_api/core/data_utils.h

orttraining/orttraining/training_api/utils.cc

pengwa · 2022-06-20T14:05:03Z

orttraining/orttraining/training_api/training_session.cc

+    : named_parameters_{parameters},
+      module_{std::make_unique<Module>(model_identifiers.train_model, named_parameters_,
+                                       session_options, session_env, providers, model_identifiers.eval_model)},
+      optimizer_{model_identifiers.optim_model.has_value()


nit: put int the body?

any reason to put it in the body as opposed to initialization list?

orttraining/orttraining/python/training/onnxblock/_graph_utils.py

lgtm-com · 2022-06-20T19:57:15Z

This pull request fixes 3 alerts when merging 4e0d0e7 into a3ec2d6 - view on LGTM.com

fixed alerts:

2 for Unused global variable
1 for Unused import

askhade · 2022-06-21T20:50:25Z

orttraining/orttraining/training_api/utils.cc

      tensor_location.mem_type == OrtMemTypeCPUOutput) {
    memset(p_tensor->MutableDataRaw(), 0, p_tensor->SizeInBytes());
+  } else if (tensor_location.device.Type() == OrtDevice::GPU) {
+    // Use a tensor on cpu and copy it over to the desired device using


Is there a perf efficient way to do this? Given this is during initialization it may be OK but wondering whether we can use CudaMemset here?

utils.cc is a part of the onnxruntime_session target. This target is currently not linked against cuda libraries because onnxruntime_session should not care about the providers that are supported. It should be providers agnostic.

We can probably add the target_link_libraries for onnxruntime_session against the cuda libraries to work around this. But this might not be the right solution. Instead, we could add a method in the execution providers for memset that does the memset on the device.

I think this should be done separately in another PR where the focus can be only on this functionality.

pengwa

LGTM

baijumeswani force-pushed the bmeswani/adamw_grad_clipping branch from ab805bd to 74b1bc6 Compare June 1, 2022 06:14

baijumeswani force-pushed the bmeswani/adamw_grad_clipping branch from 74b1bc6 to 57f6dfb Compare June 1, 2022 16:30

baijumeswani requested review from ashbhandare, askhade and pengwa June 1, 2022 23:38

baijumeswani force-pushed the bmeswani/adamw_grad_clipping branch from d5b31c9 to cc36334 Compare June 7, 2022 05:21

baijumeswani force-pushed the bmeswani/adamw_grad_clipping branch from cc36334 to 9eb06b0 Compare June 8, 2022 21:59

pengwa reviewed Jun 9, 2022

View reviewed changes

orttraining/orttraining/test/training_api/checkpoint/checkpoint_test.cc Show resolved Hide resolved

baijumeswani force-pushed the bmeswani/adamw_grad_clipping branch from 6d54c92 to ae60521 Compare June 9, 2022 04:54

ashbhandare reviewed Jun 10, 2022

View reviewed changes

orttraining/orttraining/training_api/optimizer.cc Outdated Show resolved Hide resolved

baijumeswani force-pushed the bmeswani/adamw_grad_clipping branch from 26e7df2 to e1391e7 Compare June 15, 2022 21:36

baijumeswani force-pushed the bmeswani/adamw_grad_clipping branch from e1391e7 to 7657940 Compare June 16, 2022 23:52

askhade reviewed Jun 17, 2022

View reviewed changes

onnxruntime/core/framework/utils.cc Outdated Show resolved Hide resolved

askhade reviewed Jun 17, 2022

View reviewed changes

orttraining/orttraining/training_api/training_session.cc Outdated Show resolved Hide resolved

askhade reviewed Jun 17, 2022

View reviewed changes

orttraining/orttraining/training_api/onnxruntime_training_c_api.cc Outdated Show resolved Hide resolved

askhade reviewed Jun 17, 2022

View reviewed changes

orttraining/orttraining/test/training_api/core/training_api_tests.cc Outdated Show resolved Hide resolved

Support for gradient clipping

9fc2e2a

baijumeswani added 7 commits June 17, 2022 23:58

Update step based on AdamWOptimizer update_flag

3e34ab9

Construct tensor sequence in optimizer code

ca94aac

Address code review comments

80155d6

pylint warnings fix

4d2c221

Move moments as inputs check inside a conditional on AdamW

73e64fa

Add providers as a constructor argument, integrate changes with train…

3ea8d28

…ing session

Address pull request review comments

83c24a6

baijumeswani force-pushed the bmeswani/adamw_grad_clipping branch from 7657940 to 83c24a6 Compare June 18, 2022 00:38

pengwa reviewed Jun 20, 2022

View reviewed changes

orttraining/orttraining/python/training/onnxblock/_graph_utils.py Show resolved Hide resolved

Address pull request code review comments

4e0d0e7

askhade reviewed Jun 21, 2022

View reviewed changes

pengwa approved these changes Jun 22, 2022

View reviewed changes

askhade approved these changes Jun 22, 2022

View reviewed changes

baijumeswani merged commit fac8dae into training_dev/on_device_poc Jun 22, 2022

baijumeswani deleted the bmeswani/adamw_grad_clipping branch June 22, 2022 17:27

Conversation

baijumeswani commented Jun 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lgtm-com bot commented Jun 1, 2022

Uh oh!

lgtm-com bot commented Jun 1, 2022

Uh oh!

lgtm-com bot commented Jun 1, 2022

Uh oh!

pengwa commented Jun 2, 2022

Uh oh!

lgtm-com bot commented Jun 6, 2022

Uh oh!

lgtm-com bot commented Jun 7, 2022

Uh oh!

lgtm-com bot commented Jun 8, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lgtm-com bot commented Jun 9, 2022

Uh oh!

Uh oh!

lgtm-com bot commented Jun 10, 2022

Uh oh!

lgtm-com bot commented Jun 15, 2022

Uh oh!

lgtm-com bot commented Jun 17, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

askhade Jun 17, 2022

Choose a reason for hiding this comment

Uh oh!

baijumeswani Jun 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lgtm-com bot commented Jun 18, 2022

Uh oh!

pengwa Jun 20, 2022

Choose a reason for hiding this comment

Uh oh!

baijumeswani Jun 20, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pengwa Jun 20, 2022

Choose a reason for hiding this comment

Uh oh!

baijumeswani Jun 20, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lgtm-com bot commented Jun 20, 2022

Uh oh!

askhade Jun 21, 2022

Choose a reason for hiding this comment

Uh oh!

baijumeswani Jun 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pengwa left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

baijumeswani commented Jun 1, 2022 •

edited

Loading

baijumeswani Jun 17, 2022 •

edited

Loading

baijumeswani Jun 22, 2022 •

edited

Loading