Add support for gradient clipping#11697
Add support for gradient clipping#11697baijumeswani merged 9 commits intotraining_dev/on_device_pocfrom
Conversation
ab805bd to
74b1bc6
Compare
|
This pull request fixes 3 alerts when merging 74b1bc6 into 1c316d0 - view on LGTM.com fixed alerts:
|
74b1bc6 to
57f6dfb
Compare
|
This pull request fixes 3 alerts when merging 57f6dfb into 1c316d0 - view on LGTM.com fixed alerts:
|
|
This pull request fixes 3 alerts when merging 79c49e8 into 1c316d0 - view on LGTM.com fixed alerts:
|
|
SequenceConstruct currently in ORT will bring copies for every tensors in its inputs. Here in optimizer graph, we have all grad as inputs, so have to construct a sequence then feed it into optimizer. Another way I would think is, we can pass in the sequence of grads, params, momentums directly as inputs of optimizer graph. This is feasible we manage the sequence in Step() |
|
This pull request fixes 3 alerts when merging d5b31c9 into 1c316d0 - view on LGTM.com fixed alerts:
|
d5b31c9 to
cc36334
Compare
|
This pull request fixes 3 alerts when merging cc36334 into 1c316d0 - view on LGTM.com fixed alerts:
|
cc36334 to
9eb06b0
Compare
|
This pull request fixes 3 alerts when merging 9eb06b0 into 1c316d0 - view on LGTM.com fixed alerts:
|
6d54c92 to
ae60521
Compare
|
This pull request fixes 3 alerts when merging ae60521 into 540935a - view on LGTM.com fixed alerts:
|
|
This pull request fixes 3 alerts when merging 26e7df2 into 540935a - view on LGTM.com fixed alerts:
|
26e7df2 to
e1391e7
Compare
|
This pull request fixes 3 alerts when merging e1391e7 into f63e28c - view on LGTM.com fixed alerts:
|
e1391e7 to
7657940
Compare
|
This pull request fixes 3 alerts when merging 7657940 into f63e28c - view on LGTM.com fixed alerts:
|
orttraining/orttraining/training_api/onnxruntime_training_c_api.cc
Outdated
Show resolved
Hide resolved
| const std::vector<std::shared_ptr<onnxruntime::IExecutionProviderFactory>>& provider_factories) { | ||
| std::vector<std::shared_ptr<onnxruntime::IExecutionProvider>> execution_providers; | ||
| for (const auto& factory : provider_factories) { | ||
| execution_providers.emplace_back(std::move(factory->CreateProvider())); |
There was a problem hiding this comment.
Since we are creating the provider here it means the same instance of the provider will be shared among training, eval and optimizer session right?
For inference scenarios we don't share ep instance among inference sessions but in ortmodule we do... Just wondering do we know any implications of sharing the provider instance?
There was a problem hiding this comment.
I am not sure if there are any guidelines around sharing the provider among different inference sessions. If there is, please let me know.
I think from a API design viewpoint, we should think of TrainingSession as an equivalent to the InferenceSession just in the training world. Extending that further, all components inside the TrainingSession should share the same instance of the provider and by extension the allocator (per provider). But if this is not the expected usage for the provider, we can change it.
@pengwa, @ashbhandare please provide any insight that may be relevant.
orttraining/orttraining/test/training_api/core/training_api_tests.cc
Outdated
Show resolved
Hide resolved
7657940 to
83c24a6
Compare
|
This pull request fixes 3 alerts when merging 83c24a6 into a3ec2d6 - view on LGTM.com fixed alerts:
|
| const auto& tensor_seq = feed.Get<TensorSeq>(); | ||
| if (tensor_seq.Size() != std::size_t{0}) { | ||
| feed_locations[i] = tensor_seq.Get(0).Location().device; | ||
| } |
There was a problem hiding this comment.
shall we fix the output too.
There was a problem hiding this comment.
Yes, we should. Let me get this in a follow-up pull request so that this PR can focus specifically on the use case for on device training. Will create a work item and complete in another PR.
orttraining/orttraining/test/training_api/core/checkpoint_test.cc
Outdated
Show resolved
Hide resolved
| : named_parameters_{parameters}, | ||
| module_{std::make_unique<Module>(model_identifiers.train_model, named_parameters_, | ||
| session_options, session_env, providers, model_identifiers.eval_model)}, | ||
| optimizer_{model_identifiers.optim_model.has_value() |
There was a problem hiding this comment.
any reason to put it in the body as opposed to initialization list?
|
This pull request fixes 3 alerts when merging 4e0d0e7 into a3ec2d6 - view on LGTM.com fixed alerts:
|
| tensor_location.mem_type == OrtMemTypeCPUOutput) { | ||
| memset(p_tensor->MutableDataRaw(), 0, p_tensor->SizeInBytes()); | ||
| } else if (tensor_location.device.Type() == OrtDevice::GPU) { | ||
| // Use a tensor on cpu and copy it over to the desired device using |
There was a problem hiding this comment.
Is there a perf efficient way to do this? Given this is during initialization it may be OK but wondering whether we can use CudaMemset here?
There was a problem hiding this comment.
utils.cc is a part of the onnxruntime_session target. This target is currently not linked against cuda libraries because onnxruntime_session should not care about the providers that are supported. It should be providers agnostic.
We can probably add the target_link_libraries for onnxruntime_session against the cuda libraries to work around this. But this might not be the right solution. Instead, we could add a method in the execution providers for memset that does the memset on the device.
I think this should be done separately in another PR where the focus can be only on this functionality.
This pull request adds support for gradient clipping and also integrates with the
AdamWOptimizerchanges introduced in #11506.To generate the onnx model for the
AdamWoptimizer, users can simply doand to add gradient clipping:
The gradient clipping looks like:
