Multi-stream execution support by souptc · Pull Request #13495 · microsoft/onnxruntime

souptc · 2022-10-28T18:54:23Z

Description: This PR including following works:

provide stream and related synchronization abstractions in onnxruntime.
enhance onnxruntime's execution planner / executor / memory arena to support execute multiple streams in parallel.
deprecate the parallel executor for cpu.
deprecate the Fence mechanism.
update the cuda / tensorrt EP to support the stream mechanism, support running different request in different cuda stream.

Motivation and Context

Why is this change required?
currently, the execution plan is just a linear list of those primitives, ort will execute them step by step. For any given graph, ORT will serialize it to a fixed execution order. This sequential execution design simplifies most scenarios, but it has the following limitations:

it is difficult to enable inter-node parallelization, we have a half-baked parallel executor but it is very difficult to make it work with GPU.
The fence mechanism can work with single gpu stream + cpu thread case, but when extend to multiple stream, it is difficult to manage the cross GPU stream synchronizations.
our cuda EP rely on the BFCArena to make the memory management work with the GPU async kernels, but current BFCArena is not aware of the streams, so it doesn't behavior correctly when run with multiple streams.

This PR enhance our existing execution plan and executor to support multiple stream execution. we use an unified algorithm to mange both single stream and multiple stream scenarios.
This PR mainly focus on the infrastructure support for multiple stream execution, that is said, given a valid stream assignment, onnxruntime can execute it correctly. How to generate a good stream assignment for a given model will be in the future PR.

souptc · 2022-10-28T18:56:58Z

include/onnxruntime/core/session/onnxruntime_session_options_config_keys.h

 static const char* const kOrtSessionOptionsConfigStrictShapeTypeInference = "session.strict_shape_type_inference";
+
+// The file saves configuration for partitioning node among logic streams
+static const char* const kNodePartitionConfigFile = "session.node_partition_config_file";


from Pranav:

Is this required for this PR? If not, can we please separate these as it's experimental?

it is not strongly required, but we need this to trigger the unit tests for the multi-stream in single graph feature. let me fix other comments first then go back to this one.

include/onnxruntime/core/framework/allocator.h

souptc · 2022-10-28T19:00:10Z

This PR is squashed from previous PR #12227 .

Post the unresolved comments from Pranv:

Can you add unit tests for all the new constructs?
What's the code coverage?
Do we plan to exercise all the use cases this enables in CI?
How do we plan to measure correctness? I don't see tests that do this.
Debuggability: can you check all places where adding more debug statements will help?

jslhcl · 2022-10-28T19:12:03Z

onnxruntime/core/providers/cuda/cuda_kernel.h

+    return stream->cublas_handle_;
+  }
+
+  inline onnxruntime::Stream* OrtStream(OpKernelContext* ctx) const {


my 2 cents: we don't even need this function. Any place invoke OrtStream(context), it can be simply replaced with context->GetComputeStream(). It is the base stream behavior and nothing to do with the specific cudaStream.

If we keep it, we will have to duplicate this function in rocm_kernel as well

lgtm-com · 2022-10-28T20:46:17Z

This pull request introduces 1 alert and fixes 6 when merging 53ac1d6 into 689e524 - view on LGTM.com

new alerts:

1 for Commented-out code

fixed alerts:

6 for Commented-out code

include/onnxruntime/core/common/inlined_containers_fwd.h

include/onnxruntime/core/framework/allocator.h

yuslepukhin · 2022-10-28T21:40:33Z

General comment. One of the big problems with the extensive std::function usage is the fact that it allocates memory dynamically, which is what our real-time customers are struggling with on CPU, the changes are affecting generic code. This PR is undoing a lot of improvements in reducing the number of allocations and latency variance.

souptc · 2022-10-28T22:29:24Z

One thing is not clear. Are streams assigned per inference request or they are set in stone? I am seeing cuda_stream_ being members of kernels, which implies they are set at instantiation time.

stream is passed in through OpKernelContext, it is not saved in kernel instance. where did you see the cuda_stream_ ?

onnxruntime/core/framework/allocation_planner.h

onnxruntime/core/framework/device_stream_collection.h

onnxruntime/core/framework/sequential_execution_plan.h

onnxruntime/core/framework/session_state.h

souptc · 2022-10-29T00:06:29Z

General comment. One of the big problems with the extensive std::function usage is the fact that it allocates memory dynamically, which is what our real-time customers are struggling with on CPU, the changes are affecting generic code. This PR is undoing a lot of improvements in reducing the number of allocations and latency variance.

offline synced. the std function introduced in allocator / Tensor / Buffer deleter is the main concern. will resolve this in the PR

include/onnxruntime/core/framework/buffer_deleter.h

include/onnxruntime/core/framework/op_kernel.h

include/onnxruntime/core/framework/stream_handles.h

onnxruntime/contrib_ops/cuda/quantization/qordered_ops/qordered_longformer_attention.cc

onnxruntime/core/providers/cuda/cuda_kernel.h

onnxruntime/contrib_ops/cuda/quantization/qordered_ops/qordered_attention.cc

onnxruntime/contrib_ops/cuda/quantization/qordered_ops/qordered_matmul.cc

onnxruntime/core/providers/cuda/rnn/cudnn_rnn_base.h

souptc · 2022-11-01T18:34:50Z

General comment. One of the big problems with the extensive std::function usage is the fact that it allocates memory dynamically, which is what our real-time customers are struggling with on CPU, the changes are affecting generic code. This PR is undoing a lot of improvements in reducing the number of allocations and latency variance.

@yuslepukhin , i have update the implementation to avoid the std::function usage. please help to review again to see anything still have dynamic allocation concerns.

lgtm-com · 2022-11-01T20:34:58Z

This pull request introduces 1 alert and fixes 6 when merging e5ffbe0 into b5904c4 - view on LGTM.com

new alerts:

1 for Commented-out code

fixed alerts:

6 for Commented-out code

1. Rename the partitioner; 2. Letting ::partition function return a status; 3. A few other minor fixes. Co-authored-by: Randy Shuai <[email protected]>

lgtm-com · 2022-11-02T01:05:14Z

This pull request introduces 1 alert and fixes 6 when merging bea5c1d into b5904c4 - view on LGTM.com

new alerts:

1 for Commented-out code

fixed alerts:

6 for Commented-out code

### Description use smart pointer for CpuBuffersInfo for cuda and rocm EP ### Motivation and Context use smart pointer for CpuBuffersInfo for cuda and rocm EP Co-authored-by: Lei Cao <[email protected]>

pranavsharma

LGTM 👍

skottmckay · 2022-12-13T06:56:27Z

include/onnxruntime/core/framework/allocator.h

-// forward declaration
-class SessionState;
+class IAllocator;
+void* AllocateBufferWithOptions(std::shared_ptr<IAllocator>& allocator, size_t size, bool use_reserve, Stream* stream, WaitNotificationFn wait_fn);


nit: doesn't look like the implementation of this cares if the IAllocator is in a shared_ptr or not, so could take IAllocator& to be more flexible.

skottmckay · 2022-12-13T06:58:25Z

include/onnxruntime/core/framework/op_kernel.h


  virtual Status Compute(_Inout_ OpKernelContext* context) const ORT_MUST_USE_RESULT = 0;

+  [[nodiscard]] virtual bool IsAsync() const {


nit: Is [[nodiscard]] a replacement for ORT_MUST_USE_RESULT we have on other methods here? Would be nice to be consistent

skottmckay · 2022-12-13T07:10:18Z

onnxruntime/core/framework/allocation_planner.h

+// Given a graph with node placement information, partition the nodes into multiple sequence.
+// Each sequence can be executed in-dependently. The nodes in each sequence are executed in order,
+// but we can't assume any execution order between sequences, unless there is a data dependency.
+class IGraphPartitioner {


@pranavsharma is ISequencePartitioner ok or do you have a better name?

skottmckay · 2022-12-13T22:55:22Z

onnxruntime/core/framework/bfc_arena.h

+                            WaitNotificationFn wait_fn);
+  // for any chunk that associated with target stream, reset it to default (nullptr in stream, timestamp 0)
+  // perform coalesce if coalesce_flag is true
+  void ResetChunkOnTargetStream(Stream* target_stream, bool coalesce_flag);


Can this be conditional on ENABLE_STREAM?

skottmckay · 2022-12-13T22:55:30Z

onnxruntime/core/framework/bfc_arena.h

  ORT_DISALLOW_COPY_ASSIGNMENT_AND_MOVE(BFCArena);
 };
+
+class StreamAwareArena : public BFCArena {


Can this be conditional on ENABLE_STREAM?

skottmckay · 2022-12-13T22:58:21Z

onnxruntime/core/framework/device_stream_collection.h

@@ -0,0 +1,48 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.


Can the full contents of this and the .cc be conditional on ENABLE_STREAM?

Seems most of the usage if ifdef'd on that, but an accidental usage somewhere would not break.

we could, but i need to scan the code to see any function signature need to be refactored.

souptc · 2022-12-14T01:22:22Z

offline synced with Scott, @LEI cao will follow up on the open comments in next PR.

…ch_impl_gpt.h greedy_search_impl_gpt.h

…chenta/multi-stream

**Description**: This PR including following works: 1. provide stream and related synchronization abstractions in onnxruntime. 2. enhance onnxruntime's execution planner / executor / memory arena to support execute multiple streams in parallel. 3. deprecate the parallel executor for cpu. 4. deprecate the Fence mechanism. 5. update the cuda / tensorrt EP to support the stream mechanism, support running different request in different cuda stream. **Motivation and Context** - Why is this change required? currently, the execution plan is just a linear list of those primitives, ort will execute them step by step. For any given graph, ORT will serialize it to a fixed execution order. This sequential execution design simplifies most scenarios, but it has the following limitations: 1. it is difficult to enable inter-node parallelization, we have a half-baked parallel executor but it is very difficult to make it work with GPU. 2. The fence mechanism can work with single gpu stream + cpu thread case, but when extend to multiple stream, it is difficult to manage the cross GPU stream synchronizations. 3. our cuda EP rely on the BFCArena to make the memory management work with the GPU async kernels, but current BFCArena is not aware of the streams, so it doesn't behavior correctly when run with multiple streams. This PR enhance our existing execution plan and executor to support multiple stream execution. we use an unified algorithm to mange both single stream and multiple stream scenarios. This PR mainly focus on the infrastructure support for multiple stream execution, that is said, given a valid stream assignment, onnxruntime can execute it correctly. How to generate a good stream assignment for a given model will be in the future PR. Co-authored-by: Cheng Tang <[email protected]@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net> Co-authored-by: Cheng Tang <[email protected]> Co-authored-by: RandySheriffH <[email protected]> Co-authored-by: Randy Shuai <[email protected]> Co-authored-by: cao lei <[email protected]> Co-authored-by: Lei Cao <[email protected]>

### Description This PR is to address follow-up comments for the multi-stream pr #13495 Changes including: - Make StreamAwareArena transparent to minimal build - Make DeviceStreamCollection transparent to minimal build - Replace ORT_MUST_USE_RESULT with [[nodiscard]] - Remove unnecessary shared_ptr ### Motivation and Context This PR is to address follow-up comments for the multi-stream pr #13495 Co-authored-by: Lei Cao <[email protected]>

### Fix memory profiler A follow up fix for PR #13495 In ORTModule training, `PartialExecuteThePlan` is called twice, we need create log event after the backward graph run complete to collect the whole training graph's activations infos. Also change some log level to verbose, to avoid too many logs in > verbose log level. ### Motivation and Context

### Description **Multi-stream** execution support for **CANN EP**. ### Motivation and Context **CANN EP** is currently **unavailable** due to the introduction of a new mechanism for multi-stream execution [#13495](#13495), the deletion of the Fence-based synchronization mechanism, and the failure to update the relevant logic of **CANN EP** synchronously. This PR is to fix it.

Fix two issues related to cuda graph capture: #14942 and #15002 Issue 1: Previously, graph capture starts at the second run. However, memory pattern optimization will allocate memory from the second run, and cudamalloc is not allowed during graph capture. In this PR, the graph capture will start graph capture after 2 runs to avoid the issue. Issue 2: #13495 introduced multiple stream support. But stream cleanup will call cudaStreamSyncronize which is not allowed in cuda graph capture. In this PR, we move stream cleanup after cuda graph capture. Update the squeeze net test model with dynamic axis so that we can test with larger batch size. Add a test that could reproduce the bug (when changing min runs from 2 back to 1).

squashed commit from PE branch

53ac1d6

souptc changed the title ~~squashed commit from PE branch~~ Multi-stream execution support Oct 28, 2022

souptc mentioned this pull request Oct 28, 2022

Multi-stream executor #12227

Closed

souptc commented Oct 28, 2022

View reviewed changes

include/onnxruntime/core/framework/allocator.h Outdated Show resolved Hide resolved

jslhcl reviewed Oct 28, 2022

View reviewed changes

yuslepukhin reviewed Oct 28, 2022

View reviewed changes

include/onnxruntime/core/common/inlined_containers_fwd.h Outdated Show resolved Hide resolved

yuslepukhin reviewed Oct 28, 2022

View reviewed changes

include/onnxruntime/core/framework/allocator.h Outdated Show resolved Hide resolved

pranavsharma reviewed Oct 28, 2022

View reviewed changes

edgchen1 reviewed Oct 29, 2022

View reviewed changes

jslhcl reviewed Oct 30, 2022

View reviewed changes

onnxruntime/contrib_ops/cuda/quantization/qordered_ops/qordered_longformer_attention.cc Outdated Show resolved Hide resolved

jslhcl reviewed Oct 31, 2022

View reviewed changes

onnxruntime/core/providers/cuda/cuda_kernel.h Outdated Show resolved Hide resolved

jslhcl reviewed Oct 31, 2022

View reviewed changes

onnxruntime/contrib_ops/cuda/quantization/qordered_ops/qordered_attention.cc Show resolved Hide resolved

jslhcl reviewed Oct 31, 2022

View reviewed changes

onnxruntime/contrib_ops/cuda/quantization/qordered_ops/qordered_matmul.cc Show resolved Hide resolved

jslhcl reviewed Oct 31, 2022

View reviewed changes

onnxruntime/core/providers/cuda/rnn/cudnn_rnn_base.h Show resolved Hide resolved

Cheng Tang and others added 5 commits November 1, 2022 06:20

avoid using std function for tensor create/delete

bb5c873

fix more comments in PR

4a75c0f

avoid std function when allocate scratch buffer

6cbd2f8

merge from main

aff38d9

fix winml build

e5ffbe0

RandySheriffH and others added 2 commits November 1, 2022 14:48

Refactor graph partitioner for multi-stream PR (#13537)

ec97c30

1. Rename the partitioner; 2. Letting ::partition function return a status; 3. A few other minor fixes. Co-authored-by: Randy Shuai <[email protected]>

fix moving allocator ptr; fix the line length; explicit constructor

bea5c1d

use smart pointer for CpuBuffersInfo for cuda and rocm EP (#13931)

b6b9d6c

### Description use smart pointer for CpuBuffersInfo for cuda and rocm EP ### Motivation and Context use smart pointer for CpuBuffersInfo for cuda and rocm EP Co-authored-by: Lei Cao <[email protected]>

pranavsharma previously approved these changes Dec 13, 2022

View reviewed changes

skottmckay reviewed Dec 13, 2022

View reviewed changes

souptc dismissed pranavsharma’s stale review via a8d4a6b December 14, 2022 14:18

jslhcl force-pushed the chenta/multi-stream branch from a8d4a6b to b6b9d6c Compare December 14, 2022 15:28

Lei Cao added 4 commits December 14, 2022 09:53

merge from main and resolve cpu/transformers/beam_search.cc beam_sear…

c6ddb92

…ch_impl_gpt.h greedy_search_impl_gpt.h

merge from main and resolve provider_bridge_provider.cc

1bac37c

Merge branch 'main' of https://github.com/microsoft/onnxruntime into …

58fe86e

…chenta/multi-stream

merge from main and resolve cuda/quantization/fake_quant.cc

83a14cb

jslhcl approved these changes Dec 15, 2022

View reviewed changes

souptc merged commit a81faee into main Dec 15, 2022

souptc deleted the chenta/multi-stream branch December 15, 2022 15:39

jslhcl mentioned this pull request Dec 16, 2022

Address follow-up comments on multistream pr #13495 #13992

Merged

fffrog mentioned this pull request Dec 16, 2022

[CANN] Add the ability to run graph #13728

Merged

This was referenced Jan 4, 2023

MIGraphX EP Failures with latest changes [Build] #14126

Closed

Update information for ROCm 5.4 for MIGraphX and ROCm Builds #13813

Merged

fffrog mentioned this pull request Jan 5, 2023

[CANN] Multi-stream execution support for CANN EP. #14058

Merged

jywu-msft mentioned this pull request Feb 3, 2023

[Performance] 1.14RC1 Tensorrt Regression #14484

Closed

pengwa mentioned this pull request Feb 15, 2023

Fix memory profiler #14695

Merged

tianleiwu mentioned this pull request Mar 11, 2023

Fix cuda graph capture #15005

Merged

mzchtx mentioned this pull request Mar 23, 2023

[Bug] onnxruntime-gpu 1.14.x is not thread safe #15154

Closed

fffrog mentioned this pull request May 30, 2023

[Documentation] Add third-party Pipeline Status to README.md #16154

Closed

This was referenced Oct 3, 2023

Sync on CUDA EP level stream only if really needed #17770

Open

Make CUDA a NHWC EP #17200

Merged


		virtual Status Compute(_Inout_ OpKernelContext* context) const ORT_MUST_USE_RESULT = 0;

		[[nodiscard]] virtual bool IsAsync() const {

		@@ -0,0 +1,48 @@
		// Copyright (c) Microsoft Corporation. All rights reserved.

Comments

Conversation

souptc commented Oct 28, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

souptc commented Oct 28, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lgtm-com bot commented Oct 28, 2022

Uh oh!

Uh oh!

Uh oh!

yuslepukhin commented Oct 28, 2022

Uh oh!

souptc commented Oct 28, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

souptc commented Oct 29, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

souptc commented Nov 1, 2022

Uh oh!

lgtm-com bot commented Nov 1, 2022

Uh oh!

lgtm-com bot commented Nov 2, 2022

Uh oh!

pranavsharma left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

souptc commented Dec 14, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants