Multi-stream executor by souptc · Pull Request #12227 · microsoft/onnxruntime

souptc · 2022-07-19T02:51:36Z

Description: This PR including following works:

provide stream and related synchronization abstractions in onnxruntime.
enhance onnxruntime's execution planner / executor / memory arena to support execute multiple streams in parallel.
deprecate the parallel executor for cpu.
deprecate the Fence mechanism.
update the cuda / tensorrt EP to support the stream mechanism, support running different request in different cuda stream.

Motivation and Context

Why is this change required?
currently, the execution plan is just a linear list of those primitives, ort will execute them step by step. For any given graph, ORT will serialize it to a fixed execution order. This sequential execution design simplifies most scenarios, but it has the following limitations:

it is difficult to enable inter-node parallelization, we have a half-baked parallel executor but it is very difficult to make it work with GPU.
The fence mechanism can work with single gpu stream + cpu thread case, but when extend to multiple stream, it is difficult to manage the cross GPU stream synchronizations.
our cuda EP rely on the BFCArena to make the memory management work with the GPU async kernels, but current BFCArena is not aware of the streams, so it doesn't behavior correctly when run with multiple streams.

This PR enhance our existing execution plan and executor to support multiple stream execution. we use an unified algorithm to mange both single stream and multiple stream scenarios.
This PR mainly focus on the infrastructure support for multiple stream execution, that is said, given a valid stream assignment, onnxruntime can execute it correctly. How to generate a good stream assignment for a given model will be in the future PR.

…range for better profiling

### Description The current compare function to decide whether two elements in std::set<OrtDevice> in SessionState::ResolveMemoryPatternFlag() is not right, which caused mem_pattern_ different behavior between main and pe branch ### Motivation and Context This change is to make mem_pattern_ the same behavior between main and pe branch Co-authored-by: Lei Cao <[email protected]@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>

lgtm-com · 2022-10-15T02:36:57Z

This pull request introduces 18 alerts and fixes 6 when merging 709f424 into 1ab11a1 - view on LGTM.com

new alerts:

17 for Commented-out code
1 for No trivial switch statements

fixed alerts:

6 for Commented-out code

…ame (#13334) replace vector with InlinedVector in ExecutionContext and ExecutionFrame ### Description replace vector with InlinedVector in ExecutionContext and ExecutionFrame ### Motivation and Context replace vector with InlinedVector in ExecutionContext and ExecutionFrame Co-authored-by: Lei Cao <[email protected]>

lgtm-com · 2022-10-18T19:59:30Z

This pull request introduces 18 alerts and fixes 6 when merging 94c163b into e398241 - view on LGTM.com

new alerts:

17 for Commented-out code
1 for No trivial switch statements

fixed alerts:

6 for Commented-out code

onnxruntime/core/providers/cuda/cuda_kernel.h

jslhcl · 2022-10-21T17:31:52Z

    auto& node_device_mem_location = ep->GetAllocator(ep->GetDeviceId(), OrtMemType::OrtMemTypeDefault)->Info();

There is no need to use deviceId as parameter, we can get device id in the implementation of GetAllocator()

Refers to: onnxruntime/core/framework/allocation_planner.cc:1899 in 94c163b. [](commit_id = 94c163b, deletion_comment = False)

### Description use cudaStreamNonBlocking for perf improvement ### Motivation and Context use cudaStreamNonBlocking for perf improvement Co-authored-by: Lei Cao <[email protected]@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>

pranavsharma

Still reviewing. In the meanwhile, can you please resolve the comments added by the bots as it's becoming quite inconvenient to review?

include/onnxruntime/core/common/inlined_containers_fwd.h

pranavsharma · 2022-10-21T19:22:18Z

include/onnxruntime/core/session/onnxruntime_session_options_config_keys.h

 static const char* const kOrtSessionOptionsConfigStrictShapeTypeInference = "session.strict_shape_type_inference";
+
+// The file saves configuration for partitioning node among logic streams
+static const char* const kNodePartitionConfigFile = "session.node_partition_config_file";


Is this required for this PR? If not, can we please separate these as it's experimental?

it is not strongly required, but we need this to trigger the unit tests for the multi-stream in single graph feature. let me fix other comments first then go back to this one.

i will keep this comments in the new PR: #13495

include/onnxruntime/core/framework/stream_handles.h

pranavsharma · 2022-10-21T23:11:00Z

include/onnxruntime/core/framework/allocator.h

 constexpr size_t kAllocAlignment = 256;

+class IAllocator;
+std::function<void*(size_t)> GetAllocationFn(std::shared_ptr<IAllocator> allocator, bool use_reserve, Stream* stream, WaitNotificationFn wait_fn);


Why is the allocator aware of the stream and notification?

This function is mainly used for GPU EP's GetScratch buffer, which need to aware which stream the scratch buffer is assigned to. it is a helper function to wrapper the stream aware allocation function, it won't impact other ep's usage.

lgtm-com · 2022-10-22T01:03:06Z

This pull request introduces 18 alerts and fixes 6 when merging 72c3e6b into 928c988 - view on LGTM.com

new alerts:

17 for Commented-out code
1 for No trivial switch statements

fixed alerts:

6 for Commented-out code

skottmckay

Initial comments. Have mainly looked at headers in the core code so far and still have a few of those to go before looking at implementation details.

include/onnxruntime/core/framework/allocator.h

include/onnxruntime/core/framework/execution_provider.h

include/onnxruntime/core/framework/op_kernel_context.h

include/onnxruntime/core/framework/stream_handles.h

onnxruntime/core/framework/execution_context.h

onnxruntime/core/framework/sequential_executor.cc

onnxruntime/core/framework/execution_context.h

onnxruntime/core/framework/op_kernel_context_internal.h

onnxruntime/core/providers/tensorrt/tensorrt_execution_provider.h

onnxruntime/core/providers/cuda/cuda_kernel.h

pranavsharma

Can you add unit tests for all the new constructs?
What's the code coverage?
Do we plan to exercise all the use cases this enables in CI?
How do we plan to measure correctness? I don't see tests that do this.
Debuggability: can you check all places where adding more debug statements will help?

onnxruntime/core/framework/allocation_planner.h

onnxruntime/core/framework/device_stream_collection.h

onnxruntime/core/framework/execution_context.cc

onnxruntime/core/framework/execution_context.h

pranavsharma · 2022-10-26T19:17:47Z

onnxruntime/core/framework/execution_context.h

+   * CountDownBarrier is only for scenario that the v is set
+   * to the # of consumers and each consumer calls Dec() exactly once.
+   */
+  class CountDownBarrier {


Would waitable counter from nsync work instead of reinventing the wheel? https://github.com/google/nsync/blob/master/public/nsync_counter.h

If we've to write this, would be nice to keep it in a general utility folder/header file. core/util

yes nsync counter will work, but nsync only available on non-Windows platforms, right? This is a relatively simple implementation, so i'd prefer just to implement it instead of use nsync on linux but another implementation on windows.
I am ok to keep it in separate file if we believe it is useful, once we confirm there is no other choice, i can work on separate it.

let's continue the discussion in the new PR: #13495

onnxruntime/core/providers/cuda/math/einsum.cc

lgtm-com · 2022-10-28T18:36:09Z

This pull request introduces 1 alert and fixes 6 when merging 9873a3e into 8b0669b - view on LGTM.com

new alerts:

1 for Commented-out code

fixed alerts:

6 for Commented-out code

souptc · 2022-10-28T19:00:50Z

Since there are too many commits in this PR, we perform a squash and open a new PR at #13495
Close this one.

chentaMS and others added 30 commits May 5, 2022 11:18

add test model

bb08799

fix cuda build

bad34c2

fix a typo

761fb18

fix cuda build error

cb03801

merge from master; fix cudnn/cublas stream issue

18518f9

turn cuda kernel to async, use a dummy callback for test

68d9f61

add per value alloc planner for PE

985a702

make the data transfer support async copy

6a99fdc

Merge branch 'PE' of https://github.com/microsoft/onnxruntime into PE

2f6b3b7

fix a bug in allocation plan

52d7cf2

add release plan

62fcec2

resolve conflict

39264a1

disable optimization on sess_opt2

cd08390

fix the async copy bug

35ca078

Merge branch 'PE' of https://github.com/microsoft/onnxruntime into PE

5d2fc15

separate device and host ops to different logic stream, and add nvtx …

528cd87

…range for better profiling

temp fix for IsAsync api

44185b8

Merge branch 'master' into chenta/test_pe

12259c2

fix build break

e25917a

Merge branch 'chenta/test_pe' into PE

8c8cb12

define add stream maps to dynamically scaling

f9559a5

reuse sequential exe plan by inheritance

b408452

refine the stream/notification api

f7c5a9e

Merge branch 'PE' of https://github.com/microsoft/onnxruntime into PE

e32245d

create session options for stream configuration

d2477b2

comment usage on grouped_ops and streams_per_op

caba6a3

log logic stream execution

04e375d

merge from PE

337ef4e

reuse release plan from the planner

583f51a

resolve conflict

b3835af

jslhcl reviewed Oct 21, 2022

View reviewed changes

onnxruntime/core/providers/cuda/cuda_kernel.h Show resolved Hide resolved

jslhcl and others added 2 commits October 21, 2022 14:24

merge from main and resolve comments

7a7289a

pranavsharma reviewed Oct 21, 2022

View reviewed changes

build break

72c3e6b

skottmckay reviewed Oct 24, 2022

View reviewed changes

turn stream into class

5f9dac8

jywu-msft reviewed Oct 25, 2022

View reviewed changes

onnxruntime/core/providers/tensorrt/tensorrt_execution_provider.h Outdated Show resolved Hide resolved

fix comments about execution context

419f218

jywu-msft requested review from chilo-ms and stevenlix October 26, 2022 03:26

Cheng Tang added 3 commits October 26, 2022 04:58

add more documents

2ae3cb6

fix training build

904acd5

add more documents

a016a4f

jslhcl reviewed Oct 26, 2022

View reviewed changes

onnxruntime/core/providers/cuda/cuda_kernel.h Outdated Show resolved Hide resolved

pranavsharma reviewed Oct 26, 2022

View reviewed changes

jslhcl reviewed Oct 27, 2022

View reviewed changes

onnxruntime/core/providers/cuda/math/einsum.cc Outdated Show resolved Hide resolved

Cheng Tang added 2 commits October 28, 2022 15:24

fix more comments in PR

f386b3e

merge from main

9873a3e

souptc mentioned this pull request Oct 28, 2022

Multi-stream execution support #13495

Merged

souptc closed this Oct 28, 2022

jslhcl deleted the PE branch February 15, 2023 22:34

Comments

Conversation

souptc commented Jul 19, 2022

Uh oh!

lgtm-com bot commented Oct 15, 2022

Uh oh!

lgtm-com bot commented Oct 18, 2022

Uh oh!

Uh oh!

jslhcl commented Oct 21, 2022

Uh oh!

pranavsharma left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lgtm-com bot commented Oct 22, 2022

Uh oh!

skottmckay left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pranavsharma left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lgtm-com bot commented Oct 28, 2022

Uh oh!

souptc commented Oct 28, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants