Skip to content

Comments

Multi-stream execution support#13495

Merged
souptc merged 59 commits intomainfrom
chenta/multi-stream
Dec 15, 2022
Merged

Multi-stream execution support#13495
souptc merged 59 commits intomainfrom
chenta/multi-stream

Conversation

@souptc
Copy link
Member

@souptc souptc commented Oct 28, 2022

Description: This PR including following works:

  1. provide stream and related synchronization abstractions in onnxruntime.
  2. enhance onnxruntime's execution planner / executor / memory arena to support execute multiple streams in parallel.
  3. deprecate the parallel executor for cpu.
  4. deprecate the Fence mechanism.
  5. update the cuda / tensorrt EP to support the stream mechanism, support running different request in different cuda stream.

Motivation and Context

  • Why is this change required?
    currently, the execution plan is just a linear list of those primitives, ort will execute them step by step. For any given graph, ORT will serialize it to a fixed execution order. This sequential execution design simplifies most scenarios, but it has the following limitations:
  1. it is difficult to enable inter-node parallelization, we have a half-baked parallel executor but it is very difficult to make it work with GPU.
  2. The fence mechanism can work with single gpu stream + cpu thread case, but when extend to multiple stream, it is difficult to manage the cross GPU stream synchronizations.
  3. our cuda EP rely on the BFCArena to make the memory management work with the GPU async kernels, but current BFCArena is not aware of the streams, so it doesn't behavior correctly when run with multiple streams.

This PR enhance our existing execution plan and executor to support multiple stream execution. we use an unified algorithm to mange both single stream and multiple stream scenarios.
This PR mainly focus on the infrastructure support for multiple stream execution, that is said, given a valid stream assignment, onnxruntime can execute it correctly. How to generate a good stream assignment for a given model will be in the future PR.

@souptc souptc changed the title squashed commit from PE branch Multi-stream execution support Oct 28, 2022
@souptc souptc mentioned this pull request Oct 28, 2022
static const char* const kOrtSessionOptionsConfigStrictShapeTypeInference = "session.strict_shape_type_inference";

// The file saves configuration for partitioning node among logic streams
static const char* const kNodePartitionConfigFile = "session.node_partition_config_file";
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from Pranav:

Is this required for this PR? If not, can we please separate these as it's experimental?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is not strongly required, but we need this to trigger the unit tests for the multi-stream in single graph feature. let me fix other comments first then go back to this one.

@souptc
Copy link
Member Author

souptc commented Oct 28, 2022

This PR is squashed from previous PR #12227 .

Post the unresolved comments from Pranv:

  1. Can you add unit tests for all the new constructs?
  2. What's the code coverage?
  3. Do we plan to exercise all the use cases this enables in CI?
  4. How do we plan to measure correctness? I don't see tests that do this.
  5. Debuggability: can you check all places where adding more debug statements will help?

return stream->cublas_handle_;
}

inline onnxruntime::Stream* OrtStream(OpKernelContext* ctx) const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my 2 cents: we don't even need this function. Any place invoke OrtStream(context), it can be simply replaced with context->GetComputeStream(). It is the base stream behavior and nothing to do with the specific cudaStream.

If we keep it, we will have to duplicate this function in rocm_kernel as well

@lgtm-com
Copy link

lgtm-com bot commented Oct 28, 2022

This pull request introduces 1 alert and fixes 6 when merging 53ac1d6 into 689e524 - view on LGTM.com

new alerts:

  • 1 for Commented-out code

fixed alerts:

  • 6 for Commented-out code

@yuslepukhin
Copy link
Member

General comment. One of the big problems with the extensive std::function usage is the fact that it allocates memory dynamically, which is what our real-time customers are struggling with on CPU, the changes are affecting generic code. This PR is undoing a lot of improvements in reducing the number of allocations and latency variance.

@souptc
Copy link
Member Author

souptc commented Oct 28, 2022

One thing is not clear. Are streams assigned per inference request or they are set in stone? I am seeing cuda_stream_ being members of kernels, which implies they are set at instantiation time.

One thing is not clear. Are streams assigned per inference request or they are set in stone? I am seeing cuda_stream_ being members of kernels, which implies they are set at instantiation time.

stream is passed in through OpKernelContext, it is not saved in kernel instance. where did you see the cuda_stream_ ?

@souptc
Copy link
Member Author

souptc commented Oct 29, 2022

General comment. One of the big problems with the extensive std::function usage is the fact that it allocates memory dynamically, which is what our real-time customers are struggling with on CPU, the changes are affecting generic code. This PR is undoing a lot of improvements in reducing the number of allocations and latency variance.

offline synced. the std function introduced in allocator / Tensor / Buffer deleter is the main concern. will resolve this in the PR

@souptc
Copy link
Member Author

souptc commented Nov 1, 2022

General comment. One of the big problems with the extensive std::function usage is the fact that it allocates memory dynamically, which is what our real-time customers are struggling with on CPU, the changes are affecting generic code. This PR is undoing a lot of improvements in reducing the number of allocations and latency variance.

@yuslepukhin , i have update the implementation to avoid the std::function usage. please help to review again to see anything still have dynamic allocation concerns.

@lgtm-com
Copy link

lgtm-com bot commented Nov 1, 2022

This pull request introduces 1 alert and fixes 6 when merging e5ffbe0 into b5904c4 - view on LGTM.com

new alerts:

  • 1 for Commented-out code

fixed alerts:

  • 6 for Commented-out code

RandySheriffH and others added 2 commits November 1, 2022 14:48
1. Rename the partitioner;
2. Letting ::partition function return a status;
3. A few other minor fixes.

Co-authored-by: Randy Shuai <[email protected]>
@lgtm-com
Copy link

lgtm-com bot commented Nov 2, 2022

This pull request introduces 1 alert and fixes 6 when merging bea5c1d into b5904c4 - view on LGTM.com

new alerts:

  • 1 for Commented-out code

fixed alerts:

  • 6 for Commented-out code

### Description
use smart pointer for CpuBuffersInfo for cuda and rocm EP



### Motivation and Context
use smart pointer for CpuBuffersInfo for cuda and rocm EP

Co-authored-by: Lei Cao <[email protected]>
pranavsharma
pranavsharma previously approved these changes Dec 13, 2022
Copy link
Contributor

@pranavsharma pranavsharma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

// forward declaration
class SessionState;
class IAllocator;
void* AllocateBufferWithOptions(std::shared_ptr<IAllocator>& allocator, size_t size, bool use_reserve, Stream* stream, WaitNotificationFn wait_fn);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: doesn't look like the implementation of this cares if the IAllocator is in a shared_ptr or not, so could take IAllocator& to be more flexible.


virtual Status Compute(_Inout_ OpKernelContext* context) const ORT_MUST_USE_RESULT = 0;

[[nodiscard]] virtual bool IsAsync() const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Is [[nodiscard]] a replacement for ORT_MUST_USE_RESULT we have on other methods here? Would be nice to be consistent

// Given a graph with node placement information, partition the nodes into multiple sequence.
// Each sequence can be executed in-dependently. The nodes in each sequence are executed in order,
// but we can't assume any execution order between sequences, unless there is a data dependency.
class IGraphPartitioner {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pranavsharma is ISequencePartitioner ok or do you have a better name?

WaitNotificationFn wait_fn);
// for any chunk that associated with target stream, reset it to default (nullptr in stream, timestamp 0)
// perform coalesce if coalesce_flag is true
void ResetChunkOnTargetStream(Stream* target_stream, bool coalesce_flag);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be conditional on ENABLE_STREAM?

ORT_DISALLOW_COPY_ASSIGNMENT_AND_MOVE(BFCArena);
};

class StreamAwareArena : public BFCArena {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be conditional on ENABLE_STREAM?

@@ -0,0 +1,48 @@
// Copyright (c) Microsoft Corporation. All rights reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the full contents of this and the .cc be conditional on ENABLE_STREAM?

Seems most of the usage if ifdef'd on that, but an accidental usage somewhere would not break.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could, but i need to scan the code to see any function signature need to be refactored.

@souptc
Copy link
Member Author

souptc commented Dec 14, 2022

offline synced with Scott, @LEI cao will follow up on the open comments in next PR.

@souptc souptc merged commit a81faee into main Dec 15, 2022
@souptc souptc deleted the chenta/multi-stream branch December 15, 2022 15:39
simon-moo pushed a commit to simon-moo/onnxruntime that referenced this pull request Dec 26, 2022
**Description**: This PR including following works:
1. provide stream and related synchronization abstractions in
onnxruntime.
2. enhance onnxruntime's execution planner / executor / memory arena to
support execute multiple streams in parallel.
3. deprecate the parallel executor for cpu.
4. deprecate the Fence mechanism. 
5. update the cuda / tensorrt EP to support the stream mechanism,
support running different request in different cuda stream.

**Motivation and Context**
- Why is this change required? 
currently, the execution plan is just a linear list of those primitives,
ort will execute them step by step. For any given graph, ORT will
serialize it to a fixed execution order. This sequential execution
design simplifies most scenarios, but it has the following limitations:
1. it is difficult to enable inter-node parallelization, we have a
half-baked parallel executor but it is very difficult to make it work
with GPU.
2. The fence mechanism can work with single gpu stream + cpu thread
case, but when extend to multiple stream, it is difficult to manage the
cross GPU stream synchronizations.
3. our cuda EP rely on the BFCArena to make the memory management work
with the GPU async kernels, but current BFCArena is not aware of the
streams, so it doesn't behavior correctly when run with multiple
streams.

This PR enhance our existing execution plan and executor to support
multiple stream execution. we use an unified algorithm to mange both
single stream and multiple stream scenarios.
This PR mainly focus on the infrastructure support for multiple stream
execution, that is said, given a valid stream assignment, onnxruntime
can execute it correctly. How to generate a good stream assignment for a
given model will be in the future PR.

Co-authored-by: Cheng Tang <[email protected]@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Co-authored-by: Cheng Tang <[email protected]>
Co-authored-by: RandySheriffH <[email protected]>
Co-authored-by: Randy Shuai <[email protected]>
Co-authored-by: cao lei <[email protected]>
Co-authored-by: Lei Cao <[email protected]>
jslhcl added a commit that referenced this pull request Jan 4, 2023
### Description
This PR is to address follow-up comments for the multi-stream pr
#13495

Changes including:

- Make StreamAwareArena transparent to minimal build
- Make DeviceStreamCollection transparent to minimal build
- Replace ORT_MUST_USE_RESULT with [[nodiscard]]
- Remove unnecessary shared_ptr


### Motivation and Context
This PR is to address follow-up comments for the multi-stream pr
#13495

Co-authored-by: Lei Cao <[email protected]>
@pengwa pengwa mentioned this pull request Feb 15, 2023
pengwa added a commit that referenced this pull request Feb 23, 2023
### Fix memory profiler

A follow up fix for PR
#13495

In ORTModule training, `PartialExecuteThePlan` is called twice, we need
create log event after the backward graph run complete to collect the
whole training graph's activations infos.

Also change some log level to verbose, to avoid too many logs in >
verbose log level.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
jywu-msft pushed a commit that referenced this pull request Mar 29, 2023
### Description
**Multi-stream** execution support for **CANN EP**.

### Motivation and Context
**CANN EP** is currently **unavailable** due to the introduction of a
new mechanism for multi-stream execution
[#13495](#13495), the
deletion of the Fence-based synchronization mechanism, and the failure
to update the relevant logic of **CANN EP** synchronously.

This PR is to fix it.
tianleiwu added a commit that referenced this pull request Jun 15, 2023
Fix two issues related to cuda graph capture:
#14942 and
#15002

Issue 1: Previously, graph capture starts at the second run. However,
memory pattern optimization will allocate memory from the second run,
and cudamalloc is not allowed during graph capture. In this PR, the
graph capture will start graph capture after 2 runs to avoid the issue.

Issue 2: #13495 introduced
multiple stream support. But stream cleanup will call
cudaStreamSyncronize which is not allowed in cuda graph capture. In this
PR, we move stream cleanup after cuda graph capture.

Update the squeeze net test model with dynamic axis so that we can test
with larger batch size. Add a test that could reproduce the bug (when
changing min runs from 2 back to 1).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants