Skip to content

Comments

Multi-stream executor#12227

Closed
souptc wants to merge 260 commits intomainfrom
PE
Closed

Multi-stream executor#12227
souptc wants to merge 260 commits intomainfrom
PE

Conversation

@souptc
Copy link
Member

@souptc souptc commented Jul 19, 2022

Description: This PR including following works:

  1. provide stream and related synchronization abstractions in onnxruntime.
  2. enhance onnxruntime's execution planner / executor / memory arena to support execute multiple streams in parallel.
  3. deprecate the parallel executor for cpu.
  4. deprecate the Fence mechanism.
  5. update the cuda / tensorrt EP to support the stream mechanism, support running different request in different cuda stream.

Motivation and Context

  • Why is this change required?
    currently, the execution plan is just a linear list of those primitives, ort will execute them step by step. For any given graph, ORT will serialize it to a fixed execution order. This sequential execution design simplifies most scenarios, but it has the following limitations:
  1. it is difficult to enable inter-node parallelization, we have a half-baked parallel executor but it is very difficult to make it work with GPU.
  2. The fence mechanism can work with single gpu stream + cpu thread case, but when extend to multiple stream, it is difficult to manage the cross GPU stream synchronizations.
  3. our cuda EP rely on the BFCArena to make the memory management work with the GPU async kernels, but current BFCArena is not aware of the streams, so it doesn't behavior correctly when run with multiple streams.

This PR enhance our existing execution plan and executor to support multiple stream execution. we use an unified algorithm to mange both single stream and multiple stream scenarios.
This PR mainly focus on the infrastructure support for multiple stream execution, that is said, given a valid stream assignment, onnxruntime can execute it correctly. How to generate a good stream assignment for a given model will be in the future PR.

chentaMS and others added 30 commits May 5, 2022 11:18
### Description
The current compare function to decide whether two elements in
std::set<OrtDevice> in SessionState::ResolveMemoryPatternFlag() is not
right, which caused mem_pattern_ different behavior between main and pe
branch



### Motivation and Context
This change is to make mem_pattern_ the same behavior between main and
pe branch

Co-authored-by: Lei Cao <[email protected]@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
@lgtm-com
Copy link

lgtm-com bot commented Oct 15, 2022

This pull request introduces 18 alerts and fixes 6 when merging 709f424 into 1ab11a1 - view on LGTM.com

new alerts:

  • 17 for Commented-out code
  • 1 for No trivial switch statements

fixed alerts:

  • 6 for Commented-out code

…ame (#13334)

replace vector with InlinedVector in ExecutionContext and ExecutionFrame

### Description
replace vector with InlinedVector in ExecutionContext and ExecutionFrame



### Motivation and Context
replace vector with InlinedVector in ExecutionContext and ExecutionFrame

Co-authored-by: Lei Cao <[email protected]>
@lgtm-com
Copy link

lgtm-com bot commented Oct 18, 2022

This pull request introduces 18 alerts and fixes 6 when merging 94c163b into e398241 - view on LGTM.com

new alerts:

  • 17 for Commented-out code
  • 1 for No trivial switch statements

fixed alerts:

  • 6 for Commented-out code

@jslhcl
Copy link
Contributor

jslhcl commented Oct 21, 2022

    auto& node_device_mem_location = ep->GetAllocator(ep->GetDeviceId(), OrtMemType::OrtMemTypeDefault)->Info();

There is no need to use deviceId as parameter, we can get device id in the implementation of GetAllocator()


Refers to: onnxruntime/core/framework/allocation_planner.cc:1899 in 94c163b. [](commit_id = 94c163b, deletion_comment = False)

jslhcl and others added 2 commits October 21, 2022 14:24
### Description
use cudaStreamNonBlocking for perf improvement



### Motivation and Context
use cudaStreamNonBlocking for perf improvement

Co-authored-by: Lei Cao <[email protected]@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
Copy link
Contributor

@pranavsharma pranavsharma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still reviewing. In the meanwhile, can you please resolve the comments added by the bots as it's becoming quite inconvenient to review?

static const char* const kOrtSessionOptionsConfigStrictShapeTypeInference = "session.strict_shape_type_inference";

// The file saves configuration for partitioning node among logic streams
static const char* const kNodePartitionConfigFile = "session.node_partition_config_file";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this required for this PR? If not, can we please separate these as it's experimental?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is not strongly required, but we need this to trigger the unit tests for the multi-stream in single graph feature. let me fix other comments first then go back to this one.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i will keep this comments in the new PR: #13495

constexpr size_t kAllocAlignment = 256;

class IAllocator;
std::function<void*(size_t)> GetAllocationFn(std::shared_ptr<IAllocator> allocator, bool use_reserve, Stream* stream, WaitNotificationFn wait_fn);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the allocator aware of the stream and notification?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is mainly used for GPU EP's GetScratch buffer, which need to aware which stream the scratch buffer is assigned to. it is a helper function to wrapper the stream aware allocation function, it won't impact other ep's usage.

@lgtm-com
Copy link

lgtm-com bot commented Oct 22, 2022

This pull request introduces 18 alerts and fixes 6 when merging 72c3e6b into 928c988 - view on LGTM.com

new alerts:

  • 17 for Commented-out code
  • 1 for No trivial switch statements

fixed alerts:

  • 6 for Commented-out code

Copy link
Contributor

@skottmckay skottmckay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial comments. Have mainly looked at headers in the core code so far and still have a few of those to go before looking at implementation details.

Copy link
Contributor

@pranavsharma pranavsharma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Can you add unit tests for all the new constructs?
  2. What's the code coverage?
  3. Do we plan to exercise all the use cases this enables in CI?
  4. How do we plan to measure correctness? I don't see tests that do this.
  5. Debuggability: can you check all places where adding more debug statements will help?

* CountDownBarrier is only for scenario that the v is set
* to the # of consumers and each consumer calls Dec() exactly once.
*/
class CountDownBarrier {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Would waitable counter from nsync work instead of reinventing the wheel? https://github.com/google/nsync/blob/master/public/nsync_counter.h
  2. If we've to write this, would be nice to keep it in a general utility folder/header file. core/util

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes nsync counter will work, but nsync only available on non-Windows platforms, right? This is a relatively simple implementation, so i'd prefer just to implement it instead of use nsync on linux but another implementation on windows.
I am ok to keep it in separate file if we believe it is useful, once we confirm there is no other choice, i can work on separate it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's continue the discussion in the new PR: #13495

@lgtm-com
Copy link

lgtm-com bot commented Oct 28, 2022

This pull request introduces 1 alert and fixes 6 when merging 9873a3e into 8b0669b - view on LGTM.com

new alerts:

  • 1 for Commented-out code

fixed alerts:

  • 6 for Commented-out code

@souptc
Copy link
Member Author

souptc commented Oct 28, 2022

Since there are too many commits in this PR, we perform a squash and open a new PR at #13495
Close this one.

@souptc souptc closed this Oct 28, 2022
@jslhcl jslhcl deleted the PE branch February 15, 2023 22:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants