Conversation
| static const char* const kOrtSessionOptionsConfigStrictShapeTypeInference = "session.strict_shape_type_inference"; | ||
|
|
||
| // The file saves configuration for partitioning node among logic streams | ||
| static const char* const kNodePartitionConfigFile = "session.node_partition_config_file"; |
There was a problem hiding this comment.
from Pranav:
Is this required for this PR? If not, can we please separate these as it's experimental?
There was a problem hiding this comment.
it is not strongly required, but we need this to trigger the unit tests for the multi-stream in single graph feature. let me fix other comments first then go back to this one.
|
This PR is squashed from previous PR #12227 . Post the unresolved comments from Pranv:
|
| return stream->cublas_handle_; | ||
| } | ||
|
|
||
| inline onnxruntime::Stream* OrtStream(OpKernelContext* ctx) const { |
There was a problem hiding this comment.
my 2 cents: we don't even need this function. Any place invoke OrtStream(context), it can be simply replaced with context->GetComputeStream(). It is the base stream behavior and nothing to do with the specific cudaStream.
If we keep it, we will have to duplicate this function in rocm_kernel as well
|
This pull request introduces 1 alert and fixes 6 when merging 53ac1d6 into 689e524 - view on LGTM.com new alerts:
fixed alerts:
|
|
General comment. One of the big problems with the extensive std::function usage is the fact that it allocates memory dynamically, which is what our real-time customers are struggling with on CPU, the changes are affecting generic code. This PR is undoing a lot of improvements in reducing the number of allocations and latency variance. |
stream is passed in through OpKernelContext, it is not saved in kernel instance. where did you see the cuda_stream_ ? |
offline synced. the std function introduced in allocator / Tensor / Buffer deleter is the main concern. will resolve this in the PR |
onnxruntime/contrib_ops/cuda/quantization/qordered_ops/qordered_longformer_attention.cc
Outdated
Show resolved
Hide resolved
onnxruntime/contrib_ops/cuda/quantization/qordered_ops/qordered_attention.cc
Show resolved
Hide resolved
onnxruntime/contrib_ops/cuda/quantization/qordered_ops/qordered_matmul.cc
Show resolved
Hide resolved
@yuslepukhin , i have update the implementation to avoid the std::function usage. please help to review again to see anything still have dynamic allocation concerns. |
|
This pull request introduces 1 alert and fixes 6 when merging e5ffbe0 into b5904c4 - view on LGTM.com new alerts:
fixed alerts:
|
1. Rename the partitioner; 2. Letting ::partition function return a status; 3. A few other minor fixes. Co-authored-by: Randy Shuai <[email protected]>
|
This pull request introduces 1 alert and fixes 6 when merging bea5c1d into b5904c4 - view on LGTM.com new alerts:
fixed alerts:
|
### Description use smart pointer for CpuBuffersInfo for cuda and rocm EP ### Motivation and Context use smart pointer for CpuBuffersInfo for cuda and rocm EP Co-authored-by: Lei Cao <[email protected]>
| // forward declaration | ||
| class SessionState; | ||
| class IAllocator; | ||
| void* AllocateBufferWithOptions(std::shared_ptr<IAllocator>& allocator, size_t size, bool use_reserve, Stream* stream, WaitNotificationFn wait_fn); |
There was a problem hiding this comment.
nit: doesn't look like the implementation of this cares if the IAllocator is in a shared_ptr or not, so could take IAllocator& to be more flexible.
|
|
||
| virtual Status Compute(_Inout_ OpKernelContext* context) const ORT_MUST_USE_RESULT = 0; | ||
|
|
||
| [[nodiscard]] virtual bool IsAsync() const { |
There was a problem hiding this comment.
nit: Is [[nodiscard]] a replacement for ORT_MUST_USE_RESULT we have on other methods here? Would be nice to be consistent
| // Given a graph with node placement information, partition the nodes into multiple sequence. | ||
| // Each sequence can be executed in-dependently. The nodes in each sequence are executed in order, | ||
| // but we can't assume any execution order between sequences, unless there is a data dependency. | ||
| class IGraphPartitioner { |
There was a problem hiding this comment.
@pranavsharma is ISequencePartitioner ok or do you have a better name?
| WaitNotificationFn wait_fn); | ||
| // for any chunk that associated with target stream, reset it to default (nullptr in stream, timestamp 0) | ||
| // perform coalesce if coalesce_flag is true | ||
| void ResetChunkOnTargetStream(Stream* target_stream, bool coalesce_flag); |
There was a problem hiding this comment.
Can this be conditional on ENABLE_STREAM?
| ORT_DISALLOW_COPY_ASSIGNMENT_AND_MOVE(BFCArena); | ||
| }; | ||
|
|
||
| class StreamAwareArena : public BFCArena { |
There was a problem hiding this comment.
Can this be conditional on ENABLE_STREAM?
| @@ -0,0 +1,48 @@ | |||
| // Copyright (c) Microsoft Corporation. All rights reserved. | |||
There was a problem hiding this comment.
Can the full contents of this and the .cc be conditional on ENABLE_STREAM?
Seems most of the usage if ifdef'd on that, but an accidental usage somewhere would not break.
There was a problem hiding this comment.
we could, but i need to scan the code to see any function signature need to be refactored.
|
offline synced with Scott, @LEI cao will follow up on the open comments in next PR. |
a8d4a6b to
b6b9d6c
Compare
…ch_impl_gpt.h greedy_search_impl_gpt.h
**Description**: This PR including following works: 1. provide stream and related synchronization abstractions in onnxruntime. 2. enhance onnxruntime's execution planner / executor / memory arena to support execute multiple streams in parallel. 3. deprecate the parallel executor for cpu. 4. deprecate the Fence mechanism. 5. update the cuda / tensorrt EP to support the stream mechanism, support running different request in different cuda stream. **Motivation and Context** - Why is this change required? currently, the execution plan is just a linear list of those primitives, ort will execute them step by step. For any given graph, ORT will serialize it to a fixed execution order. This sequential execution design simplifies most scenarios, but it has the following limitations: 1. it is difficult to enable inter-node parallelization, we have a half-baked parallel executor but it is very difficult to make it work with GPU. 2. The fence mechanism can work with single gpu stream + cpu thread case, but when extend to multiple stream, it is difficult to manage the cross GPU stream synchronizations. 3. our cuda EP rely on the BFCArena to make the memory management work with the GPU async kernels, but current BFCArena is not aware of the streams, so it doesn't behavior correctly when run with multiple streams. This PR enhance our existing execution plan and executor to support multiple stream execution. we use an unified algorithm to mange both single stream and multiple stream scenarios. This PR mainly focus on the infrastructure support for multiple stream execution, that is said, given a valid stream assignment, onnxruntime can execute it correctly. How to generate a good stream assignment for a given model will be in the future PR. Co-authored-by: Cheng Tang <[email protected]@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net> Co-authored-by: Cheng Tang <[email protected]> Co-authored-by: RandySheriffH <[email protected]> Co-authored-by: Randy Shuai <[email protected]> Co-authored-by: cao lei <[email protected]> Co-authored-by: Lei Cao <[email protected]>
### Description This PR is to address follow-up comments for the multi-stream pr #13495 Changes including: - Make StreamAwareArena transparent to minimal build - Make DeviceStreamCollection transparent to minimal build - Replace ORT_MUST_USE_RESULT with [[nodiscard]] - Remove unnecessary shared_ptr ### Motivation and Context This PR is to address follow-up comments for the multi-stream pr #13495 Co-authored-by: Lei Cao <[email protected]>
### Fix memory profiler A follow up fix for PR #13495 In ORTModule training, `PartialExecuteThePlan` is called twice, we need create log event after the backward graph run complete to collect the whole training graph's activations infos. Also change some log level to verbose, to avoid too many logs in > verbose log level. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->
### Description **Multi-stream** execution support for **CANN EP**. ### Motivation and Context **CANN EP** is currently **unavailable** due to the introduction of a new mechanism for multi-stream execution [#13495](#13495), the deletion of the Fence-based synchronization mechanism, and the failure to update the relevant logic of **CANN EP** synchronously. This PR is to fix it.
Fix two issues related to cuda graph capture: #14942 and #15002 Issue 1: Previously, graph capture starts at the second run. However, memory pattern optimization will allocate memory from the second run, and cudamalloc is not allowed during graph capture. In this PR, the graph capture will start graph capture after 2 runs to avoid the issue. Issue 2: #13495 introduced multiple stream support. But stream cleanup will call cudaStreamSyncronize which is not allowed in cuda graph capture. In this PR, we move stream cleanup after cuda graph capture. Update the squeeze net test model with dynamic axis so that we can test with larger batch size. Add a test that could reproduce the bug (when changing min runs from 2 back to 1).
Description: This PR including following works:
Motivation and Context
currently, the execution plan is just a linear list of those primitives, ort will execute them step by step. For any given graph, ORT will serialize it to a fixed execution order. This sequential execution design simplifies most scenarios, but it has the following limitations:
This PR enhance our existing execution plan and executor to support multiple stream execution. we use an unified algorithm to mange both single stream and multiple stream scenarios.
This PR mainly focus on the infrastructure support for multiple stream execution, that is said, given a valid stream assignment, onnxruntime can execute it correctly. How to generate a good stream assignment for a given model will be in the future PR.