Skip to content

Comments

[Fix] CPU memory type need device_id=0 to get allocator#14050

Merged
PeixuanZuo merged 1 commit intomainfrom
peixuanzuo/fix_allocation
Dec 26, 2022
Merged

[Fix] CPU memory type need device_id=0 to get allocator#14050
PeixuanZuo merged 1 commit intomainfrom
peixuanzuo/fix_allocation

Conversation

@PeixuanZuo
Copy link
Contributor

@PeixuanZuo PeixuanZuo commented Dec 22, 2022

Description

Fix an error:
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Exception during initialization: /onnxruntime_src/onnxruntime/core/framework/allocation_planner.cc:819 onnxruntime::common::Status onnxruntime::PlannerImpl::ComputeValueLocation() allocator was false.

This error happens when we run huggingface models with DDP on multi-GPUs. In a thread with rank>0, it will attempt to obtain a CPU memory allocator with device_id>0, which causes the error. There is a workaround judges whether node’s output is on the CPU or not. If the output is on CPU, we set device_id = 0.

@PeixuanZuo PeixuanZuo requested review from pengwa and souptc December 22, 2022 06:53
@pengwa pengwa requested review from Lafi7e and askhade December 22, 2022 07:11
@PeixuanZuo PeixuanZuo requested a review from zhijxu-MS December 23, 2022 06:51
@PeixuanZuo PeixuanZuo merged commit 770fb96 into main Dec 26, 2022
@PeixuanZuo PeixuanZuo deleted the peixuanzuo/fix_allocation branch December 26, 2022 05:58
simon-moo pushed a commit to simon-moo/onnxruntime that referenced this pull request Dec 26, 2022
)

### Description
<!-- Describe your changes. -->
Fix an error: 
`onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException:
[ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Exception during
initialization:
/onnxruntime_src/onnxruntime/core/framework/allocation_planner.cc:819
onnxruntime::common::Status
onnxruntime::PlannerImpl::ComputeValueLocation() allocator was false.`

This error happens when we run huggingface models with DDP on
multi-GPUs. In a thread with rank>0, it will attempt to obtain a CPU
memory allocator with device_id>0, which causes the error. There is a
workaround judges whether node’s output is on the CPU or not. If the
output is on CPU, we set device_id = 0.

Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>
pengwa added a commit that referenced this pull request May 9, 2023
### Add CPU allocation test for non-CPU devices distributed run

When CUDA EP is enabled in distributed training, CPU memory is still
used for some node output. Early we have distributed run test coverage,
but don't cover the case when some of the node are using CPU devices for
storing tensor output. As a result, I recalled we hit regression twice
in the passing months:
- #14050
- #15823

So adding this test to avoid future regressions. 

The test graph looks like this:


![image](https://user-images.githubusercontent.com/10530022/236594940-70c68a55-18bf-4e09-bbf5-8a64895d3045.png)



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
prathikr pushed a commit that referenced this pull request May 16, 2023
### Add CPU allocation test for non-CPU devices distributed run

When CUDA EP is enabled in distributed training, CPU memory is still
used for some node output. Early we have distributed run test coverage,
but don't cover the case when some of the node are using CPU devices for
storing tensor output. As a result, I recalled we hit regression twice
in the passing months:
- #14050
- #15823

So adding this test to avoid future regressions. 

The test graph looks like this:


![image](https://user-images.githubusercontent.com/10530022/236594940-70c68a55-18bf-4e09-bbf5-8a64895d3045.png)



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants