Skip to content

Comments

Add CPU allocation test for multiple GPU distributed run#15829

Merged
pengwa merged 6 commits intomainfrom
pengwa/add_ut
May 9, 2023
Merged

Add CPU allocation test for multiple GPU distributed run#15829
pengwa merged 6 commits intomainfrom
pengwa/add_ut

Conversation

@pengwa
Copy link
Contributor

@pengwa pengwa commented May 6, 2023

Add CPU allocation test for non-CPU devices distributed run

When CUDA EP is enabled in distributed training, CPU memory is still used for some node output. Early we have distributed run test coverage, but don't cover the case when some of the node are using CPU devices for storing tensor output. As a result, I recalled we hit regression twice in the passing months:

So adding this test to avoid future regressions.

The test graph looks like this:

image

Motivation and Context

@pengwa pengwa added the training issues related to ONNX Runtime training; typically submitted using template label May 6, 2023
@pengwa
Copy link
Contributor Author

pengwa commented May 9, 2023

Thanks @baijumeswani ! :)

@pengwa pengwa merged commit 003c7d3 into main May 9, 2023
@pengwa pengwa deleted the pengwa/add_ut branch May 9, 2023 02:27
prathikr pushed a commit that referenced this pull request May 16, 2023
### Add CPU allocation test for non-CPU devices distributed run

When CUDA EP is enabled in distributed training, CPU memory is still
used for some node output. Early we have distributed run test coverage,
but don't cover the case when some of the node are using CPU devices for
storing tensor output. As a result, I recalled we hit regression twice
in the passing months:
- #14050
- #15823

So adding this test to avoid future regressions. 

The test graph looks like this:


![image](https://user-images.githubusercontent.com/10530022/236594940-70c68a55-18bf-4e09-bbf5-8a64895d3045.png)



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

training issues related to ONNX Runtime training; typically submitted using template

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants