Skip to content

Conversation

@atalman
Copy link
Contributor

@atalman atalman commented Aug 20, 2024

…on (#131194)

Summary: Rather then using stdin/stdout for IPC, we can create new pipes and pass the descriptors to the subproc via the cmd line. #131070 reports an issue where the combination of deepspeed and onnxruntime-training causes something in the subproc to write to stdout and corrupt the IPC. The current implementation was already brittle; we can just create new pipes specifically for the IPC.

Test Plan: I was able to repro the MemoryError in #131070 by installing deepspeed and onnxruntime-training. Verified this PR fixes.

Differential Revision: D59968362
Pull Request resolved: #131194
Approved by: https://github.com/malfet, https://github.com/eellison, https://github.com/atalmanFixes #ISSUE_NUMBER

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang

masnesral and others added 2 commits August 20, 2024 13:07
…on (pytorch#131194)

Summary: Rather then using stdin/stdout for IPC, we can create new pipes and pass the descriptors to the subproc via the cmd line. pytorch#131070 reports an issue where the combination of deepspeed and onnxruntime-training causes _something_ in the subproc to write to stdout and corrupt the IPC. The current implementation was already brittle; we can just create new pipes specifically for the IPC.

Test Plan: I was able to repro the MemoryError in pytorch#131070 by installing deepspeed and onnxruntime-training. Verified this PR fixes.

Differential Revision: [D59968362](https://our.internmc.facebook.com/intern/diff/D59968362)
Pull Request resolved: pytorch#131194
Approved by: https://github.com/malfet, https://github.com/eellison, https://github.com/atalman
@pytorch-bot
Copy link

pytorch-bot bot commented Aug 20, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/134042

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 70 Unrelated Failures

As of commit 712d8c6 with merge base b66e3f0 (image):

NEW FAILURES - The following jobs have failed:

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Copy link
Contributor

@masnesral masnesral left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks correct

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 20, 2024
@atalman atalman merged commit 18736d2 into pytorch:release/2.4 Aug 21, 2024
@atalman atalman deleted the cherry_pick_inductor_parallel branch August 21, 2024 11:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request module: inductor

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants