-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[c10d] Enable CudaEventCache by default and add multi device support #140975
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/140975
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (1 Unrelated Failure)As of commit 88fa03c with merge base d0fd42e ( FLAKY - The following job failed but was likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
eqy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Failures look real, perhaps there's an issue with the cache being shared across streams/devices?
|
@eqy I cannot repro it locally... |
kwen2501
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving.
As discussed, a fix for the CI failure would be to create one cache per device index seen, as DDP still supports multi-device modules. (Though we may start deprecating it)
We added `CudaEventCache` in #133727 and this is a feature which tries to reuse CudaEvent so that we don't call destroy of CudaEvent which causes hang in the past. We had a bunch of tests and testing on TorchTitan and internal workload already. So far no errors or crash are found at the moment so we decide to roll out to all OSS users. For internal workload, this PR would not affect it because of some internal gating. cc H-Huang awgu kwen2501 wanchaol fegin wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
We added `CudaEventCache` in #133727 and this is a feature which tries to reuse CudaEvent so that we don't call destroy of CudaEvent which causes hang in the past. We had a bunch of tests and testing on TorchTitan and internal workload already. So far no errors or crash are found at the moment so we decide to roll out to all OSS users. For internal workload, this PR would not affect it because of some internal gating. cc H-Huang awgu kwen2501 wanchaol fegin wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
…ce support" We added `CudaEventCache` in #133727 and this is a feature which tries to reuse CudaEvent so that we don't call destroy of CudaEvent which causes hang in the past. We had a bunch of tests and testing on TorchTitan and internal workload already. So far no errors or crash are found at the moment so we decide to roll out to all OSS users. For internal workload, this PR would not affect it because of some internal gating. Also we observed some multi-device use cases in OSS, so that we want to bring back multi-device support originally proposed in https://github.com/pytorch/pytorch/pull/122732/files. cc H-Huang awgu kwen2501 wanchaol fegin wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
…ce support" We added `CudaEventCache` in #133727 and this is a feature which tries to reuse CudaEvent so that we don't call destroy of CudaEvent which causes hang in the past. We had a bunch of tests and testing on TorchTitan and internal workload already. So far no errors or crash are found at the moment so we decide to roll out to all OSS users. For internal workload, this PR would not affect it because of some internal gating. Also we observed some multi-device use cases in OSS, so that we want to bring back multi-device support originally proposed in https://github.com/pytorch/pytorch/pull/122732/files. cc H-Huang awgu kwen2501 wanchaol fegin wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
…ce support" We added `CudaEventCache` in #133727 and this is a feature which tries to reuse CudaEvent so that we don't call destroy of CudaEvent which causes hang in the past. We had a bunch of tests and testing on TorchTitan and internal workload already. So far no errors or crash are found at the moment so we decide to roll out to all OSS users. For internal workload, this PR would not affect it because of some internal gating. Also we observed some multi-device use cases in OSS, so that we want to bring back multi-device support originally proposed in https://github.com/pytorch/pytorch/pull/122732/files. cc H-Huang awgu kwen2501 wanchaol fegin wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
…ce support" We added `CudaEventCache` in #133727 and this is a feature which tries to reuse CudaEvent so that we don't call destroy of CudaEvent which causes hang in the past. We had a bunch of tests and testing on TorchTitan and internal workload already. So far no errors or crash are found at the moment so we decide to roll out to all OSS users. For internal workload, this PR would not affect it because of some internal gating. Also we observed some multi-device use cases in OSS, so that we want to bring back multi-device support originally proposed in https://github.com/pytorch/pytorch/pull/122732/files. cc H-Huang awgu kwen2501 wanchaol fegin wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
|
@pytorchbot rebase |
|
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
|
Successfully rebased |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…ytorch#140975) We added `CudaEventCache` in pytorch#133727 and this is a feature which tries to reuse CudaEvent so that we don't call destroy of CudaEvent which causes hang in the past. We had a bunch of tests and testing on TorchTitan and internal workload already. So far no errors or crash are found at the moment so we decide to roll out to all OSS users. For internal workload, this PR would not affect it because of some internal gating. Also we observed some multi-device use cases in OSS, so that we want to bring back multi-device support originally proposed in https://github.com/pytorch/pytorch/pull/122732/files. Pull Request resolved: pytorch#140975 Approved by: https://github.com/eqy, https://github.com/kwen2501
Stack from ghstack (oldest at bottom):
We added
CudaEventCachein #133727 and this is a feature which tries to reuse CudaEvent so that we don't call destroy of CudaEvent which causes hang in the past. We had a bunch of tests and testing on TorchTitan and internal workload already. So far no errors or crash are found at the moment so we decide to roll out to all OSS users. For internal workload, this PR would not affect it because of some internal gating.Also we observed some multi-device use cases in OSS, so that we want to bring back multi-device support originally proposed in https://github.com/pytorch/pytorch/pull/122732/files.
cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @wz337 @wconstab @d4l3k @c-p-i-o