-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[c10d] Land CudaEventCache with roll out flags #133727
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/133727
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 47c8e05 with merge base e1b9b89 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
cc XilunWu H-Huang awgu kwen2501 wanchaol fegin wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
|
cc @eqy any potential issues with long-term reuse of CUDA events? one motivation for this PR is to avoid the case where ~CudaEvent causes a hang. |
zdevito added a cache for CudaEvent in #122732. And we want to productionize it with a flag in this PR. cc XilunWu H-Huang awgu kwen2501 wanchaol fegin wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
kwen2501
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Added two minor comments.
| const std::optional<std::vector<at::Tensor>>& inputs, | ||
| bool desyncDebug, | ||
| bool enableTiming, | ||
| bool cudaEventCacheEnabled, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe we can make WorkNCCL a friend of ProcessGroupNCCL so that it can access ProcessGroupNCCL. cudaEventCacheEnabled_ and we don't have to pass every flag to the constructor of WorkNCCL?
| std::shared_ptr<at::cuda::CUDAEvent> ProcessGroupNCCL::CUDAEventCache::create( | ||
| bool timing) { | ||
| auto deleter = [this, timing](at::cuda::CUDAEvent* event) { | ||
| std::lock_guard<std::mutex> lock(this->cacheMutex_); | ||
| this->eventsArray_[timing ? 1 : 0].push_back(event); | ||
| }; | ||
| at::cuda::CUDAEvent* event = nullptr; | ||
| { | ||
| std::lock_guard<std::mutex> lock(cacheMutex_); | ||
| auto events = eventsArray_[timing ? 1 : 0]; | ||
| if (!events.empty()) { | ||
| event = events.back(); | ||
| events.pop_back(); | ||
| } | ||
| } | ||
| if (!event) { | ||
| event = new at::cuda::CUDAEvent( | ||
| timing ? cudaEventDefault : cudaEventDisableTiming); | ||
| } | ||
| return std::shared_ptr<at::cuda::CUDAEvent>(event, std::move(deleter)); | ||
| } | ||
|
|
||
| ProcessGroupNCCL::CUDAEventCache& ProcessGroupNCCL::CUDAEventCache::get() { | ||
| static ProcessGroupNCCL::CUDAEventCache cache; | ||
| return cache; | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can you add some comments for this block of code? Thanks!
We added `CudaEventCache` in #133727 and this is a feature which tries to reuse CudaEvent so that we don't call destroy of CudaEvent which causes hang in the past. We had a bunch of tests and testing on TorchTitan and internal workload already. So far no errors or crash are found at the moment so we decide to roll out to all OSS users. For internal workload, this PR would not affect it because of some internal gating. cc H-Huang awgu kwen2501 wanchaol fegin wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
We added `CudaEventCache` in #133727 and this is a feature which tries to reuse CudaEvent so that we don't call destroy of CudaEvent which causes hang in the past. We had a bunch of tests and testing on TorchTitan and internal workload already. So far no errors or crash are found at the moment so we decide to roll out to all OSS users. For internal workload, this PR would not affect it because of some internal gating. cc H-Huang awgu kwen2501 wanchaol fegin wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
…ce support" We added `CudaEventCache` in #133727 and this is a feature which tries to reuse CudaEvent so that we don't call destroy of CudaEvent which causes hang in the past. We had a bunch of tests and testing on TorchTitan and internal workload already. So far no errors or crash are found at the moment so we decide to roll out to all OSS users. For internal workload, this PR would not affect it because of some internal gating. Also we observed some multi-device use cases in OSS, so that we want to bring back multi-device support originally proposed in https://github.com/pytorch/pytorch/pull/122732/files. cc H-Huang awgu kwen2501 wanchaol fegin wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
…ce support" We added `CudaEventCache` in #133727 and this is a feature which tries to reuse CudaEvent so that we don't call destroy of CudaEvent which causes hang in the past. We had a bunch of tests and testing on TorchTitan and internal workload already. So far no errors or crash are found at the moment so we decide to roll out to all OSS users. For internal workload, this PR would not affect it because of some internal gating. Also we observed some multi-device use cases in OSS, so that we want to bring back multi-device support originally proposed in https://github.com/pytorch/pytorch/pull/122732/files. cc H-Huang awgu kwen2501 wanchaol fegin wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
…ce support" We added `CudaEventCache` in #133727 and this is a feature which tries to reuse CudaEvent so that we don't call destroy of CudaEvent which causes hang in the past. We had a bunch of tests and testing on TorchTitan and internal workload already. So far no errors or crash are found at the moment so we decide to roll out to all OSS users. For internal workload, this PR would not affect it because of some internal gating. Also we observed some multi-device use cases in OSS, so that we want to bring back multi-device support originally proposed in https://github.com/pytorch/pytorch/pull/122732/files. cc H-Huang awgu kwen2501 wanchaol fegin wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
…ce support" We added `CudaEventCache` in #133727 and this is a feature which tries to reuse CudaEvent so that we don't call destroy of CudaEvent which causes hang in the past. We had a bunch of tests and testing on TorchTitan and internal workload already. So far no errors or crash are found at the moment so we decide to roll out to all OSS users. For internal workload, this PR would not affect it because of some internal gating. Also we observed some multi-device use cases in OSS, so that we want to bring back multi-device support originally proposed in https://github.com/pytorch/pytorch/pull/122732/files. cc H-Huang awgu kwen2501 wanchaol fegin wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
…140975) We added `CudaEventCache` in #133727 and this is a feature which tries to reuse CudaEvent so that we don't call destroy of CudaEvent which causes hang in the past. We had a bunch of tests and testing on TorchTitan and internal workload already. So far no errors or crash are found at the moment so we decide to roll out to all OSS users. For internal workload, this PR would not affect it because of some internal gating. Also we observed some multi-device use cases in OSS, so that we want to bring back multi-device support originally proposed in https://github.com/pytorch/pytorch/pull/122732/files. Pull Request resolved: #140975 Approved by: https://github.com/eqy, https://github.com/kwen2501
…ytorch#140975) We added `CudaEventCache` in pytorch#133727 and this is a feature which tries to reuse CudaEvent so that we don't call destroy of CudaEvent which causes hang in the past. We had a bunch of tests and testing on TorchTitan and internal workload already. So far no errors or crash are found at the moment so we decide to roll out to all OSS users. For internal workload, this PR would not affect it because of some internal gating. Also we observed some multi-device use cases in OSS, so that we want to bring back multi-device support originally proposed in https://github.com/pytorch/pytorch/pull/122732/files. Pull Request resolved: pytorch#140975 Approved by: https://github.com/eqy, https://github.com/kwen2501
Stack from ghstack (oldest at bottom):
@zdevito added a cache for CudaEvent in #122732. And we want to productionize it with a flag in this PR.
cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @wz337 @wconstab @d4l3k @c-p-i-o @xmfan