[c10d] Refactor CUDAEventCache Create to use deque rather than stack #138048

fduwjj · 2024-10-16T02:06:33Z

Stack from ghstack (oldest at bottom):

We used a LIFO stack to store the CudaEvent in the cache. ,Somehow we like FIFO deque better so aside from improving the readability of the code, we use a deque instead. As @wconstab pointed out, both methods are equally correct because the moment we put the event into stack/deque, the event is already ready for reuse, this change mostly is a preference change not trying to fix anything.

cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @wz337 @wconstab @d4l3k @c-p-i-o

[ghstack-poisoned]

pytorch-bot · 2024-10-16T02:06:36Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138048

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit c62c197 with merge base a77bb85 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 6ffc3c7 Pull Request resolved: #138048

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

kwen2501

LGTM.
nit: not related to this PR, but would be nice if Create can have more in-line comments.

fduwjj · 2024-10-16T14:37:39Z

@pytorchbot merge

fduwjj · 2024-10-16T14:38:16Z

@kwen2501 will add in a follow-up PR.

pytorchmergebot · 2024-10-16T14:39:17Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…138059) We created a simple test to validate the cache is indeed working and when the cache is indeed used up. I revert the fix in (#138040) and the test indeed failed. Pull Request resolved: #138059 Approved by: https://github.com/kwen2501 ghstack dependencies: #138040, #138048

@kwen2501

Address @kwen2501 's feedback in #138048, add more inline comments to the code. Pull Request resolved: #138079 Approved by: https://github.com/kwen2501 ghstack dependencies: #138040, #138048, #138059

Refactor CUDAEventCache Create

c62c197

[ghstack-poisoned]

fduwjj mentioned this pull request Oct 16, 2024

[c10d] Fix data corruption bug after CUDAEventCache is enabled #138040

Closed

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Oct 16, 2024

fduwjj added a commit that referenced this pull request Oct 16, 2024

Refactor CUDAEventCache Create

357fa8e

ghstack-source-id: 6ffc3c7 Pull Request resolved: #138048

fduwjj requested review from kwen2501 and wconstab October 16, 2024 02:08

fduwjj changed the title ~~Refactor CUDAEventCache Create~~ [c10d] Refactor CUDAEventCache Create to use deque rather than stack Oct 16, 2024

fduwjj requested a review from shuqiangzhang October 16, 2024 02:09

wconstab reviewed Oct 16, 2024

View reviewed changes

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp Show resolved Hide resolved

fduwjj added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 16, 2024

fduwjj requested a review from wconstab October 16, 2024 03:51

This was referenced Oct 16, 2024

[c10d] Add unit test for CUDAEventCache to ensure caching is working #138055

Merged

[c10d] Add unit test for CUDAEventCache to ensure caching is working #138059

Closed

kwen2501 approved these changes Oct 16, 2024

View reviewed changes

pytorchmergebot added the merging label Oct 16, 2024

pytorchmergebot added the Merged label Oct 16, 2024

pytorchmergebot closed this in 960c3bf Oct 16, 2024

pytorchmergebot removed the merging label Oct 16, 2024

fduwjj mentioned this pull request Oct 16, 2024

[c10d][ez] Add more inline comments to CUDAEventCache code #138079

Closed

github-actions bot deleted the gh/fduwjj/146/head branch November 16, 2024 02:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[c10d] Refactor CUDAEventCache Create to use deque rather than stack #138048

[c10d] Refactor CUDAEventCache Create to use deque rather than stack #138048

Uh oh!

fduwjj commented Oct 16, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Oct 16, 2024 •

edited

Loading

Uh oh!

Uh oh!

kwen2501 left a comment

Uh oh!

fduwjj commented Oct 16, 2024

Uh oh!

fduwjj commented Oct 16, 2024

Uh oh!

pytorchmergebot commented Oct 16, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[c10d] Refactor CUDAEventCache Create to use deque rather than stack #138048

[c10d] Refactor CUDAEventCache Create to use deque rather than stack #138048

Uh oh!

Conversation

fduwjj commented Oct 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138048

✅ No Failures

Uh oh!

Uh oh!

kwen2501 left a comment

Choose a reason for hiding this comment

Uh oh!

fduwjj commented Oct 16, 2024

Uh oh!

fduwjj commented Oct 16, 2024

Uh oh!

pytorchmergebot commented Oct 16, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

fduwjj commented Oct 16, 2024 •

edited

Loading

pytorch-bot bot commented Oct 16, 2024 •

edited

Loading