-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Revert "[c10d] disable compute_duration by default (#122138)" #122539
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This reverts commit bf18e96. [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/122539
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 72e98fd with merge base 29132c2 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
| compute_duration defaults to true since retire_id is only called in the | ||
| watchdog thread, which is currently a place we call cuda APIs which may hang, | ||
| but care should be taken to avoid computing duration in any function that must | ||
| never hang. (timing must also be enabled for compute_duration - see |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is no longer correct as of this PR landing, but if you're fixing it in a later PR I'm happy to land this as a pure revert
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok looks like you aren't updating this later. better just update it here probably.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or maybe it is still correct, given the CI results of PRs later in the stack. I am going to land this as is and sort out what is happening in the CI for the later PRs.
|
@pytorchbot land |
|
❌ 🤖 pytorchbot command failed: Try |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…ytorch#122539) This reverts commit bf18e96. It is stacked after a fix to elapsed_time that will resolve the memory issues that required in the introduction of this flag. Pull Request resolved: pytorch#122539 Approved by: https://github.com/wconstab, https://github.com/shuqiangzhang ghstack dependencies: pytorch#122538
Stack from ghstack (oldest at bottom):
This reverts commit bf18e96.
It is stacked after a fix to elapsed_time that will resolve the memory issues that required in the introduction of this flag.
cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang