-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Fix rendezvous error due to EtcdStore get method not waiting in some cases #137056
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Co-authored-by: tarat44 <[email protected]>
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137056
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit af29007 with merge base c07ebaf ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
XilunWu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR looks good to me except some suggestions on lint. Thanks for the fix!!
test/distributed/elastic/rendezvous/etcd_rendezvous_backend_test.py
Outdated
Show resolved
Hide resolved
@XilunWu Thank you! Apologies for missing the linter instructions - I ran the linter locally and fixed the issues it was complaining about. |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…cases (pytorch#137056) Fixes pytorch#132950 This fixes an issue in `torch/distributed/elastic/rendezvous/etcd_store.py` where the [get method](https://github.com/pytorch/pytorch/blob/v2.4.0/torch/distributed/elastic/rendezvous/etcd_store.py#L60) does not wait as expected when no keys have been written under the store prefix yet (and therefore the store prefix key does not exist). This was because the `_try_wait_get` method would error out immediately [here](https://github.com/alenawang/pytorch/blob/main/torch/distributed/elastic/rendezvous/etcd_store.py#L179) if the prefix was not found instead of continuing to the etcd watch. This was causing upstream issues where distributed jobs using etcd-v2 could not get past the initial rendezvous at all (details in issue pytorch#132950). We added a test demonstrating this issue and the fix. Without the fix the test fails with `etcd.EtcdKeyNotFound: Key not found : /torch/elastic/store` instead of waiting for the first key to be written; with the fix the test waits properly. Co-authored-by: tarat44 <[email protected]> Pull Request resolved: pytorch#137056 Approved by: https://github.com/fduwjj Co-authored-by: tarat44 <[email protected]>
Fixes #132950
This fixes an issue in
torch/distributed/elastic/rendezvous/etcd_store.pywhere the get method does not wait as expected when no keys have been written under the store prefix yet (and therefore the store prefix key does not exist). This was because the_try_wait_getmethod would error out immediately here if the prefix was not found instead of continuing to the etcd watch.This was causing upstream issues where distributed jobs using etcd-v2 could not get past the initial rendezvous at all (details in issue #132950).
We added a test demonstrating this issue and the fix. Without the fix the test fails with
etcd.EtcdKeyNotFound: Key not found : /torch/elastic/storeinstead of waiting for the first key to be written; with the fix the test waits properly.Co-authored-by: tarat44 [email protected]
cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o