[torchelastic][c10d] Fix store prefix race in rendezvous #135957

fduwjj · 2024-09-13T05:00:56Z

Stack from ghstack (oldest at bottom):

-> [torchelastic][c10d] Fix store prefix race in rendezvous #135957

We want to take option 3 as discussed in TORCH_DISABLE_SHARE_RDZV_TCP_STORE=0 is not compatible with torchelastic restarts #135712, so every time when we retry, we create a new TCPStore server first so that we don't need to append attempt count as prefix and avoid eventually TCPStore sync failure. (This is only for the TCPStore sharing enabled case)
We start a new server bound to an ephemeral port (i.e. 0) so it gets assigned to a free port. We then pass that downstream (trainer or c10d). By doing so, TCPStore is managed by the elastic agent rather than having a race condition on binding to a specific port in the trainer.
Then the port be broadcasted for dynamic_rendezvous.

Only one more question, what do we do about the store created from (_create_tcp_store) torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py, are we ok with creating a duplicate TCPStore server?

cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @wz337 @wconstab @d4l3k @c-p-i-o

Differential Revision: D63396829

[ghstack-poisoned]

pytorch-bot · 2024-09-13T05:01:00Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/135957

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 4 Unrelated Failures

As of commit 4eb521f with merge base a0c76ea ():

NEW FAILURE - The following job has failed:

pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 2, 5, lf.linux.g5.4xlarge.nvidia.gpu) (gh)
/tmp/tmpdjzeslt0/cj3v2c67ufgabywfqrvifllhohdm3jwuizd3erl2vwvyo6acesld/ccmcjgl7iknw53pe3o6o55q3tl7kkb2debne32ajeg7ab5wjsfeq.cpp:760:54: error: cannot convert ‘torch::aot_inductor::ArrayRefTensor<float>’ to ‘AtenTensorHandle’ {aka ‘AtenTensorOpaque*’}

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

linux-binary-manywheel / manywheel-py3_9-cuda11_8-test / test (gh) (detected as infra flaky with no log or failing log classifier)
linux-binary-manywheel / manywheel-py3_9-cuda12_1-test / test (gh) (detected as infra flaky with no log or failing log classifier)
linux-binary-manywheel / manywheel-py3_9-cuda12_4-test / test (gh) (detected as infra flaky with no log or failing log classifier)
trunk / linux-focal-cuda12.4-py3.10-gcc9-sm86 / test (default, 5, 5, lf.linux.g5.4xlarge.nvidia.gpu) (gh) (detected as infra flaky with no log or failing log classifier)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 4fde76c Pull Request resolved: #135957

fduwjj · 2024-09-13T05:22:06Z

Send out the PR for RFC first and unit test is on the way.

1. We want to take option 3 as discussed in #135712, so every time when we retry, we create a new TCPStore so that we don't need to append attemp prefix and avoid eventually TCPStore sync failure. 2. Upon checking the code, we creating a new TCPStore in c10d_rendezvous_backend.py before we even hitting the the logic inside dynamic_rendezvous.py. Since we use `multi_tenant=True`, so we don't create new server in TCPStore.cpp if this flag is set: ```cpp if (opts.multiTenant) { std::lock_guard<std::mutex> guard{cache_mutex_}; // If the caller is okay with a multi-tenant store, first check if we // already have a TCPServer running on the specified port. if (opts.port > 0) { auto pos = cachedServers_.find(opts.port); if (pos != cachedServers_.end()) { server = pos->second.lock(); if (server != nullptr) { return server; } // Looks like the TCPStore has been disposed, make sure that we release // the control block. cachedServers_.erase(pos); } } server = startCore(); cachedServers_.emplace(server->port(), server); } ``` so if we are sure the TCPStore server is destroyed everytime we retry, we are recreating a TCPStore server already; otherwise, we need more changes to the elastic code. 3. This changes will force broadcasting a new mast address and port everytime when dynamic_rendezvous's `next_rendezvous` is called. Send a PR first to collect feedback. cc XilunWu H-Huang awgu kwen2501 wanchaol fegin wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

ghstack-source-id: f095408 Pull Request resolved: #135957

torch/distributed/elastic/rendezvous/api.py

torch/distributed/elastic/rendezvous/dynamic_rendezvous.py

torch/distributed/rendezvous.py

…ous" 1. We want to take option 3 as discussed in #135712, so every time when we retry, we create a new TCPStore so that we don't need to append attemp prefix and avoid eventually TCPStore sync failure. 2. Upon checking the code, we creating a new TCPStore in c10d_rendezvous_backend.py before we even hitting the the logic inside dynamic_rendezvous.py. Since we use `multi_tenant=True`, so we don't create new server in TCPStore.cpp if this flag is set: ```cpp if (opts.multiTenant) { std::lock_guard<std::mutex> guard{cache_mutex_}; // If the caller is okay with a multi-tenant store, first check if we // already have a TCPServer running on the specified port. if (opts.port > 0) { auto pos = cachedServers_.find(opts.port); if (pos != cachedServers_.end()) { server = pos->second.lock(); if (server != nullptr) { return server; } // Looks like the TCPStore has been disposed, make sure that we release // the control block. cachedServers_.erase(pos); } } server = startCore(); cachedServers_.emplace(server->port(), server); } ``` so if we are sure the TCPStore server is destroyed everytime we retry, we are recreating a TCPStore server already; otherwise, we need more changes to the elastic code. 3. This changes will force broadcasting a new mast address and port everytime when dynamic_rendezvous's `next_rendezvous` is called. Send a PR first to collect feedback. cc XilunWu H-Huang awgu kwen2501 wanchaol fegin wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

ghstack-source-id: ca4182c Pull Request resolved: #135957

d4l3k

looking good -- can we add some test for this?

torch/distributed/elastic/rendezvous/dynamic_rendezvous.py

…ous" 1. We want to take option 3 as discussed in #135712, so every time when we retry, we create a new TCPStore so that we don't need to append attemp prefix and avoid eventually TCPStore sync failure. 2. Upon checking the code, we creating a new TCPStore in c10d_rendezvous_backend.py before we even hitting the the logic inside dynamic_rendezvous.py. Since we use `multi_tenant=True`, so we don't create new server in TCPStore.cpp if this flag is set: ```cpp if (opts.multiTenant) { std::lock_guard<std::mutex> guard{cache_mutex_}; // If the caller is okay with a multi-tenant store, first check if we // already have a TCPServer running on the specified port. if (opts.port > 0) { auto pos = cachedServers_.find(opts.port); if (pos != cachedServers_.end()) { server = pos->second.lock(); if (server != nullptr) { return server; } // Looks like the TCPStore has been disposed, make sure that we release // the control block. cachedServers_.erase(pos); } } server = startCore(); cachedServers_.emplace(server->port(), server); } ``` so if we are sure the TCPStore server is destroyed everytime we retry, we are recreating a TCPStore server already; otherwise, we need more changes to the elastic code. 3. This changes will force broadcasting a new mast address and port everytime when dynamic_rendezvous's `next_rendezvous` is called. Send a PR first to collect feedback. cc XilunWu H-Huang awgu kwen2501 wanchaol fegin wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

ghstack-source-id: 19cdfd3 Pull Request resolved: #135957

…ous" 1. We want to take option 3 as discussed in #135712, so every time when we retry, we create a new TCPStore so that we don't need to append attemp prefix and avoid eventually TCPStore sync failure. 2. Upon checking the code, we creating a new TCPStore in c10d_rendezvous_backend.py before we even hitting the the logic inside dynamic_rendezvous.py. Since we use `multi_tenant=True`, so we don't create new server in TCPStore.cpp if this flag is set: ```cpp if (opts.multiTenant) { std::lock_guard<std::mutex> guard{cache_mutex_}; // If the caller is okay with a multi-tenant store, first check if we // already have a TCPServer running on the specified port. if (opts.port > 0) { auto pos = cachedServers_.find(opts.port); if (pos != cachedServers_.end()) { server = pos->second.lock(); if (server != nullptr) { return server; } // Looks like the TCPStore has been disposed, make sure that we release // the control block. cachedServers_.erase(pos); } } server = startCore(); cachedServers_.emplace(server->port(), server); } ``` so if we are sure the TCPStore server is destroyed everytime we retry, we are recreating a TCPStore server already; otherwise, we need more changes to the elastic code. 3. This changes will force broadcasting a new mast address and port everytime when dynamic_rendezvous's `next_rendezvous` is called. Send a PR first to collect feedback. cc XilunWu H-Huang awgu kwen2501 wanchaol fegin wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

ghstack-source-id: e9fd7aa Pull Request resolved: #135957

fduwjj · 2024-09-17T05:55:20Z

Looks like there are some existing test cases already, we just need to update them.

test/distributed/elastic/rendezvous/dynamic_rendezvous_test.py

…ous" 1. We want to take option 3 as discussed in #135712, so every time when we retry, we create a new TCPStore server first so that we don't need to append attempt count as prefix and avoid eventually TCPStore sync failure. (This is only for the TCPStore sharing enabled case) 2. We reuse the port from the newly created TCPStore server and first bind it to port 0 so that the system will allocate a new available one to us. 3. Then the port be broadcasted for dynamic_rendezvous. Only one more question, what do we do about the store created from (_create_tcp_store) torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py, are we ok with creating a duplicate TCPStore server? cc XilunWu H-Huang awgu kwen2501 wanchaol fegin wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

ghstack-source-id: c3957f0 Pull Request resolved: #135957

1. We want to take option 3 as discussed in #135712, so every time when we retry, we create a new TCPStore server first so that we don't need to append attempt count as prefix and avoid eventually TCPStore sync failure. (This is only for the TCPStore sharing enabled case) 2. We start a new server bound to an ephemeral port (i.e. 0) so it gets assigned to a free port. We then pass that downstream (trainer or c10d). By doing so, TCPStore is managed by the elastic agent rather than having a race condition on binding to a specific port in the trainer. 3. Then the port be broadcasted for dynamic_rendezvous. Only one more question, what do we do about the store created from (_create_tcp_store) torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py, are we ok with creating a duplicate TCPStore server? cc XilunWu H-Huang awgu kwen2501 wanchaol fegin wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

ghstack-source-id: 3ec1cf9 Pull Request resolved: #135957

fduwjj · 2024-09-25T16:47:14Z

@fduwjj has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2024-09-25T21:02:13Z

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

pytorchmergebot · 2024-09-25T21:03:57Z

Merge failed

Reason: This PR has internal changes and must be landed via Phabricator! Please try reimporting/rexporting the PR!

Details for Dev Infra team

Raised by workflow job

…h#135957) 1. We want to take option 3 as discussed in pytorch#135712, so every time when we retry, we create a new TCPStore server first so that we don't need to append attempt count as prefix and avoid eventually TCPStore sync failure. (This is only for the TCPStore sharing enabled case) 2. We start a new server bound to an ephemeral port (i.e. 0) so it gets assigned to a free port. We then pass that downstream (trainer or c10d). By doing so, TCPStore is managed by the elastic agent rather than having a race condition on binding to a specific port in the trainer. 3. Then the port be broadcasted for dynamic_rendezvous. Only one more question, what do we do about the store created from (_create_tcp_store) torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py, are we ok with creating a duplicate TCPStore server? Pull Request resolved: pytorch#135957 Approved by: https://github.com/d4l3k, https://github.com/c-p-i-o

…pytorch#135957)" This reverts commit 5033a1c. Reverted pytorch#135957 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](pytorch#135957 (comment)))

izaitsevfb · 2024-09-25T22:17:14Z

@pytorchbot merge -f 'landed internally'

pytorchmergebot · 2024-09-25T22:18:51Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-09-25T22:18:55Z

Merge failed

Reason: This PR has internal changes and must be landed via Phabricator! Please try reimporting/rexporting the PR!

Details for Dev Infra team

Raised by workflow job

atalman · 2024-09-26T01:01:00Z

@pytorchbot merge -f 'landed internally. one more try'

pytorchmergebot · 2024-09-26T01:02:35Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-09-26T01:02:38Z

Merge failed

Reason: This PR has internal changes and must be landed via Phabricator! Please try reimporting/rexporting the PR!

Details for Dev Infra team

Raised by workflow job

atalman · 2024-09-26T14:55:07Z

Please note. This change was reverted internally. I suggest we close this PR and open a new one if we want to land this change

fduwjj · 2024-09-26T17:00:14Z

recreate a PR in #136768

[RFC][c10d] Fix store prefix race in rendezvous

e0ac536

[ghstack-poisoned]

fduwjj added a commit that referenced this pull request Sep 13, 2024

[RFC][c10d] Fix store prefix race in rendezvous

8432908

ghstack-source-id: 4fde76c Pull Request resolved: #135957

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (torchelastic) labels Sep 13, 2024

fduwjj requested a review from d4l3k September 13, 2024 05:10

fduwjj linked an issue Sep 13, 2024 that may be closed by this pull request

TORCH_DISABLE_SHARE_RDZV_TCP_STORE=0 is not compatible with torchelastic restarts #135712

Closed

kwen2501 changed the title ~~[RFC][c10d] Fix store prefix race in rendezvous~~ [RFC][torchelastic] Fix store prefix race in rendezvous Sep 15, 2024

fduwjj added a commit that referenced this pull request Sep 16, 2024

[RFC][c10d] Fix store prefix race in rendezvous

b465a9c

ghstack-source-id: f095408 Pull Request resolved: #135957

fduwjj changed the title ~~[RFC][torchelastic] Fix store prefix race in rendezvous~~ [RFC][torchelastic][c10d] Fix store prefix race in rendezvous Sep 16, 2024

fduwjj requested a review from kwen2501 September 16, 2024 16:20

fduwjj commented Sep 16, 2024

View reviewed changes

torch/distributed/elastic/rendezvous/api.py Outdated Show resolved Hide resolved

d4l3k reviewed Sep 16, 2024

View reviewed changes

torch/distributed/elastic/rendezvous/api.py Outdated Show resolved Hide resolved

torch/distributed/elastic/rendezvous/dynamic_rendezvous.py Show resolved Hide resolved

torch/distributed/rendezvous.py Outdated Show resolved Hide resolved

fduwjj added a commit that referenced this pull request Sep 16, 2024

[RFC][c10d] Fix store prefix race in rendezvous

7a4707d

ghstack-source-id: ca4182c Pull Request resolved: #135957

fduwjj requested a review from d4l3k September 16, 2024 23:03

d4l3k reviewed Sep 17, 2024

View reviewed changes

torch/distributed/elastic/rendezvous/dynamic_rendezvous.py Outdated Show resolved Hide resolved

torch/distributed/elastic/rendezvous/dynamic_rendezvous.py Outdated Show resolved Hide resolved

d4l3k reviewed Sep 17, 2024

View reviewed changes

torch/distributed/elastic/rendezvous/dynamic_rendezvous.py Outdated Show resolved Hide resolved

fduwjj added a commit that referenced this pull request Sep 17, 2024

[RFC][c10d] Fix store prefix race in rendezvous

736344e

ghstack-source-id: 19cdfd3 Pull Request resolved: #135957

fduwjj added a commit that referenced this pull request Sep 17, 2024

[RFC][c10d] Fix store prefix race in rendezvous

f6f6eb5

ghstack-source-id: e9fd7aa Pull Request resolved: #135957

fduwjj requested a review from d4l3k September 17, 2024 05:55

fduwjj added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 17, 2024

kwen2501 reviewed Sep 17, 2024

View reviewed changes

test/distributed/elastic/rendezvous/dynamic_rendezvous_test.py Show resolved Hide resolved

fduwjj added a commit that referenced this pull request Sep 17, 2024

[RFC][c10d] Fix store prefix race in rendezvous

955759a

ghstack-source-id: c3957f0 Pull Request resolved: #135957

pytorchmergebot added the Reverted label Sep 24, 2024

pytorchmergebot reopened this Sep 24, 2024

kwen2501 changed the title ~~[RFC][torchelastic][c10d] Fix store prefix race in rendezvous~~ [torchelastic][c10d] Fix store prefix race in rendezvous Sep 24, 2024

fduwjj added a commit that referenced this pull request Sep 25, 2024

[RFC][c10d] Fix store prefix race in rendezvous

f32fb44

ghstack-source-id: 3ec1cf9 Pull Request resolved: #135957

pytorchmergebot added the merging label Sep 25, 2024

pytorchmergebot removed the merging label Sep 25, 2024

pytorchmergebot added the merging label Sep 25, 2024

pytorchmergebot removed the merging label Sep 25, 2024

pytorchmergebot added the merging label Sep 26, 2024

pytorchmergebot removed the merging label Sep 26, 2024

fduwjj closed this Sep 26, 2024

fduwjj mentioned this pull request Sep 26, 2024

[reland] [torchelastic][c10d] Fix store prefix race in rendezvous #136768

Closed

github-actions bot deleted the gh/fduwjj/134/head branch October 27, 2024 02:11

[torchelastic][c10d] Fix store prefix race in rendezvous #135957

[torchelastic][c10d] Fix store prefix race in rendezvous #135957

Uh oh!

Conversation

fduwjj commented Sep 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/135957

❌ 1 New Failure, 4 Unrelated Failures

Uh oh!

fduwjj commented Sep 13, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

d4l3k left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fduwjj commented Sep 17, 2024

Uh oh!

Uh oh!

fduwjj commented Sep 25, 2024

Uh oh!

facebook-github-bot commented Sep 25, 2024

Uh oh!

pytorchmergebot commented Sep 25, 2024

Merge failed

Uh oh!

izaitsevfb commented Sep 25, 2024

Uh oh!

pytorchmergebot commented Sep 25, 2024

Merge started

Uh oh!

pytorchmergebot commented Sep 25, 2024

Merge failed

Uh oh!

atalman commented Sep 26, 2024

Uh oh!

pytorchmergebot commented Sep 26, 2024

Merge started

Uh oh!

pytorchmergebot commented Sep 26, 2024

Merge failed

Uh oh!

atalman commented Sep 26, 2024

Uh oh!

fduwjj commented Sep 26, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

fduwjj commented Sep 13, 2024 •

edited

Loading

pytorch-bot bot commented Sep 13, 2024 •

edited

Loading