Skip to content

Conversation

@fduwjj
Copy link
Contributor

@fduwjj fduwjj commented Sep 13, 2024

Stack from ghstack (oldest at bottom):

  1. We want to take option 3 as discussed in TORCH_DISABLE_SHARE_RDZV_TCP_STORE=0 is not compatible with torchelastic restarts #135712, so every time when we retry, we create a new TCPStore server first so that we don't need to append attempt count as prefix and avoid eventually TCPStore sync failure. (This is only for the TCPStore sharing enabled case)
  2. We start a new server bound to an ephemeral port (i.e. 0) so it gets assigned to a free port. We then pass that downstream (trainer or c10d). By doing so, TCPStore is managed by the elastic agent rather than having a race condition on binding to a specific port in the trainer.
  3. Then the port be broadcasted for dynamic_rendezvous.

Only one more question, what do we do about the store created from (_create_tcp_store) torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py, are we ok with creating a duplicate TCPStore server?

cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @wz337 @wconstab @d4l3k @c-p-i-o

Differential Revision: D63396829

@pytorch-bot
Copy link

pytorch-bot bot commented Sep 13, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/135957

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 4 Unrelated Failures

As of commit 4eb521f with merge base a0c76ea (image):

NEW FAILURE - The following job has failed:

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

fduwjj added a commit that referenced this pull request Sep 13, 2024
ghstack-source-id: 4fde76c
Pull Request resolved: #135957
@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (torchelastic) labels Sep 13, 2024
@fduwjj fduwjj requested a review from d4l3k September 13, 2024 05:10
@fduwjj
Copy link
Contributor Author

fduwjj commented Sep 13, 2024

Send out the PR for RFC first and unit test is on the way.

@kwen2501 kwen2501 changed the title [RFC][c10d] Fix store prefix race in rendezvous [RFC][torchelastic] Fix store prefix race in rendezvous Sep 15, 2024
1. We want to take option 3 as discussed in #135712, so every time when we retry, we create a new TCPStore so that we don't need to append attemp prefix and avoid eventually TCPStore sync failure.
2. Upon checking the code, we creating a new TCPStore in c10d_rendezvous_backend.py before we even hitting the the logic inside dynamic_rendezvous.py. Since we use `multi_tenant=True`, so we don't create new server in TCPStore.cpp if this flag is set:
```cpp
if (opts.multiTenant) {
    std::lock_guard<std::mutex> guard{cache_mutex_};

    // If the caller is okay with a multi-tenant store, first check if we
    // already have a TCPServer running on the specified port.
    if (opts.port > 0) {
      auto pos = cachedServers_.find(opts.port);
      if (pos != cachedServers_.end()) {
        server = pos->second.lock();
        if (server != nullptr) {
          return server;
        }

        // Looks like the TCPStore has been disposed, make sure that we release
        // the control block.
        cachedServers_.erase(pos);
      }
    }

    server = startCore();

    cachedServers_.emplace(server->port(), server);
  }
```
so if we are sure the TCPStore server is destroyed everytime we retry, we are recreating a TCPStore server already; otherwise, we need more changes to the elastic code.
3. This changes will force broadcasting a new mast address and port everytime when dynamic_rendezvous's `next_rendezvous` is called.

Send a PR first to collect feedback.

cc XilunWu H-Huang awgu kwen2501 wanchaol fegin wz337 wconstab d4l3k c-p-i-o 

[ghstack-poisoned]
fduwjj added a commit that referenced this pull request Sep 16, 2024
ghstack-source-id: f095408
Pull Request resolved: #135957
@fduwjj fduwjj changed the title [RFC][torchelastic] Fix store prefix race in rendezvous [RFC][torchelastic][c10d] Fix store prefix race in rendezvous Sep 16, 2024
@fduwjj fduwjj requested a review from kwen2501 September 16, 2024 16:20
…ous"


1. We want to take option 3 as discussed in #135712, so every time when we retry, we create a new TCPStore so that we don't need to append attemp prefix and avoid eventually TCPStore sync failure.
2. Upon checking the code, we creating a new TCPStore in c10d_rendezvous_backend.py before we even hitting the the logic inside dynamic_rendezvous.py. Since we use `multi_tenant=True`, so we don't create new server in TCPStore.cpp if this flag is set:
```cpp
if (opts.multiTenant) {
    std::lock_guard<std::mutex> guard{cache_mutex_};

    // If the caller is okay with a multi-tenant store, first check if we
    // already have a TCPServer running on the specified port.
    if (opts.port > 0) {
      auto pos = cachedServers_.find(opts.port);
      if (pos != cachedServers_.end()) {
        server = pos->second.lock();
        if (server != nullptr) {
          return server;
        }

        // Looks like the TCPStore has been disposed, make sure that we release
        // the control block.
        cachedServers_.erase(pos);
      }
    }

    server = startCore();

    cachedServers_.emplace(server->port(), server);
  }
```
so if we are sure the TCPStore server is destroyed everytime we retry, we are recreating a TCPStore server already; otherwise, we need more changes to the elastic code.
3. This changes will force broadcasting a new mast address and port everytime when dynamic_rendezvous's `next_rendezvous` is called.

Send a PR first to collect feedback.

cc XilunWu H-Huang awgu kwen2501 wanchaol fegin wz337 wconstab d4l3k c-p-i-o 

[ghstack-poisoned]
fduwjj added a commit that referenced this pull request Sep 16, 2024
ghstack-source-id: ca4182c
Pull Request resolved: #135957
@fduwjj fduwjj requested a review from d4l3k September 16, 2024 23:03
Copy link
Member

@d4l3k d4l3k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good -- can we add some test for this?

…ous"


1. We want to take option 3 as discussed in #135712, so every time when we retry, we create a new TCPStore so that we don't need to append attemp prefix and avoid eventually TCPStore sync failure.
2. Upon checking the code, we creating a new TCPStore in c10d_rendezvous_backend.py before we even hitting the the logic inside dynamic_rendezvous.py. Since we use `multi_tenant=True`, so we don't create new server in TCPStore.cpp if this flag is set:
```cpp
if (opts.multiTenant) {
    std::lock_guard<std::mutex> guard{cache_mutex_};

    // If the caller is okay with a multi-tenant store, first check if we
    // already have a TCPServer running on the specified port.
    if (opts.port > 0) {
      auto pos = cachedServers_.find(opts.port);
      if (pos != cachedServers_.end()) {
        server = pos->second.lock();
        if (server != nullptr) {
          return server;
        }

        // Looks like the TCPStore has been disposed, make sure that we release
        // the control block.
        cachedServers_.erase(pos);
      }
    }

    server = startCore();

    cachedServers_.emplace(server->port(), server);
  }
```
so if we are sure the TCPStore server is destroyed everytime we retry, we are recreating a TCPStore server already; otherwise, we need more changes to the elastic code.
3. This changes will force broadcasting a new mast address and port everytime when dynamic_rendezvous's `next_rendezvous` is called.

Send a PR first to collect feedback.

cc XilunWu H-Huang awgu kwen2501 wanchaol fegin wz337 wconstab d4l3k c-p-i-o 

[ghstack-poisoned]
fduwjj added a commit that referenced this pull request Sep 17, 2024
ghstack-source-id: 19cdfd3
Pull Request resolved: #135957
…ous"


1. We want to take option 3 as discussed in #135712, so every time when we retry, we create a new TCPStore so that we don't need to append attemp prefix and avoid eventually TCPStore sync failure.
2. Upon checking the code, we creating a new TCPStore in c10d_rendezvous_backend.py before we even hitting the the logic inside dynamic_rendezvous.py. Since we use `multi_tenant=True`, so we don't create new server in TCPStore.cpp if this flag is set:
```cpp
if (opts.multiTenant) {
    std::lock_guard<std::mutex> guard{cache_mutex_};

    // If the caller is okay with a multi-tenant store, first check if we
    // already have a TCPServer running on the specified port.
    if (opts.port > 0) {
      auto pos = cachedServers_.find(opts.port);
      if (pos != cachedServers_.end()) {
        server = pos->second.lock();
        if (server != nullptr) {
          return server;
        }

        // Looks like the TCPStore has been disposed, make sure that we release
        // the control block.
        cachedServers_.erase(pos);
      }
    }

    server = startCore();

    cachedServers_.emplace(server->port(), server);
  }
```
so if we are sure the TCPStore server is destroyed everytime we retry, we are recreating a TCPStore server already; otherwise, we need more changes to the elastic code.
3. This changes will force broadcasting a new mast address and port everytime when dynamic_rendezvous's `next_rendezvous` is called.

Send a PR first to collect feedback.

cc XilunWu H-Huang awgu kwen2501 wanchaol fegin wz337 wconstab d4l3k c-p-i-o 

[ghstack-poisoned]
fduwjj added a commit that referenced this pull request Sep 17, 2024
ghstack-source-id: e9fd7aa
Pull Request resolved: #135957
@fduwjj
Copy link
Contributor Author

fduwjj commented Sep 17, 2024

Looks like there are some existing test cases already, we just need to update them.

@fduwjj fduwjj requested a review from d4l3k September 17, 2024 05:55
@fduwjj fduwjj added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 17, 2024
…ous"


1. We want to take option 3 as discussed in #135712, so every time when we retry, we create a new TCPStore server first so that we don't need to append attempt count as prefix and avoid eventually TCPStore sync failure. (This is only for the TCPStore sharing enabled case) 
2. We reuse the port from the newly created TCPStore server and first bind it to port 0 so that the system will allocate a new available one to us.
3. Then the port be broadcasted for dynamic_rendezvous.

Only one more question, what do we do about the store created from (_create_tcp_store) torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py, are we ok with creating a duplicate TCPStore server?

cc XilunWu H-Huang awgu kwen2501 wanchaol fegin wz337 wconstab d4l3k c-p-i-o 

[ghstack-poisoned]
fduwjj added a commit that referenced this pull request Sep 17, 2024
ghstack-source-id: c3957f0
Pull Request resolved: #135957
@kwen2501 kwen2501 changed the title [RFC][torchelastic][c10d] Fix store prefix race in rendezvous [torchelastic][c10d] Fix store prefix race in rendezvous Sep 24, 2024
1. We want to take option 3 as discussed in #135712, so every time when we retry, we create a new TCPStore server first so that we don't need to append attempt count as prefix and avoid eventually TCPStore sync failure. (This is only for the TCPStore sharing enabled case) 
2. We start a new server bound to an ephemeral port (i.e. 0) so it gets assigned to a free port. We then pass that downstream (trainer or c10d). By doing so, TCPStore is managed by the elastic agent rather than having a race condition on binding to a specific port in the trainer.
3. Then the port be broadcasted for dynamic_rendezvous.

Only one more question, what do we do about the store created from (_create_tcp_store) torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py, are we ok with creating a duplicate TCPStore server?

cc XilunWu H-Huang awgu kwen2501 wanchaol fegin wz337 wconstab d4l3k c-p-i-o 

[ghstack-poisoned]
1. We want to take option 3 as discussed in #135712, so every time when we retry, we create a new TCPStore server first so that we don't need to append attempt count as prefix and avoid eventually TCPStore sync failure. (This is only for the TCPStore sharing enabled case) 
2. We start a new server bound to an ephemeral port (i.e. 0) so it gets assigned to a free port. We then pass that downstream (trainer or c10d). By doing so, TCPStore is managed by the elastic agent rather than having a race condition on binding to a specific port in the trainer.
3. Then the port be broadcasted for dynamic_rendezvous.

Only one more question, what do we do about the store created from (_create_tcp_store) torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py, are we ok with creating a duplicate TCPStore server?

cc XilunWu H-Huang awgu kwen2501 wanchaol fegin wz337 wconstab d4l3k c-p-i-o 

[ghstack-poisoned]
fduwjj added a commit that referenced this pull request Sep 25, 2024
ghstack-source-id: 3ec1cf9
Pull Request resolved: #135957
@fduwjj
Copy link
Contributor Author

fduwjj commented Sep 25, 2024

@fduwjj has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: This PR has internal changes and must be landed via Phabricator! Please try reimporting/rexporting the PR!

Details for Dev Infra team Raised by workflow job

BoyuanFeng pushed a commit to BoyuanFeng/pytorch that referenced this pull request Sep 25, 2024
…h#135957)

1. We want to take option 3 as discussed in pytorch#135712, so every time when we retry, we create a new TCPStore server first so that we don't need to append attempt count as prefix and avoid eventually TCPStore sync failure. (This is only for the TCPStore sharing enabled case)
2. We start a new server bound to an ephemeral port (i.e. 0) so it gets assigned to a free port. We then pass that downstream (trainer or c10d). By doing so, TCPStore is managed by the elastic agent rather than having a race condition on binding to a specific port in the trainer.
3. Then the port be broadcasted for dynamic_rendezvous.

Only one more question, what do we do about the store created from (_create_tcp_store) torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py, are we ok with creating a duplicate TCPStore server?

Pull Request resolved: pytorch#135957
Approved by: https://github.com/d4l3k, https://github.com/c-p-i-o
@izaitsevfb
Copy link
Contributor

@pytorchbot merge -f 'landed internally'

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: This PR has internal changes and must be landed via Phabricator! Please try reimporting/rexporting the PR!

Details for Dev Infra team Raised by workflow job

@atalman
Copy link
Contributor

atalman commented Sep 26, 2024

@pytorchbot merge -f 'landed internally. one more try'

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: This PR has internal changes and must be landed via Phabricator! Please try reimporting/rexporting the PR!

Details for Dev Infra team Raised by workflow job

@atalman
Copy link
Contributor

atalman commented Sep 26, 2024

Please note. This change was reverted internally. I suggest we close this PR and open a new one if we want to land this change

@fduwjj
Copy link
Contributor Author

fduwjj commented Sep 26, 2024

recreate a PR in #136768

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (torchelastic) Reverted

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TORCH_DISABLE_SHARE_RDZV_TCP_STORE=0 is not compatible with torchelastic restarts

9 participants