Skip to content

Conversation

@kwen2501
Copy link
Collaborator

@kwen2501 kwen2501 commented Nov 4, 2024

Stack from ghstack (oldest at bottom):

Motivation

Today, watchdog only reports that it found a collective timeout:

[rank1]:[E1104 14:02:18.767594328 ProcessGroupNCCL.cpp:688] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=200, NumelOut=200, Timeout(ms)=5000) ran for 5096 milliseconds before timing out.

While this is nice, it is hard to associate the error with user's program or library stack.

This PR

This PR gives watchdog the ability to report the call-time stack of the collective, so that it would be easier to track the error back to the program's behavior.

The call-time stack was recorded by Flight Recorder with minimal overhead (for details, please read this doc written by @zdevito ). In ProcessGroupNCCL, we are only tracking / reporting the python part so that it fits most PyTorch users.

Demo

stack_demo.py.

TORCH_NCCL_TRACE_BUFFER_SIZE=100 torchrun --nproc-per-node 2 stack_demo.py

TORCH_NCCL_TRACE_BUFFER_SIZE is for turning on the Flight Recorder.

Output:

[rank0]:[E1104 14:19:27.591610653 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation: 
#0 all_reduce from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:2696
#1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83
#2 bar from /data/users/kw2501/sync_async/repro.py:15
#3 foo from /data/users/kw2501/sync_async/repro.py:24
#4 main from /data/users/kw2501/sync_async/repro.py:34
#5 <module> from /data/users/kw2501/sync_async/repro.py:40

[rank1]:[E1104 14:19:27.771430164 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation: 
#0 all_gather_into_tensor from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:3630
#1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83
#2 baz from /data/users/kw2501/sync_async/repro.py:20
#3 foo from /data/users/kw2501/sync_async/repro.py:26
#4 main from /data/users/kw2501/sync_async/repro.py:34
#5 <module> from /data/users/kw2501/sync_async/repro.py:40

From the log above, we can tell that bar() and baz() are the places where the two ranks divert.

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

@pytorch-bot
Copy link

pytorch-bot bot commented Nov 4, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/139659

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 1 Unrelated Failure

As of commit f35e885 with merge base 3a6f014 (image):

NEW FAILURE - The following job has failed:

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Nov 4, 2024
kwen2501 added a commit that referenced this pull request Nov 4, 2024
…g timeout"


### Motivation
Today, watchdog only reports that it found a collective timeout:
```
[rank1]:[E1104 14:02:18.767594328 ProcessGroupNCCL.cpp:688] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=200, NumelOut=200, Timeout(ms)=5000) ran for 5096 milliseconds before timing out.
```
While this is nice, it is hard to associate the error with user's program or library stack.

### This PR
This PR gives watchdog the ability to report the call-time stack of the collective, so that it would be easier to track the error back to the program's behavior.

The call-time stack was recorded by Flight Recorder with minimal overhead (for details, please read this [doc](https://dev-discuss.pytorch.org/t/fast-combined-c-python-torchscript-inductor-tracebacks/1158) written by zdevito ). In `ProcessGroupNCCL`, we are only tracking / reporting the python part so that it fits most PyTorch users.

### Demo
[stack_demo.py](https://gist.github.com/kwen2501/6758e18d305d67fc6f3f926217825c09).

```
TORCH_NCCL_TRACE_BUFFER_SIZE=100 torchrun --nproc-per-node 2 stack_demo.py
```
`TORCH_NCCL_TRACE_BUFFER_SIZE` is for turning on the Flight Recorder.

Output:
```
[rank0]:[E1104 14:19:27.591610653 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation: 
#0 all_reduce from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:2696
#1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83
#2 bar from /data/users/kw2501/sync_async/repro.py:15
#3 foo from /data/users/kw2501/sync_async/repro.py:24
#4 main from /data/users/kw2501/sync_async/repro.py:34
#5 <module> from /data/users/kw2501/sync_async/repro.py:40

[rank1]:[E1104 14:19:27.771430164 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation: 
#0 all_gather_into_tensor from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:3630
#1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83
#2 baz from /data/users/kw2501/sync_async/repro.py:20
#3 foo from /data/users/kw2501/sync_async/repro.py:26
#4 main from /data/users/kw2501/sync_async/repro.py:34
#5 <module> from /data/users/kw2501/sync_async/repro.py:40
```

From the log above, we can tell that `bar()` and `baz()` are the places where the two ranks divert.

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
@kwen2501 kwen2501 added ciflow/trunk Trigger trunk jobs on your pull request keep-going Don't stop on first failure, keep running tests until the end labels Nov 4, 2024
// Returns the entry with the given id, if it exists. Otherwise, returns
// std::nullopt.
std::optional<NCCLTraceBuffer::Entry> NCCLTraceBuffer::getEntry(
std::optional<size_t> id) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its a little odd to let 'id' be an optional param to getEntry which then returns nullopt if id is nullopt.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I went back and forth too. Eventually I decided to stay in line with existing API signatures, like:
std::optional<size_t> NCCLTraceBuffer::record(...)
void NCCLTraceBuffer::retire_id(std::optional<size_t> id, ...)

Copy link
Contributor

@wconstab wconstab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM i think.

one thing is that there can be very long tracebacks in real programs, and one per rank can be a bit spammy. maybe that's ok but it made me hesitate a bit.

one idea there is, would it make sense to disable for larger than N (=8? 128?) world size? bc at 1k or 10k it is pretty unlikely people want to dig the logs and the raw FR dump and the tooling that comes with it would likely be easier

cc @c-p-i-o

@kwen2501
Copy link
Collaborator Author

kwen2501 commented Nov 5, 2024

@wconstab those are good concerns.
I am not sure if I have perfect solution here.
But I also think the log here is way less spammy than the often-seen C++ stack traces -- which prints the entire CPython stack (a lot of "EvalFrame" calls).
And some users direct logs to one destination per host, that reduces the density related to scale too.

Edit:
My general thought:

  • letting people know is better than not letting people know.
  • letting people know online is better than letting them know offline.

…g timeout"


### Motivation
Today, watchdog only reports that it found a collective timeout:
```
[rank1]:[E1104 14:02:18.767594328 ProcessGroupNCCL.cpp:688] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=200, NumelOut=200, Timeout(ms)=5000) ran for 5096 milliseconds before timing out.
```
While this is nice, it is hard to associate the error with user's program or library stack.

### This PR
This PR gives watchdog the ability to report the call-time stack of the collective, so that it would be easier to track the error back to the program's behavior.

The call-time stack was recorded by Flight Recorder with minimal overhead (for details, please read this [doc](https://dev-discuss.pytorch.org/t/fast-combined-c-python-torchscript-inductor-tracebacks/1158) written by zdevito ). In `ProcessGroupNCCL`, we are only tracking / reporting the python part so that it fits most PyTorch users.

### Demo
[stack_demo.py](https://gist.github.com/kwen2501/6758e18d305d67fc6f3f926217825c09).

```
TORCH_NCCL_TRACE_BUFFER_SIZE=100 torchrun --nproc-per-node 2 stack_demo.py
```
`TORCH_NCCL_TRACE_BUFFER_SIZE` is for turning on the Flight Recorder.

Output:
```
[rank0]:[E1104 14:19:27.591610653 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation: 
#0 all_reduce from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:2696
#1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83
#2 bar from /data/users/kw2501/sync_async/repro.py:15
#3 foo from /data/users/kw2501/sync_async/repro.py:24
#4 main from /data/users/kw2501/sync_async/repro.py:34
#5 <module> from /data/users/kw2501/sync_async/repro.py:40

[rank1]:[E1104 14:19:27.771430164 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation: 
#0 all_gather_into_tensor from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:3630
#1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83
#2 baz from /data/users/kw2501/sync_async/repro.py:20
#3 foo from /data/users/kw2501/sync_async/repro.py:26
#4 main from /data/users/kw2501/sync_async/repro.py:34
#5 <module> from /data/users/kw2501/sync_async/repro.py:40
```

From the log above, we can tell that `bar()` and `baz()` are the places where the two ranks divert.

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
kwen2501 added a commit that referenced this pull request Nov 5, 2024
@wconstab
Copy link
Contributor

wconstab commented Nov 5, 2024

But I also think the log here is way less spammy than the often-seen C++ stack traces -- which prints the entire CPython stack (a lot of "EvalFrame" calls).

well, i think we disabled the cpp logs? or maybe we only disabled them in some jobs, but it was causing people to complain. thats partly what made me think about this

My general thought:

letting people know is better than not letting people know.
letting people know online is better than letting them know offline.

yea, I agree. Otoh for internal jobs we will have 'online' analysis done via mast tooling. And for oss, it is more likely that jobs will be on the smaller side, but also more useful to have this feature enabled. Hence it might be nice to make enablement either size based, or just flag based so it can be controlled.

I guess we can just try it and see, if people complain we can add a flag later.

auto entryVal = entry.value();
// Get stack trace from FR entry, in string format
// Note: `getTraceback` call below invokes `torch::symbolize`.
// According to @fduwjj, `torch::symbolize` may need to acquire the GIL. In
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not according to me, it's according to the code. Shall we not write a comment like this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you can just say "torch::symbolize would acquire the GIL, so we add a timeout to make it non-blocking" something like that?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed, sorry

auto future = std::async(
std::launch::async, [&entryVal]() { return entryVal.getTraceback(); });
// Wait for the future to complete or timeout
auto status = future.wait_for(std::chrono::seconds(5));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use a promise (std::promise) so that the worst case we wait for 5 secs but we can finish early? Also, I think maybe we can give it 10-15 secs with promise? Thanks

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, does std::promise provides a better wait mechanism? Here future.wait_for(5s) also allows early return, no?

std::launch::async, [&entryVal]() { return entryVal.getTraceback(); });
// Wait for the future to complete or timeout
auto status = future.wait_for(std::chrono::seconds(5));
if (status == std::future_status::ready) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have function waitForFutureOrTimeout which you can reuse.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its code seems pretty heavy to understand. The added log may be good for abort or such, here we just have a simple case.

Copy link
Contributor

@fduwjj fduwjj Nov 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then can we at least use std::promise? Thanks

Edit: NVM, this future wait_for works.

Comment on lines +394 to +401
/* Resulted format is like:
#0 all_reduce from pytorch/torch/distributed/distributed_c10d.py:2696
#1 wrapper from pytorch/torch/distributed/c10d_logger.py:83
#2 bar from /home/user/repro.py:15
#3 foo from /home/user/repro.py:24
#4 main from /home/user/repro.py:34
#5 <module> from /home/user/repro.py:40
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I am thinking, do we want to do some truncate on the trace? like only showing first 10-15 lines? Because in FR I noticed that the trace is super long. I don't have a strong opinion on this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I don't have the knowledge to set up the preference. Let's see how user complains :)

…g timeout"


### Motivation
Today, watchdog only reports that it found a collective timeout:
```
[rank1]:[E1104 14:02:18.767594328 ProcessGroupNCCL.cpp:688] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=200, NumelOut=200, Timeout(ms)=5000) ran for 5096 milliseconds before timing out.
```
While this is nice, it is hard to associate the error with user's program or library stack.

### This PR
This PR gives watchdog the ability to report the call-time stack of the collective, so that it would be easier to track the error back to the program's behavior.

The call-time stack was recorded by Flight Recorder with minimal overhead (for details, please read this [doc](https://dev-discuss.pytorch.org/t/fast-combined-c-python-torchscript-inductor-tracebacks/1158) written by zdevito ). In `ProcessGroupNCCL`, we are only tracking / reporting the python part so that it fits most PyTorch users.

### Demo
[stack_demo.py](https://gist.github.com/kwen2501/6758e18d305d67fc6f3f926217825c09).

```
TORCH_NCCL_TRACE_BUFFER_SIZE=100 torchrun --nproc-per-node 2 stack_demo.py
```
`TORCH_NCCL_TRACE_BUFFER_SIZE` is for turning on the Flight Recorder.

Output:
```
[rank0]:[E1104 14:19:27.591610653 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation: 
#0 all_reduce from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:2696
#1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83
#2 bar from /data/users/kw2501/sync_async/repro.py:15
#3 foo from /data/users/kw2501/sync_async/repro.py:24
#4 main from /data/users/kw2501/sync_async/repro.py:34
#5 <module> from /data/users/kw2501/sync_async/repro.py:40

[rank1]:[E1104 14:19:27.771430164 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation: 
#0 all_gather_into_tensor from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:3630
#1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83
#2 baz from /data/users/kw2501/sync_async/repro.py:20
#3 foo from /data/users/kw2501/sync_async/repro.py:26
#4 main from /data/users/kw2501/sync_async/repro.py:34
#5 <module> from /data/users/kw2501/sync_async/repro.py:40
```

From the log above, we can tell that `bar()` and `baz()` are the places where the two ranks divert.

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
kwen2501 added a commit that referenced this pull request Nov 5, 2024
Copy link
Contributor

@shuqiangzhang shuqiangzhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also print traceback when there is any other error, e.g., in watchdogHandler?

// Get the stack trace of the work at call time
// First step we get the corresponding record entry from FR, based on work's
// trace_id_
std::optional<NCCLTraceBuffer::Entry> entry =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you place this piece of code into a helper function in WorkNCCL such that we can print the traceback whenever needed

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I was thinking about that too. In a next PR I will rearrange things. The current is kept as-is for "review sanity".

@kwen2501 kwen2501 requested a review from fduwjj November 5, 2024 19:00
@kwen2501
Copy link
Collaborator Author

kwen2501 commented Nov 5, 2024

@pytorchbot merge -f "CI was green before the last commit; the last commit just changes some comments in code"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

atalman pushed a commit to atalman/pytorch that referenced this pull request Nov 11, 2024
…ytorch#139659)

### Motivation
Today, watchdog only reports that it found a collective timeout:
```
[rank1]:[E1104 14:02:18.767594328 ProcessGroupNCCL.cpp:688] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=200, NumelOut=200, Timeout(ms)=5000) ran for 5096 milliseconds before timing out.
```
While this is nice, it is hard to associate the error with user's program or library stack.

### This PR
This PR gives watchdog the ability to report the call-time stack of the collective, so that it would be easier to track the error back to the program's behavior.

The call-time stack was recorded by Flight Recorder with minimal overhead (for details, please read this [doc](https://dev-discuss.pytorch.org/t/fast-combined-c-python-torchscript-inductor-tracebacks/1158) written by @zdevito ). In `ProcessGroupNCCL`, we are only tracking / reporting the python part so that it fits most PyTorch users.

### Demo
[stack_demo.py](https://gist.github.com/kwen2501/6758e18d305d67fc6f3f926217825c09).

```
TORCH_NCCL_TRACE_BUFFER_SIZE=100 torchrun --nproc-per-node 2 stack_demo.py
```
`TORCH_NCCL_TRACE_BUFFER_SIZE` is for turning on the Flight Recorder.

Output:
```
[rank0]:[E1104 14:19:27.591610653 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation:
#0 all_reduce from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:2696
pytorch#1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83
pytorch#2 bar from /data/users/kw2501/sync_async/repro.py:15
pytorch#3 foo from /data/users/kw2501/sync_async/repro.py:24
pytorch#4 main from /data/users/kw2501/sync_async/repro.py:34
pytorch#5 <module> from /data/users/kw2501/sync_async/repro.py:40

[rank1]:[E1104 14:19:27.771430164 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation:
#0 all_gather_into_tensor from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:3630
pytorch#1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83
pytorch#2 baz from /data/users/kw2501/sync_async/repro.py:20
pytorch#3 foo from /data/users/kw2501/sync_async/repro.py:26
pytorch#4 main from /data/users/kw2501/sync_async/repro.py:34
pytorch#5 <module> from /data/users/kw2501/sync_async/repro.py:40
```

From the log above, we can tell that `bar()` and `baz()` are the places where the two ranks divert.

Pull Request resolved: pytorch#139659
Approved by: https://github.com/wconstab, https://github.com/fduwjj
pobin6 pushed a commit to pobin6/pytorch that referenced this pull request Dec 5, 2024
…ytorch#139659)

### Motivation
Today, watchdog only reports that it found a collective timeout:
```
[rank1]:[E1104 14:02:18.767594328 ProcessGroupNCCL.cpp:688] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=200, NumelOut=200, Timeout(ms)=5000) ran for 5096 milliseconds before timing out.
```
While this is nice, it is hard to associate the error with user's program or library stack.

### This PR
This PR gives watchdog the ability to report the call-time stack of the collective, so that it would be easier to track the error back to the program's behavior.

The call-time stack was recorded by Flight Recorder with minimal overhead (for details, please read this [doc](https://dev-discuss.pytorch.org/t/fast-combined-c-python-torchscript-inductor-tracebacks/1158) written by @zdevito ). In `ProcessGroupNCCL`, we are only tracking / reporting the python part so that it fits most PyTorch users.

### Demo
[stack_demo.py](https://gist.github.com/kwen2501/6758e18d305d67fc6f3f926217825c09).

```
TORCH_NCCL_TRACE_BUFFER_SIZE=100 torchrun --nproc-per-node 2 stack_demo.py
```
`TORCH_NCCL_TRACE_BUFFER_SIZE` is for turning on the Flight Recorder.

Output:
```
[rank0]:[E1104 14:19:27.591610653 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation:
#0 all_reduce from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:2696
#1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83
pytorch#2 bar from /data/users/kw2501/sync_async/repro.py:15
pytorch#3 foo from /data/users/kw2501/sync_async/repro.py:24
pytorch#4 main from /data/users/kw2501/sync_async/repro.py:34
pytorch#5 <module> from /data/users/kw2501/sync_async/repro.py:40

[rank1]:[E1104 14:19:27.771430164 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation:
#0 all_gather_into_tensor from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:3630
#1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83
pytorch#2 baz from /data/users/kw2501/sync_async/repro.py:20
pytorch#3 foo from /data/users/kw2501/sync_async/repro.py:26
pytorch#4 main from /data/users/kw2501/sync_async/repro.py:34
pytorch#5 <module> from /data/users/kw2501/sync_async/repro.py:40
```

From the log above, we can tell that `bar()` and `baz()` are the places where the two ranks divert.

Pull Request resolved: pytorch#139659
Approved by: https://github.com/wconstab, https://github.com/fduwjj
@github-actions github-actions bot deleted the gh/kwen2501/91/head branch December 6, 2024 02:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request keep-going Don't stop on first failure, keep running tests until the end Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants