[PGNCCL] Watchdog prints call-time traceback when reporting timeout #139659

kwen2501 · 2024-11-04T21:22:00Z

Stack from ghstack (oldest at bottom):

-> [PGNCCL] Watchdog prints call-time traceback when reporting timeout #139659

Motivation

Today, watchdog only reports that it found a collective timeout:

[rank1]:[E1104 14:02:18.767594328 ProcessGroupNCCL.cpp:688] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=200, NumelOut=200, Timeout(ms)=5000) ran for 5096 milliseconds before timing out.

While this is nice, it is hard to associate the error with user's program or library stack.

This PR

This PR gives watchdog the ability to report the call-time stack of the collective, so that it would be easier to track the error back to the program's behavior.

The call-time stack was recorded by Flight Recorder with minimal overhead (for details, please read this doc written by @zdevito ). In ProcessGroupNCCL, we are only tracking / reporting the python part so that it fits most PyTorch users.

Demo

stack_demo.py.

TORCH_NCCL_TRACE_BUFFER_SIZE=100 torchrun --nproc-per-node 2 stack_demo.py

TORCH_NCCL_TRACE_BUFFER_SIZE is for turning on the Flight Recorder.

Output:

[rank0]:[E1104 14:19:27.591610653 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation: 
#0 all_reduce from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:2696
#1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83
#2 bar from /data/users/kw2501/sync_async/repro.py:15
#3 foo from /data/users/kw2501/sync_async/repro.py:24
#4 main from /data/users/kw2501/sync_async/repro.py:34
#5 <module> from /data/users/kw2501/sync_async/repro.py:40

[rank1]:[E1104 14:19:27.771430164 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation: 
#0 all_gather_into_tensor from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:3630
#1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83
#2 baz from /data/users/kw2501/sync_async/repro.py:20
#3 foo from /data/users/kw2501/sync_async/repro.py:26
#4 main from /data/users/kw2501/sync_async/repro.py:34
#5 <module> from /data/users/kw2501/sync_async/repro.py:40

From the log above, we can tell that bar() and baz() are the places where the two ranks divert.

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

[ghstack-poisoned]

pytorch-bot · 2024-11-04T21:22:04Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/139659

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 1 Unrelated Failure

As of commit f35e885 with merge base 3a6f014 ():

NEW FAILURE - The following job has failed:

pull / linux-focal-py3.11-clang10 / test (dynamo, 2, 3, lf.linux.2xlarge) (gh)
test_python_dispatch.py::TestWrapperSubclassAliasingCPU::test_wrapper_subclass_aliasing_custom_NumpyCubeCustomOp_cpu_float32

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / cuda12.1-py3.10-gcc9-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu) (gh) (trunk failure)
##[error]Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: fee5d1d Pull Request resolved: #139659

…g timeout" ### Motivation Today, watchdog only reports that it found a collective timeout: ``` [rank1]:[E1104 14:02:18.767594328 ProcessGroupNCCL.cpp:688] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=200, NumelOut=200, Timeout(ms)=5000) ran for 5096 milliseconds before timing out. ``` While this is nice, it is hard to associate the error with user's program or library stack. ### This PR This PR gives watchdog the ability to report the call-time stack of the collective, so that it would be easier to track the error back to the program's behavior. The call-time stack was recorded by Flight Recorder with minimal overhead (for details, please read this [doc](https://dev-discuss.pytorch.org/t/fast-combined-c-python-torchscript-inductor-tracebacks/1158) written by zdevito ). In `ProcessGroupNCCL`, we are only tracking / reporting the python part so that it fits most PyTorch users. ### Demo [stack_demo.py](https://gist.github.com/kwen2501/6758e18d305d67fc6f3f926217825c09). ``` TORCH_NCCL_TRACE_BUFFER_SIZE=100 torchrun --nproc-per-node 2 stack_demo.py ``` `TORCH_NCCL_TRACE_BUFFER_SIZE` is for turning on the Flight Recorder. Output: ``` [rank0]:[E1104 14:19:27.591610653 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation: #0 all_reduce from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:2696 #1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83 #2 bar from /data/users/kw2501/sync_async/repro.py:15 #3 foo from /data/users/kw2501/sync_async/repro.py:24 #4 main from /data/users/kw2501/sync_async/repro.py:34 #5 <module> from /data/users/kw2501/sync_async/repro.py:40 [rank1]:[E1104 14:19:27.771430164 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation: #0 all_gather_into_tensor from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:3630 #1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83 #2 baz from /data/users/kw2501/sync_async/repro.py:20 #3 foo from /data/users/kw2501/sync_async/repro.py:26 #4 main from /data/users/kw2501/sync_async/repro.py:34 #5 <module> from /data/users/kw2501/sync_async/repro.py:40 ``` From the log above, we can tell that `bar()` and `baz()` are the places where the two ranks divert. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

wconstab · 2024-11-04T23:47:42Z

torch/csrc/distributed/c10d/NCCLUtils.cpp

+// Returns the entry with the given id, if it exists. Otherwise, returns
+// std::nullopt.
+std::optional<NCCLTraceBuffer::Entry> NCCLTraceBuffer::getEntry(
+    std::optional<size_t> id) {


its a little odd to let 'id' be an optional param to getEntry which then returns nullopt if id is nullopt.

Yeah, I went back and forth too. Eventually I decided to stay in line with existing API signatures, like:
std::optional<size_t> NCCLTraceBuffer::record(...)
void NCCLTraceBuffer::retire_id(std::optional<size_t> id, ...)

wconstab

LGTM i think.

one thing is that there can be very long tracebacks in real programs, and one per rank can be a bit spammy. maybe that's ok but it made me hesitate a bit.

one idea there is, would it make sense to disable for larger than N (=8? 128?) world size? bc at 1k or 10k it is pretty unlikely people want to dig the logs and the raw FR dump and the tooling that comes with it would likely be easier

cc @c-p-i-o

kwen2501 · 2024-11-05T00:11:17Z

@wconstab those are good concerns.
I am not sure if I have perfect solution here.
But I also think the log here is way less spammy than the often-seen C++ stack traces -- which prints the entire CPython stack (a lot of "EvalFrame" calls).
And some users direct logs to one destination per host, that reduces the density related to scale too.

Edit:
My general thought:

letting people know is better than not letting people know.
letting people know online is better than letting them know offline.

…g timeout" ### Motivation Today, watchdog only reports that it found a collective timeout: ``` [rank1]:[E1104 14:02:18.767594328 ProcessGroupNCCL.cpp:688] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=200, NumelOut=200, Timeout(ms)=5000) ran for 5096 milliseconds before timing out. ``` While this is nice, it is hard to associate the error with user's program or library stack. ### This PR This PR gives watchdog the ability to report the call-time stack of the collective, so that it would be easier to track the error back to the program's behavior. The call-time stack was recorded by Flight Recorder with minimal overhead (for details, please read this [doc](https://dev-discuss.pytorch.org/t/fast-combined-c-python-torchscript-inductor-tracebacks/1158) written by zdevito ). In `ProcessGroupNCCL`, we are only tracking / reporting the python part so that it fits most PyTorch users. ### Demo [stack_demo.py](https://gist.github.com/kwen2501/6758e18d305d67fc6f3f926217825c09). ``` TORCH_NCCL_TRACE_BUFFER_SIZE=100 torchrun --nproc-per-node 2 stack_demo.py ``` `TORCH_NCCL_TRACE_BUFFER_SIZE` is for turning on the Flight Recorder. Output: ``` [rank0]:[E1104 14:19:27.591610653 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation: #0 all_reduce from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:2696 #1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83 #2 bar from /data/users/kw2501/sync_async/repro.py:15 #3 foo from /data/users/kw2501/sync_async/repro.py:24 #4 main from /data/users/kw2501/sync_async/repro.py:34 #5 <module> from /data/users/kw2501/sync_async/repro.py:40 [rank1]:[E1104 14:19:27.771430164 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation: #0 all_gather_into_tensor from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:3630 #1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83 #2 baz from /data/users/kw2501/sync_async/repro.py:20 #3 foo from /data/users/kw2501/sync_async/repro.py:26 #4 main from /data/users/kw2501/sync_async/repro.py:34 #5 <module> from /data/users/kw2501/sync_async/repro.py:40 ``` From the log above, we can tell that `bar()` and `baz()` are the places where the two ranks divert. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

ghstack-source-id: 4b28f69 Pull Request resolved: #139659

wconstab · 2024-11-05T01:10:15Z

But I also think the log here is way less spammy than the often-seen C++ stack traces -- which prints the entire CPython stack (a lot of "EvalFrame" calls).

well, i think we disabled the cpp logs? or maybe we only disabled them in some jobs, but it was causing people to complain. thats partly what made me think about this

My general thought:

letting people know is better than not letting people know.
letting people know online is better than letting them know offline.

yea, I agree. Otoh for internal jobs we will have 'online' analysis done via mast tooling. And for oss, it is more likely that jobs will be on the smaller side, but also more useful to have this feature enabled. Hence it might be nice to make enablement either size based, or just flag based so it can be controlled.

I guess we can just try it and see, if people complain we can add a flag later.

fduwjj · 2024-11-05T16:21:53Z

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

+    auto entryVal = entry.value();
+    // Get stack trace from FR entry, in string format
+    // Note: `getTraceback` call below invokes `torch::symbolize`.
+    // According to @fduwjj, `torch::symbolize` may need to acquire the GIL. In


it's not according to me, it's according to the code. Shall we not write a comment like this?

Maybe you can just say "torch::symbolize would acquire the GIL, so we add a timeout to make it non-blocking" something like that?

Removed, sorry

fduwjj · 2024-11-05T16:27:00Z

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

+    auto future = std::async(
+        std::launch::async, [&entryVal]() { return entryVal.getTraceback(); });
+    // Wait for the future to complete or timeout
+    auto status = future.wait_for(std::chrono::seconds(5));


can we use a promise (std::promise) so that the worst case we wait for 5 secs but we can finish early? Also, I think maybe we can give it 10-15 secs with promise? Thanks

Hmm, does std::promise provides a better wait mechanism? Here future.wait_for(5s) also allows early return, no?

fduwjj · 2024-11-05T16:28:02Z

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

+        std::launch::async, [&entryVal]() { return entryVal.getTraceback(); });
+    // Wait for the future to complete or timeout
+    auto status = future.wait_for(std::chrono::seconds(5));
+    if (status == std::future_status::ready) {


we have function waitForFutureOrTimeout which you can reuse.

Its code seems pretty heavy to understand. The added log may be good for abort or such, here we just have a simple case.

Then can we at least use std::promise? Thanks

Edit: NVM, this future wait_for works.

fduwjj · 2024-11-05T16:30:11Z

torch/csrc/distributed/c10d/NCCLUtils.cpp

+  /* Resulted format is like:
+    #0 all_reduce from pytorch/torch/distributed/distributed_c10d.py:2696
+    #1 wrapper from pytorch/torch/distributed/c10d_logger.py:83
+    #2 bar from /home/user/repro.py:15
+    #3 foo from /home/user/repro.py:24
+    #4 main from /home/user/repro.py:34
+    #5 <module> from /home/user/repro.py:40
+  */


Also I am thinking, do we want to do some truncate on the trace? like only showing first 10-15 lines? Because in FR I noticed that the trace is super long. I don't have a strong opinion on this.

Hmm, I don't have the knowledge to set up the preference. Let's see how user complains :)

…g timeout" ### Motivation Today, watchdog only reports that it found a collective timeout: ``` [rank1]:[E1104 14:02:18.767594328 ProcessGroupNCCL.cpp:688] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=200, NumelOut=200, Timeout(ms)=5000) ran for 5096 milliseconds before timing out. ``` While this is nice, it is hard to associate the error with user's program or library stack. ### This PR This PR gives watchdog the ability to report the call-time stack of the collective, so that it would be easier to track the error back to the program's behavior. The call-time stack was recorded by Flight Recorder with minimal overhead (for details, please read this [doc](https://dev-discuss.pytorch.org/t/fast-combined-c-python-torchscript-inductor-tracebacks/1158) written by zdevito ). In `ProcessGroupNCCL`, we are only tracking / reporting the python part so that it fits most PyTorch users. ### Demo [stack_demo.py](https://gist.github.com/kwen2501/6758e18d305d67fc6f3f926217825c09). ``` TORCH_NCCL_TRACE_BUFFER_SIZE=100 torchrun --nproc-per-node 2 stack_demo.py ``` `TORCH_NCCL_TRACE_BUFFER_SIZE` is for turning on the Flight Recorder. Output: ``` [rank0]:[E1104 14:19:27.591610653 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation: #0 all_reduce from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:2696 #1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83 #2 bar from /data/users/kw2501/sync_async/repro.py:15 #3 foo from /data/users/kw2501/sync_async/repro.py:24 #4 main from /data/users/kw2501/sync_async/repro.py:34 #5 <module> from /data/users/kw2501/sync_async/repro.py:40 [rank1]:[E1104 14:19:27.771430164 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation: #0 all_gather_into_tensor from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:3630 #1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83 #2 baz from /data/users/kw2501/sync_async/repro.py:20 #3 foo from /data/users/kw2501/sync_async/repro.py:26 #4 main from /data/users/kw2501/sync_async/repro.py:34 #5 <module> from /data/users/kw2501/sync_async/repro.py:40 ``` From the log above, we can tell that `bar()` and `baz()` are the places where the two ranks divert. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

ghstack-source-id: 560b0cf Pull Request resolved: #139659

shuqiangzhang

Can we also print traceback when there is any other error, e.g., in watchdogHandler?

shuqiangzhang · 2024-11-05T17:20:47Z

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

+  // Get the stack trace of the work at call time
+  // First step we get the corresponding record entry from FR, based on work's
+  // trace_id_
+  std::optional<NCCLTraceBuffer::Entry> entry =


Could you place this piece of code into a helper function in WorkNCCL such that we can print the traceback whenever needed

Yep, I was thinking about that too. In a next PR I will rearrange things. The current is kept as-is for "review sanity".

kwen2501 · 2024-11-05T19:05:31Z

@pytorchbot merge -f "CI was green before the last commit; the last commit just changes some comments in code"

pytorchmergebot · 2024-11-05T19:07:08Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

@zdevito

…ytorch#139659) ### Motivation Today, watchdog only reports that it found a collective timeout: ``` [rank1]:[E1104 14:02:18.767594328 ProcessGroupNCCL.cpp:688] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=200, NumelOut=200, Timeout(ms)=5000) ran for 5096 milliseconds before timing out. ``` While this is nice, it is hard to associate the error with user's program or library stack. ### This PR This PR gives watchdog the ability to report the call-time stack of the collective, so that it would be easier to track the error back to the program's behavior. The call-time stack was recorded by Flight Recorder with minimal overhead (for details, please read this [doc](https://dev-discuss.pytorch.org/t/fast-combined-c-python-torchscript-inductor-tracebacks/1158) written by @zdevito ). In `ProcessGroupNCCL`, we are only tracking / reporting the python part so that it fits most PyTorch users. ### Demo [stack_demo.py](https://gist.github.com/kwen2501/6758e18d305d67fc6f3f926217825c09). ``` TORCH_NCCL_TRACE_BUFFER_SIZE=100 torchrun --nproc-per-node 2 stack_demo.py ``` `TORCH_NCCL_TRACE_BUFFER_SIZE` is for turning on the Flight Recorder. Output: ``` [rank0]:[E1104 14:19:27.591610653 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation: #0 all_reduce from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:2696 pytorch#1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83 pytorch#2 bar from /data/users/kw2501/sync_async/repro.py:15 pytorch#3 foo from /data/users/kw2501/sync_async/repro.py:24 pytorch#4 main from /data/users/kw2501/sync_async/repro.py:34 pytorch#5 <module> from /data/users/kw2501/sync_async/repro.py:40 [rank1]:[E1104 14:19:27.771430164 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation: #0 all_gather_into_tensor from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:3630 pytorch#1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83 pytorch#2 baz from /data/users/kw2501/sync_async/repro.py:20 pytorch#3 foo from /data/users/kw2501/sync_async/repro.py:26 pytorch#4 main from /data/users/kw2501/sync_async/repro.py:34 pytorch#5 <module> from /data/users/kw2501/sync_async/repro.py:40 ``` From the log above, we can tell that `bar()` and `baz()` are the places where the two ranks divert. Pull Request resolved: pytorch#139659 Approved by: https://github.com/wconstab, https://github.com/fduwjj

@zdevito

…ytorch#139659) ### Motivation Today, watchdog only reports that it found a collective timeout: ``` [rank1]:[E1104 14:02:18.767594328 ProcessGroupNCCL.cpp:688] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=200, NumelOut=200, Timeout(ms)=5000) ran for 5096 milliseconds before timing out. ``` While this is nice, it is hard to associate the error with user's program or library stack. ### This PR This PR gives watchdog the ability to report the call-time stack of the collective, so that it would be easier to track the error back to the program's behavior. The call-time stack was recorded by Flight Recorder with minimal overhead (for details, please read this [doc](https://dev-discuss.pytorch.org/t/fast-combined-c-python-torchscript-inductor-tracebacks/1158) written by @zdevito ). In `ProcessGroupNCCL`, we are only tracking / reporting the python part so that it fits most PyTorch users. ### Demo [stack_demo.py](https://gist.github.com/kwen2501/6758e18d305d67fc6f3f926217825c09). ``` TORCH_NCCL_TRACE_BUFFER_SIZE=100 torchrun --nproc-per-node 2 stack_demo.py ``` `TORCH_NCCL_TRACE_BUFFER_SIZE` is for turning on the Flight Recorder. Output: ``` [rank0]:[E1104 14:19:27.591610653 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation: #0 all_reduce from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:2696 #1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83 pytorch#2 bar from /data/users/kw2501/sync_async/repro.py:15 pytorch#3 foo from /data/users/kw2501/sync_async/repro.py:24 pytorch#4 main from /data/users/kw2501/sync_async/repro.py:34 pytorch#5 <module> from /data/users/kw2501/sync_async/repro.py:40 [rank1]:[E1104 14:19:27.771430164 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation: #0 all_gather_into_tensor from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:3630 #1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83 pytorch#2 baz from /data/users/kw2501/sync_async/repro.py:20 pytorch#3 foo from /data/users/kw2501/sync_async/repro.py:26 pytorch#4 main from /data/users/kw2501/sync_async/repro.py:34 pytorch#5 <module> from /data/users/kw2501/sync_async/repro.py:40 ``` From the log above, we can tell that `bar()` and `baz()` are the places where the two ranks divert. Pull Request resolved: pytorch#139659 Approved by: https://github.com/wconstab, https://github.com/fduwjj

[PGNCCL] Watchdog prints call-time traceback when reporting timeout

bce52c0

[ghstack-poisoned]

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Nov 4, 2024

kwen2501 added a commit that referenced this pull request Nov 4, 2024

[PGNCCL] Watchdog prints call-time traceback when reporting timeout

417eb72

ghstack-source-id: fee5d1d Pull Request resolved: #139659

kwen2501 requested review from c-p-i-o, d4l3k, fduwjj, shuqiangzhang and wconstab November 4, 2024 23:39

kwen2501 added ciflow/trunk Trigger trunk jobs on your pull request keep-going Don't stop on first failure, keep running tests until the end labels Nov 4, 2024

wconstab reviewed Nov 4, 2024

View reviewed changes

kwen2501 added a commit that referenced this pull request Nov 5, 2024

[PGNCCL] Watchdog prints call-time traceback when reporting timeout

852352f

ghstack-source-id: 4b28f69 Pull Request resolved: #139659

wconstab approved these changes Nov 5, 2024

View reviewed changes

fduwjj requested changes Nov 5, 2024

View reviewed changes

fduwjj reviewed Nov 5, 2024

View reviewed changes

kwen2501 added a commit that referenced this pull request Nov 5, 2024

[PGNCCL] Watchdog prints call-time traceback when reporting timeout

7e15799

ghstack-source-id: 560b0cf Pull Request resolved: #139659

shuqiangzhang reviewed Nov 5, 2024

View reviewed changes

kwen2501 requested a review from fduwjj November 5, 2024 19:00

fduwjj approved these changes Nov 5, 2024

View reviewed changes

pytorchmergebot added the merging label Nov 5, 2024

pytorchmergebot added the Merged label Nov 5, 2024

pytorchmergebot closed this in 5f2ed50 Nov 5, 2024

pytorchmergebot removed the merging label Nov 5, 2024

kwen2501 mentioned this pull request Nov 6, 2024

[PGNCCL] Slimming watchdog loop #139834

Closed

github-actions bot deleted the gh/kwen2501/91/head branch December 6, 2024 02:12

[PGNCCL] Watchdog prints call-time traceback when reporting timeout #139659

[PGNCCL] Watchdog prints call-time traceback when reporting timeout #139659

Uh oh!

Conversation

kwen2501 commented Nov 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

This PR

Demo

Uh oh!

pytorch-bot bot commented Nov 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/139659

❌ 1 New Failure, 1 Unrelated Failure

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wconstab left a comment

Choose a reason for hiding this comment

Uh oh!

kwen2501 commented Nov 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wconstab commented Nov 5, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fduwjj Nov 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shuqiangzhang left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kwen2501 commented Nov 5, 2024

Uh oh!

pytorchmergebot commented Nov 5, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

kwen2501 commented Nov 4, 2024 •

edited

Loading

pytorch-bot bot commented Nov 4, 2024 •

edited

Loading

kwen2501 commented Nov 5, 2024 •

edited

Loading

fduwjj Nov 5, 2024 •

edited

Loading

shuqiangzhang left a comment •

edited

Loading