[NCCL] [For Test] In DDP's reducer merge `work` and `future_work` #41840

sinannasir · 2020-07-22T06:01:34Z

Stack from ghstack:

[NCCL] [For Test] In DDP's reducer merge work and future_work #41840 [NCCL] [For Test] In DDP's reducer merge work and future_work

Only NCCL's get_future API is suppoerted, so disabled gloo tests.
I'll use this PR to test NCCL's get_future API's flow-cli performance to validate no regression on speed.
This PR will also solve Issue #41266 once get_future API supports gloo and mpi backends.

Differential Revision: D22660198

Test plan:

Run python test/distributed/test_c10d.py.

ss..........ss.sss...s.ssssss.s.ss.......s...ssssssssssssssssssssssssssssssssssssssssssss........sssssss..........s.
----------------------------------------------------------------------
Ran 116 tests

OK (skipped=70)

1. Only NCCL's `get_future` API is suppoerted, so disabled gloo tests. 2. I'll use this PR to test NCCL's `get_future` API's flow-cli performance to validate no regression on speed. 3. This PR will also solve Issue [#41266](#41266) once `get_future` API supports gloo and mpi backends. Differential Revision: [D22660198](https://our.internmc.facebook.com/intern/diff/D22660198/) [ghstack-poisoned]

1. Only NCCL's `get_future` API is suppoerted, so disabled gloo tests. 2. I'll use this PR to test NCCL's `get_future` API's flow-cli performance to validate no regression on speed. 3. This PR will also solve Issue [#41266](#41266) once `get_future` API supports gloo and mpi backends. Differential Revision: [D22660198](https://our.internmc.facebook.com/intern/diff/D22660198/) ghstack-source-id: 108226828 Pull Request resolved: #41840

dr-ci · 2020-07-22T06:43:59Z

💊 CI failures summary and remediations

As of commit 861496c (more details on the Dr. CI page):

7/8 failures possibly* introduced in this PR
- 1/7 non-CircleCI failure(s)
1/8 tentatively recognized as flaky ❄️
- Click here to rerun these jobs

🕵️ 6 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

pytorch_linux_xenial_py3_6_gcc5_4_ge_config_simple_test (1/6)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Sep 01 16:12:30 ERROR:sccache::server: Compilation failed: Output { status: ExitStatus(ExitStatus(256)), stdout: "", stderr: "/var/lib/jenkins/.cache/torch_extensions/test_compilation_error_formatting/main.cpp: In function \'int main()\':\n/var/lib/jenkins/.cache/torch_extensions/test_compilation_error_formatting/main.cpp:2:23: error: expected \';\' before \'}\' token\n int main() { return 0 }\n                       ^\n" }

Sep 01 16:12:30 Traceback (most recent call last): 
Sep 01 16:12:30   File "test/run_test.py", line 734, in <module> 
Sep 01 16:12:30     main() 
Sep 01 16:12:30   File "test/run_test.py", line 717, in main 
Sep 01 16:12:30     raise RuntimeError(err_message) 
Sep 01 16:12:30 RuntimeError: distributed/test_distributed failed! 
Sep 01 16:12:30 + cleanup 
Sep 01 16:12:30 + retcode=1 
Sep 01 16:12:30 + set +x 
Sep 01 16:12:30 =================== sccache compilation log =================== 
Sep 01 16:12:30 ERROR:sccache::server: Compilation failed: Output { status: ExitStatus(ExitStatus(256)), stdout: "", stderr: "/var/lib/jenkins/.cache/torch_extensions/test_compilation_error_formatting/main.cpp: In function \'int main()\':\n/var/lib/jenkins/.cache/torch_extensions/test_compilation_error_formatting/main.cpp:2:23: error: expected \';\' before \'}\' token\n int main() { return 0 }\n                       ^\n" } 
Sep 01 16:12:30  
Sep 01 16:12:30 =========== If your build fails, please take a look at the log above for possible reasons =========== 
Sep 01 16:12:30 Compile requests                 63 
Sep 01 16:12:30 Compile requests executed        35 
Sep 01 16:12:30 Cache hits                       27 
Sep 01 16:12:30 Cache misses                      7 
Sep 01 16:12:30 Cache timeouts                    0 
Sep 01 16:12:30 Cache read errors                 0 
Sep 01 16:12:30 Forced recaches                   0 
Sep 01 16:12:30 Cache write errors                0

pytorch_linux_xenial_py3_clang5_asan_test2 (2/6)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Sep 01 15:58:49 SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /var/lib/jenkins/workspace/aten/src/ATen/Utils.cpp:11:3 in

Sep 01 15:58:49     #7 0x5624064f17eb in PyEval_EvalCode /tmp/build/80754af9/python_1588903631989/work/Python/ceval.c:731 
Sep 01 15:58:49     #8 0x562406571e73 in run_mod /tmp/build/80754af9/python_1588903631989/work/Python/pythonrun.c:1025 
Sep 01 15:58:49     #9 0x562406571f0c in PyRun_StringFlags /tmp/build/80754af9/python_1588903631989/work/Python/pythonrun.c:949 
Sep 01 15:58:49     #10 0x562406571f6e in PyRun_SimpleStringFlags /tmp/build/80754af9/python_1588903631989/work/Python/pythonrun.c:445 
Sep 01 15:58:49     #11 0x562406575d72 in run_command /tmp/build/80754af9/python_1588903631989/work/Modules/main.c:301 
Sep 01 15:58:49     #12 0x562406575d72 in Py_Main /tmp/build/80754af9/python_1588903631989/work/Modules/main.c:749 
Sep 01 15:58:49     #13 0x56240643ff2d in main /tmp/build/80754af9/python_1588903631989/work/Programs/python.c:69 
Sep 01 15:58:49     #14 0x7f1da271d83f in __libc_start_main /build/glibc-e6zv40/glibc-2.23/csu/../csu/libc-start.c:291 
Sep 01 15:58:49     #15 0x56240651f27e in _start /home/rdonnelly/mc/conda-bld/compilers_linux-64_1534865402226/work/.build/src/glibc-2.12.2/csu/../sysdeps/x86_64/elf/start.S:103 
Sep 01 15:58:49  
Sep 01 15:58:49 SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /var/lib/jenkins/workspace/aten/src/ATen/Utils.cpp:11:3 in  
Sep 01 15:58:49 + retcode=1 
Sep 01 15:58:49 + set -e 
Sep 01 15:58:49 + return 1 
Sep 01 15:58:49 + [[ pytorch-linux-xenial-py3-clang5-asan-test2 == *-NO_AVX-* ]] 
Sep 01 15:58:49 + [[ pytorch-linux-xenial-py3-clang5-asan-test2 == *-NO_AVX2-* ]] 
Sep 01 15:58:49 + '[' -n https://github.com/pytorch/pytorch/pull/41840 ']' 
Sep 01 15:58:49 + [[ pytorch-linux-xenial-py3-clang5-asan-test2 != *coverage* ]] 
Sep 01 15:58:49 ++ mktemp 
Sep 01 15:58:49 + DETERMINE_FROM=/tmp/tmp.CqjxY6xKyj 
Sep 01 15:58:49 + file_diff_from_base /tmp/tmp.CqjxY6xKyj

pytorch_linux_bionic_py3_6_clang9_test (3/6)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Sep 01 16:12:01 ERROR:sccache::server: Compilation failed: Output { status: ExitStatus(ExitStatus(256)), stdout: "", stderr: "/var/lib/jenkins/.cache/torch_extensions/test_compilation_error_formatting/main.cpp: In function \'int main()\':\n/var/lib/jenkins/.cache/torch_extensions/test_compilation_error_formatting/main.cpp:2:23: error: expected \';\' before \'}\' token\n int main() { return 0 }\n                       ^\n" }

Sep 01 16:12:01     raise RuntimeError(err_message) 
Sep 01 16:12:01 RuntimeError: distributed/test_distributed failed! 
Sep 01 16:12:01  
Sep 01 16:12:01 real	8m8.487s 
Sep 01 16:12:01 user	7m52.039s 
Sep 01 16:12:01 sys	1m31.678s 
Sep 01 16:12:01 + cleanup 
Sep 01 16:12:01 + retcode=1 
Sep 01 16:12:01 + set +x 
Sep 01 16:12:01 =================== sccache compilation log =================== 
Sep 01 16:12:01 ERROR:sccache::server: Compilation failed: Output { status: ExitStatus(ExitStatus(256)), stdout: "", stderr: "/var/lib/jenkins/.cache/torch_extensions/test_compilation_error_formatting/main.cpp: In function \'int main()\':\n/var/lib/jenkins/.cache/torch_extensions/test_compilation_error_formatting/main.cpp:2:23: error: expected \';\' before \'}\' token\n int main() { return 0 }\n                       ^\n" } 
Sep 01 16:12:01  
Sep 01 16:12:01 =========== If your build fails, please take a look at the log above for possible reasons =========== 
Sep 01 16:12:01 Compile requests                 63 
Sep 01 16:12:01 Compile requests executed        35 
Sep 01 16:12:01 Cache hits                       27 
Sep 01 16:12:01 Cac

pytorch_linux_bionic_py3_8_gcc9_coverage_test (4/6)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Sep 01 16:12:47 ERROR:sccache::server: Compilation failed: Output { status: ExitStatus(ExitStatus(256)), stdout: "", stderr: "/var/lib/jenkins/.cache/torch_extensions/test_compilation_error_formatting/main.cpp: In function ‘int main()’:\n/var/lib/jenkins/.cache/torch_extensions/test_compilation_error_formatting/main.cpp:2:22: error: expected ‘;’ before ‘}’ token\n    2 | int main() { return 0 }\n      |                      ^~\n      |                      ;\n" }

Sep 01 16:12:47     raise RuntimeError(err_message) 
Sep 01 16:12:47 RuntimeError: distributed/test_distributed failed! 
Sep 01 16:12:47  
Sep 01 16:12:47 real	6m5.221s 
Sep 01 16:12:47 user	6m13.355s 
Sep 01 16:12:47 sys	0m23.962s 
Sep 01 16:12:47 + cleanup 
Sep 01 16:12:47 + retcode=1 
Sep 01 16:12:47 + set +x 
Sep 01 16:12:47 =================== sccache compilation log =================== 
Sep 01 16:12:47 ERROR:sccache::server: Compilation failed: Output { status: ExitStatus(ExitStatus(256)), stdout: "", stderr: "/var/lib/jenkins/.cache/torch_extensions/test_compilation_error_formatting/main.cpp: In function ‘int main()’:\n/var/lib/jenkins/.cache/torch_extensions/test_compilation_error_formatting/main.cpp:2:22: error: expected ‘;’ before ‘}’ token\n    2 | int main() { return 0 }\n      |                      ^~\n      |                      ;\n" } 
Sep 01 16:12:47  
Sep 01 16:12:47 =========== If your build fails, please take a look at the log above for possible reasons =========== 
Sep 01 16:12:47 Compile requests                 63 
Sep 01 16:12:47 Compile requests executed        35 
Sep 01 16:12:47 Cache hits                       27 
Sep 01 16:12:47 Cache misses                      7 
Sep 01 16:12:47 Cache timeouts                    0 
Sep 01 16:12:47 Cache read errors                 0 
Sep 01 16:12:47 Forced recaches                   0 
Sep 01 16:12:47 Cache write errors                0

pytorch_macos_10_13_py3_test (5/6)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

Sep 01 09:09:19 ERROR [0.023s]: test_sparse_all_reduce_sum_cuda (__main__.TestDistBackend)

Sep 01 09:09:19   File "/Users/distiller/workspace/miniconda3/lib/python3.7/multiprocessing/managers.py", line 847, in _incref 
Sep 01 09:09:19     conn = self._Client(self._token.address, authkey=self._authkey) 
Sep 01 09:09:19   File "/Users/distiller/workspace/miniconda3/lib/python3.7/multiprocessing/connection.py", line 492, in Client 
Sep 01 09:09:19     c = SocketClient(address) 
Sep 01 09:09:19   File "/Users/distiller/workspace/miniconda3/lib/python3.7/multiprocessing/connection.py", line 619, in SocketClient 
Sep 01 09:09:19     s.connect(address) 
Sep 01 09:09:19 ConnectionRefusedError: [Errno 61] Connection refused 
Sep 01 09:09:19 --------------------------------------------------------------------------- 
Sep 01 09:09:19  
Sep 01 09:09:19 ====================================================================== 
Sep 01 09:09:19 ERROR [0.023s]: test_sparse_all_reduce_sum_cuda (__main__.TestDistBackend) 
Sep 01 09:09:19 ---------------------------------------------------------------------- 
Sep 01 09:09:19 Traceback (most recent call last): 
Sep 01 09:09:19   File "distributed/test_distributed.py", line 3153, in setUp 
Sep 01 09:09:19     self._fork_processes() 
Sep 01 09:09:19   File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/testing/_internal/common_distributed.py", line 282, in _fork_processes 
Sep 01 09:09:19     self._start_processes(proc) 
Sep 01 09:09:19   File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/testing/_internal/common_distributed.py", line 268, in _start_processes 
Sep 01 09:09:19     test_skips.update(TEST_SKIPS) 
Sep 01 09:09:19   File "<string>", line 2, in update 
Sep 01 09:09:19   File "/Users/distiller/workspace/miniconda3/lib/python3.7/multiprocessing/managers.py", line 834, in _callmethod

pytorch_linux_xenial_py3_6_gcc5_4_test (6/6)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Sep 01 16:13:13 ERROR:sccache::server: Compilation failed: Output { status: ExitStatus(ExitStatus(256)), stdout: "", stderr: "/var/lib/jenkins/.cache/torch_extensions/test_compilation_error_formatting/main.cpp: In function \'int main()\':\n/var/lib/jenkins/.cache/torch_extensions/test_compilation_error_formatting/main.cpp:2:23: error: expected \';\' before \'}\' token\n int main() { return 0 }\n                       ^\n" }

Sep 01 16:13:13 Traceback (most recent call last): 
Sep 01 16:13:13   File "test/run_test.py", line 734, in <module> 
Sep 01 16:13:13     main() 
Sep 01 16:13:13   File "test/run_test.py", line 717, in main 
Sep 01 16:13:13     raise RuntimeError(err_message) 
Sep 01 16:13:13 RuntimeError: distributed/test_distributed failed! 
Sep 01 16:13:13 + cleanup 
Sep 01 16:13:13 + retcode=1 
Sep 01 16:13:13 + set +x 
Sep 01 16:13:13 =================== sccache compilation log =================== 
Sep 01 16:13:13 ERROR:sccache::server: Compilation failed: Output { status: ExitStatus(ExitStatus(256)), stdout: "", stderr: "/var/lib/jenkins/.cache/torch_extensions/test_compilation_error_formatting/main.cpp: In function \'int main()\':\n/var/lib/jenkins/.cache/torch_extensions/test_compilation_error_formatting/main.cpp:2:23: error: expected \';\' before \'}\' token\n int main() { return 0 }\n                       ^\n" } 
Sep 01 16:13:13  
Sep 01 16:13:13 =========== If your build fails, please take a look at the log above for possible reasons =========== 
Sep 01 16:13:13 Compile requests                 63 
Sep 01 16:13:13 Compile requests executed        35 
Sep 01 16:13:13 Cache hits                       27 
Sep 01 16:13:13 Cache misses                      7 
Sep 01 16:13:13 Cache timeouts                    0 
Sep 01 16:13:13 Cache read errors                 0 
Sep 01 16:13:13 Forced recaches                   0 
Sep 01 16:13:13 Cache write errors                0

❄️ 1 failure tentatively classified as flaky

but reruns have not yet been triggered to confirm:

pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test (1/1)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun) ❄️

Sep 01 16:56:22 ConnectionResetError: [Errno 104] Connection reset by peer

Sep 01 16:56:22   File "/opt/conda/lib/python3.6/multiprocessing/connection.py", line 455, in accept 
Sep 01 16:56:22     deliver_challenge(c, self._authkey) 
Sep 01 16:56:22   File "/opt/conda/lib/python3.6/multiprocessing/connection.py", line 722, in deliver_challenge 
Sep 01 16:56:22     response = connection.recv_bytes(256)        # reject large message 
Sep 01 16:56:22   File "/opt/conda/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes 
Sep 01 16:56:22     buf = self._recv_bytes(maxlength) 
Sep 01 16:56:22   File "/opt/conda/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes 
Sep 01 16:56:22     buf = self._recv(4) 
Sep 01 16:56:22   File "/opt/conda/lib/python3.6/multiprocessing/connection.py", line 379, in _recv 
Sep 01 16:56:22     chunk = read(handle, remaining) 
Sep 01 16:56:22 ConnectionResetError: [Errno 104] Connection reset by peer 
Sep 01 16:56:22 /opt/conda/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 14 leaked semaphores to clean up at shutdown 
Sep 01 16:56:22   len(cache)) 
Sep 01 16:56:25 Process ErrorTrackingProcess-152: 
Sep 01 16:56:25 Traceback (most recent call last): 
Sep 01 16:56:25   File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap 
Sep 01 16:56:25     self.run() 
Sep 01 16:56:25   File "/var/lib/jenkins/workspace/test/test_dataloader.py", line 361, in run 
Sep 01 16:56:25     super(ErrorTrackingProcess, self).run() 
Sep 01 16:56:25   File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 93, in run 
Sep 01 16:56:25     self._target(*self._args, **self._kwargs)

ci.pytorch.org: 1 failed

Failed: pr/pytorch-linux-xenial-rocm3.5.1-py3.6

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 38 times.

We've modified the previous design of `convert_dist_work_to_future` API in the GH Issue [#39272](#39272). 1. Whenever we create a `WorkNCCL` object, create a `Future` associated with `WorkNCCL` and store it with the object. 2. Add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. 3. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation. 4. To mark the future associated with WorkNCCL completed, implement a `cudaStreamCallback` function. `cudaStreamAddCallback` is marked as deprecated. An alternative is `cudaLaunchHostFunc`, but it is supported for CUDA > 10 and may not be deprecated until there's a reasonable alternative available according to [this discussion](https://stackoverflow.com/questions/56448390/how-to-recover-from-cuda-errors-when-using-cudalaunchhostfunc-instead-of-cudastr). Differential Revision: [D22583690](https://our.internmc.facebook.com/intern/diff/D22583690/) #### Test plan: Run old python test/distributed/test_c10d.py. Some additional tests: `test_ddp_comm_hook_allreduce_hook_nccl`: This unit test verifies whether a DDP communication hook that just calls allreduce gives the same result result with the case of no hook registered. Without the then callback, the future_value in reducer is no longer a PyObject, and this unit test verifies future_value is properly checked. `test_ddp_comm_hook_allreduce_then_mult_ten_hook_nccl`: This unit test verifies whether a DDP communication hook that calls allreduce and then multiplies the result by ten gives the expected result. As of v10: ``` ........................s.....s.....................................................s............................... ---------------------------------------------------------------------- Ran 116 tests OK (skipped=3) ``` `flow-cli` performance validation using a stacked diff when bucket.work is completely replaced with bucket.future_work in `reducer`. See PR [#41840](#41840) [D22660198](https://www.internalfb.com/intern/diff/D22660198/). [ghstack-poisoned]

pritamdamania87 · 2020-07-23T20:54:50Z

Looking at the results, there seems to be 1-1.5% regression in general, although in the case of world_size=2, for p90 and p95 there seems to be a significant regression for opts (not sure what it refers to, optimizer?).

I guess we can't land this PR without gloo support. So I'd suggest leaving the PR in the current state and then we can have someone pick this work up once we start working on gloo support.

pritamdamania87 · 2020-07-23T20:42:05Z

torch/csrc/distributed/c10d/reducer.cpp

    // #41266.
    if (comm_hook_ == nullptr) {
-      bucket.work = process_group_->allreduce(tensors);
+      bucket.future_work = process_group_->allreduce(tensors)->getFuture();


We can get rid of this if-else completely and register a hook in DDP constructor that does allreduce by default? Then, we should allow users to override the hook (I think we only allow the hook to be registered once) as many times as they wish.

Included a new class named AllreduceHook and registered that as default comm hook.

pritamdamania87 · 2020-07-23T20:47:22Z

torch/csrc/distributed/c10d/reducer.cpp

+    bucket.future_work->wait();

+    // See Note [DDP Communication Hook]
+    if (comm_hook_ != nullptr) {


Instead of checking on comm_hook here, we can check if future_result is the same as replica.contents and if so, you can skip initializing bucket views since you know it is not a new tensor. You can use the is_same API for this: https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/templates/TensorBody.h#L194

I think that would require future_value.toTensorVector first. Instead, I am doing smth like this, please let me know if it isn't good:

pytorch/torch/csrc/distributed/c10d/reducer.cpp

Line 957 in a1056aa

if (!dynamic_cast<AllreduceHook*>(comm_hook_.get())) {

torch/testing/_internal/common_distributed.py

We've modified the previous design of `convert_dist_work_to_future` API in the GH Issue [#39272](#39272). 1. Whenever we create a `WorkNCCL` object, create a `Future` associated with `WorkNCCL` and store it with the object. 2. Add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. 3. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation. 4. To mark the future associated with WorkNCCL completed, implement a `cudaStreamCallback` function. `cudaStreamAddCallback` is marked as deprecated. An alternative is `cudaLaunchHostFunc`, but it is supported for CUDA > 10 and may not be deprecated until there's a reasonable alternative available according to [this discussion](https://stackoverflow.com/questions/56448390/how-to-recover-from-cuda-errors-when-using-cudalaunchhostfunc-instead-of-cudastr). Differential Revision: [D22583690](https://our.internmc.facebook.com/intern/diff/D22583690/) #### Test plan: Run old python test/distributed/test_c10d.py. Some additional tests: `test_ddp_comm_hook_allreduce_hook_nccl`: This unit test verifies whether a DDP communication hook that just calls allreduce gives the same result result with the case of no hook registered. Without the then callback, the future_value in reducer is no longer a PyObject, and this unit test verifies future_value is properly checked. `test_ddp_comm_hook_allreduce_then_mult_ten_hook_nccl`: This unit test verifies whether a DDP communication hook that calls allreduce and then multiplies the result by ten gives the expected result. As of v10: ``` ........................s.....s.....................................................s............................... ---------------------------------------------------------------------- Ran 116 tests OK (skipped=3) ``` `flow-cli` performance validation using a stacked diff where bucket.work is completely replaced with bucket.future_work in `reducer`. See PR [#41840](#41840) [D22660198](https://www.internalfb.com/intern/diff/D22660198/). [ghstack-poisoned]

…e_work`" 1. Only NCCL's `get_future` API is suppoerted, so disabled gloo tests. 2. I'll use this PR to test NCCL's `get_future` API's flow-cli performance to validate no regression on speed. 3. This PR will also solve Issue [#41266](#41266) once `get_future` API supports gloo and mpi backends. Differential Revision: [D22660198](https://our.internmc.facebook.com/intern/diff/D22660198/) #### Test plan: 1. Run `python test/distributed/test_c10d.py`. ``` ss..........ss.sss...s.ssssss.s.ss.......s...ssssssssssssssssssssssssssssssssssssssssssss........sssssss..........s. ---------------------------------------------------------------------- Ran 116 tests OK (skipped=70) ``` 2. `flow-cli` performance tests using NCCL backend. * Example: ``` flow-cli canary pytorch.benchmark.main.workflow --parameters-json '{"model_arch": "resnet50", "batch_size": 32, "world_size": 2, "use_fp16": false, "print_percentile": true, "backend": "nccl"}' --entitlement pytorch_ftw_gpu --name test_torchelastic_nccl_getFutureD22660198 --run-as-secure-group pytorch_distributed ``` * **World size =1, master:** f206781394, f206947522 ``` sum: 8 GPUs: p25: 0.109 293/s p50: 0.109 293/s p75: 0.109 293/s p90: 0.109 292/s p95: 0.109 292/s fwds: 8 GPUs: p25: 0.032 1002/s p50: 0.032 999/s p75: 0.032 995/s p90: 0.032 991/s p95: 0.032 989/s bwds: 8 GPUs: p25: 0.074 431/s p50: 0.074 430/s p75: 0.075 429/s p90: 0.075 428/s p95: 0.075 427/s opts: 8 GPUs: p25: 0.003 12007/s p50: 0.003 11565/s p75: 0.003 11377/s p90: 0.003 11325/s p95: 0.003 11299/s ``` ``` sum: 8 GPUs: p25: 0.109 293/s p50: 0.109 293/s p75: 0.109 293/s p90: 0.109 292/s p95: 0.109 292/s fwds: 8 GPUs: p25: 0.032 1005/s p50: 0.032 999/s p75: 0.032 993/s p90: 0.032 989/s p95: 0.032 987/s bwds: 8 GPUs: p25: 0.074 431/s p50: 0.074 430/s p75: 0.075 428/s p90: 0.075 427/s p95: 0.075 426/s opts: 8 GPUs: p25: 0.003 12085/s p50: 0.003 11663/s p75: 0.003 11451/s p90: 0.003 11364/s p95: 0.003 11326/s ``` * **World size =1, getFuture API:** f206788430 f206945375 ``` sum: 8 GPUs: p25: 0.120 266/s p50: 0.121 265/s p75: 0.121 263/s p90: 0.122 262/s p95: 0.123 260/s fwds: 8 GPUs: p25: 0.032 989/s p50: 0.033 983/s p75: 0.033 976/s p90: 0.033 971/s p95: 0.033 968/s bwds: 8 GPUs: p25: 0.079 404/s p50: 0.081 394/s p75: 0.083 384/s p90: 0.085 378/s p95: 0.085 374/s opts: 8 GPUs: p25: 0.005 6285/s p50: 0.007 4482/s p75: 0.009 3535/s p90: 0.012 2710/s p95: 0.012 2580/s ``` ``` sum: 8 GPUs: p25: 0.120 267/s p50: 0.120 266/s p75: 0.121 264/s p90: 0.121 263/s p95: 0.122 261/s fwds: 8 GPUs: p25: 0.032 995/s p50: 0.032 989/s p75: 0.033 983/s p90: 0.033 978/s p95: 0.033 975/s bwds: 8 GPUs: p25: 0.076 419/s p50: 0.078 408/s p75: 0.080 397/s p90: 0.082 390/s p95: 0.083 384/s opts: 8 GPUs: p25: 0.008 4086/s p50: 0.010 3210/s p75: 0.011 2836/s p90: 0.012 2762/s p95: 0.012 2691/s ``` * **World size = 2, master:** f206922196 ``` sum: 16 GPUs: p25: 0.121 263/s p50: 0.122 261/s p75: 0.125 256/s p90: 0.126 253/s p95: 0.127 252/s fwds: 16 GPUs: p25: 0.032 1004/s p50: 0.032 998/s p75: 0.032 993/s p90: 0.032 989/s p95: 0.032 987/s bwds: 16 GPUs: p25: 0.087 369/s p50: 0.088 365/s p75: 0.090 355/s p90: 0.091 350/s p95: 0.092 346/s opts: 16 GPUs: p25: 0.003 12120/s p50: 0.003 11587/s p75: 0.003 11379/s p90: 0.003 11185/s p95: 0.003 11126/s ``` * **World size = 2, getFuture API:** f206921378 ``` sum: 16 GPUs: p25: 0.126 253/s p50: 0.127 252/s p75: 0.127 251/s p90: 0.128 249/s p95: 0.130 247/s fwds: 16 GPUs: p25: 0.032 997/s p50: 0.032 990/s p75: 0.033 983/s p90: 0.033 978/s p95: 0.033 976/s bwds: 16 GPUs: p25: 0.091 353/s p50: 0.091 349/s p75: 0.092 347/s p90: 0.093 344/s p95: 0.094 338/s opts: 16 GPUs: p25: 0.003 12083/s p50: 0.003 11569/s p75: 0.003 11300/s p90: 0.007 4846/s p95: 0.009 3543/s ``` * **Memory:** * **World size = 2, master:** f206922196 ``` CUDA Memory Summary After first iteration: |===========================================================================| | PyTorch CUDA memory summary, device ID 0 | |---------------------------------------------------------------------------| | CUDA OOMs: 0 | cudaMalloc retries: 0 | |===========================================================================| | Metric | Cur Usage | Peak Usage | Tot Alloc | Tot Freed | |---------------------------------------------------------------------------| | Allocated memory | 428091 KB | 2892 MB | 9825 MB | 9407 MB | | from large pool | 374913 KB | 2874 MB | 9752 MB | 9386 MB | | from small pool | 53178 KB | 52 MB | 73 MB | 21 MB | |---------------------------------------------------------------------------| | Active memory | 428091 KB | 2892 MB | 9825 MB | 9407 MB | | from large pool | 374913 KB | 2874 MB | 9752 MB | 9386 MB | | from small pool | 53178 KB | 52 MB | 73 MB | 21 MB | |---------------------------------------------------------------------------| | GPU reserved memory | 3490 MB | 3490 MB | 3490 MB | 0 B | | from large pool | 3434 MB | 3434 MB | 3434 MB | 0 B | | from small pool | 56 MB | 56 MB | 56 MB | 0 B | |---------------------------------------------------------------------------| | Non-releasable memory | 315332 KB | 343472 KB | 2295 MB | 1987 MB | | from large pool | 311166 KB | 340158 KB | 2239 MB | 1935 MB | | from small pool | 4166 KB | 4334 KB | 56 MB | 52 MB | |---------------------------------------------------------------------------| | Allocations | 704 | 705 | 1390 | 686 | | from large pool | 60 | 131 | 395 | 335 | | from small pool | 644 | 645 | 995 | 351 | |---------------------------------------------------------------------------| | Active allocs | 704 | 705 | 1390 | 686 | | from large pool | 60 | 131 | 395 | 335 | | from small pool | 644 | 645 | 995 | 351 | |---------------------------------------------------------------------------| | GPU reserved segments | 102 | 102 | 102 | 0 | | from large pool | 74 | 74 | 74 | 0 | | from small pool | 28 | 28 | 28 | 0 | |---------------------------------------------------------------------------| | Non-releasable allocs | 34 | 54 | 431 | 397 | | from large pool | 15 | 48 | 208 | 193 | | from small pool | 19 | 19 | 223 | 204 | |===========================================================================| ``` * **World size = 2, getFuture API:** f206921378 ``` CUDA Memory Summary After first iteration: |===========================================================================| | PyTorch CUDA memory summary, device ID 0 | |---------------------------------------------------------------------------| | CUDA OOMs: 0 | cudaMalloc retries: 0 | |===========================================================================| | Metric | Cur Usage | Peak Usage | Tot Alloc | Tot Freed | |---------------------------------------------------------------------------| | Allocated memory | 428091 KB | 2892 MB | 9825 MB | 9407 MB | | from large pool | 374913 KB | 2874 MB | 9752 MB | 9386 MB | | from small pool | 53178 KB | 52 MB | 73 MB | 21 MB | |---------------------------------------------------------------------------| | Active memory | 428091 KB | 2892 MB | 9825 MB | 9407 MB | | from large pool | 374913 KB | 2874 MB | 9752 MB | 9386 MB | | from small pool | 53178 KB | 52 MB | 73 MB | 21 MB | |---------------------------------------------------------------------------| | GPU reserved memory | 3490 MB | 3490 MB | 3490 MB | 0 B | | from large pool | 3434 MB | 3434 MB | 3434 MB | 0 B | | from small pool | 56 MB | 56 MB | 56 MB | 0 B | |---------------------------------------------------------------------------| | Non-releasable memory | 315332 KB | 343472 KB | 2295 MB | 1987 MB | | from large pool | 311166 KB | 340158 KB | 2239 MB | 1935 MB | | from small pool | 4166 KB | 4334 KB | 56 MB | 52 MB | |---------------------------------------------------------------------------| | Allocations | 704 | 705 | 1390 | 686 | | from large pool | 60 | 131 | 395 | 335 | | from small pool | 644 | 645 | 995 | 351 | |---------------------------------------------------------------------------| | Active allocs | 704 | 705 | 1390 | 686 | | from large pool | 60 | 131 | 395 | 335 | | from small pool | 644 | 645 | 995 | 351 | |---------------------------------------------------------------------------| | GPU reserved segments | 102 | 102 | 102 | 0 | | from large pool | 74 | 74 | 74 | 0 | | from small pool | 28 | 28 | 28 | 0 | |---------------------------------------------------------------------------| | Non-releasable allocs | 34 | 54 | 431 | 397 | | from large pool | 15 | 48 | 208 | 193 | | from small pool | 19 | 19 | 223 | 204 | |===========================================================================| ``` [ghstack-poisoned]

Pull Request resolved: #41840 1. Only NCCL's `get_future` API is suppoerted, so disabled gloo tests. 2. I'll use this PR to test NCCL's `get_future` API's flow-cli performance to validate no regression on speed. 3. This PR will also solve Issue [#41266] (#41266) once `get_future` API supports gloo and mpi backends. 4. We register a default communication hook to reducer called AllreduceHook. 5. We now allow DDP communication hook to be registered multiple times. 6. Removed test_ddp_comm_hook_register_just_once unit test. ghstack-source-id: 108410963 Differential Revision: [D22660198](https://our.internmc.facebook.com/intern/diff/D22660198/)

…e_work`" 1. Only NCCL's `get_future` API is suppoerted, so disabled gloo tests. 2. I'll use this PR to test NCCL's `get_future` API's flow-cli performance to validate no regression on speed. 3. This PR will also solve Issue [#41266](#41266) once `get_future` API supports gloo and mpi backends. Differential Revision: [D22660198](https://our.internmc.facebook.com/intern/diff/D22660198/) #### Test plan: 1. Run `python test/distributed/test_c10d.py`. ``` ss..........ss.sss...s.ssssss.s.ss.......s...ssssssssssssssssssssssssssssssssssssssssssss........sssssss..........s. ---------------------------------------------------------------------- Ran 116 tests OK (skipped=70) ``` 2. `flow-cli` performance tests using NCCL backend. * Example: ``` flow-cli canary pytorch.benchmark.main.workflow --parameters-json '{"model_arch": "resnet50", "batch_size": 32, "world_size": 2, "use_fp16": false, "print_percentile": true, "backend": "nccl"}' --entitlement pytorch_ftw_gpu --name test_torchelastic_nccl_getFutureD22660198 --run-as-secure-group pytorch_distributed ``` * **World size =1, master:** f206781394, f206947522 ``` sum: 8 GPUs: p25: 0.109 293/s p50: 0.109 293/s p75: 0.109 293/s p90: 0.109 292/s p95: 0.109 292/s fwds: 8 GPUs: p25: 0.032 1002/s p50: 0.032 999/s p75: 0.032 995/s p90: 0.032 991/s p95: 0.032 989/s bwds: 8 GPUs: p25: 0.074 431/s p50: 0.074 430/s p75: 0.075 429/s p90: 0.075 428/s p95: 0.075 427/s opts: 8 GPUs: p25: 0.003 12007/s p50: 0.003 11565/s p75: 0.003 11377/s p90: 0.003 11325/s p95: 0.003 11299/s ``` ``` sum: 8 GPUs: p25: 0.109 293/s p50: 0.109 293/s p75: 0.109 293/s p90: 0.109 292/s p95: 0.109 292/s fwds: 8 GPUs: p25: 0.032 1005/s p50: 0.032 999/s p75: 0.032 993/s p90: 0.032 989/s p95: 0.032 987/s bwds: 8 GPUs: p25: 0.074 431/s p50: 0.074 430/s p75: 0.075 428/s p90: 0.075 427/s p95: 0.075 426/s opts: 8 GPUs: p25: 0.003 12085/s p50: 0.003 11663/s p75: 0.003 11451/s p90: 0.003 11364/s p95: 0.003 11326/s ``` * **World size =1, getFuture API:** f206788430 f206945375 ``` sum: 8 GPUs: p25: 0.120 266/s p50: 0.121 265/s p75: 0.121 263/s p90: 0.122 262/s p95: 0.123 260/s fwds: 8 GPUs: p25: 0.032 989/s p50: 0.033 983/s p75: 0.033 976/s p90: 0.033 971/s p95: 0.033 968/s bwds: 8 GPUs: p25: 0.079 404/s p50: 0.081 394/s p75: 0.083 384/s p90: 0.085 378/s p95: 0.085 374/s opts: 8 GPUs: p25: 0.005 6285/s p50: 0.007 4482/s p75: 0.009 3535/s p90: 0.012 2710/s p95: 0.012 2580/s ``` ``` sum: 8 GPUs: p25: 0.120 267/s p50: 0.120 266/s p75: 0.121 264/s p90: 0.121 263/s p95: 0.122 261/s fwds: 8 GPUs: p25: 0.032 995/s p50: 0.032 989/s p75: 0.033 983/s p90: 0.033 978/s p95: 0.033 975/s bwds: 8 GPUs: p25: 0.076 419/s p50: 0.078 408/s p75: 0.080 397/s p90: 0.082 390/s p95: 0.083 384/s opts: 8 GPUs: p25: 0.008 4086/s p50: 0.010 3210/s p75: 0.011 2836/s p90: 0.012 2762/s p95: 0.012 2691/s ``` * **World size = 2, master:** f206922196 ``` sum: 16 GPUs: p25: 0.121 263/s p50: 0.122 261/s p75: 0.125 256/s p90: 0.126 253/s p95: 0.127 252/s fwds: 16 GPUs: p25: 0.032 1004/s p50: 0.032 998/s p75: 0.032 993/s p90: 0.032 989/s p95: 0.032 987/s bwds: 16 GPUs: p25: 0.087 369/s p50: 0.088 365/s p75: 0.090 355/s p90: 0.091 350/s p95: 0.092 346/s opts: 16 GPUs: p25: 0.003 12120/s p50: 0.003 11587/s p75: 0.003 11379/s p90: 0.003 11185/s p95: 0.003 11126/s ``` * **World size = 2, getFuture API:** f206921378 ``` sum: 16 GPUs: p25: 0.126 253/s p50: 0.127 252/s p75: 0.127 251/s p90: 0.128 249/s p95: 0.130 247/s fwds: 16 GPUs: p25: 0.032 997/s p50: 0.032 990/s p75: 0.033 983/s p90: 0.033 978/s p95: 0.033 976/s bwds: 16 GPUs: p25: 0.091 353/s p50: 0.091 349/s p75: 0.092 347/s p90: 0.093 344/s p95: 0.094 338/s opts: 16 GPUs: p25: 0.003 12083/s p50: 0.003 11569/s p75: 0.003 11300/s p90: 0.007 4846/s p95: 0.009 3543/s ``` * **Memory:** * **World size = 2, master:** f206922196 ``` CUDA Memory Summary After first iteration: |===========================================================================| | PyTorch CUDA memory summary, device ID 0 | |---------------------------------------------------------------------------| | CUDA OOMs: 0 | cudaMalloc retries: 0 | |===========================================================================| | Metric | Cur Usage | Peak Usage | Tot Alloc | Tot Freed | |---------------------------------------------------------------------------| | Allocated memory | 428091 KB | 2892 MB | 9825 MB | 9407 MB | | from large pool | 374913 KB | 2874 MB | 9752 MB | 9386 MB | | from small pool | 53178 KB | 52 MB | 73 MB | 21 MB | |---------------------------------------------------------------------------| | Active memory | 428091 KB | 2892 MB | 9825 MB | 9407 MB | | from large pool | 374913 KB | 2874 MB | 9752 MB | 9386 MB | | from small pool | 53178 KB | 52 MB | 73 MB | 21 MB | |---------------------------------------------------------------------------| | GPU reserved memory | 3490 MB | 3490 MB | 3490 MB | 0 B | | from large pool | 3434 MB | 3434 MB | 3434 MB | 0 B | | from small pool | 56 MB | 56 MB | 56 MB | 0 B | |---------------------------------------------------------------------------| | Non-releasable memory | 315332 KB | 343472 KB | 2295 MB | 1987 MB | | from large pool | 311166 KB | 340158 KB | 2239 MB | 1935 MB | | from small pool | 4166 KB | 4334 KB | 56 MB | 52 MB | |---------------------------------------------------------------------------| | Allocations | 704 | 705 | 1390 | 686 | | from large pool | 60 | 131 | 395 | 335 | | from small pool | 644 | 645 | 995 | 351 | |---------------------------------------------------------------------------| | Active allocs | 704 | 705 | 1390 | 686 | | from large pool | 60 | 131 | 395 | 335 | | from small pool | 644 | 645 | 995 | 351 | |---------------------------------------------------------------------------| | GPU reserved segments | 102 | 102 | 102 | 0 | | from large pool | 74 | 74 | 74 | 0 | | from small pool | 28 | 28 | 28 | 0 | |---------------------------------------------------------------------------| | Non-releasable allocs | 34 | 54 | 431 | 397 | | from large pool | 15 | 48 | 208 | 193 | | from small pool | 19 | 19 | 223 | 204 | |===========================================================================| ``` * **World size = 2, getFuture API:** f206921378 ``` CUDA Memory Summary After first iteration: |===========================================================================| | PyTorch CUDA memory summary, device ID 0 | |---------------------------------------------------------------------------| | CUDA OOMs: 0 | cudaMalloc retries: 0 | |===========================================================================| | Metric | Cur Usage | Peak Usage | Tot Alloc | Tot Freed | |---------------------------------------------------------------------------| | Allocated memory | 428091 KB | 2892 MB | 9825 MB | 9407 MB | | from large pool | 374913 KB | 2874 MB | 9752 MB | 9386 MB | | from small pool | 53178 KB | 52 MB | 73 MB | 21 MB | |---------------------------------------------------------------------------| | Active memory | 428091 KB | 2892 MB | 9825 MB | 9407 MB | | from large pool | 374913 KB | 2874 MB | 9752 MB | 9386 MB | | from small pool | 53178 KB | 52 MB | 73 MB | 21 MB | |---------------------------------------------------------------------------| | GPU reserved memory | 3490 MB | 3490 MB | 3490 MB | 0 B | | from large pool | 3434 MB | 3434 MB | 3434 MB | 0 B | | from small pool | 56 MB | 56 MB | 56 MB | 0 B | |---------------------------------------------------------------------------| | Non-releasable memory | 315332 KB | 343472 KB | 2295 MB | 1987 MB | | from large pool | 311166 KB | 340158 KB | 2239 MB | 1935 MB | | from small pool | 4166 KB | 4334 KB | 56 MB | 52 MB | |---------------------------------------------------------------------------| | Allocations | 704 | 705 | 1390 | 686 | | from large pool | 60 | 131 | 395 | 335 | | from small pool | 644 | 645 | 995 | 351 | |---------------------------------------------------------------------------| | Active allocs | 704 | 705 | 1390 | 686 | | from large pool | 60 | 131 | 395 | 335 | | from small pool | 644 | 645 | 995 | 351 | |---------------------------------------------------------------------------| | GPU reserved segments | 102 | 102 | 102 | 0 | | from large pool | 74 | 74 | 74 | 0 | | from small pool | 28 | 28 | 28 | 0 | |---------------------------------------------------------------------------| | Non-releasable allocs | 34 | 54 | 431 | 397 | | from large pool | 15 | 48 | 208 | 193 | | from small pool | 19 | 19 | 223 | 204 | |===========================================================================| ``` [ghstack-poisoned]

Pull Request resolved: #41840 1. Only NCCL's `get_future` API is suppoerted, so disabled gloo tests. 2. I'll use this PR to test NCCL's `get_future` API's flow-cli performance to validate no regression on speed. 3. This PR will also solve Issue [#41266] (#41266) once `get_future` API supports gloo and mpi backends. 4. We register a default communication hook to reducer called AllreduceHook. 5. We now allow DDP communication hook to be registered multiple times. 6. Removed test_ddp_comm_hook_register_just_once unit test. ghstack-source-id: 108445379 Differential Revision: [D22660198](https://our.internmc.facebook.com/intern/diff/D22660198/)

Summary: Pull Request resolved: #41596 We've modified the previous design of `convert_dist_work_to_future` API in the GH Issue [#39272](#39272). 1. Whenever we create a `WorkNCCL` object, create a `Future` associated with `WorkNCCL` and store it with the object. 2. Add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. 3. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation. 4. To mark the future associated with WorkNCCL completed, implement a `cudaStreamCallback` function. `cudaStreamAddCallback` is marked as deprecated. An alternative is `cudaLaunchHostFunc`, but it is supported for CUDA > 10 and may not be deprecated until there's a reasonable alternative available according to [this discussion](https://stackoverflow.com/questions/56448390/how-to-recover-from-cuda-errors-when-using-cudalaunchhostfunc-instead-of-cudastr). ghstack-source-id: 108409748 Test Plan: Run old python test/distributed/test_c10d.py. Some additional tests: `test_ddp_comm_hook_allreduce_hook_nccl`: This unit test verifies whether a DDP communication hook that just calls allreduce gives the same result result with the case of no hook registered. Without the then callback, the future_value in reducer is no longer a PyObject, and this unit test verifies future_value is properly checked. `test_ddp_comm_hook_allreduce_then_mult_ten_hook_nccl`: This unit test verifies whether a DDP communication hook that calls allreduce and then multiplies the result by ten gives the expected result. As of v10: ``` ........................s.....s.....................................................s............................... ---------------------------------------------------------------------- Ran 116 tests OK (skipped=3) ``` `flow-cli` performance validation using a stacked diff where `bucket.work` is completely replaced with `bucket.future_work` in `reducer`. See PR [#41840](#41840) [D22660198](https://www.internalfb.com/intern/diff/D22660198/). Reviewed By: izdeby Differential Revision: D22583690 fbshipit-source-id: 8c059745261d68d543eaf21a5700e64826e8d94a

…e_work`" 1. Only NCCL's `get_future` API is suppoerted, so disabled gloo tests. 2. I'll use this PR to test NCCL's `get_future` API's flow-cli performance to validate no regression on speed. 3. This PR will also solve Issue [#41266](#41266) once `get_future` API supports gloo and mpi backends. Differential Revision: [D22660198](https://our.internmc.facebook.com/intern/diff/D22660198/) #### Test plan: 1. Run `python test/distributed/test_c10d.py`. ``` ss..........ss.sss...s.ssssss.s.ss.......s...ssssssssssssssssssssssssssssssssssssssssssss........sssssss..........s. ---------------------------------------------------------------------- Ran 116 tests OK (skipped=70) ``` [ghstack-poisoned]

Pull Request resolved: #41840 1. Only NCCL's `get_future` API is suppoerted, so disabled gloo tests. 2. I'll use this PR to test NCCL's `get_future` API's flow-cli performance to validate no regression on speed. 3. This PR will also solve Issue [#41266] (#41266) once `get_future` API supports gloo and mpi backends. 4. We register a default communication hook to reducer called AllreduceHook. 5. We now allow DDP communication hook to be registered multiple times. 6. Removed test_ddp_comm_hook_register_just_once unit test. ghstack-source-id: 109448918 Differential Revision: [D22660198](https://our.internmc.facebook.com/intern/diff/D22660198/)

…e_work`" 1. Only NCCL's `get_future` API is suppoerted, so disabled gloo tests. 2. I'll use this PR to test NCCL's `get_future` API's flow-cli performance to validate no regression on speed. 3. This PR will also solve Issue [#41266](#41266) once `get_future` API supports gloo and mpi backends. Differential Revision: [D22660198](https://our.internmc.facebook.com/intern/diff/D22660198/) #### Test plan: 1. Run `python test/distributed/test_c10d.py`. ``` ss..........ss.sss...s.ssssss.s.ss.......s...ssssssssssssssssssssssssssssssssssssssssssss........sssssss..........s. ---------------------------------------------------------------------- Ran 116 tests OK (skipped=70) ``` [ghstack-poisoned]

Pull Request resolved: #41840 1. Only NCCL's `get_future` API is suppoerted, so disabled gloo tests. 2. I'll use this PR to test NCCL's `get_future` API's flow-cli performance to validate no regression on speed. 3. This PR will also solve Issue [#41266] (#41266) once `get_future` API supports gloo and mpi backends. 4. We register a default communication hook to reducer called AllreduceHook. 5. We now allow DDP communication hook to be registered multiple times. 6. Removed test_ddp_comm_hook_register_just_once unit test. ghstack-source-id: 111148361 Differential Revision: [D22660198](https://our.internmc.facebook.com/intern/diff/D22660198/)

ngimel · 2020-09-15T06:27:48Z

@pritamdamania87, @sinannasir should this PR be closed?

pritamdamania87 · 2020-09-17T00:26:34Z

@ngimel Sure we can close this out, it was mostly for reference once we support all the backends. We can reuse this PR later as and when we add support for all backends.

sinannasir requested review from mrshenli, pietern and zhaojuanmao as code owners July 22, 2020 06:01

sinannasir mentioned this pull request Jul 22, 2020

[NCCL] DDP communication hook: getFuture() #41596

Closed

sinannasir requested review from pritamdamania87 and rohan-varma July 22, 2020 06:35

pritamdamania87 reviewed Jul 23, 2020

View reviewed changes

sinannasir requested a review from apaszke as a code owner July 24, 2020 02:58

sinannasir mentioned this pull request Jul 27, 2020

In DDP's reducer merge work and future_work #41266

Closed

This was referenced Aug 7, 2020

[NCCL] DDP communication hook: getFuture() without cudaStreamAddCallback #42335

Closed

DDP communication hook: skip dividing grads by world_size if hook registered. #42400

Closed

sinannasir requested a review from pritamdamania87 August 7, 2020 21:20

pytorchbot added the open source label Sep 4, 2020

ngimel added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Sep 16, 2020

pritamdamania87 closed this Sep 17, 2020

facebook-github-bot deleted the gh/sinannasir/6/head branch October 17, 2020 14:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[NCCL] [For Test] In DDP's reducer merge `work` and `future_work` #41840

[NCCL] [For Test] In DDP's reducer merge `work` and `future_work` #41840

Uh oh!

sinannasir commented Jul 22, 2020 •

edited

Loading

Uh oh!

dr-ci bot commented Jul 22, 2020 •

edited

Loading

Uh oh!

pritamdamania87 commented Jul 23, 2020

Uh oh!

pritamdamania87 Jul 23, 2020

Uh oh!

sinannasir Jul 24, 2020

Uh oh!

pritamdamania87 Jul 23, 2020

Uh oh!

sinannasir Jul 24, 2020

Uh oh!

Uh oh!

ngimel commented Sep 15, 2020

Uh oh!

pritamdamania87 commented Sep 17, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[NCCL] [For Test] In DDP's reducer merge work and future_work #41840

[NCCL] [For Test] In DDP's reducer merge work and future_work #41840

Uh oh!

Conversation

sinannasir commented Jul 22, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test plan:

Uh oh!

dr-ci bot commented Jul 22, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

🕵️ 6 new failures recognized by patterns

pytorch_linux_xenial_py3_6_gcc5_4_ge_config_simple_test (1/6)

pytorch_linux_xenial_py3_clang5_asan_test2 (2/6)

pytorch_linux_bionic_py3_6_clang9_test (3/6)

pytorch_linux_bionic_py3_8_gcc9_coverage_test (4/6)

pytorch_macos_10_13_py3_test (5/6)

pytorch_linux_xenial_py3_6_gcc5_4_test (6/6)

❄️ 1 failure tentatively classified as flaky

pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test (1/1)

ci.pytorch.org: 1 failed

Uh oh!

pritamdamania87 commented Jul 23, 2020

Uh oh!

pritamdamania87 Jul 23, 2020

Choose a reason for hiding this comment

Uh oh!

sinannasir Jul 24, 2020

Choose a reason for hiding this comment

Uh oh!

pritamdamania87 Jul 23, 2020

Choose a reason for hiding this comment

Uh oh!

sinannasir Jul 24, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ngimel commented Sep 15, 2020

Uh oh!

pritamdamania87 commented Sep 17, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[NCCL] [For Test] In DDP's reducer merge `work` and `future_work` #41840

[NCCL] [For Test] In DDP's reducer merge `work` and `future_work` #41840

sinannasir commented Jul 22, 2020 •

edited

Loading

dr-ci bot commented Jul 22, 2020 •

edited

Loading