Skip to content

Conversation

@sinannasir
Copy link
Contributor

@sinannasir sinannasir commented Jul 22, 2020

Stack from ghstack:

  1. Only NCCL's get_future API is suppoerted, so disabled gloo tests.
  2. I'll use this PR to test NCCL's get_future API's flow-cli performance to validate no regression on speed.
  3. This PR will also solve Issue #41266 once get_future API supports gloo and mpi backends.

Differential Revision: D22660198

Test plan:

  1. Run python test/distributed/test_c10d.py.
ss..........ss.sss...s.ssssss.s.ss.......s...ssssssssssssssssssssssssssssssssssssssssssss........sssssss..........s.
----------------------------------------------------------------------
Ran 116 tests

OK (skipped=70)

1. Only NCCL's `get_future` API is suppoerted, so disabled gloo tests.
2. I'll use this PR to test NCCL's `get_future` API's flow-cli performance to validate no regression on speed.
3. This PR will also solve Issue [#41266](#41266) once `get_future` API supports gloo and mpi backends.

Differential Revision: [D22660198](https://our.internmc.facebook.com/intern/diff/D22660198/)

[ghstack-poisoned]
sinannasir added a commit that referenced this pull request Jul 22, 2020
1. Only NCCL's `get_future` API is suppoerted, so disabled gloo tests.
2. I'll use this PR to test NCCL's `get_future` API's flow-cli performance to validate no regression on speed.
3. This PR will also solve Issue [#41266](#41266) once `get_future` API supports gloo and mpi backends.

Differential Revision: [D22660198](https://our.internmc.facebook.com/intern/diff/D22660198/)

ghstack-source-id: 108226828
Pull Request resolved: #41840
@dr-ci
Copy link

dr-ci bot commented Jul 22, 2020

💊 CI failures summary and remediations

As of commit 861496c (more details on the Dr. CI page):



🕵️ 6 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_linux_xenial_py3_6_gcc5_4_ge_config_simple_test (1/6)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Sep 01 16:12:30 ERROR:sccache::server: Compilation failed: Output { status: ExitStatus(ExitStatus(256)), stdout: "", stderr: "/var/lib/jenkins/.cache/torch_extensions/test_compilation_error_formatting/main.cpp: In function \'int main()\':\n/var/lib/jenkins/.cache/torch_extensions/test_compilation_error_formatting/main.cpp:2:23: error: expected \';\' before \'}\' token\n int main() { return 0 }\n ^\n" }
Sep 01 16:12:30 Traceback (most recent call last): 
Sep 01 16:12:30   File "test/run_test.py", line 734, in <module> 
Sep 01 16:12:30     main() 
Sep 01 16:12:30   File "test/run_test.py", line 717, in main 
Sep 01 16:12:30     raise RuntimeError(err_message) 
Sep 01 16:12:30 RuntimeError: distributed/test_distributed failed! 
Sep 01 16:12:30 + cleanup 
Sep 01 16:12:30 + retcode=1 
Sep 01 16:12:30 + set +x 
Sep 01 16:12:30 =================== sccache compilation log =================== 
Sep 01 16:12:30 ERROR:sccache::server: Compilation failed: Output { status: ExitStatus(ExitStatus(256)), stdout: "", stderr: "/var/lib/jenkins/.cache/torch_extensions/test_compilation_error_formatting/main.cpp: In function \'int main()\':\n/var/lib/jenkins/.cache/torch_extensions/test_compilation_error_formatting/main.cpp:2:23: error: expected \';\' before \'}\' token\n int main() { return 0 }\n                       ^\n" } 
Sep 01 16:12:30  
Sep 01 16:12:30 =========== If your build fails, please take a look at the log above for possible reasons =========== 
Sep 01 16:12:30 Compile requests                 63 
Sep 01 16:12:30 Compile requests executed        35 
Sep 01 16:12:30 Cache hits                       27 
Sep 01 16:12:30 Cache misses                      7 
Sep 01 16:12:30 Cache timeouts                    0 
Sep 01 16:12:30 Cache read errors                 0 
Sep 01 16:12:30 Forced recaches                   0 
Sep 01 16:12:30 Cache write errors                0 

See CircleCI build pytorch_linux_xenial_py3_clang5_asan_test2 (2/6)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Sep 01 15:58:49 SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /var/lib/jenkins/workspace/aten/src/ATen/Utils.cpp:11:3 in
Sep 01 15:58:49     #7 0x5624064f17eb in PyEval_EvalCode /tmp/build/80754af9/python_1588903631989/work/Python/ceval.c:731 
Sep 01 15:58:49     #8 0x562406571e73 in run_mod /tmp/build/80754af9/python_1588903631989/work/Python/pythonrun.c:1025 
Sep 01 15:58:49     #9 0x562406571f0c in PyRun_StringFlags /tmp/build/80754af9/python_1588903631989/work/Python/pythonrun.c:949 
Sep 01 15:58:49     #10 0x562406571f6e in PyRun_SimpleStringFlags /tmp/build/80754af9/python_1588903631989/work/Python/pythonrun.c:445 
Sep 01 15:58:49     #11 0x562406575d72 in run_command /tmp/build/80754af9/python_1588903631989/work/Modules/main.c:301 
Sep 01 15:58:49     #12 0x562406575d72 in Py_Main /tmp/build/80754af9/python_1588903631989/work/Modules/main.c:749 
Sep 01 15:58:49     #13 0x56240643ff2d in main /tmp/build/80754af9/python_1588903631989/work/Programs/python.c:69 
Sep 01 15:58:49     #14 0x7f1da271d83f in __libc_start_main /build/glibc-e6zv40/glibc-2.23/csu/../csu/libc-start.c:291 
Sep 01 15:58:49     #15 0x56240651f27e in _start /home/rdonnelly/mc/conda-bld/compilers_linux-64_1534865402226/work/.build/src/glibc-2.12.2/csu/../sysdeps/x86_64/elf/start.S:103 
Sep 01 15:58:49  
Sep 01 15:58:49 SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /var/lib/jenkins/workspace/aten/src/ATen/Utils.cpp:11:3 in  
Sep 01 15:58:49 + retcode=1 
Sep 01 15:58:49 + set -e 
Sep 01 15:58:49 + return 1 
Sep 01 15:58:49 + [[ pytorch-linux-xenial-py3-clang5-asan-test2 == *-NO_AVX-* ]] 
Sep 01 15:58:49 + [[ pytorch-linux-xenial-py3-clang5-asan-test2 == *-NO_AVX2-* ]] 
Sep 01 15:58:49 + '[' -n https://github.com/pytorch/pytorch/pull/41840 ']' 
Sep 01 15:58:49 + [[ pytorch-linux-xenial-py3-clang5-asan-test2 != *coverage* ]] 
Sep 01 15:58:49 ++ mktemp 
Sep 01 15:58:49 + DETERMINE_FROM=/tmp/tmp.CqjxY6xKyj 
Sep 01 15:58:49 + file_diff_from_base /tmp/tmp.CqjxY6xKyj 

See CircleCI build pytorch_linux_bionic_py3_6_clang9_test (3/6)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Sep 01 16:12:01 ERROR:sccache::server: Compilation failed: Output { status: ExitStatus(ExitStatus(256)), stdout: "", stderr: "/var/lib/jenkins/.cache/torch_extensions/test_compilation_error_formatting/main.cpp: In function \'int main()\':\n/var/lib/jenkins/.cache/torch_extensions/test_compilation_error_formatting/main.cpp:2:23: error: expected \';\' before \'}\' token\n int main() { return 0 }\n ^\n" }
Sep 01 16:12:01     raise RuntimeError(err_message) 
Sep 01 16:12:01 RuntimeError: distributed/test_distributed failed! 
Sep 01 16:12:01  
Sep 01 16:12:01 real	8m8.487s 
Sep 01 16:12:01 user	7m52.039s 
Sep 01 16:12:01 sys	1m31.678s 
Sep 01 16:12:01 + cleanup 
Sep 01 16:12:01 + retcode=1 
Sep 01 16:12:01 + set +x 
Sep 01 16:12:01 =================== sccache compilation log =================== 
Sep 01 16:12:01 ERROR:sccache::server: Compilation failed: Output { status: ExitStatus(ExitStatus(256)), stdout: "", stderr: "/var/lib/jenkins/.cache/torch_extensions/test_compilation_error_formatting/main.cpp: In function \'int main()\':\n/var/lib/jenkins/.cache/torch_extensions/test_compilation_error_formatting/main.cpp:2:23: error: expected \';\' before \'}\' token\n int main() { return 0 }\n                       ^\n" } 
Sep 01 16:12:01  
Sep 01 16:12:01 =========== If your build fails, please take a look at the log above for possible reasons =========== 
Sep 01 16:12:01 Compile requests                 63 
Sep 01 16:12:01 Compile requests executed        35 
Sep 01 16:12:01 Cache hits                       27 
Sep 01 16:12:01 Cac

See CircleCI build pytorch_linux_bionic_py3_8_gcc9_coverage_test (4/6)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Sep 01 16:12:47 ERROR:sccache::server: Compilation failed: Output { status: ExitStatus(ExitStatus(256)), stdout: "", stderr: "/var/lib/jenkins/.cache/torch_extensions/test_compilation_error_formatting/main.cpp: In function ‘int main()’:\n/var/lib/jenkins/.cache/torch_extensions/test_compilation_error_formatting/main.cpp:2:22: error: expected ‘;’ before ‘}’ token\n 2 | int main() { return 0 }\n | ^~\n | ;\n" }
Sep 01 16:12:47     raise RuntimeError(err_message) 
Sep 01 16:12:47 RuntimeError: distributed/test_distributed failed! 
Sep 01 16:12:47  
Sep 01 16:12:47 real	6m5.221s 
Sep 01 16:12:47 user	6m13.355s 
Sep 01 16:12:47 sys	0m23.962s 
Sep 01 16:12:47 + cleanup 
Sep 01 16:12:47 + retcode=1 
Sep 01 16:12:47 + set +x 
Sep 01 16:12:47 =================== sccache compilation log =================== 
Sep 01 16:12:47 ERROR:sccache::server: Compilation failed: Output { status: ExitStatus(ExitStatus(256)), stdout: "", stderr: "/var/lib/jenkins/.cache/torch_extensions/test_compilation_error_formatting/main.cpp: In function ‘int main()’:\n/var/lib/jenkins/.cache/torch_extensions/test_compilation_error_formatting/main.cpp:2:22: error: expected ‘;’ before ‘}’ token\n    2 | int main() { return 0 }\n      |                      ^~\n      |                      ;\n" } 
Sep 01 16:12:47  
Sep 01 16:12:47 =========== If your build fails, please take a look at the log above for possible reasons =========== 
Sep 01 16:12:47 Compile requests                 63 
Sep 01 16:12:47 Compile requests executed        35 
Sep 01 16:12:47 Cache hits                       27 
Sep 01 16:12:47 Cache misses                      7 
Sep 01 16:12:47 Cache timeouts                    0 
Sep 01 16:12:47 Cache read errors                 0 
Sep 01 16:12:47 Forced recaches                   0 
Sep 01 16:12:47 Cache write errors                0 

See CircleCI build pytorch_macos_10_13_py3_test (5/6)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

Sep 01 09:09:19 ERROR [0.023s]: test_sparse_all_reduce_sum_cuda (__main__.TestDistBackend)
Sep 01 09:09:19   File "/Users/distiller/workspace/miniconda3/lib/python3.7/multiprocessing/managers.py", line 847, in _incref 
Sep 01 09:09:19     conn = self._Client(self._token.address, authkey=self._authkey) 
Sep 01 09:09:19   File "/Users/distiller/workspace/miniconda3/lib/python3.7/multiprocessing/connection.py", line 492, in Client 
Sep 01 09:09:19     c = SocketClient(address) 
Sep 01 09:09:19   File "/Users/distiller/workspace/miniconda3/lib/python3.7/multiprocessing/connection.py", line 619, in SocketClient 
Sep 01 09:09:19     s.connect(address) 
Sep 01 09:09:19 ConnectionRefusedError: [Errno 61] Connection refused 
Sep 01 09:09:19 --------------------------------------------------------------------------- 
Sep 01 09:09:19  
Sep 01 09:09:19 ====================================================================== 
Sep 01 09:09:19 ERROR [0.023s]: test_sparse_all_reduce_sum_cuda (__main__.TestDistBackend) 
Sep 01 09:09:19 ---------------------------------------------------------------------- 
Sep 01 09:09:19 Traceback (most recent call last): 
Sep 01 09:09:19   File "distributed/test_distributed.py", line 3153, in setUp 
Sep 01 09:09:19     self._fork_processes() 
Sep 01 09:09:19   File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/testing/_internal/common_distributed.py", line 282, in _fork_processes 
Sep 01 09:09:19     self._start_processes(proc) 
Sep 01 09:09:19   File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/testing/_internal/common_distributed.py", line 268, in _start_processes 
Sep 01 09:09:19     test_skips.update(TEST_SKIPS) 
Sep 01 09:09:19   File "<string>", line 2, in update 
Sep 01 09:09:19   File "/Users/distiller/workspace/miniconda3/lib/python3.7/multiprocessing/managers.py", line 834, in _callmethod 

See CircleCI build pytorch_linux_xenial_py3_6_gcc5_4_test (6/6)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Sep 01 16:13:13 ERROR:sccache::server: Compilation failed: Output { status: ExitStatus(ExitStatus(256)), stdout: "", stderr: "/var/lib/jenkins/.cache/torch_extensions/test_compilation_error_formatting/main.cpp: In function \'int main()\':\n/var/lib/jenkins/.cache/torch_extensions/test_compilation_error_formatting/main.cpp:2:23: error: expected \';\' before \'}\' token\n int main() { return 0 }\n ^\n" }
Sep 01 16:13:13 Traceback (most recent call last): 
Sep 01 16:13:13   File "test/run_test.py", line 734, in <module> 
Sep 01 16:13:13     main() 
Sep 01 16:13:13   File "test/run_test.py", line 717, in main 
Sep 01 16:13:13     raise RuntimeError(err_message) 
Sep 01 16:13:13 RuntimeError: distributed/test_distributed failed! 
Sep 01 16:13:13 + cleanup 
Sep 01 16:13:13 + retcode=1 
Sep 01 16:13:13 + set +x 
Sep 01 16:13:13 =================== sccache compilation log =================== 
Sep 01 16:13:13 ERROR:sccache::server: Compilation failed: Output { status: ExitStatus(ExitStatus(256)), stdout: "", stderr: "/var/lib/jenkins/.cache/torch_extensions/test_compilation_error_formatting/main.cpp: In function \'int main()\':\n/var/lib/jenkins/.cache/torch_extensions/test_compilation_error_formatting/main.cpp:2:23: error: expected \';\' before \'}\' token\n int main() { return 0 }\n                       ^\n" } 
Sep 01 16:13:13  
Sep 01 16:13:13 =========== If your build fails, please take a look at the log above for possible reasons =========== 
Sep 01 16:13:13 Compile requests                 63 
Sep 01 16:13:13 Compile requests executed        35 
Sep 01 16:13:13 Cache hits                       27 
Sep 01 16:13:13 Cache misses                      7 
Sep 01 16:13:13 Cache timeouts                    0 
Sep 01 16:13:13 Cache read errors                 0 
Sep 01 16:13:13 Forced recaches                   0 
Sep 01 16:13:13 Cache write errors                0 

❄️ 1 failure tentatively classified as flaky

but reruns have not yet been triggered to confirm:

See CircleCI build pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test (1/1)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun) ❄️

Sep 01 16:56:22 ConnectionResetError: [Errno 104] Connection reset by peer
Sep 01 16:56:22   File "/opt/conda/lib/python3.6/multiprocessing/connection.py", line 455, in accept 
Sep 01 16:56:22     deliver_challenge(c, self._authkey) 
Sep 01 16:56:22   File "/opt/conda/lib/python3.6/multiprocessing/connection.py", line 722, in deliver_challenge 
Sep 01 16:56:22     response = connection.recv_bytes(256)        # reject large message 
Sep 01 16:56:22   File "/opt/conda/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes 
Sep 01 16:56:22     buf = self._recv_bytes(maxlength) 
Sep 01 16:56:22   File "/opt/conda/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes 
Sep 01 16:56:22     buf = self._recv(4) 
Sep 01 16:56:22   File "/opt/conda/lib/python3.6/multiprocessing/connection.py", line 379, in _recv 
Sep 01 16:56:22     chunk = read(handle, remaining) 
Sep 01 16:56:22 ConnectionResetError: [Errno 104] Connection reset by peer 
Sep 01 16:56:22 /opt/conda/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 14 leaked semaphores to clean up at shutdown 
Sep 01 16:56:22   len(cache)) 
Sep 01 16:56:25 Process ErrorTrackingProcess-152: 
Sep 01 16:56:25 Traceback (most recent call last): 
Sep 01 16:56:25   File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap 
Sep 01 16:56:25     self.run() 
Sep 01 16:56:25   File "/var/lib/jenkins/workspace/test/test_dataloader.py", line 361, in run 
Sep 01 16:56:25     super(ErrorTrackingProcess, self).run() 
Sep 01 16:56:25   File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 93, in run 
Sep 01 16:56:25     self._target(*self._args, **self._kwargs) 

ci.pytorch.org: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 38 times.

sinannasir added a commit that referenced this pull request Jul 23, 2020
We've modified the previous design of `convert_dist_work_to_future` API in the GH Issue [#39272](#39272).

1. Whenever we create a `WorkNCCL` object, create a `Future` associated with `WorkNCCL` and store it with the object.
2. Add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`.
3. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation.
4. To mark the future associated with WorkNCCL completed, implement a `cudaStreamCallback` function.

`cudaStreamAddCallback` is marked as deprecated. An alternative is `cudaLaunchHostFunc`, but it is supported for CUDA > 10 and may not be deprecated until there's a reasonable alternative available according to [this discussion](https://stackoverflow.com/questions/56448390/how-to-recover-from-cuda-errors-when-using-cudalaunchhostfunc-instead-of-cudastr).

Differential Revision: [D22583690](https://our.internmc.facebook.com/intern/diff/D22583690/)

#### Test plan:
Run old  python test/distributed/test_c10d.py.
Some additional tests:
`test_ddp_comm_hook_allreduce_hook_nccl`: This unit test verifies whether a DDP communication hook that just calls allreduce gives the same result result with the case of no hook registered.  Without the then callback, the future_value in reducer is no longer a PyObject, and this unit test verifies future_value is properly checked.
`test_ddp_comm_hook_allreduce_then_mult_ten_hook_nccl`: This unit test verifies whether a DDP communication hook that calls allreduce and then multiplies the result by ten gives the expected result.

As of v10:
```
........................s.....s.....................................................s...............................
----------------------------------------------------------------------
Ran 116 tests

OK (skipped=3)
```
`flow-cli` performance validation using a stacked diff when bucket.work is completely replaced with bucket.future_work in `reducer`. See PR [#41840](#41840) [D22660198](https://www.internalfb.com/intern/diff/D22660198/).


[ghstack-poisoned]
@pritamdamania87
Copy link
Contributor

Looking at the results, there seems to be 1-1.5% regression in general, although in the case of world_size=2, for p90 and p95 there seems to be a significant regression for opts (not sure what it refers to, optimizer?).

I guess we can't land this PR without gloo support. So I'd suggest leaving the PR in the current state and then we can have someone pick this work up once we start working on gloo support.

// #41266.
if (comm_hook_ == nullptr) {
bucket.work = process_group_->allreduce(tensors);
bucket.future_work = process_group_->allreduce(tensors)->getFuture();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can get rid of this if-else completely and register a hook in DDP constructor that does allreduce by default? Then, we should allow users to override the hook (I think we only allow the hook to be registered once) as many times as they wish.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Included a new class named AllreduceHook and registered that as default comm hook.

bucket.future_work->wait();

// See Note [DDP Communication Hook]
if (comm_hook_ != nullptr) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of checking on comm_hook here, we can check if future_result is the same as replica.contents and if so, you can skip initializing bucket views since you know it is not a new tensor. You can use the is_same API for this: https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/templates/TensorBody.h#L194

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would require future_value.toTensorVector first. Instead, I am doing smth like this, please let me know if it isn't good:

if (!dynamic_cast<AllreduceHook*>(comm_hook_.get())) {

sinannasir added a commit that referenced this pull request Jul 24, 2020
We've modified the previous design of `convert_dist_work_to_future` API in the GH Issue [#39272](#39272).

1. Whenever we create a `WorkNCCL` object, create a `Future` associated with `WorkNCCL` and store it with the object.
2. Add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`.
3. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation.
4. To mark the future associated with WorkNCCL completed, implement a `cudaStreamCallback` function.

`cudaStreamAddCallback` is marked as deprecated. An alternative is `cudaLaunchHostFunc`, but it is supported for CUDA > 10 and may not be deprecated until there's a reasonable alternative available according to [this discussion](https://stackoverflow.com/questions/56448390/how-to-recover-from-cuda-errors-when-using-cudalaunchhostfunc-instead-of-cudastr).

Differential Revision: [D22583690](https://our.internmc.facebook.com/intern/diff/D22583690/)

#### Test plan:
Run old  python test/distributed/test_c10d.py.
Some additional tests:
`test_ddp_comm_hook_allreduce_hook_nccl`: This unit test verifies whether a DDP communication hook that just calls allreduce gives the same result result with the case of no hook registered.  Without the then callback, the future_value in reducer is no longer a PyObject, and this unit test verifies future_value is properly checked.
`test_ddp_comm_hook_allreduce_then_mult_ten_hook_nccl`: This unit test verifies whether a DDP communication hook that calls allreduce and then multiplies the result by ten gives the expected result.

As of v10:
```
........................s.....s.....................................................s...............................
----------------------------------------------------------------------
Ran 116 tests

OK (skipped=3)
```
`flow-cli` performance validation using a stacked diff where bucket.work is completely replaced with bucket.future_work in `reducer`. See PR [#41840](#41840) [D22660198](https://www.internalfb.com/intern/diff/D22660198/).


[ghstack-poisoned]
…e_work`"


1. Only NCCL's `get_future` API is suppoerted, so disabled gloo tests.
2. I'll use this PR to test NCCL's `get_future` API's flow-cli performance to validate no regression on speed.
3. This PR will also solve Issue [#41266](#41266) once `get_future` API supports gloo and mpi backends.

Differential Revision: [D22660198](https://our.internmc.facebook.com/intern/diff/D22660198/)

#### Test plan:
1. Run  `python test/distributed/test_c10d.py`.
```
ss..........ss.sss...s.ssssss.s.ss.......s...ssssssssssssssssssssssssssssssssssssssssssss........sssssss..........s.
----------------------------------------------------------------------
Ran 116 tests

OK (skipped=70)
```
2. `flow-cli` performance tests using NCCL backend.
* Example: 
```
flow-cli canary pytorch.benchmark.main.workflow --parameters-json '{"model_arch": "resnet50", "batch_size": 32, "world_size": 2, "use_fp16": false, "print_percentile": true, "backend": "nccl"}' --entitlement pytorch_ftw_gpu --name test_torchelastic_nccl_getFutureD22660198 --run-as-secure-group pytorch_distributed
```
* **World size =1, master:** f206781394, f206947522
```
sum: 
8 GPUs: p25:  0.109   293/s  p50:  0.109   293/s  p75:  0.109   293/s  p90:  0.109   292/s  p95:  0.109   292/s  
fwds: 
8 GPUs: p25:  0.032  1002/s  p50:  0.032   999/s  p75:  0.032   995/s  p90:  0.032   991/s  p95:  0.032   989/s  
bwds: 
8 GPUs: p25:  0.074   431/s  p50:  0.074   430/s  p75:  0.075   429/s  p90:  0.075   428/s  p95:  0.075   427/s  
opts: 
8 GPUs: p25:  0.003  12007/s  p50:  0.003  11565/s  p75:  0.003  11377/s  p90:  0.003  11325/s  p95:  0.003  11299/s
```
```
sum: 
8 GPUs: p25:  0.109   293/s  p50:  0.109   293/s  p75:  0.109   293/s  p90:  0.109   292/s  p95:  0.109   292/s  
fwds: 
8 GPUs: p25:  0.032  1005/s  p50:  0.032   999/s  p75:  0.032   993/s  p90:  0.032   989/s  p95:  0.032   987/s  
bwds: 
8 GPUs: p25:  0.074   431/s  p50:  0.074   430/s  p75:  0.075   428/s  p90:  0.075   427/s  p95:  0.075   426/s  
opts: 
8 GPUs: p25:  0.003  12085/s  p50:  0.003  11663/s  p75:  0.003  11451/s  p90:  0.003  11364/s  p95:  0.003  11326/s
```
* **World size =1, getFuture API:** f206788430 f206945375
```
sum: 
8 GPUs: p25:  0.120   266/s  p50:  0.121   265/s  p75:  0.121   263/s  p90:  0.122   262/s  p95:  0.123   260/s  
fwds: 
8 GPUs: p25:  0.032   989/s  p50:  0.033   983/s  p75:  0.033   976/s  p90:  0.033   971/s  p95:  0.033   968/s  
bwds: 
8 GPUs: p25:  0.079   404/s  p50:  0.081   394/s  p75:  0.083   384/s  p90:  0.085   378/s  p95:  0.085   374/s  
opts: 
8 GPUs: p25:  0.005  6285/s  p50:  0.007  4482/s  p75:  0.009  3535/s  p90:  0.012  2710/s  p95:  0.012  2580/s
```
```
sum: 
8 GPUs: p25:  0.120   267/s  p50:  0.120   266/s  p75:  0.121   264/s  p90:  0.121   263/s  p95:  0.122   261/s  
fwds: 
8 GPUs: p25:  0.032   995/s  p50:  0.032   989/s  p75:  0.033   983/s  p90:  0.033   978/s  p95:  0.033   975/s  
bwds: 
8 GPUs: p25:  0.076   419/s  p50:  0.078   408/s  p75:  0.080   397/s  p90:  0.082   390/s  p95:  0.083   384/s  
opts: 
8 GPUs: p25:  0.008  4086/s  p50:  0.010  3210/s  p75:  0.011  2836/s  p90:  0.012  2762/s  p95:  0.012  2691/s
```

* **World size = 2, master:** f206922196
```
sum: 
16 GPUs: p25:  0.121   263/s  p50:  0.122   261/s  p75:  0.125   256/s  p90:  0.126   253/s  p95:  0.127   252/s  
fwds: 
16 GPUs: p25:  0.032  1004/s  p50:  0.032   998/s  p75:  0.032   993/s  p90:  0.032   989/s  p95:  0.032   987/s  
bwds: 
16 GPUs: p25:  0.087   369/s  p50:  0.088   365/s  p75:  0.090   355/s  p90:  0.091   350/s  p95:  0.092   346/s  
opts: 
16 GPUs: p25:  0.003  12120/s  p50:  0.003  11587/s  p75:  0.003  11379/s  p90:  0.003  11185/s  p95:  0.003  11126/s
```
* **World size = 2, getFuture API:** f206921378
```
sum: 
16 GPUs: p25:  0.126   253/s  p50:  0.127   252/s  p75:  0.127   251/s  p90:  0.128   249/s  p95:  0.130   247/s  
fwds: 
16 GPUs: p25:  0.032   997/s  p50:  0.032   990/s  p75:  0.033   983/s  p90:  0.033   978/s  p95:  0.033   976/s  
bwds: 
16 GPUs: p25:  0.091   353/s  p50:  0.091   349/s  p75:  0.092   347/s  p90:  0.093   344/s  p95:  0.094   338/s  
opts: 
16 GPUs: p25:  0.003  12083/s  p50:  0.003  11569/s  p75:  0.003  11300/s  p90:  0.007  4846/s  p95:  0.009  3543/s
```
* **Memory:**
* **World size = 2, master:** f206922196
```
CUDA Memory Summary After                     first iteration: |===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |  428091 KB |    2892 MB |    9825 MB |    9407 MB |
|       from large pool |  374913 KB |    2874 MB |    9752 MB |    9386 MB |
|       from small pool |   53178 KB |      52 MB |      73 MB |      21 MB |
|---------------------------------------------------------------------------|
| Active memory         |  428091 KB |    2892 MB |    9825 MB |    9407 MB |
|       from large pool |  374913 KB |    2874 MB |    9752 MB |    9386 MB |
|       from small pool |   53178 KB |      52 MB |      73 MB |      21 MB |
|---------------------------------------------------------------------------|
| GPU reserved memory   |    3490 MB |    3490 MB |    3490 MB |       0 B  |
|       from large pool |    3434 MB |    3434 MB |    3434 MB |       0 B  |
|       from small pool |      56 MB |      56 MB |      56 MB |       0 B  |
|---------------------------------------------------------------------------|
| Non-releasable memory |  315332 KB |  343472 KB |    2295 MB |    1987 MB |
|       from large pool |  311166 KB |  340158 KB |    2239 MB |    1935 MB |
|       from small pool |    4166 KB |    4334 KB |      56 MB |      52 MB |
|---------------------------------------------------------------------------|
| Allocations           |     704    |     705    |    1390    |     686    |
|       from large pool |      60    |     131    |     395    |     335    |
|       from small pool |     644    |     645    |     995    |     351    |
|---------------------------------------------------------------------------|
| Active allocs         |     704    |     705    |    1390    |     686    |
|       from large pool |      60    |     131    |     395    |     335    |
|       from small pool |     644    |     645    |     995    |     351    |
|---------------------------------------------------------------------------|
| GPU reserved segments |     102    |     102    |     102    |       0    |
|       from large pool |      74    |      74    |      74    |       0    |
|       from small pool |      28    |      28    |      28    |       0    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |      34    |      54    |     431    |     397    |
|       from large pool |      15    |      48    |     208    |     193    |
|       from small pool |      19    |      19    |     223    |     204    |
|===========================================================================|
```
* **World size = 2, getFuture API:** f206921378
```
CUDA Memory Summary After                     first iteration: |===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |  428091 KB |    2892 MB |    9825 MB |    9407 MB |
|       from large pool |  374913 KB |    2874 MB |    9752 MB |    9386 MB |
|       from small pool |   53178 KB |      52 MB |      73 MB |      21 MB |
|---------------------------------------------------------------------------|
| Active memory         |  428091 KB |    2892 MB |    9825 MB |    9407 MB |
|       from large pool |  374913 KB |    2874 MB |    9752 MB |    9386 MB |
|       from small pool |   53178 KB |      52 MB |      73 MB |      21 MB |
|---------------------------------------------------------------------------|
| GPU reserved memory   |    3490 MB |    3490 MB |    3490 MB |       0 B  |
|       from large pool |    3434 MB |    3434 MB |    3434 MB |       0 B  |
|       from small pool |      56 MB |      56 MB |      56 MB |       0 B  |
|---------------------------------------------------------------------------|
| Non-releasable memory |  315332 KB |  343472 KB |    2295 MB |    1987 MB |
|       from large pool |  311166 KB |  340158 KB |    2239 MB |    1935 MB |
|       from small pool |    4166 KB |    4334 KB |      56 MB |      52 MB |
|---------------------------------------------------------------------------|
| Allocations           |     704    |     705    |    1390    |     686    |
|       from large pool |      60    |     131    |     395    |     335    |
|       from small pool |     644    |     645    |     995    |     351    |
|---------------------------------------------------------------------------|
| Active allocs         |     704    |     705    |    1390    |     686    |
|       from large pool |      60    |     131    |     395    |     335    |
|       from small pool |     644    |     645    |     995    |     351    |
|---------------------------------------------------------------------------|
| GPU reserved segments |     102    |     102    |     102    |       0    |
|       from large pool |      74    |      74    |      74    |       0    |
|       from small pool |      28    |      28    |      28    |       0    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |      34    |      54    |     431    |     397    |
|       from large pool |      15    |      48    |     208    |     193    |
|       from small pool |      19    |      19    |     223    |     204    |
|===========================================================================|

```

[ghstack-poisoned]
@sinannasir sinannasir requested a review from apaszke as a code owner July 24, 2020 02:58
sinannasir added a commit that referenced this pull request Jul 24, 2020
Pull Request resolved: #41840

1. Only NCCL's `get_future` API is suppoerted, so disabled gloo tests.
2. I'll use this PR to test NCCL's `get_future` API's flow-cli performance to validate no regression on speed.
3. This PR will also solve Issue [#41266]
(#41266) once `get_future` API supports gloo and mpi backends.
4. We register a default communication hook to reducer called AllreduceHook.
5. We now allow DDP communication hook to be registered multiple times. 6. Removed test_ddp_comm_hook_register_just_once unit test.
ghstack-source-id: 108410963

Differential Revision: [D22660198](https://our.internmc.facebook.com/intern/diff/D22660198/)
…e_work`"


1. Only NCCL's `get_future` API is suppoerted, so disabled gloo tests.
2. I'll use this PR to test NCCL's `get_future` API's flow-cli performance to validate no regression on speed.
3. This PR will also solve Issue [#41266](#41266) once `get_future` API supports gloo and mpi backends.

Differential Revision: [D22660198](https://our.internmc.facebook.com/intern/diff/D22660198/)

#### Test plan:
1. Run  `python test/distributed/test_c10d.py`.
```
ss..........ss.sss...s.ssssss.s.ss.......s...ssssssssssssssssssssssssssssssssssssssssssss........sssssss..........s.
----------------------------------------------------------------------
Ran 116 tests

OK (skipped=70)
```
2. `flow-cli` performance tests using NCCL backend.
* Example: 
```
flow-cli canary pytorch.benchmark.main.workflow --parameters-json '{"model_arch": "resnet50", "batch_size": 32, "world_size": 2, "use_fp16": false, "print_percentile": true, "backend": "nccl"}' --entitlement pytorch_ftw_gpu --name test_torchelastic_nccl_getFutureD22660198 --run-as-secure-group pytorch_distributed
```
* **World size =1, master:** f206781394, f206947522
```
sum: 
8 GPUs: p25:  0.109   293/s  p50:  0.109   293/s  p75:  0.109   293/s  p90:  0.109   292/s  p95:  0.109   292/s  
fwds: 
8 GPUs: p25:  0.032  1002/s  p50:  0.032   999/s  p75:  0.032   995/s  p90:  0.032   991/s  p95:  0.032   989/s  
bwds: 
8 GPUs: p25:  0.074   431/s  p50:  0.074   430/s  p75:  0.075   429/s  p90:  0.075   428/s  p95:  0.075   427/s  
opts: 
8 GPUs: p25:  0.003  12007/s  p50:  0.003  11565/s  p75:  0.003  11377/s  p90:  0.003  11325/s  p95:  0.003  11299/s
```
```
sum: 
8 GPUs: p25:  0.109   293/s  p50:  0.109   293/s  p75:  0.109   293/s  p90:  0.109   292/s  p95:  0.109   292/s  
fwds: 
8 GPUs: p25:  0.032  1005/s  p50:  0.032   999/s  p75:  0.032   993/s  p90:  0.032   989/s  p95:  0.032   987/s  
bwds: 
8 GPUs: p25:  0.074   431/s  p50:  0.074   430/s  p75:  0.075   428/s  p90:  0.075   427/s  p95:  0.075   426/s  
opts: 
8 GPUs: p25:  0.003  12085/s  p50:  0.003  11663/s  p75:  0.003  11451/s  p90:  0.003  11364/s  p95:  0.003  11326/s
```
* **World size =1, getFuture API:** f206788430 f206945375
```
sum: 
8 GPUs: p25:  0.120   266/s  p50:  0.121   265/s  p75:  0.121   263/s  p90:  0.122   262/s  p95:  0.123   260/s  
fwds: 
8 GPUs: p25:  0.032   989/s  p50:  0.033   983/s  p75:  0.033   976/s  p90:  0.033   971/s  p95:  0.033   968/s  
bwds: 
8 GPUs: p25:  0.079   404/s  p50:  0.081   394/s  p75:  0.083   384/s  p90:  0.085   378/s  p95:  0.085   374/s  
opts: 
8 GPUs: p25:  0.005  6285/s  p50:  0.007  4482/s  p75:  0.009  3535/s  p90:  0.012  2710/s  p95:  0.012  2580/s
```
```
sum: 
8 GPUs: p25:  0.120   267/s  p50:  0.120   266/s  p75:  0.121   264/s  p90:  0.121   263/s  p95:  0.122   261/s  
fwds: 
8 GPUs: p25:  0.032   995/s  p50:  0.032   989/s  p75:  0.033   983/s  p90:  0.033   978/s  p95:  0.033   975/s  
bwds: 
8 GPUs: p25:  0.076   419/s  p50:  0.078   408/s  p75:  0.080   397/s  p90:  0.082   390/s  p95:  0.083   384/s  
opts: 
8 GPUs: p25:  0.008  4086/s  p50:  0.010  3210/s  p75:  0.011  2836/s  p90:  0.012  2762/s  p95:  0.012  2691/s
```

* **World size = 2, master:** f206922196
```
sum: 
16 GPUs: p25:  0.121   263/s  p50:  0.122   261/s  p75:  0.125   256/s  p90:  0.126   253/s  p95:  0.127   252/s  
fwds: 
16 GPUs: p25:  0.032  1004/s  p50:  0.032   998/s  p75:  0.032   993/s  p90:  0.032   989/s  p95:  0.032   987/s  
bwds: 
16 GPUs: p25:  0.087   369/s  p50:  0.088   365/s  p75:  0.090   355/s  p90:  0.091   350/s  p95:  0.092   346/s  
opts: 
16 GPUs: p25:  0.003  12120/s  p50:  0.003  11587/s  p75:  0.003  11379/s  p90:  0.003  11185/s  p95:  0.003  11126/s
```
* **World size = 2, getFuture API:** f206921378
```
sum: 
16 GPUs: p25:  0.126   253/s  p50:  0.127   252/s  p75:  0.127   251/s  p90:  0.128   249/s  p95:  0.130   247/s  
fwds: 
16 GPUs: p25:  0.032   997/s  p50:  0.032   990/s  p75:  0.033   983/s  p90:  0.033   978/s  p95:  0.033   976/s  
bwds: 
16 GPUs: p25:  0.091   353/s  p50:  0.091   349/s  p75:  0.092   347/s  p90:  0.093   344/s  p95:  0.094   338/s  
opts: 
16 GPUs: p25:  0.003  12083/s  p50:  0.003  11569/s  p75:  0.003  11300/s  p90:  0.007  4846/s  p95:  0.009  3543/s
```
* **Memory:**
* **World size = 2, master:** f206922196
```
CUDA Memory Summary After                     first iteration: |===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |  428091 KB |    2892 MB |    9825 MB |    9407 MB |
|       from large pool |  374913 KB |    2874 MB |    9752 MB |    9386 MB |
|       from small pool |   53178 KB |      52 MB |      73 MB |      21 MB |
|---------------------------------------------------------------------------|
| Active memory         |  428091 KB |    2892 MB |    9825 MB |    9407 MB |
|       from large pool |  374913 KB |    2874 MB |    9752 MB |    9386 MB |
|       from small pool |   53178 KB |      52 MB |      73 MB |      21 MB |
|---------------------------------------------------------------------------|
| GPU reserved memory   |    3490 MB |    3490 MB |    3490 MB |       0 B  |
|       from large pool |    3434 MB |    3434 MB |    3434 MB |       0 B  |
|       from small pool |      56 MB |      56 MB |      56 MB |       0 B  |
|---------------------------------------------------------------------------|
| Non-releasable memory |  315332 KB |  343472 KB |    2295 MB |    1987 MB |
|       from large pool |  311166 KB |  340158 KB |    2239 MB |    1935 MB |
|       from small pool |    4166 KB |    4334 KB |      56 MB |      52 MB |
|---------------------------------------------------------------------------|
| Allocations           |     704    |     705    |    1390    |     686    |
|       from large pool |      60    |     131    |     395    |     335    |
|       from small pool |     644    |     645    |     995    |     351    |
|---------------------------------------------------------------------------|
| Active allocs         |     704    |     705    |    1390    |     686    |
|       from large pool |      60    |     131    |     395    |     335    |
|       from small pool |     644    |     645    |     995    |     351    |
|---------------------------------------------------------------------------|
| GPU reserved segments |     102    |     102    |     102    |       0    |
|       from large pool |      74    |      74    |      74    |       0    |
|       from small pool |      28    |      28    |      28    |       0    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |      34    |      54    |     431    |     397    |
|       from large pool |      15    |      48    |     208    |     193    |
|       from small pool |      19    |      19    |     223    |     204    |
|===========================================================================|
```
* **World size = 2, getFuture API:** f206921378
```
CUDA Memory Summary After                     first iteration: |===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |  428091 KB |    2892 MB |    9825 MB |    9407 MB |
|       from large pool |  374913 KB |    2874 MB |    9752 MB |    9386 MB |
|       from small pool |   53178 KB |      52 MB |      73 MB |      21 MB |
|---------------------------------------------------------------------------|
| Active memory         |  428091 KB |    2892 MB |    9825 MB |    9407 MB |
|       from large pool |  374913 KB |    2874 MB |    9752 MB |    9386 MB |
|       from small pool |   53178 KB |      52 MB |      73 MB |      21 MB |
|---------------------------------------------------------------------------|
| GPU reserved memory   |    3490 MB |    3490 MB |    3490 MB |       0 B  |
|       from large pool |    3434 MB |    3434 MB |    3434 MB |       0 B  |
|       from small pool |      56 MB |      56 MB |      56 MB |       0 B  |
|---------------------------------------------------------------------------|
| Non-releasable memory |  315332 KB |  343472 KB |    2295 MB |    1987 MB |
|       from large pool |  311166 KB |  340158 KB |    2239 MB |    1935 MB |
|       from small pool |    4166 KB |    4334 KB |      56 MB |      52 MB |
|---------------------------------------------------------------------------|
| Allocations           |     704    |     705    |    1390    |     686    |
|       from large pool |      60    |     131    |     395    |     335    |
|       from small pool |     644    |     645    |     995    |     351    |
|---------------------------------------------------------------------------|
| Active allocs         |     704    |     705    |    1390    |     686    |
|       from large pool |      60    |     131    |     395    |     335    |
|       from small pool |     644    |     645    |     995    |     351    |
|---------------------------------------------------------------------------|
| GPU reserved segments |     102    |     102    |     102    |       0    |
|       from large pool |      74    |      74    |      74    |       0    |
|       from small pool |      28    |      28    |      28    |       0    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |      34    |      54    |     431    |     397    |
|       from large pool |      15    |      48    |     208    |     193    |
|       from small pool |      19    |      19    |     223    |     204    |
|===========================================================================|

```

[ghstack-poisoned]
sinannasir added a commit that referenced this pull request Jul 24, 2020
Pull Request resolved: #41840

1. Only NCCL's `get_future` API is suppoerted, so disabled gloo tests.
2. I'll use this PR to test NCCL's `get_future` API's flow-cli performance to validate no regression on speed.
3. This PR will also solve Issue [#41266]
(#41266) once `get_future` API supports gloo and mpi backends.
4. We register a default communication hook to reducer called AllreduceHook.
5. We now allow DDP communication hook to be registered multiple times. 6. Removed test_ddp_comm_hook_register_just_once unit test.
ghstack-source-id: 108445379

Differential Revision: [D22660198](https://our.internmc.facebook.com/intern/diff/D22660198/)
facebook-github-bot pushed a commit that referenced this pull request Jul 24, 2020
Summary:
Pull Request resolved: #41596

We've modified the previous design of `convert_dist_work_to_future` API in the GH Issue [#39272](#39272).

1. Whenever we create a `WorkNCCL` object, create a `Future` associated with `WorkNCCL` and store it with the object.
2. Add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`.
3. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation.
4. To mark the future associated with WorkNCCL completed, implement a `cudaStreamCallback` function.

`cudaStreamAddCallback` is marked as deprecated. An alternative is `cudaLaunchHostFunc`, but it is supported for CUDA > 10 and may not be deprecated until there's a reasonable alternative available according to [this discussion](https://stackoverflow.com/questions/56448390/how-to-recover-from-cuda-errors-when-using-cudalaunchhostfunc-instead-of-cudastr).
ghstack-source-id: 108409748

Test Plan:
Run old  python test/distributed/test_c10d.py.
Some additional tests:
`test_ddp_comm_hook_allreduce_hook_nccl`: This unit test verifies whether a DDP communication hook that just calls allreduce gives the same result result with the case of no hook registered.  Without the then callback, the future_value in reducer is no longer a PyObject, and this unit test verifies future_value is properly checked.
`test_ddp_comm_hook_allreduce_then_mult_ten_hook_nccl`: This unit test verifies whether a DDP communication hook that calls allreduce and then multiplies the result by ten gives the expected result.

As of v10:
```
........................s.....s.....................................................s...............................
----------------------------------------------------------------------
Ran 116 tests

OK (skipped=3)
```
`flow-cli` performance validation using a stacked diff where `bucket.work` is completely replaced with `bucket.future_work` in `reducer`. See PR [#41840](#41840) [D22660198](https://www.internalfb.com/intern/diff/D22660198/).

Reviewed By: izdeby

Differential Revision: D22583690

fbshipit-source-id: 8c059745261d68d543eaf21a5700e64826e8d94a
…e_work`"


1. Only NCCL's `get_future` API is suppoerted, so disabled gloo tests.
2. I'll use this PR to test NCCL's `get_future` API's flow-cli performance to validate no regression on speed.
3. This PR will also solve Issue [#41266](#41266) once `get_future` API supports gloo and mpi backends.

Differential Revision: [D22660198](https://our.internmc.facebook.com/intern/diff/D22660198/)

#### Test plan:
1. Run  `python test/distributed/test_c10d.py`.
```
ss..........ss.sss...s.ssssss.s.ss.......s...ssssssssssssssssssssssssssssssssssssssssssss........sssssss..........s.
----------------------------------------------------------------------
Ran 116 tests

OK (skipped=70)
```

[ghstack-poisoned]
…e_work`"


1. Only NCCL's `get_future` API is suppoerted, so disabled gloo tests.
2. I'll use this PR to test NCCL's `get_future` API's flow-cli performance to validate no regression on speed.
3. This PR will also solve Issue [#41266](#41266) once `get_future` API supports gloo and mpi backends.

Differential Revision: [D22660198](https://our.internmc.facebook.com/intern/diff/D22660198/)

#### Test plan:
1. Run  `python test/distributed/test_c10d.py`.
```
ss..........ss.sss...s.ssssss.s.ss.......s...ssssssssssssssssssssssssssssssssssssssssssss........sssssss..........s.
----------------------------------------------------------------------
Ran 116 tests

OK (skipped=70)
```

[ghstack-poisoned]
sinannasir added a commit that referenced this pull request Aug 7, 2020
Pull Request resolved: #41840

1. Only NCCL's `get_future` API is suppoerted, so disabled gloo tests.
2. I'll use this PR to test NCCL's `get_future` API's flow-cli performance to validate no regression on speed.
3. This PR will also solve Issue [#41266]
(#41266) once `get_future` API supports gloo and mpi backends.
4. We register a default communication hook to reducer called AllreduceHook.
5. We now allow DDP communication hook to be registered multiple times. 6. Removed test_ddp_comm_hook_register_just_once unit test.
ghstack-source-id: 109448918

Differential Revision: [D22660198](https://our.internmc.facebook.com/intern/diff/D22660198/)
…e_work`"


1. Only NCCL's `get_future` API is suppoerted, so disabled gloo tests.
2. I'll use this PR to test NCCL's `get_future` API's flow-cli performance to validate no regression on speed.
3. This PR will also solve Issue [#41266](#41266) once `get_future` API supports gloo and mpi backends.

Differential Revision: [D22660198](https://our.internmc.facebook.com/intern/diff/D22660198/)

#### Test plan:
1. Run  `python test/distributed/test_c10d.py`.
```
ss..........ss.sss...s.ssssss.s.ss.......s...ssssssssssssssssssssssssssssssssssssssssssss........sssssss..........s.
----------------------------------------------------------------------
Ran 116 tests

OK (skipped=70)
```

[ghstack-poisoned]
sinannasir added a commit that referenced this pull request Sep 1, 2020
Pull Request resolved: #41840

1. Only NCCL's `get_future` API is suppoerted, so disabled gloo tests.
2. I'll use this PR to test NCCL's `get_future` API's flow-cli performance to validate no regression on speed.
3. This PR will also solve Issue [#41266]
(#41266) once `get_future` API supports gloo and mpi backends.
4. We register a default communication hook to reducer called AllreduceHook.
5. We now allow DDP communication hook to be registered multiple times. 6. Removed test_ddp_comm_hook_register_just_once unit test.
ghstack-source-id: 111148361

Differential Revision: [D22660198](https://our.internmc.facebook.com/intern/diff/D22660198/)
@ngimel
Copy link
Collaborator

ngimel commented Sep 15, 2020

@pritamdamania87, @sinannasir should this PR be closed?

@ngimel ngimel added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Sep 16, 2020
@pritamdamania87
Copy link
Contributor

@ngimel Sure we can close this out, it was mostly for reference once we support all the backends. We can reuse this PR later as and when we add support for all backends.

@facebook-github-bot facebook-github-bot deleted the gh/sinannasir/6/head branch October 17, 2020 14:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants