DDP Communication hook: Fix the way we pass future result to buckets. #43307

sinannasir · 2020-08-19T23:55:15Z

Stack from ghstack:

DDP Communication hook: Fix the way we pass future result to buckets. #43307 DDP Communication hook: Fix the way we pass future result to buckets.

I identified a bug with DDP communication hook while I was trying accuracy benchmarks: I was getting loss=nan.

Looks like when we re-initialize_bucketviews with the value of future_work, as Reducer::mark_variable_ready_dense does bucket_view.copy_(grad) it wasn't copying the grads back to the contents since bucket_view wouldn't have any relationship with contents after re-intitializing it with something else. As we have multiple iterations, this was causing problems.
I solved this by adding two states for bucket_view:

    // bucket_views_in[i].copy_(grad) and
    // grad.copy_(bucket_views_out[i])
    // provide convenient ways to move grad data in/out of contents.
    std::vector<at::Tensor> bucket_views_in;
    std::vector<at::Tensor> bucket_views_out;

I included two additional unit tests where we run multiple iterations for better test coverage:

test_accumulate_gradients_no_sync_allreduce_hook
test_accumulate_gradients_no_sync_allreduce_with_then_hook.

Those tests were failing with the old version. (If I run it for just 2 iterations, it wouldn't fail.)

Differential Revision: D23229309

I identified a bug with DDP communication hook while I was trying accuracy benchmarks: I was getting `loss=nan`. Looks like passing the result of future as re-`initialize_bucketviews` was causing the problem. One easy fix is to simply do double `copy_` by `bucket.replicas[i].contents.copy_(future_result[i]);`. I included two additional unit tests where we run multiple iterations for better test coverage: 1) `test_accumulate_gradients_no_sync_allreduce_hook` 2) `test_accumulate_gradients_no_sync_allreduce_with_then_hook`. Those tests were failing with the old version. (If I run it for just 2 iterations, it wouldn't fail.) This simple fix just fixes the issue by doing double `copy_`. From my recent observations, performance regression by using double `copy_` is negligible. If a better solution is not obvious, we should probably land this quickly and open a separate issue to think more on the problem and a better solution. Differential Revision: [D23229309](https://our.internmc.facebook.com/intern/diff/D23229309/) [ghstack-poisoned]

I identified a bug with DDP communication hook while I was trying accuracy benchmarks: I was getting `loss=nan`. Looks like passing the result of future as re-`initialize_bucketviews` was causing the problem. One easy fix is to simply do double `copy_` by `bucket.replicas[i].contents.copy_(future_result[i]);`. I included two additional unit tests where we run multiple iterations for better test coverage: 1) `test_accumulate_gradients_no_sync_allreduce_hook` 2) `test_accumulate_gradients_no_sync_allreduce_with_then_hook`. Those tests were failing with the old version. (If I run it for just 2 iterations, it wouldn't fail.) This simple fix just fixes the issue by doing double `copy_`. From my recent observations, performance regression by using double `copy_` is negligible. If a better solution is not obvious, we should probably land this quickly and open a separate issue to think more on the problem and a better solution. Differential Revision: [D23229309](https://our.internmc.facebook.com/intern/diff/D23229309/) ghstack-source-id: 110298379 Pull Request resolved: #43307

dr-ci · 2020-08-20T00:03:56Z

💊 CI failures summary and remediations

As of commit 20ce201 (more details on the Dr. CI page):

4/4 failures introduced in this PR

🕵️ 4 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

pytorch_linux_xenial_py3_6_gcc5_4_build (1/4)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.

Auto-merging aten/src/ATen/core/interned_strings.h 
CONFLICT (content): Merge conflict in aten/src/ATen/core/interned_strings.h 
Auto-merging aten/src/ATen/core/aten_interned_strings.h 
Auto-merging aten/src/ATen/BatchingRegistrations.cpp 
CONFLICT (content): Merge conflict in aten/src/ATen/BatchingRegistrations.cpp 
Auto-merging CONTRIBUTING.md 
CONFLICT (content): Merge conflict in CONTRIBUTING.md 
Auto-merging .jenkins/pytorch/test.sh 
Auto-merging .circleci/verbatim-sources/job-specs/binary-job-specs.yml 
Auto-merging .circleci/config.yml 
Automatic merge failed; fix conflicts and then commit the result.

pytorch_linux_xenial_cuda9_2_cudnn7_py3_gcc5_4_build (2/4)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.

Auto-merging aten/src/ATen/core/interned_strings.h 
CONFLICT (content): Merge conflict in aten/src/ATen/core/interned_strings.h 
Auto-merging aten/src/ATen/core/aten_interned_strings.h 
Auto-merging aten/src/ATen/BatchingRegistrations.cpp 
CONFLICT (content): Merge conflict in aten/src/ATen/BatchingRegistrations.cpp 
Auto-merging CONTRIBUTING.md 
CONFLICT (content): Merge conflict in CONTRIBUTING.md 
Auto-merging .jenkins/pytorch/test.sh 
Auto-merging .circleci/verbatim-sources/job-specs/binary-job-specs.yml 
Auto-merging .circleci/config.yml 
Automatic merge failed; fix conflicts and then commit the result.

binary_windows_libtorch_3_7_cpu_release_build (3/4)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

FAILED: third_party/ideep/mkl-dnn/src/cpu/CMakeFiles/dnnl_cpu.dir/gemm_convolution_utils.cpp.obj

[536/2279] Building CXX object third_party\ideep\mkl-dnn\src\cpu\x64\CMakeFiles\dnnl_cpu_x64.dir\gemm\f32\jit_avx2_kernel_sgemm_kern.cpp.obj 
[537/2279] Building CXX object third_party\ideep\mkl-dnn\src\cpu\x64\CMakeFiles\dnnl_cpu_x64.dir\gemm\f32\jit_avx2_f32_copy_bt_kern_autogen.cpp.obj 
cl : Command line warning D9025 : overriding '/O2' with '/Od'
[538/2279] Building CXX object third_party\ideep\mkl-dnn\src\cpu\x64\CMakeFiles\dnnl_cpu_x64.dir\gemm\f32\jit_avx2_f32_copy_bn_kern_autogen.cpp.obj 
cl : Command line warning D9025 : overriding '/O2' with '/Od'
[539/2279] Building CXX object third_party\ideep\mkl-dnn\src\cpu\x64\CMakeFiles\dnnl_cpu_x64.dir\gemm\f32\jit_avx512_core_f32_copy_an_kern_autogen.cpp.obj 
cl : Command line warning D9025 : overriding '/O2' with '/Od'
[540/2279] Building CXX object third_party\ideep\mkl-dnn\src\cpu\x64\CMakeFiles\dnnl_cpu_x64.dir\gemm\f32\jit_avx512_core_f32_copy_at_kern_autogen.cpp.obj 
cl : Command line warning D9025 : overriding '/O2' with '/Od'
[541/2279] Building CXX object third_party\ideep\mkl-dnn\src\cpu\CMakeFiles\dnnl_cpu.dir\gemm_convolution_utils.cpp.obj 
FAILED: third_party/ideep/mkl-dnn/src/cpu/CMakeFiles/dnnl_cpu.dir/gemm_convolution_utils.cpp.obj  
arty\pybind11\include /DWIN32 /D_WINDOWS /GR /EHsc /w /bigobj -openmp:experimental -DNDEBUG -openmp:experimental  /MP    /wd4800 /wd4068 /wd4305 /wd4551 /wd4244  /MD /O2 /Ob2 /DNDEBUG /w /bigobj -DNDEBUG   -DUSE_GCC_GET_CPUID -DUSE_AVX -DUSE_AVX2 -std:c++14 /showIncludes /Fothird_party\ideep\mkl-dnn\src\cpu\CMakeFiles\dnnl_cpu.dir\gemm_convolution_utils.cpp.obj /Fdthird_party\ideep\mkl-dnn\src\cpu\CMakeFiles\dnnl_cpu.dir\ /FS -c ..\..\third_party\ideep\mkl-dnn\src\cpu\gemm_convolution_utils.cpp 
C:\w\b\windows\pytorch\third_party\ideep\mkl-dnn\src\cpu\gemm_convolution_utils.cpp(401) : fatal error C1001: Internal compiler error.
(compiler file 'd:\agent\_work\7\s\src\vctools\Compiler\Utc\src\p2\main.c', line 195)
 To work around this problem, try simplifying or changing the program near the locations listed above.
If possible please provide a repro here: https://developercommunity.visualstudio.com 
Please choose the Technical Support command on the Visual C++ 
 Help menu, or open the Technical Support help file for more information
  cl!RaiseException()+0x69
  cl!RaiseException()+0x69
  cl!CloseTypeServerPDB()+0x22e6b

pytorch_xla_linux_bionic_py3_6_clang9_build (4/4)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.

Auto-merging aten/src/ATen/core/interned_strings.h 
CONFLICT (content): Merge conflict in aten/src/ATen/core/interned_strings.h 
Auto-merging aten/src/ATen/core/aten_interned_strings.h 
Auto-merging aten/src/ATen/BatchingRegistrations.cpp 
CONFLICT (content): Merge conflict in aten/src/ATen/BatchingRegistrations.cpp 
Auto-merging CONTRIBUTING.md 
CONFLICT (content): Merge conflict in CONTRIBUTING.md 
Auto-merging .jenkins/pytorch/test.sh 
Auto-merging .circleci/verbatim-sources/job-specs/binary-job-specs.yml 
Auto-merging .circleci/config.yml 
Automatic merge failed; fix conflicts and then commit the result.

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 21 times.

rohan-varma · 2020-08-20T19:10:26Z

test/distributed/test_c10d.py

+        """
        int_devices = gpus_for_rank(self.world_size)[self.rank][:1]
-        devices = list([torch.device('cuda:' + str(i)) for i in int_devices])
+        devices = list([torch.device("cuda:" + str(i)) for i in int_devices])


nit: since you're already changing this line, might as well remove the call to list() which is not needed

Oh, I missed that. I was just running linter for the parts that I was changing. I also made relevant changes across the whole test file.

In this diff, we prepared some example DDP communication hooks [#40848](#40848): 1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior. 2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients. 3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean. 4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html. Note that we separately send scale and zero_point (two floats per rank) before quantized tensors. 5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error. **Test Plan:** python torch/distributed/algorithms/ddp_comm_hooks/test_ddp_hooks.py ``` Couldn't download test skip set, leaving all tests enabled... ..... Ran 5 tests in 26.724s OK ``` **P.S.** Ignore the changes in `reducer.cpp` while reviewing, please see #43307 Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/) [ghstack-poisoned]

pritamdamania87

As discussed offline, let's figure out why the Nan is occurring and whether its possible to avoid this additional copy.

In this diff, we prepared some example DDP communication hooks [#40848](#40848): 1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior. 2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients. 3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean. 4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html. Note that we separately send scale and zero_point (two floats per rank) before quantized tensors. 5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error. **Test Plan:** python torch/distributed/algorithms/ddp_comm_hooks/test_ddp_hooks.py ``` Couldn't download test skip set, leaving all tests enabled... ..... Ran 5 tests in 26.724s OK ``` **P.S.** Ignore the changes in `reducer.cpp` while reviewing, please see #43307 Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/) [ghstack-poisoned]

…to buckets." I identified a bug with DDP communication hook #40848 while I was trying accuracy benchmarks: I was getting `loss=nan`. Looks like passing the result of future as re-`initialize_bucketviews` was causing the problem. One easy fix is to simply do double `copy_` by `bucket.replicas[i].contents.copy_(future_result[i]);`. I included two additional unit tests where we run multiple iterations for better test coverage: 1) `test_accumulate_gradients_no_sync_allreduce_hook` 2) `test_accumulate_gradients_no_sync_allreduce_with_then_hook`. Those tests were failing with the old version. (If I run it for just 2 iterations, it wouldn't fail.) This simple fix just fixes the issue by doing double `copy_`. From my recent observations, performance regression by using double `copy_` is negligible. If a better solution is not obvious, we should probably land this quickly and open a separate issue to think more on the problem and a better solution. Differential Revision: [D23229309](https://our.internmc.facebook.com/intern/diff/D23229309/) [ghstack-poisoned]

…to buckets." I identified a bug with DDP communication hook while I was trying accuracy benchmarks: I was getting `loss=nan`. Looks like when we re-`initialize_bucketviews` with the value of `future_work`, as `Reducer::finalize_bucket_dense` does `grad.copy_(bucket_view)` it wasn't copying the `grads` back to the contents since `bucket_view` wouldn't have any relationship with `contents` after re-intitializing it with something else. As we have multiple iterations, this was causing problems. I solved this by adding two states for `bucket_view`: ```[c++] // bucket_views_in[i].copy_(grad) and // grad.copy_(bucket_views_out[i]) // provide convenient ways to move grad data in/out of contents. std::vector<at::Tensor> bucket_views_in; std::vector<at::Tensor> bucket_views_out; ``` I included two additional unit tests where we run multiple iterations for better test coverage: 1) `test_accumulate_gradients_no_sync_allreduce_hook` 2) `test_accumulate_gradients_no_sync_allreduce_with_then_hook`. Those tests were failing with the old version. (If I run it for just 2 iterations, it wouldn't fail.) This simple fix just fixes the issue by doing double `copy_`. From my recent observations, performance regression by using double `copy_` is negligible. If a better solution is not obvious, we should probably land this quickly and open a separate issue to think more on the problem and a better solution. Differential Revision: [D23229309](https://our.internmc.facebook.com/intern/diff/D23229309/) [ghstack-poisoned]

Pull Request resolved: #43307 I identified a bug with DDP communication hook while I was trying accuracy benchmarks: I was getting `loss=nan`. Looks like when we re-`initialize_bucketviews` with the value of `future_work`, as `Reducer::finalize_bucket_dense` does `grad.copy_(bucket_view)` it wasn't copying the `grads` back to the contents since `bucket_view` wouldn't have any relationship with `contents` after re-intitializing it with something else. As we have multiple iterations, this was causing problems. I solved this by adding two states for `bucket_view`: ``` // bucket_views_in[i].copy_(grad) and // grad.copy_(bucket_views_out[i]) // provide convenient ways to move grad data in/out of contents. std::vector<at::Tensor> bucket_views_in; std::vector<at::Tensor> bucket_views_out; ``` I included two additional unit tests where we run multiple iterations for better test coverage: 1) `test_accumulate_gradients_no_sync_allreduce_hook` 2) `test_accumulate_gradients_no_sync_allreduce_with_then_hook`. Those tests were failing with the old version. (If I run it for just 2 iterations, it wouldn't fail.) This simple fix just fixes the issue by doing double `copy_`. From my recent observations, performance regression by using double `copy_` is negligible. If a better solution is not obvious, we should probably land this quickly and open a separate issue to think more on the problem and a better solution. ghstack-source-id: 110537709 Differential Revision: [D23229309](https://our.internmc.facebook.com/intern/diff/D23229309/)

sinannasir · 2020-08-24T16:17:32Z

As discussed offline, let's figure out why the Nan is occurring and whether its possible to avoid this additional copy.

Looks like when we re-initialize_bucketviews with the value of future_work, as Reducer::mark_variable_ready_dense does bucket_view.copy_(grad) it wasn't copying the grads back to the contents since bucket_view wouldn't have any relationship with contents after re-intitializing it with something else. As we have multiple iterations, this was causing problems.
I solved this by adding two states for bucket_view:

    // bucket_views_in[i].copy_(grad) and
    // grad.copy_(bucket_views_out[i])
    // provide convenient ways to move grad data in/out of contents.
    std::vector<at::Tensor> bucket_views_in;
    std::vector<at::Tensor> bucket_views_out;

I see that there is a recent PR on bucket_views #41954. I've solved a lot of merge conflicts with that PR. Hope my changes won't cause any trouble.

In this diff, we prepared some example DDP communication hooks [#40848](#40848): 1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior. 2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients. 3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean. 4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html. Note that we separately send scale and zero_point (two floats per rank) before quantized tensors. 5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error. **Test Plan:** python torch/distributed/algorithms/ddp_comm_hooks/test_ddp_hooks.py ``` Couldn't download test skip set, leaving all tests enabled... ..... Ran 5 tests in 26.724s OK ``` **P.S.** Ignore the changes in `reducer.cpp` while reviewing, please see #43307 Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/) [ghstack-poisoned]

…to buckets." I identified a bug with DDP communication hook while I was trying accuracy benchmarks: I was getting `loss=nan`. Looks like when we re-`initialize_bucketviews` with the value of `future_work`, as `Reducer::mark_variable_ready_dense` does `bucket_view.copy_(grad)` it wasn't copying the `grads` back to the contents since `bucket_view` wouldn't have any relationship with `contents` after re-intitializing it with something else. As we have multiple iterations, this was causing problems. I solved this by adding two states for `bucket_view`: ``` // bucket_views_in[i].copy_(grad) and // grad.copy_(bucket_views_out[i]) // provide convenient ways to move grad data in/out of contents. std::vector<at::Tensor> bucket_views_in; std::vector<at::Tensor> bucket_views_out; ``` I included two additional unit tests where we run multiple iterations for better test coverage: 1) `test_accumulate_gradients_no_sync_allreduce_hook` 2) `test_accumulate_gradients_no_sync_allreduce_with_then_hook`. Those tests were failing with the old version. (If I run it for just 2 iterations, it wouldn't fail.) This simple fix just fixes the issue by doing double `copy_`. From my recent observations, performance regression by using double `copy_` is negligible. If a better solution is not obvious, we should probably land this quickly and open a separate issue to think more on the problem and a better solution. Differential Revision: [D23229309](https://our.internmc.facebook.com/intern/diff/D23229309/) [ghstack-poisoned]

Pull Request resolved: #43307 I identified a bug with DDP communication hook while I was trying accuracy benchmarks: I was getting `loss=nan`. Looks like when we re-`initialize_bucketviews` with the value of `future_work`, as `Reducer::mark_variable_ready_dense` does `bucket_view.copy_(grad)` it wasn't copying the `grads` back to the contents since `bucket_view` wouldn't have any relationship with `contents` after re-intitializing it with something else. As we have multiple iterations, this was causing problems. I solved this by adding two states for `bucket_view`: ``` // bucket_views_in[i].copy_(grad) and // grad.copy_(bucket_views_out[i]) // provide convenient ways to move grad data in/out of contents. std::vector<at::Tensor> bucket_views_in; std::vector<at::Tensor> bucket_views_out; ``` I included two additional unit tests where we run multiple iterations for better test coverage: 1) `test_accumulate_gradients_no_sync_allreduce_hook` 2) `test_accumulate_gradients_no_sync_allreduce_with_then_hook`. Those tests were failing with the old version. (If I run it for just 2 iterations, it wouldn't fail.) This simple fix just fixes the issue by doing double `copy_`. From my recent observations, performance regression by using double `copy_` is negligible. If a better solution is not obvious, we should probably land this quickly and open a separate issue to think more on the problem and a better solution. ghstack-source-id: 110705100 Differential Revision: [D23229309](https://our.internmc.facebook.com/intern/diff/D23229309/)

Pull Request resolved: #43307 I identified a bug with DDP communication hook while I was trying accuracy benchmarks: I was getting `loss=nan`. Looks like when we re-`initialize_bucketviews` with the value of `future_work`, as `Reducer::mark_variable_ready_dense` does `bucket_view.copy_(grad)` it wasn't copying the `grads` back to the contents since `bucket_view` wouldn't have any relationship with `contents` after re-intitializing it with something else. As we have multiple iterations, this was causing problems. I solved this by adding two states for `bucket_view`: ``` // bucket_views_in[i].copy_(grad) and // grad.copy_(bucket_views_out[i]) // provide convenient ways to move grad data in/out of contents. std::vector<at::Tensor> bucket_views_in; std::vector<at::Tensor> bucket_views_out; ``` I included two additional unit tests where we run multiple iterations for better test coverage: 1) `test_accumulate_gradients_no_sync_allreduce_hook` 2) `test_accumulate_gradients_no_sync_allreduce_with_then_hook`. Those tests were failing with the old version. (If I run it for just 2 iterations, it wouldn't fail.) This simple fix just fixes the issue by doing double `copy_`. From my recent observations, performance regression by using double `copy_` is negligible. If a better solution is not obvious, we should probably land this quickly and open a separate issue to think more on the problem and a better solution. ghstack-source-id: 110728299 Differential Revision: [D23229309](https://our.internmc.facebook.com/intern/diff/D23229309/)

facebook-github-bot · 2020-08-26T22:26:47Z

This pull request has been merged in 769b938.

rohan-varma · 2020-08-26T22:35:41Z

torch/csrc/distributed/c10d/reducer.cpp

      // the autograd engine (AccumulateGrad) will also create gradients
      // matching its layout.
-      replica.bucket_views.push_back(
+      replica.bucket_views_out.push_back(


Sorry that I didn't get a chance to look closely at this PR earlier, but I was wondering in the regular case (no comm. hoook) if bucket_views_out and bucket_views_in are essentially the same thing? If so, I was wondering if it would work for bucket_views_out to just point to bucket_views_in when there is no comm. hook?

Yeah, probably that will look better. Please see #43734

torch/csrc/distributed/c10d/reducer.cpp

In this diff, we prepared some example DDP communication hooks [#40848](#40848): 1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior. 2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients. 3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean. 4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html. Note that we separately send scale and zero_point (two floats per rank) before quantized tensors. 5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error. **Test Plan:** python torch/distributed/algorithms/ddp_comm_hooks/test_ddp_hooks.py ``` Couldn't download test skip set, leaving all tests enabled... ..... Ran 5 tests in 26.724s OK ``` **P.S.** Ignore the changes in `reducer.cpp` while reviewing, please see #43307 Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/) [ghstack-poisoned]

…buckets. Following the additional GH comments on the original PR #43307. Differential Revision: [D23380288](https://our.internmc.facebook.com/intern/diff/D23380288/) [ghstack-poisoned]

…buckets. Following the additional GH comments on the original PR #43307. Differential Revision: [D23380288](https://our.internmc.facebook.com/intern/diff/D23380288/) ghstack-source-id: 110876508 Pull Request resolved: #43734

In this diff, we prepared some example DDP communication hooks [#40848](#40848): 1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior. 2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients. 3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean. 4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html. Note that we separately send scale and zero_point (two floats per rank) before quantized tensors. 5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error. **Test Plan:** python torch/distributed/algorithms/ddp_comm_hooks/test_ddp_hooks.py ``` Couldn't download test skip set, leaving all tests enabled... ..... Ran 5 tests in 26.724s OK ``` **P.S.** Ignore the changes in `reducer.cpp` while reviewing, please see #43307 Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/) [ghstack-poisoned]

… result to buckets." Following the additional GH comments on the original PR #43307. Differential Revision: [D23380288](https://our.internmc.facebook.com/intern/diff/D23380288/) [ghstack-poisoned]

…buckets. Pull Request resolved: #43734 Following the additional GH comments on the original PR #43307. ghstack-source-id: 111327130 Differential Revision: [D23380288](https://our.internmc.facebook.com/intern/diff/D23380288/)

…buckets. (#43734) Summary: Pull Request resolved: #43734 Following the additional GH comments on the original PR #43307. ghstack-source-id: 111327130 Test Plan: Run `python test/distributed/test_c10d.py` Reviewed By: smessmer Differential Revision: D23380288 fbshipit-source-id: 4b8889341c57b3701f0efa4edbe1d7bbc2a82ced

Summary: Pull Request resolved: #72348 **Overview** #43307 changed `_test_accumulate_gradients_no_sync()` to add a `num_iters` argument. However, I think the change misconstrued the test logic slightly. https://github.com/pytorch/pytorch/blob/61ab04e1db77fd59c940ca4ba34dbfb6afcc6551/torch/testing/_internal/distributed/distributed_test.py#L4369-L4397 - `iteration % num_iters == 0` evaluates to `True` only for `iteration == 0` since `iteration` comes from `for iteration in `range(num_iters)`. - IIUC, the intention is to alternate between accumulating gradients (using `no_sync()`) and synchronizing gradients normally. In the existing implementation, any iterations following the second one are non-productive since gradients are in sync, meaning it reduces to testing normal DDP. - This PR changes the check back to `iteration % 2 == 0` to restore the alternating behavior. Test Plan: Imported from OSS Reviewed By: rohan-varma Differential Revision: D34011559 Pulled By: awgu fbshipit-source-id: 4ba771e45b28a343167a324462571e4b8e25ae72

Summary: Pull Request resolved: #72348 **Overview** #43307 changed `_test_accumulate_gradients_no_sync()` to add a `num_iters` argument. However, I think the change misconstrued the test logic slightly. https://github.com/pytorch/pytorch/blob/61ab04e1db77fd59c940ca4ba34dbfb6afcc6551/torch/testing/_internal/distributed/distributed_test.py#L4369-L4397 - `iteration % num_iters == 0` evaluates to `True` only for `iteration == 0` since `iteration` comes from `for iteration in `range(num_iters)`. - IIUC, the intention is to alternate between accumulating gradients (using `no_sync()`) and synchronizing gradients normally. In the existing implementation, any iterations following the second one are non-productive since gradients are in sync, meaning it reduces to testing normal DDP. - This PR changes the check back to `iteration % 2 == 0` to restore the alternating behavior. Test Plan: Imported from OSS Reviewed By: rohan-varma Differential Revision: D34011559 Pulled By: awgu fbshipit-source-id: 4ba771e45b28a343167a324462571e4b8e25ae72 (cherry picked from commit 8492a8b)

sinannasir requested review from mrshenli, pietern, pritamdamania87 and zhaojuanmao as code owners August 19, 2020 23:55

sinannasir requested a review from rohan-varma August 19, 2020 23:55

sinannasir mentioned this pull request Aug 20, 2020

DDP communication hook examples #43310

Closed

rohan-varma reviewed Aug 20, 2020

View reviewed changes

pritamdamania87 suggested changes Aug 22, 2020

View reviewed changes

sinannasir requested review from pritamdamania87 and rohan-varma August 24, 2020 16:17

sinannasir mentioned this pull request Aug 25, 2020

[NCCL] Dedicated stream to run all FutureNCCL callbacks. #43447

Closed

facebook-github-bot closed this in 769b938 Aug 26, 2020

facebook-github-bot added the merged label Aug 26, 2020

rohan-varma reviewed Aug 26, 2020

View reviewed changes

torch/csrc/distributed/c10d/reducer.cpp Show resolved Hide resolved

sinannasir mentioned this pull request Aug 27, 2020

DDP Communication hook: (Patch) Fix the way we pass future result to buckets. #43734

Closed

facebook-github-bot deleted the gh/sinannasir/11/head branch August 30, 2020 14:17

mruberry added the Merged label Oct 28, 2020

awgu mentioned this pull request Feb 4, 2022

[PT-D][BE] Fix DDP no_sync() test logic #72348

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DDP Communication hook: Fix the way we pass future result to buckets. #43307

DDP Communication hook: Fix the way we pass future result to buckets. #43307

Uh oh!

sinannasir commented Aug 19, 2020 •

edited

Loading

Uh oh!

dr-ci bot commented Aug 20, 2020 •

edited

Loading

Uh oh!

rohan-varma Aug 20, 2020

Uh oh!

sinannasir Aug 24, 2020

Uh oh!

pritamdamania87 left a comment

Uh oh!

sinannasir commented Aug 24, 2020

Uh oh!

facebook-github-bot commented Aug 26, 2020

Uh oh!

rohan-varma Aug 26, 2020

Uh oh!

sinannasir Aug 27, 2020 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

DDP Communication hook: Fix the way we pass future result to buckets. #43307

DDP Communication hook: Fix the way we pass future result to buckets. #43307

Uh oh!

Conversation

sinannasir commented Aug 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dr-ci bot commented Aug 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

🕵️ 4 new failures recognized by patterns

pytorch_linux_xenial_py3_6_gcc5_4_build (1/4)

pytorch_linux_xenial_cuda9_2_cudnn7_py3_gcc5_4_build (2/4)

binary_windows_libtorch_3_7_cpu_release_build (3/4)

pytorch_xla_linux_bionic_py3_6_clang9_build (4/4)

Uh oh!

rohan-varma Aug 20, 2020

Choose a reason for hiding this comment

Uh oh!

sinannasir Aug 24, 2020

Choose a reason for hiding this comment

Uh oh!

pritamdamania87 left a comment

Choose a reason for hiding this comment

Uh oh!

sinannasir commented Aug 24, 2020

Uh oh!

facebook-github-bot commented Aug 26, 2020

Uh oh!

rohan-varma Aug 26, 2020

Choose a reason for hiding this comment

Uh oh!

sinannasir Aug 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

sinannasir commented Aug 19, 2020 •

edited

Loading

dr-ci bot commented Aug 20, 2020 •

edited

Loading

sinannasir Aug 27, 2020 •

edited

Loading