Skip to content

Conversation

@sinannasir
Copy link
Contributor

@sinannasir sinannasir commented Aug 19, 2020

Stack from ghstack:

I identified a bug with DDP communication hook while I was trying accuracy benchmarks: I was getting loss=nan.

Looks like when we re-initialize_bucketviews with the value of future_work, as Reducer::mark_variable_ready_dense does bucket_view.copy_(grad) it wasn't copying the grads back to the contents since bucket_view wouldn't have any relationship with contents after re-intitializing it with something else. As we have multiple iterations, this was causing problems.
I solved this by adding two states for bucket_view:

    // bucket_views_in[i].copy_(grad) and
    // grad.copy_(bucket_views_out[i])
    // provide convenient ways to move grad data in/out of contents.
    std::vector<at::Tensor> bucket_views_in;
    std::vector<at::Tensor> bucket_views_out;

I included two additional unit tests where we run multiple iterations for better test coverage:

  1. test_accumulate_gradients_no_sync_allreduce_hook
  2. test_accumulate_gradients_no_sync_allreduce_with_then_hook.

Those tests were failing with the old version. (If I run it for just 2 iterations, it wouldn't fail.)

Differential Revision: D23229309

I identified a bug with DDP communication hook while I was trying accuracy benchmarks: I was getting `loss=nan`.

Looks like passing the result of future as re-`initialize_bucketviews` was causing the problem.

One easy fix is to simply do double `copy_` by `bucket.replicas[i].contents.copy_(future_result[i]);`.

I included two additional unit tests where we run multiple iterations for better test coverage:
1) `test_accumulate_gradients_no_sync_allreduce_hook`
2) `test_accumulate_gradients_no_sync_allreduce_with_then_hook`.

Those tests were failing with the old version. (If I run it for just 2 iterations, it wouldn't fail.) This simple fix just fixes the issue by doing double `copy_`. From my recent observations, performance regression by using double `copy_` is negligible. If a better solution is not obvious, we should probably land this quickly and open a separate issue to think more on the problem and a better solution.

Differential Revision: [D23229309](https://our.internmc.facebook.com/intern/diff/D23229309/)

[ghstack-poisoned]
sinannasir added a commit that referenced this pull request Aug 19, 2020
I identified a bug with DDP communication hook while I was trying accuracy benchmarks: I was getting `loss=nan`.

Looks like passing the result of future as re-`initialize_bucketviews` was causing the problem.

One easy fix is to simply do double `copy_` by `bucket.replicas[i].contents.copy_(future_result[i]);`.

I included two additional unit tests where we run multiple iterations for better test coverage:
1) `test_accumulate_gradients_no_sync_allreduce_hook`
2) `test_accumulate_gradients_no_sync_allreduce_with_then_hook`.

Those tests were failing with the old version. (If I run it for just 2 iterations, it wouldn't fail.) This simple fix just fixes the issue by doing double `copy_`. From my recent observations, performance regression by using double `copy_` is negligible. If a better solution is not obvious, we should probably land this quickly and open a separate issue to think more on the problem and a better solution.

Differential Revision: [D23229309](https://our.internmc.facebook.com/intern/diff/D23229309/)

ghstack-source-id: 110298379
Pull Request resolved: #43307
@sinannasir sinannasir requested a review from rohan-varma August 19, 2020 23:55
@dr-ci
Copy link

dr-ci bot commented Aug 20, 2020

💊 CI failures summary and remediations

As of commit 20ce201 (more details on the Dr. CI page):


  • 4/4 failures introduced in this PR

🕵️ 4 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_linux_xenial_py3_6_gcc5_4_build (1/4)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.
Auto-merging aten/src/ATen/core/interned_strings.h 
CONFLICT (content): Merge conflict in aten/src/ATen/core/interned_strings.h 
Auto-merging aten/src/ATen/core/aten_interned_strings.h 
Auto-merging aten/src/ATen/BatchingRegistrations.cpp 
CONFLICT (content): Merge conflict in aten/src/ATen/BatchingRegistrations.cpp 
Auto-merging CONTRIBUTING.md 
CONFLICT (content): Merge conflict in CONTRIBUTING.md 
Auto-merging .jenkins/pytorch/test.sh 
Auto-merging .circleci/verbatim-sources/job-specs/binary-job-specs.yml 
Auto-merging .circleci/config.yml 
Automatic merge failed; fix conflicts and then commit the result. 

See CircleCI build pytorch_linux_xenial_cuda9_2_cudnn7_py3_gcc5_4_build (2/4)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.
Auto-merging aten/src/ATen/core/interned_strings.h 
CONFLICT (content): Merge conflict in aten/src/ATen/core/interned_strings.h 
Auto-merging aten/src/ATen/core/aten_interned_strings.h 
Auto-merging aten/src/ATen/BatchingRegistrations.cpp 
CONFLICT (content): Merge conflict in aten/src/ATen/BatchingRegistrations.cpp 
Auto-merging CONTRIBUTING.md 
CONFLICT (content): Merge conflict in CONTRIBUTING.md 
Auto-merging .jenkins/pytorch/test.sh 
Auto-merging .circleci/verbatim-sources/job-specs/binary-job-specs.yml 
Auto-merging .circleci/config.yml 
Automatic merge failed; fix conflicts and then commit the result. 

See CircleCI build binary_windows_libtorch_3_7_cpu_release_build (3/4)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

FAILED: third_party/ideep/mkl-dnn/src/cpu/CMakeFiles/dnnl_cpu.dir/gemm_convolution_utils.cpp.obj
[536/2279] Building CXX object third_party\ideep\mkl-dnn\src\cpu\x64\CMakeFiles\dnnl_cpu_x64.dir\gemm\f32\jit_avx2_kernel_sgemm_kern.cpp.obj 
[537/2279] Building CXX object third_party\ideep\mkl-dnn\src\cpu\x64\CMakeFiles\dnnl_cpu_x64.dir\gemm\f32\jit_avx2_f32_copy_bt_kern_autogen.cpp.obj 
cl : Command line warning D9025 : overriding '/O2' with '/Od'
[538/2279] Building CXX object third_party\ideep\mkl-dnn\src\cpu\x64\CMakeFiles\dnnl_cpu_x64.dir\gemm\f32\jit_avx2_f32_copy_bn_kern_autogen.cpp.obj 
cl : Command line warning D9025 : overriding '/O2' with '/Od'
[539/2279] Building CXX object third_party\ideep\mkl-dnn\src\cpu\x64\CMakeFiles\dnnl_cpu_x64.dir\gemm\f32\jit_avx512_core_f32_copy_an_kern_autogen.cpp.obj 
cl : Command line warning D9025 : overriding '/O2' with '/Od'
[540/2279] Building CXX object third_party\ideep\mkl-dnn\src\cpu\x64\CMakeFiles\dnnl_cpu_x64.dir\gemm\f32\jit_avx512_core_f32_copy_at_kern_autogen.cpp.obj 
cl : Command line warning D9025 : overriding '/O2' with '/Od'
[541/2279] Building CXX object third_party\ideep\mkl-dnn\src\cpu\CMakeFiles\dnnl_cpu.dir\gemm_convolution_utils.cpp.obj 
FAILED: third_party/ideep/mkl-dnn/src/cpu/CMakeFiles/dnnl_cpu.dir/gemm_convolution_utils.cpp.obj  
arty\pybind11\include /DWIN32 /D_WINDOWS /GR /EHsc /w /bigobj -openmp:experimental -DNDEBUG -openmp:experimental  /MP    /wd4800 /wd4068 /wd4305 /wd4551 /wd4244  /MD /O2 /Ob2 /DNDEBUG /w /bigobj -DNDEBUG   -DUSE_GCC_GET_CPUID -DUSE_AVX -DUSE_AVX2 -std:c++14 /showIncludes /Fothird_party\ideep\mkl-dnn\src\cpu\CMakeFiles\dnnl_cpu.dir\gemm_convolution_utils.cpp.obj /Fdthird_party\ideep\mkl-dnn\src\cpu\CMakeFiles\dnnl_cpu.dir\ /FS -c ..\..\third_party\ideep\mkl-dnn\src\cpu\gemm_convolution_utils.cpp 
C:\w\b\windows\pytorch\third_party\ideep\mkl-dnn\src\cpu\gemm_convolution_utils.cpp(401) : fatal error C1001: Internal compiler error.
(compiler file 'd:\agent\_work\7\s\src\vctools\Compiler\Utc\src\p2\main.c', line 195)
 To work around this problem, try simplifying or changing the program near the locations listed above.
If possible please provide a repro here: https://developercommunity.visualstudio.com 
Please choose the Technical Support command on the Visual C++ 
 Help menu, or open the Technical Support help file for more information
  cl!RaiseException()+0x69
  cl!RaiseException()+0x69
  cl!CloseTypeServerPDB()+0x22e6b

See CircleCI build pytorch_xla_linux_bionic_py3_6_clang9_build (4/4)

Step: "(Optional) Merge target branch" (full log | diagnosis details | 🔁 rerun)

Automatic merge failed; fix conflicts and then commit the result.
Auto-merging aten/src/ATen/core/interned_strings.h 
CONFLICT (content): Merge conflict in aten/src/ATen/core/interned_strings.h 
Auto-merging aten/src/ATen/core/aten_interned_strings.h 
Auto-merging aten/src/ATen/BatchingRegistrations.cpp 
CONFLICT (content): Merge conflict in aten/src/ATen/BatchingRegistrations.cpp 
Auto-merging CONTRIBUTING.md 
CONFLICT (content): Merge conflict in CONTRIBUTING.md 
Auto-merging .jenkins/pytorch/test.sh 
Auto-merging .circleci/verbatim-sources/job-specs/binary-job-specs.yml 
Auto-merging .circleci/config.yml 
Automatic merge failed; fix conflicts and then commit the result. 

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 21 times.

"""
int_devices = gpus_for_rank(self.world_size)[self.rank][:1]
devices = list([torch.device('cuda:' + str(i)) for i in int_devices])
devices = list([torch.device("cuda:" + str(i)) for i in int_devices])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: since you're already changing this line, might as well remove the call to list() which is not needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I missed that. I was just running linter for the parts that I was changing. I also made relevant changes across the whole test file.

sinannasir added a commit that referenced this pull request Aug 21, 2020
In this diff, we prepared some example DDP communication hooks [#40848](#40848):

1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior.

2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients.

3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean.

4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html.  Note that we separately send scale and zero_point (two floats per rank) before quantized tensors.

5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error.

**Test Plan:**
python torch/distributed/algorithms/ddp_comm_hooks/test_ddp_hooks.py 
```
Couldn't download test skip set, leaving all tests enabled...
.....
Ran 5 tests in 26.724s
OK
```

**P.S.** Ignore the changes in `reducer.cpp` while reviewing, please see #43307 
Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/)

[ghstack-poisoned]
Copy link
Contributor

@pritamdamania87 pritamdamania87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline, let's figure out why the Nan is occurring and whether its possible to avoid this additional copy.

sinannasir added a commit that referenced this pull request Aug 22, 2020
In this diff, we prepared some example DDP communication hooks [#40848](#40848):

1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior.

2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients.

3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean.

4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html.  Note that we separately send scale and zero_point (two floats per rank) before quantized tensors.

5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error.

**Test Plan:**
python torch/distributed/algorithms/ddp_comm_hooks/test_ddp_hooks.py 
```
Couldn't download test skip set, leaving all tests enabled...
.....
Ran 5 tests in 26.724s
OK
```

**P.S.** Ignore the changes in `reducer.cpp` while reviewing, please see #43307 
Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/)

[ghstack-poisoned]
sinannasir added a commit that referenced this pull request Aug 22, 2020
In this diff, we prepared some example DDP communication hooks [#40848](#40848):

1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior.

2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients.

3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean.

4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html.  Note that we separately send scale and zero_point (two floats per rank) before quantized tensors.

5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error.

**Test Plan:**
python torch/distributed/algorithms/ddp_comm_hooks/test_ddp_hooks.py 
```
Couldn't download test skip set, leaving all tests enabled...
.....
Ran 5 tests in 26.724s
OK
```

**P.S.** Ignore the changes in `reducer.cpp` while reviewing, please see #43307 
Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/)

[ghstack-poisoned]
…to buckets."


I identified a bug with DDP communication hook #40848 while I was trying accuracy benchmarks: I was getting `loss=nan`.

Looks like passing the result of future as re-`initialize_bucketviews` was causing the problem.

One easy fix is to simply do double `copy_` by `bucket.replicas[i].contents.copy_(future_result[i]);`.

I included two additional unit tests where we run multiple iterations for better test coverage:
1) `test_accumulate_gradients_no_sync_allreduce_hook`
2) `test_accumulate_gradients_no_sync_allreduce_with_then_hook`.

Those tests were failing with the old version. (If I run it for just 2 iterations, it wouldn't fail.) This simple fix just fixes the issue by doing double `copy_`. From my recent observations, performance regression by using double `copy_` is negligible. If a better solution is not obvious, we should probably land this quickly and open a separate issue to think more on the problem and a better solution.

Differential Revision: [D23229309](https://our.internmc.facebook.com/intern/diff/D23229309/)

[ghstack-poisoned]
…to buckets."


I identified a bug with DDP communication hook while I was trying accuracy benchmarks: I was getting `loss=nan`. 

Looks like when we re-`initialize_bucketviews` with the value of `future_work`, as `Reducer::finalize_bucket_dense` does `grad.copy_(bucket_view)` it wasn't copying the `grads` back to the contents since `bucket_view` wouldn't have any relationship with `contents` after re-intitializing it with something else. As we have multiple iterations, this was causing problems. 
I solved this by adding two states for `bucket_view`:
```[c++]
    // bucket_views_in[i].copy_(grad) and
    // grad.copy_(bucket_views_out[i])
    // provide convenient ways to move grad data in/out of contents.
    std::vector<at::Tensor> bucket_views_in;
    std::vector<at::Tensor> bucket_views_out;
```

I included two additional unit tests where we run multiple iterations for better test coverage:
1) `test_accumulate_gradients_no_sync_allreduce_hook`
2) `test_accumulate_gradients_no_sync_allreduce_with_then_hook`.

Those tests were failing with the old version. (If I run it for just 2 iterations, it wouldn't fail.) This simple fix just fixes the issue by doing double `copy_`. From my recent observations, performance regression by using double `copy_` is negligible. If a better solution is not obvious, we should probably land this quickly and open a separate issue to think more on the problem and a better solution.

Differential Revision: [D23229309](https://our.internmc.facebook.com/intern/diff/D23229309/)

[ghstack-poisoned]
…to buckets."


I identified a bug with DDP communication hook while I was trying accuracy benchmarks: I was getting `loss=nan`. 

Looks like when we re-`initialize_bucketviews` with the value of `future_work`, as `Reducer::finalize_bucket_dense` does `grad.copy_(bucket_view)` it wasn't copying the `grads` back to the contents since `bucket_view` wouldn't have any relationship with `contents` after re-intitializing it with something else. As we have multiple iterations, this was causing problems. 
I solved this by adding two states for `bucket_view`:
```[c++]
    // bucket_views_in[i].copy_(grad) and
    // grad.copy_(bucket_views_out[i])
    // provide convenient ways to move grad data in/out of contents.
    std::vector<at::Tensor> bucket_views_in;
    std::vector<at::Tensor> bucket_views_out;
```

I included two additional unit tests where we run multiple iterations for better test coverage:
1) `test_accumulate_gradients_no_sync_allreduce_hook`
2) `test_accumulate_gradients_no_sync_allreduce_with_then_hook`.

Those tests were failing with the old version. (If I run it for just 2 iterations, it wouldn't fail.) This simple fix just fixes the issue by doing double `copy_`. From my recent observations, performance regression by using double `copy_` is negligible. If a better solution is not obvious, we should probably land this quickly and open a separate issue to think more on the problem and a better solution.

Differential Revision: [D23229309](https://our.internmc.facebook.com/intern/diff/D23229309/)

[ghstack-poisoned]
sinannasir added a commit that referenced this pull request Aug 24, 2020
Pull Request resolved: #43307

I identified a bug with DDP communication hook while I was trying accuracy benchmarks: I was getting `loss=nan`.

Looks like when we re-`initialize_bucketviews` with the value of `future_work`, as `Reducer::finalize_bucket_dense` does `grad.copy_(bucket_view)` it wasn't copying the `grads` back to the contents since `bucket_view` wouldn't have any relationship with `contents` after re-intitializing it with something else. As we have multiple iterations, this was causing problems.
I solved this by adding two states for `bucket_view`:
```
    // bucket_views_in[i].copy_(grad) and
    // grad.copy_(bucket_views_out[i])
    // provide convenient ways to move grad data in/out of contents.
    std::vector<at::Tensor> bucket_views_in;
    std::vector<at::Tensor> bucket_views_out;
```

I included two additional unit tests where we run multiple iterations for better test coverage:
1) `test_accumulate_gradients_no_sync_allreduce_hook`
2) `test_accumulate_gradients_no_sync_allreduce_with_then_hook`.

Those tests were failing with the old version. (If I run it for just 2 iterations, it wouldn't fail.) This simple fix just fixes the issue by doing double `copy_`. From my recent observations, performance regression by using double `copy_` is negligible. If a better solution is not obvious, we should probably land this quickly and open a separate issue to think more on the problem and a better solution.
ghstack-source-id: 110537709

Differential Revision: [D23229309](https://our.internmc.facebook.com/intern/diff/D23229309/)
@sinannasir
Copy link
Contributor Author

As discussed offline, let's figure out why the Nan is occurring and whether its possible to avoid this additional copy.

Looks like when we re-initialize_bucketviews with the value of future_work, as Reducer::mark_variable_ready_dense does bucket_view.copy_(grad) it wasn't copying the grads back to the contents since bucket_view wouldn't have any relationship with contents after re-intitializing it with something else. As we have multiple iterations, this was causing problems.
I solved this by adding two states for bucket_view:

    // bucket_views_in[i].copy_(grad) and
    // grad.copy_(bucket_views_out[i])
    // provide convenient ways to move grad data in/out of contents.
    std::vector<at::Tensor> bucket_views_in;
    std::vector<at::Tensor> bucket_views_out;

I see that there is a recent PR on bucket_views #41954. I've solved a lot of merge conflicts with that PR. Hope my changes won't cause any trouble.

sinannasir added a commit that referenced this pull request Aug 25, 2020
In this diff, we prepared some example DDP communication hooks [#40848](#40848):

1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior.

2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients.

3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean.

4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html.  Note that we separately send scale and zero_point (two floats per rank) before quantized tensors.

5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error.

**Test Plan:**
python torch/distributed/algorithms/ddp_comm_hooks/test_ddp_hooks.py 
```
Couldn't download test skip set, leaving all tests enabled...
.....
Ran 5 tests in 26.724s
OK
```

**P.S.** Ignore the changes in `reducer.cpp` while reviewing, please see #43307 
Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/)

[ghstack-poisoned]
sinannasir added a commit that referenced this pull request Aug 25, 2020
In this diff, we prepared some example DDP communication hooks [#40848](#40848):

1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior.

2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients.

3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean.

4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html.  Note that we separately send scale and zero_point (two floats per rank) before quantized tensors.

5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error.

**Test Plan:**
python torch/distributed/algorithms/ddp_comm_hooks/test_ddp_hooks.py 
```
Couldn't download test skip set, leaving all tests enabled...
.....
Ran 5 tests in 26.724s
OK
```

**P.S.** Ignore the changes in `reducer.cpp` while reviewing, please see #43307 
Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/)

[ghstack-poisoned]
sinannasir added a commit that referenced this pull request Aug 25, 2020
In this diff, we prepared some example DDP communication hooks [#40848](#40848):

1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior.

2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients.

3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean.

4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html.  Note that we separately send scale and zero_point (two floats per rank) before quantized tensors.

5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error.

**Test Plan:**
python torch/distributed/algorithms/ddp_comm_hooks/test_ddp_hooks.py 
```
Couldn't download test skip set, leaving all tests enabled...
.....
Ran 5 tests in 26.724s
OK
```

**P.S.** Ignore the changes in `reducer.cpp` while reviewing, please see #43307 
Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/)

[ghstack-poisoned]
sinannasir added a commit that referenced this pull request Aug 25, 2020
In this diff, we prepared some example DDP communication hooks [#40848](#40848):

1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior.

2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients.

3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean.

4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html.  Note that we separately send scale and zero_point (two floats per rank) before quantized tensors.

5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error.

**Test Plan:**
python torch/distributed/algorithms/ddp_comm_hooks/test_ddp_hooks.py 
```
Couldn't download test skip set, leaving all tests enabled...
.....
Ran 5 tests in 26.724s
OK
```

**P.S.** Ignore the changes in `reducer.cpp` while reviewing, please see #43307 
Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/)

[ghstack-poisoned]
…to buckets."


I identified a bug with DDP communication hook while I was trying accuracy benchmarks: I was getting `loss=nan`. 

Looks like when we re-`initialize_bucketviews` with the value of `future_work`, as `Reducer::mark_variable_ready_dense` does `bucket_view.copy_(grad)` it wasn't copying the `grads` back to the contents since `bucket_view` wouldn't have any relationship with `contents` after re-intitializing it with something else. As we have multiple iterations, this was causing problems. 
I solved this by adding two states for `bucket_view`:
```
    // bucket_views_in[i].copy_(grad) and
    // grad.copy_(bucket_views_out[i])
    // provide convenient ways to move grad data in/out of contents.
    std::vector<at::Tensor> bucket_views_in;
    std::vector<at::Tensor> bucket_views_out;
```

I included two additional unit tests where we run multiple iterations for better test coverage:
1) `test_accumulate_gradients_no_sync_allreduce_hook`
2) `test_accumulate_gradients_no_sync_allreduce_with_then_hook`.

Those tests were failing with the old version. (If I run it for just 2 iterations, it wouldn't fail.) This simple fix just fixes the issue by doing double `copy_`. From my recent observations, performance regression by using double `copy_` is negligible. If a better solution is not obvious, we should probably land this quickly and open a separate issue to think more on the problem and a better solution.

Differential Revision: [D23229309](https://our.internmc.facebook.com/intern/diff/D23229309/)

[ghstack-poisoned]
sinannasir added a commit that referenced this pull request Aug 25, 2020
Pull Request resolved: #43307

I identified a bug with DDP communication hook while I was trying accuracy benchmarks: I was getting `loss=nan`.

Looks like when we re-`initialize_bucketviews` with the value of `future_work`, as `Reducer::mark_variable_ready_dense` does `bucket_view.copy_(grad)` it wasn't copying the `grads` back to the contents since `bucket_view` wouldn't have any relationship with `contents` after re-intitializing it with something else. As we have multiple iterations, this was causing problems.
I solved this by adding two states for `bucket_view`:
```
    // bucket_views_in[i].copy_(grad) and
    // grad.copy_(bucket_views_out[i])
    // provide convenient ways to move grad data in/out of contents.
    std::vector<at::Tensor> bucket_views_in;
    std::vector<at::Tensor> bucket_views_out;
```

I included two additional unit tests where we run multiple iterations for better test coverage:
1) `test_accumulate_gradients_no_sync_allreduce_hook`
2) `test_accumulate_gradients_no_sync_allreduce_with_then_hook`.

Those tests were failing with the old version. (If I run it for just 2 iterations, it wouldn't fail.) This simple fix just fixes the issue by doing double `copy_`. From my recent observations, performance regression by using double `copy_` is negligible. If a better solution is not obvious, we should probably land this quickly and open a separate issue to think more on the problem and a better solution.
ghstack-source-id: 110705100

Differential Revision: [D23229309](https://our.internmc.facebook.com/intern/diff/D23229309/)
sinannasir added a commit that referenced this pull request Aug 26, 2020
Pull Request resolved: #43307

I identified a bug with DDP communication hook while I was trying accuracy benchmarks: I was getting `loss=nan`.

Looks like when we re-`initialize_bucketviews` with the value of `future_work`, as `Reducer::mark_variable_ready_dense` does `bucket_view.copy_(grad)` it wasn't copying the `grads` back to the contents since `bucket_view` wouldn't have any relationship with `contents` after re-intitializing it with something else. As we have multiple iterations, this was causing problems.
I solved this by adding two states for `bucket_view`:
```
    // bucket_views_in[i].copy_(grad) and
    // grad.copy_(bucket_views_out[i])
    // provide convenient ways to move grad data in/out of contents.
    std::vector<at::Tensor> bucket_views_in;
    std::vector<at::Tensor> bucket_views_out;
```

I included two additional unit tests where we run multiple iterations for better test coverage:
1) `test_accumulate_gradients_no_sync_allreduce_hook`
2) `test_accumulate_gradients_no_sync_allreduce_with_then_hook`.

Those tests were failing with the old version. (If I run it for just 2 iterations, it wouldn't fail.) This simple fix just fixes the issue by doing double `copy_`. From my recent observations, performance regression by using double `copy_` is negligible. If a better solution is not obvious, we should probably land this quickly and open a separate issue to think more on the problem and a better solution.
ghstack-source-id: 110728299

Differential Revision: [D23229309](https://our.internmc.facebook.com/intern/diff/D23229309/)
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 769b938.

// the autograd engine (AccumulateGrad) will also create gradients
// matching its layout.
replica.bucket_views.push_back(
replica.bucket_views_out.push_back(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry that I didn't get a chance to look closely at this PR earlier, but I was wondering in the regular case (no comm. hoook) if bucket_views_out and bucket_views_in are essentially the same thing? If so, I was wondering if it would work for bucket_views_out to just point to bucket_views_in when there is no comm. hook?

Copy link
Contributor Author

@sinannasir sinannasir Aug 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, probably that will look better. Please see #43734

sinannasir added a commit that referenced this pull request Aug 27, 2020
In this diff, we prepared some example DDP communication hooks [#40848](#40848):

1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior.

2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients.

3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean.

4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html.  Note that we separately send scale and zero_point (two floats per rank) before quantized tensors.

5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error.

**Test Plan:**
python torch/distributed/algorithms/ddp_comm_hooks/test_ddp_hooks.py 
```
Couldn't download test skip set, leaving all tests enabled...
.....
Ran 5 tests in 26.724s
OK
```

**P.S.** Ignore the changes in `reducer.cpp` while reviewing, please see #43307 
Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/)

[ghstack-poisoned]
sinannasir added a commit that referenced this pull request Aug 27, 2020
In this diff, we prepared some example DDP communication hooks [#40848](#40848):

1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior.

2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients.

3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean.

4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html.  Note that we separately send scale and zero_point (two floats per rank) before quantized tensors.

5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error.

**Test Plan:**
python torch/distributed/algorithms/ddp_comm_hooks/test_ddp_hooks.py 
```
Couldn't download test skip set, leaving all tests enabled...
.....
Ran 5 tests in 26.724s
OK
```

**P.S.** Ignore the changes in `reducer.cpp` while reviewing, please see #43307 
Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/)

[ghstack-poisoned]
sinannasir added a commit that referenced this pull request Aug 27, 2020
In this diff, we prepared some example DDP communication hooks [#40848](#40848):

1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior.

2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients.

3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean.

4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html.  Note that we separately send scale and zero_point (two floats per rank) before quantized tensors.

5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error.

**Test Plan:**
python torch/distributed/algorithms/ddp_comm_hooks/test_ddp_hooks.py 
```
Couldn't download test skip set, leaving all tests enabled...
.....
Ran 5 tests in 26.724s
OK
```

**P.S.** Ignore the changes in `reducer.cpp` while reviewing, please see #43307 
Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/)

[ghstack-poisoned]
sinannasir added a commit that referenced this pull request Aug 27, 2020
In this diff, we prepared some example DDP communication hooks [#40848](#40848):

1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior.

2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients.

3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean.

4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html.  Note that we separately send scale and zero_point (two floats per rank) before quantized tensors.

5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error.

**Test Plan:**
python torch/distributed/algorithms/ddp_comm_hooks/test_ddp_hooks.py 
```
Couldn't download test skip set, leaving all tests enabled...
.....
Ran 5 tests in 26.724s
OK
```

**P.S.** Ignore the changes in `reducer.cpp` while reviewing, please see #43307 
Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/)

[ghstack-poisoned]
sinannasir added a commit that referenced this pull request Aug 27, 2020
sinannasir added a commit that referenced this pull request Aug 27, 2020
…buckets.

Following the additional GH comments on the original PR #43307.

Differential Revision: [D23380288](https://our.internmc.facebook.com/intern/diff/D23380288/)

ghstack-source-id: 110876508
Pull Request resolved: #43734
sinannasir added a commit that referenced this pull request Aug 28, 2020
In this diff, we prepared some example DDP communication hooks [#40848](#40848):

1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior.

2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients.

3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean.

4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html.  Note that we separately send scale and zero_point (two floats per rank) before quantized tensors.

5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error.

**Test Plan:**
python torch/distributed/algorithms/ddp_comm_hooks/test_ddp_hooks.py 
```
Couldn't download test skip set, leaving all tests enabled...
.....
Ran 5 tests in 26.724s
OK
```

**P.S.** Ignore the changes in `reducer.cpp` while reviewing, please see #43307 
Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/)

[ghstack-poisoned]
sinannasir added a commit that referenced this pull request Aug 28, 2020
In this diff, we prepared some example DDP communication hooks [#40848](#40848):

1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior.

2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients.

3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean.

4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html.  Note that we separately send scale and zero_point (two floats per rank) before quantized tensors.

5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error.

**Test Plan:**
python torch/distributed/algorithms/ddp_comm_hooks/test_ddp_hooks.py 
```
Couldn't download test skip set, leaving all tests enabled...
.....
Ran 5 tests in 26.724s
OK
```

**P.S.** Ignore the changes in `reducer.cpp` while reviewing, please see #43307 
Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/)

[ghstack-poisoned]
sinannasir added a commit that referenced this pull request Aug 28, 2020
In this diff, we prepared some example DDP communication hooks [#40848](#40848):

1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior.

2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients.

3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean.

4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html.  Note that we separately send scale and zero_point (two floats per rank) before quantized tensors.

5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error.

**Test Plan:**
python torch/distributed/algorithms/ddp_comm_hooks/test_ddp_hooks.py 
```
Couldn't download test skip set, leaving all tests enabled...
.....
Ran 5 tests in 26.724s
OK
```

**P.S.** Ignore the changes in `reducer.cpp` while reviewing, please see #43307 
Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/)

[ghstack-poisoned]
sinannasir added a commit that referenced this pull request Aug 28, 2020
In this diff, we prepared some example DDP communication hooks [#40848](#40848):

1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior.

2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients.

3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean.

4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html.  Note that we separately send scale and zero_point (two floats per rank) before quantized tensors.

5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error.

**Test Plan:**
python torch/distributed/algorithms/ddp_comm_hooks/test_ddp_hooks.py 
```
Couldn't download test skip set, leaving all tests enabled...
.....
Ran 5 tests in 26.724s
OK
```

**P.S.** Ignore the changes in `reducer.cpp` while reviewing, please see #43307 
Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/)

[ghstack-poisoned]
@facebook-github-bot facebook-github-bot deleted the gh/sinannasir/11/head branch August 30, 2020 14:17
sinannasir added a commit that referenced this pull request Sep 3, 2020
… result to buckets."

Following the additional GH comments on the original PR #43307.

Differential Revision: [D23380288](https://our.internmc.facebook.com/intern/diff/D23380288/)

[ghstack-poisoned]
sinannasir added a commit that referenced this pull request Sep 3, 2020
… result to buckets."

Following the additional GH comments on the original PR #43307.

Differential Revision: [D23380288](https://our.internmc.facebook.com/intern/diff/D23380288/)

[ghstack-poisoned]
sinannasir added a commit that referenced this pull request Sep 3, 2020
…buckets.

Pull Request resolved: #43734

Following the additional GH comments on the original PR #43307.
ghstack-source-id: 111327130

Differential Revision: [D23380288](https://our.internmc.facebook.com/intern/diff/D23380288/)
facebook-github-bot pushed a commit that referenced this pull request Sep 3, 2020
…buckets. (#43734)

Summary:
Pull Request resolved: #43734

Following the additional GH comments on the original PR #43307.
ghstack-source-id: 111327130

Test Plan: Run `python test/distributed/test_c10d.py`

Reviewed By: smessmer

Differential Revision: D23380288

fbshipit-source-id: 4b8889341c57b3701f0efa4edbe1d7bbc2a82ced
facebook-github-bot pushed a commit that referenced this pull request Feb 7, 2022
Summary:
Pull Request resolved: #72348

**Overview**
#43307 changed `_test_accumulate_gradients_no_sync()` to add a `num_iters` argument. However, I think the change misconstrued the test logic slightly.

https://github.com/pytorch/pytorch/blob/61ab04e1db77fd59c940ca4ba34dbfb6afcc6551/torch/testing/_internal/distributed/distributed_test.py#L4369-L4397

- `iteration % num_iters == 0` evaluates to `True` only for `iteration == 0` since `iteration` comes from `for iteration in `range(num_iters)`.
- IIUC, the intention is to alternate between accumulating gradients (using `no_sync()`) and synchronizing gradients normally. In the existing implementation, any iterations following the second one are non-productive since gradients are in sync, meaning it reduces to testing normal DDP.
- This PR changes the check back to `iteration % 2 == 0` to restore the alternating behavior.

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D34011559

Pulled By: awgu

fbshipit-source-id: 4ba771e45b28a343167a324462571e4b8e25ae72
pytorchmergebot pushed a commit that referenced this pull request Feb 7, 2022
Summary:
Pull Request resolved: #72348

**Overview**
#43307 changed `_test_accumulate_gradients_no_sync()` to add a `num_iters` argument. However, I think the change misconstrued the test logic slightly.

https://github.com/pytorch/pytorch/blob/61ab04e1db77fd59c940ca4ba34dbfb6afcc6551/torch/testing/_internal/distributed/distributed_test.py#L4369-L4397

- `iteration % num_iters == 0` evaluates to `True` only for `iteration == 0` since `iteration` comes from `for iteration in `range(num_iters)`.
- IIUC, the intention is to alternate between accumulating gradients (using `no_sync()`) and synchronizing gradients normally. In the existing implementation, any iterations following the second one are non-productive since gradients are in sync, meaning it reduces to testing normal DDP.
- This PR changes the check back to `iteration % 2 == 0` to restore the alternating behavior.

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D34011559

Pulled By: awgu

fbshipit-source-id: 4ba771e45b28a343167a324462571e4b8e25ae72
(cherry picked from commit 8492a8b)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants