Skip to content

Conversation

@sinannasir
Copy link
Contributor

@sinannasir sinannasir commented Jul 1, 2020

Stack from ghstack:

Sub-tasks 1 and 2 of 39272

Summary:

  1. We create a Future type comm_hook in the reducer and register the DDP communication hook in the Python API. We avoid using py::object directly in reducer.cpp, so PythonCommHook is an instance of abstract class CommHookInterface and define the hook as std::unique_ptr<CommHookInterface> comm_hook_.
  2. Reducer calls and waits for future_work instead of work if the hook was registered. Then it obtains result of the future_work.
  3. Included documentation for all public facing APIs.

Differential Revision: D22328310

Test plan:

1. As of v18: python test/distributed/test_c10d.py

....................s.....s.....................................................s................................
----------------------------------------------------------------------
Ran 114 tests in 215.983s

OK (skipped=3)

2. Additional tests in python test/distributed/test_c10d.py:

  • test_ddp_comm_hook_future_passing_cpu: This unit test verifies whether the Future object is passed properly. The callback function creates a Future object and sets a value to it.
  • _test_ddp_comm_hook_future_passing_gpu: This unit test verifies whether the Future object is passed properly. The callback function creates a Future object and sets a value to it.
  • test_ddp_comm_hook_future_passing_gpu_gloo: This unit test executes _test_ddp_comm_hook_future_passing_gpu using gloo backend.
  • test_ddp_comm_hook_future_passing_gpu_nccl: This unit test executes _test_ddp_comm_hook_future_passing_gpu using nccl backend.
  • test_ddp_invalid_comm_hook_init: This unit test makes sure that register_comm_hook properly checks the format of hook defined by user. The Python hook must be callable. This test also checks whether bucket annotation checked properly if defined.
  • test_ddp_invalid_comm_hook_return_type: This test checks whether return annotation checked properly if defined. It also checks whether an internal error is thrown if return type is incorrect and user hasn't specified any return type annotation.
  • test_ddp_comm_hook_register_just_once: DDP communication hook can only be registered once. This test validates whether the error is thrown properly when register_comm_hook is called more than once.

sinannasir added a commit that referenced this pull request Jul 1, 2020
Draft for sub-tasks 1 and 2 of [39272](#39272)

Differential Revision: [D22328310](https://our.internmc.facebook.com/intern/diff/D22328310/)

ghstack-source-id: 106939916
Pull Request resolved: #40848
@sinannasir sinannasir marked this pull request as draft July 1, 2020 03:21
@dr-ci
Copy link

dr-ci bot commented Jul 1, 2020

💊 CI failures summary and remediations

As of commit 2d125f7 (more details on the Dr. CI page):


  • 2/2 failures possibly* introduced in this PR
    • 2/2 non-CircleCI failure(s)

Extra GitHub checks: 1 failed


ci.pytorch.org: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 112 times.

sinannasir added a commit that referenced this pull request Jul 1, 2020
Pull Request resolved: #40848

Draft for sub-tasks 1 and 2 of [39272](#39272)
ghstack-source-id: 106962725

Differential Revision: [D22328310](https://our.internmc.facebook.com/intern/diff/D22328310/)
sinannasir added a commit that referenced this pull request Jul 1, 2020
Pull Request resolved: #40848

Draft for sub-tasks 1 and 2 of [39272](#39272)
ghstack-source-id: 106966006

Differential Revision: [D22328310](https://our.internmc.facebook.com/intern/diff/D22328310/)
sinannasir added a commit that referenced this pull request Jul 1, 2020
Pull Request resolved: #40848

Draft for sub-tasks 1 and 2 of [39272](#39272)
ghstack-source-id: 106990707

Differential Revision: [D22328310](https://our.internmc.facebook.com/intern/diff/D22328310/)
Draft for sub-tasks 1 and 2 of [39272](#39272)

Differential Revision: [D22328310](https://our.internmc.facebook.com/intern/diff/D22328310/)

#### Test plan:
As of v8: `python test/distributed/test_c10d.py`
```
................s.....s.....................................................s...............................
----------------------------------------------------------------------
Ran 108 tests in 202.235s

OK (skipped=3)
```
WIP: 
1. some additional unit tests in `test_c10d.py`
2. benchmark tests to to validate new code (if conditions before `allreduce`) doesn't slow down today's DDP.

[ghstack-poisoned]
sinannasir added a commit that referenced this pull request Jul 2, 2020
Pull Request resolved: #40848

Draft for sub-tasks 1 and 2 of [39272](#39272)
ghstack-source-id: 107063559

Differential Revision: [D22328310](https://our.internmc.facebook.com/intern/diff/D22328310/)
Draft for sub-tasks 1 and 2 of [39272](#39272)

Differential Revision: [D22328310](https://our.internmc.facebook.com/intern/diff/D22328310/)

#### Test plan:
As of v8: `python test/distributed/test_c10d.py`
```
................s.....s.....................................................s...............................
----------------------------------------------------------------------
Ran 108 tests in 202.235s

OK (skipped=3)
```
WIP: 
1. some additional unit tests in `test_c10d.py`
2. benchmark tests to to validate new code (if conditions before `allreduce`) doesn't slow down today's DDP.

[ghstack-poisoned]
sinannasir added a commit that referenced this pull request Jul 3, 2020
Pull Request resolved: #40848

Draft for sub-tasks 1 and 2 of [39272](#39272)
ghstack-source-id: 107097073

Differential Revision: [D22328310](https://our.internmc.facebook.com/intern/diff/D22328310/)
Draft for sub-tasks 1 and 2 of [39272](#39272)

Differential Revision: [D22328310](https://our.internmc.facebook.com/intern/diff/D22328310/)

#### Test plan:
As of v8: `python test/distributed/test_c10d.py`
```
................s.....s.....................................................s...............................
----------------------------------------------------------------------
Ran 108 tests in 202.235s

OK (skipped=3)
```
WIP: 
1. some additional unit tests in `test_c10d.py`
2. benchmark tests to to validate new code (if conditions before `allreduce`) doesn't slow down today's DDP.

[ghstack-poisoned]
sinannasir added a commit that referenced this pull request Jul 6, 2020
Pull Request resolved: #40848

Draft for sub-tasks 1 and 2 of [39272](#39272)
ghstack-source-id: 107180711

Differential Revision: [D22328310](https://our.internmc.facebook.com/intern/diff/D22328310/)
Draft for sub-tasks 1 and 2 of [39272](#39272)

Differential Revision: [D22328310](https://our.internmc.facebook.com/intern/diff/D22328310/)

#### Test plan:
As of v8: `python test/distributed/test_c10d.py`
```
................s.....s.....................................................s...............................
----------------------------------------------------------------------
Ran 108 tests in 202.235s

OK (skipped=3)
```
WIP: 
1. some additional unit tests in `test_c10d.py`
2. benchmark tests to to validate new code (if conditions before `allreduce`) doesn't slow down today's DDP.

[ghstack-poisoned]
sinannasir added a commit that referenced this pull request Jul 6, 2020
Pull Request resolved: #40848

Draft for sub-tasks 1 and 2 of [39272](#39272)
ghstack-source-id: 107207660

Differential Revision: [D22328310](https://our.internmc.facebook.com/intern/diff/D22328310/)
Draft for sub-tasks 1 and 2 of [39272](#39272)

Differential Revision: [D22328310](https://our.internmc.facebook.com/intern/diff/D22328310/)

#### Test plan:
As of v8: `python test/distributed/test_c10d.py`
```
................s.....s.....................................................s...............................
----------------------------------------------------------------------
Ran 108 tests in 202.235s

OK (skipped=3)
```
WIP: 
1. some additional unit tests in `test_c10d.py`
2. benchmark tests to to validate new code (if conditions before `allreduce`) doesn't slow down today's DDP.

[ghstack-poisoned]
sinannasir added a commit that referenced this pull request Aug 20, 2020
In this diff, we prepared some example DDP communication hooks [#40848](#40848):

1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior.

2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients.

3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean.

4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html.  Note that we separately send scale and zero_point (two floats per rank) before quantized tensors.

5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error.

Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/)

[ghstack-poisoned]
sinannasir added a commit that referenced this pull request Aug 20, 2020
In this diff, we prepared some example DDP communication hooks [#40848](#40848):

1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior.

2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients.

3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean.

4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html.  Note that we separately send scale and zero_point (two floats per rank) before quantized tensors.

5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error.

Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/)

ghstack-source-id: 110260865
Pull Request resolved: #43310
sinannasir added a commit that referenced this pull request Aug 21, 2020
In this diff, we prepared some example DDP communication hooks [#40848](#40848):

1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior.

2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients.

3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean.

4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html.  Note that we separately send scale and zero_point (two floats per rank) before quantized tensors.

5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error.

**Test Plan:**
python torch/distributed/algorithms/ddp_comm_hooks/test_ddp_hooks.py 
```
Couldn't download test skip set, leaving all tests enabled...
.....
Ran 5 tests in 26.724s
OK
```

**P.S.** Ignore the changes in `reducer.cpp` while reviewing, please see #43307 
Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/)

[ghstack-poisoned]
sinannasir added a commit that referenced this pull request Aug 21, 2020
Pull Request resolved: #43310

In this diff, we prepared some example DDP communication hooks [#40848](#40848):

1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior.

2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients.

3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean.

4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html.  Note that we separately send scale and zero_point (two floats per rank) before quantized tensors.

5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error.
ghstack-source-id: 110463009

Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D22937999/)!
sinannasir added a commit that referenced this pull request Aug 22, 2020
In this diff, we prepared some example DDP communication hooks [#40848](#40848):

1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior.

2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients.

3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean.

4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html.  Note that we separately send scale and zero_point (two floats per rank) before quantized tensors.

5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error.

**Test Plan:**
python torch/distributed/algorithms/ddp_comm_hooks/test_ddp_hooks.py 
```
Couldn't download test skip set, leaving all tests enabled...
.....
Ran 5 tests in 26.724s
OK
```

**P.S.** Ignore the changes in `reducer.cpp` while reviewing, please see #43307 
Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/)

[ghstack-poisoned]
sinannasir added a commit that referenced this pull request Aug 22, 2020
In this diff, we prepared some example DDP communication hooks [#40848](#40848):

1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior.

2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients.

3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean.

4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html.  Note that we separately send scale and zero_point (two floats per rank) before quantized tensors.

5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error.

**Test Plan:**
python torch/distributed/algorithms/ddp_comm_hooks/test_ddp_hooks.py 
```
Couldn't download test skip set, leaving all tests enabled...
.....
Ran 5 tests in 26.724s
OK
```

**P.S.** Ignore the changes in `reducer.cpp` while reviewing, please see #43307 
Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/)

[ghstack-poisoned]
sinannasir added a commit that referenced this pull request Aug 22, 2020
Pull Request resolved: #43310

In this diff, we prepared some example DDP communication hooks [#40848](#40848):

1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior.

2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients.

3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean.

4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html.  Note that we separately send scale and zero_point (two floats per rank) before quantized tensors.

5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error.
ghstack-source-id: 110501907

Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D22937999/)!
sinannasir added a commit that referenced this pull request Aug 24, 2020
…to buckets."


I identified a bug with DDP communication hook #40848 while I was trying accuracy benchmarks: I was getting `loss=nan`.

Looks like passing the result of future as re-`initialize_bucketviews` was causing the problem.

One easy fix is to simply do double `copy_` by `bucket.replicas[i].contents.copy_(future_result[i]);`.

I included two additional unit tests where we run multiple iterations for better test coverage:
1) `test_accumulate_gradients_no_sync_allreduce_hook`
2) `test_accumulate_gradients_no_sync_allreduce_with_then_hook`.

Those tests were failing with the old version. (If I run it for just 2 iterations, it wouldn't fail.) This simple fix just fixes the issue by doing double `copy_`. From my recent observations, performance regression by using double `copy_` is negligible. If a better solution is not obvious, we should probably land this quickly and open a separate issue to think more on the problem and a better solution.

Differential Revision: [D23229309](https://our.internmc.facebook.com/intern/diff/D23229309/)

[ghstack-poisoned]
sinannasir added a commit that referenced this pull request Aug 25, 2020
In this diff, we prepared some example DDP communication hooks [#40848](#40848):

1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior.

2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients.

3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean.

4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html.  Note that we separately send scale and zero_point (two floats per rank) before quantized tensors.

5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error.

**Test Plan:**
python torch/distributed/algorithms/ddp_comm_hooks/test_ddp_hooks.py 
```
Couldn't download test skip set, leaving all tests enabled...
.....
Ran 5 tests in 26.724s
OK
```

**P.S.** Ignore the changes in `reducer.cpp` while reviewing, please see #43307 
Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/)

[ghstack-poisoned]
sinannasir added a commit that referenced this pull request Aug 25, 2020
In this diff, we prepared some example DDP communication hooks [#40848](#40848):

1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior.

2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients.

3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean.

4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html.  Note that we separately send scale and zero_point (two floats per rank) before quantized tensors.

5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error.

**Test Plan:**
python torch/distributed/algorithms/ddp_comm_hooks/test_ddp_hooks.py 
```
Couldn't download test skip set, leaving all tests enabled...
.....
Ran 5 tests in 26.724s
OK
```

**P.S.** Ignore the changes in `reducer.cpp` while reviewing, please see #43307 
Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/)

[ghstack-poisoned]
sinannasir added a commit that referenced this pull request Aug 25, 2020
Pull Request resolved: #43310

In this diff, we prepared some example DDP communication hooks [#40848](#40848):

1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior.

2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients.

3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean.

4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html.  Note that we separately send scale and zero_point (two floats per rank) before quantized tensors.

5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error.
ghstack-source-id: 110620825

Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D22937999/)!
sinannasir added a commit that referenced this pull request Aug 25, 2020
In this diff, we prepared some example DDP communication hooks [#40848](#40848):

1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior.

2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients.

3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean.

4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html.  Note that we separately send scale and zero_point (two floats per rank) before quantized tensors.

5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error.

**Test Plan:**
python torch/distributed/algorithms/ddp_comm_hooks/test_ddp_hooks.py 
```
Couldn't download test skip set, leaving all tests enabled...
.....
Ran 5 tests in 26.724s
OK
```

**P.S.** Ignore the changes in `reducer.cpp` while reviewing, please see #43307 
Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/)

[ghstack-poisoned]
sinannasir added a commit that referenced this pull request Aug 25, 2020
In this diff, we prepared some example DDP communication hooks [#40848](#40848):

1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior.

2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients.

3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean.

4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html.  Note that we separately send scale and zero_point (two floats per rank) before quantized tensors.

5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error.

**Test Plan:**
python torch/distributed/algorithms/ddp_comm_hooks/test_ddp_hooks.py 
```
Couldn't download test skip set, leaving all tests enabled...
.....
Ran 5 tests in 26.724s
OK
```

**P.S.** Ignore the changes in `reducer.cpp` while reviewing, please see #43307 
Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/)

[ghstack-poisoned]
sinannasir added a commit that referenced this pull request Aug 26, 2020
In this diff, we prepared some example DDP communication hooks [#40848](#40848):

1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior.

2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients.

3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean.

4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html.  Note that we separately send scale and zero_point (two floats per rank) before quantized tensors.

5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error.

**Test Plan:**
python torch/distributed/algorithms/ddp_comm_hooks/test_ddp_hooks.py 
```
Couldn't download test skip set, leaving all tests enabled...
.....
Ran 5 tests in 26.724s
OK
```

**P.S.** Ignore the changes in `reducer.cpp` while reviewing, please see #43307 
Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/)

[ghstack-poisoned]
sinannasir added a commit that referenced this pull request Aug 26, 2020
In this diff, we prepared some example DDP communication hooks [#40848](#40848):

1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior.

2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients.

3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean.

4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html.  Note that we separately send scale and zero_point (two floats per rank) before quantized tensors.

5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error.

**Test Plan:**
python torch/distributed/algorithms/ddp_comm_hooks/test_ddp_hooks.py 
```
Couldn't download test skip set, leaving all tests enabled...
.....
Ran 5 tests in 26.724s
OK
```

**P.S.** Ignore the changes in `reducer.cpp` while reviewing, please see #43307 
Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/)

[ghstack-poisoned]
sinannasir added a commit that referenced this pull request Aug 26, 2020
Pull Request resolved: #43310

In this diff, we prepared some example DDP communication hooks [#40848](#40848):

1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior.

2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients.

3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean.

4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html.  Note that we separately send scale and zero_point (two floats per rank) before quantized tensors.

5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error.
ghstack-source-id: 110706550

Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D22937999/)!
sinannasir added a commit that referenced this pull request Aug 27, 2020
In this diff, we prepared some example DDP communication hooks [#40848](#40848):

1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior.

2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients.

3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean.

4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html.  Note that we separately send scale and zero_point (two floats per rank) before quantized tensors.

5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error.

**Test Plan:**
python torch/distributed/algorithms/ddp_comm_hooks/test_ddp_hooks.py 
```
Couldn't download test skip set, leaving all tests enabled...
.....
Ran 5 tests in 26.724s
OK
```

**P.S.** Ignore the changes in `reducer.cpp` while reviewing, please see #43307 
Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/)

[ghstack-poisoned]
sinannasir added a commit that referenced this pull request Aug 27, 2020
In this diff, we prepared some example DDP communication hooks [#40848](#40848):

1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior.

2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients.

3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean.

4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html.  Note that we separately send scale and zero_point (two floats per rank) before quantized tensors.

5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error.

**Test Plan:**
python torch/distributed/algorithms/ddp_comm_hooks/test_ddp_hooks.py 
```
Couldn't download test skip set, leaving all tests enabled...
.....
Ran 5 tests in 26.724s
OK
```

**P.S.** Ignore the changes in `reducer.cpp` while reviewing, please see #43307 
Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/)

[ghstack-poisoned]
sinannasir added a commit that referenced this pull request Aug 27, 2020
In this diff, we prepared some example DDP communication hooks [#40848](#40848):

1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior.

2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients.

3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean.

4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html.  Note that we separately send scale and zero_point (two floats per rank) before quantized tensors.

5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error.

**Test Plan:**
python torch/distributed/algorithms/ddp_comm_hooks/test_ddp_hooks.py 
```
Couldn't download test skip set, leaving all tests enabled...
.....
Ran 5 tests in 26.724s
OK
```

**P.S.** Ignore the changes in `reducer.cpp` while reviewing, please see #43307 
Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/)

[ghstack-poisoned]
sinannasir added a commit that referenced this pull request Aug 27, 2020
In this diff, we prepared some example DDP communication hooks [#40848](#40848):

1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior.

2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients.

3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean.

4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html.  Note that we separately send scale and zero_point (two floats per rank) before quantized tensors.

5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error.

**Test Plan:**
python torch/distributed/algorithms/ddp_comm_hooks/test_ddp_hooks.py 
```
Couldn't download test skip set, leaving all tests enabled...
.....
Ran 5 tests in 26.724s
OK
```

**P.S.** Ignore the changes in `reducer.cpp` while reviewing, please see #43307 
Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/)

[ghstack-poisoned]
sinannasir added a commit that referenced this pull request Aug 27, 2020
Pull Request resolved: #43310

In this diff, we prepared some example DDP communication hooks [#40848](#40848):

1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior.

2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients.

3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean.

4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html.  Note that we separately send scale and zero_point (two floats per rank) before quantized tensors.

5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error.
ghstack-source-id: 110868178

Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D22937999/)!
sinannasir added a commit that referenced this pull request Aug 28, 2020
In this diff, we prepared some example DDP communication hooks [#40848](#40848):

1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior.

2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients.

3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean.

4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html.  Note that we separately send scale and zero_point (two floats per rank) before quantized tensors.

5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error.

**Test Plan:**
python torch/distributed/algorithms/ddp_comm_hooks/test_ddp_hooks.py 
```
Couldn't download test skip set, leaving all tests enabled...
.....
Ran 5 tests in 26.724s
OK
```

**P.S.** Ignore the changes in `reducer.cpp` while reviewing, please see #43307 
Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/)

[ghstack-poisoned]
sinannasir added a commit that referenced this pull request Aug 28, 2020
In this diff, we prepared some example DDP communication hooks [#40848](#40848):

1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior.

2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients.

3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean.

4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html.  Note that we separately send scale and zero_point (two floats per rank) before quantized tensors.

5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error.

**Test Plan:**
python torch/distributed/algorithms/ddp_comm_hooks/test_ddp_hooks.py 
```
Couldn't download test skip set, leaving all tests enabled...
.....
Ran 5 tests in 26.724s
OK
```

**P.S.** Ignore the changes in `reducer.cpp` while reviewing, please see #43307 
Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/)

[ghstack-poisoned]
sinannasir added a commit that referenced this pull request Aug 28, 2020
Pull Request resolved: #43310

In this diff, we prepared some example DDP communication hooks [#40848](#40848):

1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior.

2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients.

3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean.

4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html.  Note that we separately send scale and zero_point (two floats per rank) before quantized tensors.

5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error.
ghstack-source-id: 110908936

Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D22937999/)!
sinannasir added a commit that referenced this pull request Aug 28, 2020
In this diff, we prepared some example DDP communication hooks [#40848](#40848):

1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior.

2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients.

3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean.

4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html.  Note that we separately send scale and zero_point (two floats per rank) before quantized tensors.

5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error.

**Test Plan:**
python torch/distributed/algorithms/ddp_comm_hooks/test_ddp_hooks.py 
```
Couldn't download test skip set, leaving all tests enabled...
.....
Ran 5 tests in 26.724s
OK
```

**P.S.** Ignore the changes in `reducer.cpp` while reviewing, please see #43307 
Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/)

[ghstack-poisoned]
sinannasir added a commit that referenced this pull request Aug 28, 2020
Pull Request resolved: #43310

In this diff, we prepared some example DDP communication hooks [#40848](#40848):

1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior.

2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients.

3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean.

4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html.  Note that we separately send scale and zero_point (two floats per rank) before quantized tensors.

5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error.
ghstack-source-id: 110923269

Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D22937999/)!
sinannasir added a commit that referenced this pull request Aug 28, 2020
In this diff, we prepared some example DDP communication hooks [#40848](#40848):

1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior.

2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients.

3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean.

4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html.  Note that we separately send scale and zero_point (two floats per rank) before quantized tensors.

5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error.

**Test Plan:**
python torch/distributed/algorithms/ddp_comm_hooks/test_ddp_hooks.py 
```
Couldn't download test skip set, leaving all tests enabled...
.....
Ran 5 tests in 26.724s
OK
```

**P.S.** Ignore the changes in `reducer.cpp` while reviewing, please see #43307 
Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/)

[ghstack-poisoned]
facebook-github-bot pushed a commit that referenced this pull request Aug 29, 2020
Summary:
Pull Request resolved: #43310

In this diff, we prepared some example DDP communication hooks [#40848](#40848):

1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior.

2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients.

3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean.

4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html.  Note that we separately send scale and zero_point (two floats per rank) before quantized tensors.

5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error.
ghstack-source-id: 110923269

Test Plan:
python torch/distributed/algorithms/ddp_comm_hooks/test_ddp_hooks.py
Couldn't download test skip set, leaving all tests enabled...
.....
----------------------------------------------------------------------
Ran 4 tests in 26.724s

OK

Internal testing:
```
buck run mode/dev-nosan //caffe2/test/distributed/algorithms/ddp_comm_hooks:test_ddp_hooks
```

Reviewed By: malfet

Differential Revision: D22937999

fbshipit-source-id: 274452e7932414570999cb978ae77a97eb3fb0ec
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants