-
Notifications
You must be signed in to change notification settings - Fork 26.3k
DDP Communication hook: Fix the way we pass future result to buckets. #43307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I identified a bug with DDP communication hook while I was trying accuracy benchmarks: I was getting `loss=nan`. Looks like passing the result of future as re-`initialize_bucketviews` was causing the problem. One easy fix is to simply do double `copy_` by `bucket.replicas[i].contents.copy_(future_result[i]);`. I included two additional unit tests where we run multiple iterations for better test coverage: 1) `test_accumulate_gradients_no_sync_allreduce_hook` 2) `test_accumulate_gradients_no_sync_allreduce_with_then_hook`. Those tests were failing with the old version. (If I run it for just 2 iterations, it wouldn't fail.) This simple fix just fixes the issue by doing double `copy_`. From my recent observations, performance regression by using double `copy_` is negligible. If a better solution is not obvious, we should probably land this quickly and open a separate issue to think more on the problem and a better solution. Differential Revision: [D23229309](https://our.internmc.facebook.com/intern/diff/D23229309/) [ghstack-poisoned]
I identified a bug with DDP communication hook while I was trying accuracy benchmarks: I was getting `loss=nan`. Looks like passing the result of future as re-`initialize_bucketviews` was causing the problem. One easy fix is to simply do double `copy_` by `bucket.replicas[i].contents.copy_(future_result[i]);`. I included two additional unit tests where we run multiple iterations for better test coverage: 1) `test_accumulate_gradients_no_sync_allreduce_hook` 2) `test_accumulate_gradients_no_sync_allreduce_with_then_hook`. Those tests were failing with the old version. (If I run it for just 2 iterations, it wouldn't fail.) This simple fix just fixes the issue by doing double `copy_`. From my recent observations, performance regression by using double `copy_` is negligible. If a better solution is not obvious, we should probably land this quickly and open a separate issue to think more on the problem and a better solution. Differential Revision: [D23229309](https://our.internmc.facebook.com/intern/diff/D23229309/) ghstack-source-id: 110298379 Pull Request resolved: #43307
💊 CI failures summary and remediationsAs of commit 20ce201 (more details on the Dr. CI page):
🕵️ 4 new failures recognized by patternsThe following CI failures do not appear to be due to upstream breakages:
|
test/distributed/test_c10d.py
Outdated
| """ | ||
| int_devices = gpus_for_rank(self.world_size)[self.rank][:1] | ||
| devices = list([torch.device('cuda:' + str(i)) for i in int_devices]) | ||
| devices = list([torch.device("cuda:" + str(i)) for i in int_devices]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: since you're already changing this line, might as well remove the call to list() which is not needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I missed that. I was just running linter for the parts that I was changing. I also made relevant changes across the whole test file.
In this diff, we prepared some example DDP communication hooks [#40848](#40848): 1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior. 2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients. 3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean. 4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html. Note that we separately send scale and zero_point (two floats per rank) before quantized tensors. 5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error. **Test Plan:** python torch/distributed/algorithms/ddp_comm_hooks/test_ddp_hooks.py ``` Couldn't download test skip set, leaving all tests enabled... ..... Ran 5 tests in 26.724s OK ``` **P.S.** Ignore the changes in `reducer.cpp` while reviewing, please see #43307 Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/) [ghstack-poisoned]
pritamdamania87
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed offline, let's figure out why the Nan is occurring and whether its possible to avoid this additional copy.
In this diff, we prepared some example DDP communication hooks [#40848](#40848): 1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior. 2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients. 3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean. 4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html. Note that we separately send scale and zero_point (two floats per rank) before quantized tensors. 5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error. **Test Plan:** python torch/distributed/algorithms/ddp_comm_hooks/test_ddp_hooks.py ``` Couldn't download test skip set, leaving all tests enabled... ..... Ran 5 tests in 26.724s OK ``` **P.S.** Ignore the changes in `reducer.cpp` while reviewing, please see #43307 Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/) [ghstack-poisoned]
In this diff, we prepared some example DDP communication hooks [#40848](#40848): 1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior. 2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients. 3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean. 4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html. Note that we separately send scale and zero_point (two floats per rank) before quantized tensors. 5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error. **Test Plan:** python torch/distributed/algorithms/ddp_comm_hooks/test_ddp_hooks.py ``` Couldn't download test skip set, leaving all tests enabled... ..... Ran 5 tests in 26.724s OK ``` **P.S.** Ignore the changes in `reducer.cpp` while reviewing, please see #43307 Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/) [ghstack-poisoned]
…to buckets." I identified a bug with DDP communication hook #40848 while I was trying accuracy benchmarks: I was getting `loss=nan`. Looks like passing the result of future as re-`initialize_bucketviews` was causing the problem. One easy fix is to simply do double `copy_` by `bucket.replicas[i].contents.copy_(future_result[i]);`. I included two additional unit tests where we run multiple iterations for better test coverage: 1) `test_accumulate_gradients_no_sync_allreduce_hook` 2) `test_accumulate_gradients_no_sync_allreduce_with_then_hook`. Those tests were failing with the old version. (If I run it for just 2 iterations, it wouldn't fail.) This simple fix just fixes the issue by doing double `copy_`. From my recent observations, performance regression by using double `copy_` is negligible. If a better solution is not obvious, we should probably land this quickly and open a separate issue to think more on the problem and a better solution. Differential Revision: [D23229309](https://our.internmc.facebook.com/intern/diff/D23229309/) [ghstack-poisoned]
…to buckets."
I identified a bug with DDP communication hook while I was trying accuracy benchmarks: I was getting `loss=nan`.
Looks like when we re-`initialize_bucketviews` with the value of `future_work`, as `Reducer::finalize_bucket_dense` does `grad.copy_(bucket_view)` it wasn't copying the `grads` back to the contents since `bucket_view` wouldn't have any relationship with `contents` after re-intitializing it with something else. As we have multiple iterations, this was causing problems.
I solved this by adding two states for `bucket_view`:
```[c++]
// bucket_views_in[i].copy_(grad) and
// grad.copy_(bucket_views_out[i])
// provide convenient ways to move grad data in/out of contents.
std::vector<at::Tensor> bucket_views_in;
std::vector<at::Tensor> bucket_views_out;
```
I included two additional unit tests where we run multiple iterations for better test coverage:
1) `test_accumulate_gradients_no_sync_allreduce_hook`
2) `test_accumulate_gradients_no_sync_allreduce_with_then_hook`.
Those tests were failing with the old version. (If I run it for just 2 iterations, it wouldn't fail.) This simple fix just fixes the issue by doing double `copy_`. From my recent observations, performance regression by using double `copy_` is negligible. If a better solution is not obvious, we should probably land this quickly and open a separate issue to think more on the problem and a better solution.
Differential Revision: [D23229309](https://our.internmc.facebook.com/intern/diff/D23229309/)
[ghstack-poisoned]
…to buckets."
I identified a bug with DDP communication hook while I was trying accuracy benchmarks: I was getting `loss=nan`.
Looks like when we re-`initialize_bucketviews` with the value of `future_work`, as `Reducer::finalize_bucket_dense` does `grad.copy_(bucket_view)` it wasn't copying the `grads` back to the contents since `bucket_view` wouldn't have any relationship with `contents` after re-intitializing it with something else. As we have multiple iterations, this was causing problems.
I solved this by adding two states for `bucket_view`:
```[c++]
// bucket_views_in[i].copy_(grad) and
// grad.copy_(bucket_views_out[i])
// provide convenient ways to move grad data in/out of contents.
std::vector<at::Tensor> bucket_views_in;
std::vector<at::Tensor> bucket_views_out;
```
I included two additional unit tests where we run multiple iterations for better test coverage:
1) `test_accumulate_gradients_no_sync_allreduce_hook`
2) `test_accumulate_gradients_no_sync_allreduce_with_then_hook`.
Those tests were failing with the old version. (If I run it for just 2 iterations, it wouldn't fail.) This simple fix just fixes the issue by doing double `copy_`. From my recent observations, performance regression by using double `copy_` is negligible. If a better solution is not obvious, we should probably land this quickly and open a separate issue to think more on the problem and a better solution.
Differential Revision: [D23229309](https://our.internmc.facebook.com/intern/diff/D23229309/)
[ghstack-poisoned]
Pull Request resolved: #43307 I identified a bug with DDP communication hook while I was trying accuracy benchmarks: I was getting `loss=nan`. Looks like when we re-`initialize_bucketviews` with the value of `future_work`, as `Reducer::finalize_bucket_dense` does `grad.copy_(bucket_view)` it wasn't copying the `grads` back to the contents since `bucket_view` wouldn't have any relationship with `contents` after re-intitializing it with something else. As we have multiple iterations, this was causing problems. I solved this by adding two states for `bucket_view`: ``` // bucket_views_in[i].copy_(grad) and // grad.copy_(bucket_views_out[i]) // provide convenient ways to move grad data in/out of contents. std::vector<at::Tensor> bucket_views_in; std::vector<at::Tensor> bucket_views_out; ``` I included two additional unit tests where we run multiple iterations for better test coverage: 1) `test_accumulate_gradients_no_sync_allreduce_hook` 2) `test_accumulate_gradients_no_sync_allreduce_with_then_hook`. Those tests were failing with the old version. (If I run it for just 2 iterations, it wouldn't fail.) This simple fix just fixes the issue by doing double `copy_`. From my recent observations, performance regression by using double `copy_` is negligible. If a better solution is not obvious, we should probably land this quickly and open a separate issue to think more on the problem and a better solution. ghstack-source-id: 110537709 Differential Revision: [D23229309](https://our.internmc.facebook.com/intern/diff/D23229309/)
Looks like when we re- I see that there is a recent PR on bucket_views #41954. I've solved a lot of merge conflicts with that PR. Hope my changes won't cause any trouble. |
In this diff, we prepared some example DDP communication hooks [#40848](#40848): 1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior. 2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients. 3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean. 4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html. Note that we separately send scale and zero_point (two floats per rank) before quantized tensors. 5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error. **Test Plan:** python torch/distributed/algorithms/ddp_comm_hooks/test_ddp_hooks.py ``` Couldn't download test skip set, leaving all tests enabled... ..... Ran 5 tests in 26.724s OK ``` **P.S.** Ignore the changes in `reducer.cpp` while reviewing, please see #43307 Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/) [ghstack-poisoned]
In this diff, we prepared some example DDP communication hooks [#40848](#40848): 1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior. 2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients. 3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean. 4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html. Note that we separately send scale and zero_point (two floats per rank) before quantized tensors. 5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error. **Test Plan:** python torch/distributed/algorithms/ddp_comm_hooks/test_ddp_hooks.py ``` Couldn't download test skip set, leaving all tests enabled... ..... Ran 5 tests in 26.724s OK ``` **P.S.** Ignore the changes in `reducer.cpp` while reviewing, please see #43307 Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/) [ghstack-poisoned]
In this diff, we prepared some example DDP communication hooks [#40848](#40848): 1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior. 2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients. 3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean. 4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html. Note that we separately send scale and zero_point (two floats per rank) before quantized tensors. 5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error. **Test Plan:** python torch/distributed/algorithms/ddp_comm_hooks/test_ddp_hooks.py ``` Couldn't download test skip set, leaving all tests enabled... ..... Ran 5 tests in 26.724s OK ``` **P.S.** Ignore the changes in `reducer.cpp` while reviewing, please see #43307 Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/) [ghstack-poisoned]
In this diff, we prepared some example DDP communication hooks [#40848](#40848): 1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior. 2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients. 3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean. 4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html. Note that we separately send scale and zero_point (two floats per rank) before quantized tensors. 5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error. **Test Plan:** python torch/distributed/algorithms/ddp_comm_hooks/test_ddp_hooks.py ``` Couldn't download test skip set, leaving all tests enabled... ..... Ran 5 tests in 26.724s OK ``` **P.S.** Ignore the changes in `reducer.cpp` while reviewing, please see #43307 Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/) [ghstack-poisoned]
…to buckets."
I identified a bug with DDP communication hook while I was trying accuracy benchmarks: I was getting `loss=nan`.
Looks like when we re-`initialize_bucketviews` with the value of `future_work`, as `Reducer::mark_variable_ready_dense` does `bucket_view.copy_(grad)` it wasn't copying the `grads` back to the contents since `bucket_view` wouldn't have any relationship with `contents` after re-intitializing it with something else. As we have multiple iterations, this was causing problems.
I solved this by adding two states for `bucket_view`:
```
// bucket_views_in[i].copy_(grad) and
// grad.copy_(bucket_views_out[i])
// provide convenient ways to move grad data in/out of contents.
std::vector<at::Tensor> bucket_views_in;
std::vector<at::Tensor> bucket_views_out;
```
I included two additional unit tests where we run multiple iterations for better test coverage:
1) `test_accumulate_gradients_no_sync_allreduce_hook`
2) `test_accumulate_gradients_no_sync_allreduce_with_then_hook`.
Those tests were failing with the old version. (If I run it for just 2 iterations, it wouldn't fail.) This simple fix just fixes the issue by doing double `copy_`. From my recent observations, performance regression by using double `copy_` is negligible. If a better solution is not obvious, we should probably land this quickly and open a separate issue to think more on the problem and a better solution.
Differential Revision: [D23229309](https://our.internmc.facebook.com/intern/diff/D23229309/)
[ghstack-poisoned]
Pull Request resolved: #43307 I identified a bug with DDP communication hook while I was trying accuracy benchmarks: I was getting `loss=nan`. Looks like when we re-`initialize_bucketviews` with the value of `future_work`, as `Reducer::mark_variable_ready_dense` does `bucket_view.copy_(grad)` it wasn't copying the `grads` back to the contents since `bucket_view` wouldn't have any relationship with `contents` after re-intitializing it with something else. As we have multiple iterations, this was causing problems. I solved this by adding two states for `bucket_view`: ``` // bucket_views_in[i].copy_(grad) and // grad.copy_(bucket_views_out[i]) // provide convenient ways to move grad data in/out of contents. std::vector<at::Tensor> bucket_views_in; std::vector<at::Tensor> bucket_views_out; ``` I included two additional unit tests where we run multiple iterations for better test coverage: 1) `test_accumulate_gradients_no_sync_allreduce_hook` 2) `test_accumulate_gradients_no_sync_allreduce_with_then_hook`. Those tests were failing with the old version. (If I run it for just 2 iterations, it wouldn't fail.) This simple fix just fixes the issue by doing double `copy_`. From my recent observations, performance regression by using double `copy_` is negligible. If a better solution is not obvious, we should probably land this quickly and open a separate issue to think more on the problem and a better solution. ghstack-source-id: 110705100 Differential Revision: [D23229309](https://our.internmc.facebook.com/intern/diff/D23229309/)
Pull Request resolved: #43307 I identified a bug with DDP communication hook while I was trying accuracy benchmarks: I was getting `loss=nan`. Looks like when we re-`initialize_bucketviews` with the value of `future_work`, as `Reducer::mark_variable_ready_dense` does `bucket_view.copy_(grad)` it wasn't copying the `grads` back to the contents since `bucket_view` wouldn't have any relationship with `contents` after re-intitializing it with something else. As we have multiple iterations, this was causing problems. I solved this by adding two states for `bucket_view`: ``` // bucket_views_in[i].copy_(grad) and // grad.copy_(bucket_views_out[i]) // provide convenient ways to move grad data in/out of contents. std::vector<at::Tensor> bucket_views_in; std::vector<at::Tensor> bucket_views_out; ``` I included two additional unit tests where we run multiple iterations for better test coverage: 1) `test_accumulate_gradients_no_sync_allreduce_hook` 2) `test_accumulate_gradients_no_sync_allreduce_with_then_hook`. Those tests were failing with the old version. (If I run it for just 2 iterations, it wouldn't fail.) This simple fix just fixes the issue by doing double `copy_`. From my recent observations, performance regression by using double `copy_` is negligible. If a better solution is not obvious, we should probably land this quickly and open a separate issue to think more on the problem and a better solution. ghstack-source-id: 110728299 Differential Revision: [D23229309](https://our.internmc.facebook.com/intern/diff/D23229309/)
|
This pull request has been merged in 769b938. |
| // the autograd engine (AccumulateGrad) will also create gradients | ||
| // matching its layout. | ||
| replica.bucket_views.push_back( | ||
| replica.bucket_views_out.push_back( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry that I didn't get a chance to look closely at this PR earlier, but I was wondering in the regular case (no comm. hoook) if bucket_views_out and bucket_views_in are essentially the same thing? If so, I was wondering if it would work for bucket_views_out to just point to bucket_views_in when there is no comm. hook?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, probably that will look better. Please see #43734
In this diff, we prepared some example DDP communication hooks [#40848](#40848): 1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior. 2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients. 3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean. 4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html. Note that we separately send scale and zero_point (two floats per rank) before quantized tensors. 5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error. **Test Plan:** python torch/distributed/algorithms/ddp_comm_hooks/test_ddp_hooks.py ``` Couldn't download test skip set, leaving all tests enabled... ..... Ran 5 tests in 26.724s OK ``` **P.S.** Ignore the changes in `reducer.cpp` while reviewing, please see #43307 Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/) [ghstack-poisoned]
In this diff, we prepared some example DDP communication hooks [#40848](#40848): 1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior. 2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients. 3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean. 4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html. Note that we separately send scale and zero_point (two floats per rank) before quantized tensors. 5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error. **Test Plan:** python torch/distributed/algorithms/ddp_comm_hooks/test_ddp_hooks.py ``` Couldn't download test skip set, leaving all tests enabled... ..... Ran 5 tests in 26.724s OK ``` **P.S.** Ignore the changes in `reducer.cpp` while reviewing, please see #43307 Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/) [ghstack-poisoned]
In this diff, we prepared some example DDP communication hooks [#40848](#40848): 1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior. 2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients. 3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean. 4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html. Note that we separately send scale and zero_point (two floats per rank) before quantized tensors. 5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error. **Test Plan:** python torch/distributed/algorithms/ddp_comm_hooks/test_ddp_hooks.py ``` Couldn't download test skip set, leaving all tests enabled... ..... Ran 5 tests in 26.724s OK ``` **P.S.** Ignore the changes in `reducer.cpp` while reviewing, please see #43307 Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/) [ghstack-poisoned]
In this diff, we prepared some example DDP communication hooks [#40848](#40848): 1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior. 2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients. 3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean. 4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html. Note that we separately send scale and zero_point (two floats per rank) before quantized tensors. 5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error. **Test Plan:** python torch/distributed/algorithms/ddp_comm_hooks/test_ddp_hooks.py ``` Couldn't download test skip set, leaving all tests enabled... ..... Ran 5 tests in 26.724s OK ``` **P.S.** Ignore the changes in `reducer.cpp` while reviewing, please see #43307 Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/) [ghstack-poisoned]
…buckets. Following the additional GH comments on the original PR #43307. Differential Revision: [D23380288](https://our.internmc.facebook.com/intern/diff/D23380288/) [ghstack-poisoned]
…buckets. Following the additional GH comments on the original PR #43307. Differential Revision: [D23380288](https://our.internmc.facebook.com/intern/diff/D23380288/) ghstack-source-id: 110876508 Pull Request resolved: #43734
In this diff, we prepared some example DDP communication hooks [#40848](#40848): 1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior. 2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients. 3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean. 4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html. Note that we separately send scale and zero_point (two floats per rank) before quantized tensors. 5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error. **Test Plan:** python torch/distributed/algorithms/ddp_comm_hooks/test_ddp_hooks.py ``` Couldn't download test skip set, leaving all tests enabled... ..... Ran 5 tests in 26.724s OK ``` **P.S.** Ignore the changes in `reducer.cpp` while reviewing, please see #43307 Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/) [ghstack-poisoned]
In this diff, we prepared some example DDP communication hooks [#40848](#40848): 1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior. 2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients. 3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean. 4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html. Note that we separately send scale and zero_point (two floats per rank) before quantized tensors. 5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error. **Test Plan:** python torch/distributed/algorithms/ddp_comm_hooks/test_ddp_hooks.py ``` Couldn't download test skip set, leaving all tests enabled... ..... Ran 5 tests in 26.724s OK ``` **P.S.** Ignore the changes in `reducer.cpp` while reviewing, please see #43307 Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/) [ghstack-poisoned]
In this diff, we prepared some example DDP communication hooks [#40848](#40848): 1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior. 2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients. 3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean. 4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html. Note that we separately send scale and zero_point (two floats per rank) before quantized tensors. 5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error. **Test Plan:** python torch/distributed/algorithms/ddp_comm_hooks/test_ddp_hooks.py ``` Couldn't download test skip set, leaving all tests enabled... ..... Ran 5 tests in 26.724s OK ``` **P.S.** Ignore the changes in `reducer.cpp` while reviewing, please see #43307 Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/) [ghstack-poisoned]
In this diff, we prepared some example DDP communication hooks [#40848](#40848): 1\. `allreduce_hook`: This DDP communication hook just calls ``allreduce`` using ``GradBucket`` tensors. Once gradient tensors are aggregated across all workers, its ``then`` callback takes the mean and returns the result. If user registers this hook DDP results is expected to be same as the case where no hook was registered. Hence, this won't change behavior of DDP and user can use this as a reference or modify this hook to log useful information or any other purposes while unaffecting DDP behavior. 2\. `allgather_then_aggregate_hook` Similar to ``allreduce_hook``, this hook first gathers ``GradBucket`` tensors and its ``then`` callback aggregates the gathered gradient tensors and takes mean. Instead of ``allreduce`` this hook uses ``allgather``. Note that with W workers, both the computation and communication time scale as O(W) for allgather compared to O(logW) for allreduce. Therefore, this hook is expected to be much slower than ``allreduce_hook`` although both essentially do the same thing with the gradients. 3\. `fp16_compress_hook` This DDP communication hook implements a simple gradient compression approach that converts ``GradBucket`` tensors whose type is assumed to be ``torch.float32`` to half-precision floating point format (``torch.float16``). It allreduces those ``float16`` gradient tensors. Once compressed gradient tensors are allreduced, its then callback called ``decompress`` converts the aggregated result back to ``float32`` and takes the mean. 4\. `quantization_pertensor_hook` does quantization per tensor and uses the idea in https://pytorch.org/docs/master/generated/torch.quantize_per_tensor.html. Note that we separately send scale and zero_point (two floats per rank) before quantized tensors. 5\. `quantization_perchannel_hook` does quantization per channel similar to https://pytorch.org/docs/master/generated/torch.quantize_per_channel.html. The main motivation is that after the initial QSGD study diff, we realized that for considerably large gradient tensors such as a tensor that contains 6 million floats quantizing dividing it into smaller channels (512 float chunks) and quantizing independently may significantly increase the resolution and result with lower error. **Test Plan:** python torch/distributed/algorithms/ddp_comm_hooks/test_ddp_hooks.py ``` Couldn't download test skip set, leaving all tests enabled... ..... Ran 5 tests in 26.724s OK ``` **P.S.** Ignore the changes in `reducer.cpp` while reviewing, please see #43307 Differential Revision: [D22937999](https://our.internmc.facebook.com/intern/diff/D22937999/) [ghstack-poisoned]
… result to buckets." Following the additional GH comments on the original PR #43307. Differential Revision: [D23380288](https://our.internmc.facebook.com/intern/diff/D23380288/) [ghstack-poisoned]
… result to buckets." Following the additional GH comments on the original PR #43307. Differential Revision: [D23380288](https://our.internmc.facebook.com/intern/diff/D23380288/) [ghstack-poisoned]
…buckets. Pull Request resolved: #43734 Following the additional GH comments on the original PR #43307. ghstack-source-id: 111327130 Differential Revision: [D23380288](https://our.internmc.facebook.com/intern/diff/D23380288/)
…buckets. (#43734) Summary: Pull Request resolved: #43734 Following the additional GH comments on the original PR #43307. ghstack-source-id: 111327130 Test Plan: Run `python test/distributed/test_c10d.py` Reviewed By: smessmer Differential Revision: D23380288 fbshipit-source-id: 4b8889341c57b3701f0efa4edbe1d7bbc2a82ced
Summary: Pull Request resolved: #72348 **Overview** #43307 changed `_test_accumulate_gradients_no_sync()` to add a `num_iters` argument. However, I think the change misconstrued the test logic slightly. https://github.com/pytorch/pytorch/blob/61ab04e1db77fd59c940ca4ba34dbfb6afcc6551/torch/testing/_internal/distributed/distributed_test.py#L4369-L4397 - `iteration % num_iters == 0` evaluates to `True` only for `iteration == 0` since `iteration` comes from `for iteration in `range(num_iters)`. - IIUC, the intention is to alternate between accumulating gradients (using `no_sync()`) and synchronizing gradients normally. In the existing implementation, any iterations following the second one are non-productive since gradients are in sync, meaning it reduces to testing normal DDP. - This PR changes the check back to `iteration % 2 == 0` to restore the alternating behavior. Test Plan: Imported from OSS Reviewed By: rohan-varma Differential Revision: D34011559 Pulled By: awgu fbshipit-source-id: 4ba771e45b28a343167a324462571e4b8e25ae72
Summary: Pull Request resolved: #72348 **Overview** #43307 changed `_test_accumulate_gradients_no_sync()` to add a `num_iters` argument. However, I think the change misconstrued the test logic slightly. https://github.com/pytorch/pytorch/blob/61ab04e1db77fd59c940ca4ba34dbfb6afcc6551/torch/testing/_internal/distributed/distributed_test.py#L4369-L4397 - `iteration % num_iters == 0` evaluates to `True` only for `iteration == 0` since `iteration` comes from `for iteration in `range(num_iters)`. - IIUC, the intention is to alternate between accumulating gradients (using `no_sync()`) and synchronizing gradients normally. In the existing implementation, any iterations following the second one are non-productive since gradients are in sync, meaning it reduces to testing normal DDP. - This PR changes the check back to `iteration % 2 == 0` to restore the alternating behavior. Test Plan: Imported from OSS Reviewed By: rohan-varma Differential Revision: D34011559 Pulled By: awgu fbshipit-source-id: 4ba771e45b28a343167a324462571e4b8e25ae72 (cherry picked from commit 8492a8b)
Stack from ghstack:
I identified a bug with DDP communication hook while I was trying accuracy benchmarks: I was getting
loss=nan.Looks like when we re-
initialize_bucketviewswith the value offuture_work, asReducer::mark_variable_ready_densedoesbucket_view.copy_(grad)it wasn't copying thegradsback to the contents sincebucket_viewwouldn't have any relationship withcontentsafter re-intitializing it with something else. As we have multiple iterations, this was causing problems.I solved this by adding two states for
bucket_view:I included two additional unit tests where we run multiple iterations for better test coverage:
test_accumulate_gradients_no_sync_allreduce_hooktest_accumulate_gradients_no_sync_allreduce_with_then_hook.Those tests were failing with the old version. (If I run it for just 2 iterations, it wouldn't fail.)
Differential Revision: D23229309