-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[NCCL] [For Test] In DDP's reducer merge work and future_work
#41840
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
1. Only NCCL's `get_future` API is suppoerted, so disabled gloo tests. 2. I'll use this PR to test NCCL's `get_future` API's flow-cli performance to validate no regression on speed. 3. This PR will also solve Issue [#41266](#41266) once `get_future` API supports gloo and mpi backends. Differential Revision: [D22660198](https://our.internmc.facebook.com/intern/diff/D22660198/) [ghstack-poisoned]
1. Only NCCL's `get_future` API is suppoerted, so disabled gloo tests. 2. I'll use this PR to test NCCL's `get_future` API's flow-cli performance to validate no regression on speed. 3. This PR will also solve Issue [#41266](#41266) once `get_future` API supports gloo and mpi backends. Differential Revision: [D22660198](https://our.internmc.facebook.com/intern/diff/D22660198/) ghstack-source-id: 108226828 Pull Request resolved: #41840
💊 CI failures summary and remediationsAs of commit 861496c (more details on the Dr. CI page):
🕵️ 6 new failures recognized by patternsThe following CI failures do not appear to be due to upstream breakages:
|
We've modified the previous design of `convert_dist_work_to_future` API in the GH Issue [#39272](#39272). 1. Whenever we create a `WorkNCCL` object, create a `Future` associated with `WorkNCCL` and store it with the object. 2. Add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. 3. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation. 4. To mark the future associated with WorkNCCL completed, implement a `cudaStreamCallback` function. `cudaStreamAddCallback` is marked as deprecated. An alternative is `cudaLaunchHostFunc`, but it is supported for CUDA > 10 and may not be deprecated until there's a reasonable alternative available according to [this discussion](https://stackoverflow.com/questions/56448390/how-to-recover-from-cuda-errors-when-using-cudalaunchhostfunc-instead-of-cudastr). Differential Revision: [D22583690](https://our.internmc.facebook.com/intern/diff/D22583690/) #### Test plan: Run old python test/distributed/test_c10d.py. Some additional tests: `test_ddp_comm_hook_allreduce_hook_nccl`: This unit test verifies whether a DDP communication hook that just calls allreduce gives the same result result with the case of no hook registered. Without the then callback, the future_value in reducer is no longer a PyObject, and this unit test verifies future_value is properly checked. `test_ddp_comm_hook_allreduce_then_mult_ten_hook_nccl`: This unit test verifies whether a DDP communication hook that calls allreduce and then multiplies the result by ten gives the expected result. As of v10: ``` ........................s.....s.....................................................s............................... ---------------------------------------------------------------------- Ran 116 tests OK (skipped=3) ``` `flow-cli` performance validation using a stacked diff when bucket.work is completely replaced with bucket.future_work in `reducer`. See PR [#41840](#41840) [D22660198](https://www.internalfb.com/intern/diff/D22660198/). [ghstack-poisoned]
|
Looking at the results, there seems to be 1-1.5% regression in general, although in the case of world_size=2, for p90 and p95 there seems to be a significant regression for I guess we can't land this PR without gloo support. So I'd suggest leaving the PR in the current state and then we can have someone pick this work up once we start working on gloo support. |
| // #41266. | ||
| if (comm_hook_ == nullptr) { | ||
| bucket.work = process_group_->allreduce(tensors); | ||
| bucket.future_work = process_group_->allreduce(tensors)->getFuture(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can get rid of this if-else completely and register a hook in DDP constructor that does allreduce by default? Then, we should allow users to override the hook (I think we only allow the hook to be registered once) as many times as they wish.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Included a new class named AllreduceHook and registered that as default comm hook.
| bucket.future_work->wait(); | ||
|
|
||
| // See Note [DDP Communication Hook] | ||
| if (comm_hook_ != nullptr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of checking on comm_hook here, we can check if future_result is the same as replica.contents and if so, you can skip initializing bucket views since you know it is not a new tensor. You can use the is_same API for this: https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/templates/TensorBody.h#L194
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that would require future_value.toTensorVector first. Instead, I am doing smth like this, please let me know if it isn't good:
| if (!dynamic_cast<AllreduceHook*>(comm_hook_.get())) { |
We've modified the previous design of `convert_dist_work_to_future` API in the GH Issue [#39272](#39272). 1. Whenever we create a `WorkNCCL` object, create a `Future` associated with `WorkNCCL` and store it with the object. 2. Add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. 3. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation. 4. To mark the future associated with WorkNCCL completed, implement a `cudaStreamCallback` function. `cudaStreamAddCallback` is marked as deprecated. An alternative is `cudaLaunchHostFunc`, but it is supported for CUDA > 10 and may not be deprecated until there's a reasonable alternative available according to [this discussion](https://stackoverflow.com/questions/56448390/how-to-recover-from-cuda-errors-when-using-cudalaunchhostfunc-instead-of-cudastr). Differential Revision: [D22583690](https://our.internmc.facebook.com/intern/diff/D22583690/) #### Test plan: Run old python test/distributed/test_c10d.py. Some additional tests: `test_ddp_comm_hook_allreduce_hook_nccl`: This unit test verifies whether a DDP communication hook that just calls allreduce gives the same result result with the case of no hook registered. Without the then callback, the future_value in reducer is no longer a PyObject, and this unit test verifies future_value is properly checked. `test_ddp_comm_hook_allreduce_then_mult_ten_hook_nccl`: This unit test verifies whether a DDP communication hook that calls allreduce and then multiplies the result by ten gives the expected result. As of v10: ``` ........................s.....s.....................................................s............................... ---------------------------------------------------------------------- Ran 116 tests OK (skipped=3) ``` `flow-cli` performance validation using a stacked diff where bucket.work is completely replaced with bucket.future_work in `reducer`. See PR [#41840](#41840) [D22660198](https://www.internalfb.com/intern/diff/D22660198/). [ghstack-poisoned]
…e_work`" 1. Only NCCL's `get_future` API is suppoerted, so disabled gloo tests. 2. I'll use this PR to test NCCL's `get_future` API's flow-cli performance to validate no regression on speed. 3. This PR will also solve Issue [#41266](#41266) once `get_future` API supports gloo and mpi backends. Differential Revision: [D22660198](https://our.internmc.facebook.com/intern/diff/D22660198/) #### Test plan: 1. Run `python test/distributed/test_c10d.py`. ``` ss..........ss.sss...s.ssssss.s.ss.......s...ssssssssssssssssssssssssssssssssssssssssssss........sssssss..........s. ---------------------------------------------------------------------- Ran 116 tests OK (skipped=70) ``` 2. `flow-cli` performance tests using NCCL backend. * Example: ``` flow-cli canary pytorch.benchmark.main.workflow --parameters-json '{"model_arch": "resnet50", "batch_size": 32, "world_size": 2, "use_fp16": false, "print_percentile": true, "backend": "nccl"}' --entitlement pytorch_ftw_gpu --name test_torchelastic_nccl_getFutureD22660198 --run-as-secure-group pytorch_distributed ``` * **World size =1, master:** f206781394, f206947522 ``` sum: 8 GPUs: p25: 0.109 293/s p50: 0.109 293/s p75: 0.109 293/s p90: 0.109 292/s p95: 0.109 292/s fwds: 8 GPUs: p25: 0.032 1002/s p50: 0.032 999/s p75: 0.032 995/s p90: 0.032 991/s p95: 0.032 989/s bwds: 8 GPUs: p25: 0.074 431/s p50: 0.074 430/s p75: 0.075 429/s p90: 0.075 428/s p95: 0.075 427/s opts: 8 GPUs: p25: 0.003 12007/s p50: 0.003 11565/s p75: 0.003 11377/s p90: 0.003 11325/s p95: 0.003 11299/s ``` ``` sum: 8 GPUs: p25: 0.109 293/s p50: 0.109 293/s p75: 0.109 293/s p90: 0.109 292/s p95: 0.109 292/s fwds: 8 GPUs: p25: 0.032 1005/s p50: 0.032 999/s p75: 0.032 993/s p90: 0.032 989/s p95: 0.032 987/s bwds: 8 GPUs: p25: 0.074 431/s p50: 0.074 430/s p75: 0.075 428/s p90: 0.075 427/s p95: 0.075 426/s opts: 8 GPUs: p25: 0.003 12085/s p50: 0.003 11663/s p75: 0.003 11451/s p90: 0.003 11364/s p95: 0.003 11326/s ``` * **World size =1, getFuture API:** f206788430 f206945375 ``` sum: 8 GPUs: p25: 0.120 266/s p50: 0.121 265/s p75: 0.121 263/s p90: 0.122 262/s p95: 0.123 260/s fwds: 8 GPUs: p25: 0.032 989/s p50: 0.033 983/s p75: 0.033 976/s p90: 0.033 971/s p95: 0.033 968/s bwds: 8 GPUs: p25: 0.079 404/s p50: 0.081 394/s p75: 0.083 384/s p90: 0.085 378/s p95: 0.085 374/s opts: 8 GPUs: p25: 0.005 6285/s p50: 0.007 4482/s p75: 0.009 3535/s p90: 0.012 2710/s p95: 0.012 2580/s ``` ``` sum: 8 GPUs: p25: 0.120 267/s p50: 0.120 266/s p75: 0.121 264/s p90: 0.121 263/s p95: 0.122 261/s fwds: 8 GPUs: p25: 0.032 995/s p50: 0.032 989/s p75: 0.033 983/s p90: 0.033 978/s p95: 0.033 975/s bwds: 8 GPUs: p25: 0.076 419/s p50: 0.078 408/s p75: 0.080 397/s p90: 0.082 390/s p95: 0.083 384/s opts: 8 GPUs: p25: 0.008 4086/s p50: 0.010 3210/s p75: 0.011 2836/s p90: 0.012 2762/s p95: 0.012 2691/s ``` * **World size = 2, master:** f206922196 ``` sum: 16 GPUs: p25: 0.121 263/s p50: 0.122 261/s p75: 0.125 256/s p90: 0.126 253/s p95: 0.127 252/s fwds: 16 GPUs: p25: 0.032 1004/s p50: 0.032 998/s p75: 0.032 993/s p90: 0.032 989/s p95: 0.032 987/s bwds: 16 GPUs: p25: 0.087 369/s p50: 0.088 365/s p75: 0.090 355/s p90: 0.091 350/s p95: 0.092 346/s opts: 16 GPUs: p25: 0.003 12120/s p50: 0.003 11587/s p75: 0.003 11379/s p90: 0.003 11185/s p95: 0.003 11126/s ``` * **World size = 2, getFuture API:** f206921378 ``` sum: 16 GPUs: p25: 0.126 253/s p50: 0.127 252/s p75: 0.127 251/s p90: 0.128 249/s p95: 0.130 247/s fwds: 16 GPUs: p25: 0.032 997/s p50: 0.032 990/s p75: 0.033 983/s p90: 0.033 978/s p95: 0.033 976/s bwds: 16 GPUs: p25: 0.091 353/s p50: 0.091 349/s p75: 0.092 347/s p90: 0.093 344/s p95: 0.094 338/s opts: 16 GPUs: p25: 0.003 12083/s p50: 0.003 11569/s p75: 0.003 11300/s p90: 0.007 4846/s p95: 0.009 3543/s ``` * **Memory:** * **World size = 2, master:** f206922196 ``` CUDA Memory Summary After first iteration: |===========================================================================| | PyTorch CUDA memory summary, device ID 0 | |---------------------------------------------------------------------------| | CUDA OOMs: 0 | cudaMalloc retries: 0 | |===========================================================================| | Metric | Cur Usage | Peak Usage | Tot Alloc | Tot Freed | |---------------------------------------------------------------------------| | Allocated memory | 428091 KB | 2892 MB | 9825 MB | 9407 MB | | from large pool | 374913 KB | 2874 MB | 9752 MB | 9386 MB | | from small pool | 53178 KB | 52 MB | 73 MB | 21 MB | |---------------------------------------------------------------------------| | Active memory | 428091 KB | 2892 MB | 9825 MB | 9407 MB | | from large pool | 374913 KB | 2874 MB | 9752 MB | 9386 MB | | from small pool | 53178 KB | 52 MB | 73 MB | 21 MB | |---------------------------------------------------------------------------| | GPU reserved memory | 3490 MB | 3490 MB | 3490 MB | 0 B | | from large pool | 3434 MB | 3434 MB | 3434 MB | 0 B | | from small pool | 56 MB | 56 MB | 56 MB | 0 B | |---------------------------------------------------------------------------| | Non-releasable memory | 315332 KB | 343472 KB | 2295 MB | 1987 MB | | from large pool | 311166 KB | 340158 KB | 2239 MB | 1935 MB | | from small pool | 4166 KB | 4334 KB | 56 MB | 52 MB | |---------------------------------------------------------------------------| | Allocations | 704 | 705 | 1390 | 686 | | from large pool | 60 | 131 | 395 | 335 | | from small pool | 644 | 645 | 995 | 351 | |---------------------------------------------------------------------------| | Active allocs | 704 | 705 | 1390 | 686 | | from large pool | 60 | 131 | 395 | 335 | | from small pool | 644 | 645 | 995 | 351 | |---------------------------------------------------------------------------| | GPU reserved segments | 102 | 102 | 102 | 0 | | from large pool | 74 | 74 | 74 | 0 | | from small pool | 28 | 28 | 28 | 0 | |---------------------------------------------------------------------------| | Non-releasable allocs | 34 | 54 | 431 | 397 | | from large pool | 15 | 48 | 208 | 193 | | from small pool | 19 | 19 | 223 | 204 | |===========================================================================| ``` * **World size = 2, getFuture API:** f206921378 ``` CUDA Memory Summary After first iteration: |===========================================================================| | PyTorch CUDA memory summary, device ID 0 | |---------------------------------------------------------------------------| | CUDA OOMs: 0 | cudaMalloc retries: 0 | |===========================================================================| | Metric | Cur Usage | Peak Usage | Tot Alloc | Tot Freed | |---------------------------------------------------------------------------| | Allocated memory | 428091 KB | 2892 MB | 9825 MB | 9407 MB | | from large pool | 374913 KB | 2874 MB | 9752 MB | 9386 MB | | from small pool | 53178 KB | 52 MB | 73 MB | 21 MB | |---------------------------------------------------------------------------| | Active memory | 428091 KB | 2892 MB | 9825 MB | 9407 MB | | from large pool | 374913 KB | 2874 MB | 9752 MB | 9386 MB | | from small pool | 53178 KB | 52 MB | 73 MB | 21 MB | |---------------------------------------------------------------------------| | GPU reserved memory | 3490 MB | 3490 MB | 3490 MB | 0 B | | from large pool | 3434 MB | 3434 MB | 3434 MB | 0 B | | from small pool | 56 MB | 56 MB | 56 MB | 0 B | |---------------------------------------------------------------------------| | Non-releasable memory | 315332 KB | 343472 KB | 2295 MB | 1987 MB | | from large pool | 311166 KB | 340158 KB | 2239 MB | 1935 MB | | from small pool | 4166 KB | 4334 KB | 56 MB | 52 MB | |---------------------------------------------------------------------------| | Allocations | 704 | 705 | 1390 | 686 | | from large pool | 60 | 131 | 395 | 335 | | from small pool | 644 | 645 | 995 | 351 | |---------------------------------------------------------------------------| | Active allocs | 704 | 705 | 1390 | 686 | | from large pool | 60 | 131 | 395 | 335 | | from small pool | 644 | 645 | 995 | 351 | |---------------------------------------------------------------------------| | GPU reserved segments | 102 | 102 | 102 | 0 | | from large pool | 74 | 74 | 74 | 0 | | from small pool | 28 | 28 | 28 | 0 | |---------------------------------------------------------------------------| | Non-releasable allocs | 34 | 54 | 431 | 397 | | from large pool | 15 | 48 | 208 | 193 | | from small pool | 19 | 19 | 223 | 204 | |===========================================================================| ``` [ghstack-poisoned]
Pull Request resolved: #41840 1. Only NCCL's `get_future` API is suppoerted, so disabled gloo tests. 2. I'll use this PR to test NCCL's `get_future` API's flow-cli performance to validate no regression on speed. 3. This PR will also solve Issue [#41266] (#41266) once `get_future` API supports gloo and mpi backends. 4. We register a default communication hook to reducer called AllreduceHook. 5. We now allow DDP communication hook to be registered multiple times. 6. Removed test_ddp_comm_hook_register_just_once unit test. ghstack-source-id: 108410963 Differential Revision: [D22660198](https://our.internmc.facebook.com/intern/diff/D22660198/)
…e_work`" 1. Only NCCL's `get_future` API is suppoerted, so disabled gloo tests. 2. I'll use this PR to test NCCL's `get_future` API's flow-cli performance to validate no regression on speed. 3. This PR will also solve Issue [#41266](#41266) once `get_future` API supports gloo and mpi backends. Differential Revision: [D22660198](https://our.internmc.facebook.com/intern/diff/D22660198/) #### Test plan: 1. Run `python test/distributed/test_c10d.py`. ``` ss..........ss.sss...s.ssssss.s.ss.......s...ssssssssssssssssssssssssssssssssssssssssssss........sssssss..........s. ---------------------------------------------------------------------- Ran 116 tests OK (skipped=70) ``` 2. `flow-cli` performance tests using NCCL backend. * Example: ``` flow-cli canary pytorch.benchmark.main.workflow --parameters-json '{"model_arch": "resnet50", "batch_size": 32, "world_size": 2, "use_fp16": false, "print_percentile": true, "backend": "nccl"}' --entitlement pytorch_ftw_gpu --name test_torchelastic_nccl_getFutureD22660198 --run-as-secure-group pytorch_distributed ``` * **World size =1, master:** f206781394, f206947522 ``` sum: 8 GPUs: p25: 0.109 293/s p50: 0.109 293/s p75: 0.109 293/s p90: 0.109 292/s p95: 0.109 292/s fwds: 8 GPUs: p25: 0.032 1002/s p50: 0.032 999/s p75: 0.032 995/s p90: 0.032 991/s p95: 0.032 989/s bwds: 8 GPUs: p25: 0.074 431/s p50: 0.074 430/s p75: 0.075 429/s p90: 0.075 428/s p95: 0.075 427/s opts: 8 GPUs: p25: 0.003 12007/s p50: 0.003 11565/s p75: 0.003 11377/s p90: 0.003 11325/s p95: 0.003 11299/s ``` ``` sum: 8 GPUs: p25: 0.109 293/s p50: 0.109 293/s p75: 0.109 293/s p90: 0.109 292/s p95: 0.109 292/s fwds: 8 GPUs: p25: 0.032 1005/s p50: 0.032 999/s p75: 0.032 993/s p90: 0.032 989/s p95: 0.032 987/s bwds: 8 GPUs: p25: 0.074 431/s p50: 0.074 430/s p75: 0.075 428/s p90: 0.075 427/s p95: 0.075 426/s opts: 8 GPUs: p25: 0.003 12085/s p50: 0.003 11663/s p75: 0.003 11451/s p90: 0.003 11364/s p95: 0.003 11326/s ``` * **World size =1, getFuture API:** f206788430 f206945375 ``` sum: 8 GPUs: p25: 0.120 266/s p50: 0.121 265/s p75: 0.121 263/s p90: 0.122 262/s p95: 0.123 260/s fwds: 8 GPUs: p25: 0.032 989/s p50: 0.033 983/s p75: 0.033 976/s p90: 0.033 971/s p95: 0.033 968/s bwds: 8 GPUs: p25: 0.079 404/s p50: 0.081 394/s p75: 0.083 384/s p90: 0.085 378/s p95: 0.085 374/s opts: 8 GPUs: p25: 0.005 6285/s p50: 0.007 4482/s p75: 0.009 3535/s p90: 0.012 2710/s p95: 0.012 2580/s ``` ``` sum: 8 GPUs: p25: 0.120 267/s p50: 0.120 266/s p75: 0.121 264/s p90: 0.121 263/s p95: 0.122 261/s fwds: 8 GPUs: p25: 0.032 995/s p50: 0.032 989/s p75: 0.033 983/s p90: 0.033 978/s p95: 0.033 975/s bwds: 8 GPUs: p25: 0.076 419/s p50: 0.078 408/s p75: 0.080 397/s p90: 0.082 390/s p95: 0.083 384/s opts: 8 GPUs: p25: 0.008 4086/s p50: 0.010 3210/s p75: 0.011 2836/s p90: 0.012 2762/s p95: 0.012 2691/s ``` * **World size = 2, master:** f206922196 ``` sum: 16 GPUs: p25: 0.121 263/s p50: 0.122 261/s p75: 0.125 256/s p90: 0.126 253/s p95: 0.127 252/s fwds: 16 GPUs: p25: 0.032 1004/s p50: 0.032 998/s p75: 0.032 993/s p90: 0.032 989/s p95: 0.032 987/s bwds: 16 GPUs: p25: 0.087 369/s p50: 0.088 365/s p75: 0.090 355/s p90: 0.091 350/s p95: 0.092 346/s opts: 16 GPUs: p25: 0.003 12120/s p50: 0.003 11587/s p75: 0.003 11379/s p90: 0.003 11185/s p95: 0.003 11126/s ``` * **World size = 2, getFuture API:** f206921378 ``` sum: 16 GPUs: p25: 0.126 253/s p50: 0.127 252/s p75: 0.127 251/s p90: 0.128 249/s p95: 0.130 247/s fwds: 16 GPUs: p25: 0.032 997/s p50: 0.032 990/s p75: 0.033 983/s p90: 0.033 978/s p95: 0.033 976/s bwds: 16 GPUs: p25: 0.091 353/s p50: 0.091 349/s p75: 0.092 347/s p90: 0.093 344/s p95: 0.094 338/s opts: 16 GPUs: p25: 0.003 12083/s p50: 0.003 11569/s p75: 0.003 11300/s p90: 0.007 4846/s p95: 0.009 3543/s ``` * **Memory:** * **World size = 2, master:** f206922196 ``` CUDA Memory Summary After first iteration: |===========================================================================| | PyTorch CUDA memory summary, device ID 0 | |---------------------------------------------------------------------------| | CUDA OOMs: 0 | cudaMalloc retries: 0 | |===========================================================================| | Metric | Cur Usage | Peak Usage | Tot Alloc | Tot Freed | |---------------------------------------------------------------------------| | Allocated memory | 428091 KB | 2892 MB | 9825 MB | 9407 MB | | from large pool | 374913 KB | 2874 MB | 9752 MB | 9386 MB | | from small pool | 53178 KB | 52 MB | 73 MB | 21 MB | |---------------------------------------------------------------------------| | Active memory | 428091 KB | 2892 MB | 9825 MB | 9407 MB | | from large pool | 374913 KB | 2874 MB | 9752 MB | 9386 MB | | from small pool | 53178 KB | 52 MB | 73 MB | 21 MB | |---------------------------------------------------------------------------| | GPU reserved memory | 3490 MB | 3490 MB | 3490 MB | 0 B | | from large pool | 3434 MB | 3434 MB | 3434 MB | 0 B | | from small pool | 56 MB | 56 MB | 56 MB | 0 B | |---------------------------------------------------------------------------| | Non-releasable memory | 315332 KB | 343472 KB | 2295 MB | 1987 MB | | from large pool | 311166 KB | 340158 KB | 2239 MB | 1935 MB | | from small pool | 4166 KB | 4334 KB | 56 MB | 52 MB | |---------------------------------------------------------------------------| | Allocations | 704 | 705 | 1390 | 686 | | from large pool | 60 | 131 | 395 | 335 | | from small pool | 644 | 645 | 995 | 351 | |---------------------------------------------------------------------------| | Active allocs | 704 | 705 | 1390 | 686 | | from large pool | 60 | 131 | 395 | 335 | | from small pool | 644 | 645 | 995 | 351 | |---------------------------------------------------------------------------| | GPU reserved segments | 102 | 102 | 102 | 0 | | from large pool | 74 | 74 | 74 | 0 | | from small pool | 28 | 28 | 28 | 0 | |---------------------------------------------------------------------------| | Non-releasable allocs | 34 | 54 | 431 | 397 | | from large pool | 15 | 48 | 208 | 193 | | from small pool | 19 | 19 | 223 | 204 | |===========================================================================| ``` * **World size = 2, getFuture API:** f206921378 ``` CUDA Memory Summary After first iteration: |===========================================================================| | PyTorch CUDA memory summary, device ID 0 | |---------------------------------------------------------------------------| | CUDA OOMs: 0 | cudaMalloc retries: 0 | |===========================================================================| | Metric | Cur Usage | Peak Usage | Tot Alloc | Tot Freed | |---------------------------------------------------------------------------| | Allocated memory | 428091 KB | 2892 MB | 9825 MB | 9407 MB | | from large pool | 374913 KB | 2874 MB | 9752 MB | 9386 MB | | from small pool | 53178 KB | 52 MB | 73 MB | 21 MB | |---------------------------------------------------------------------------| | Active memory | 428091 KB | 2892 MB | 9825 MB | 9407 MB | | from large pool | 374913 KB | 2874 MB | 9752 MB | 9386 MB | | from small pool | 53178 KB | 52 MB | 73 MB | 21 MB | |---------------------------------------------------------------------------| | GPU reserved memory | 3490 MB | 3490 MB | 3490 MB | 0 B | | from large pool | 3434 MB | 3434 MB | 3434 MB | 0 B | | from small pool | 56 MB | 56 MB | 56 MB | 0 B | |---------------------------------------------------------------------------| | Non-releasable memory | 315332 KB | 343472 KB | 2295 MB | 1987 MB | | from large pool | 311166 KB | 340158 KB | 2239 MB | 1935 MB | | from small pool | 4166 KB | 4334 KB | 56 MB | 52 MB | |---------------------------------------------------------------------------| | Allocations | 704 | 705 | 1390 | 686 | | from large pool | 60 | 131 | 395 | 335 | | from small pool | 644 | 645 | 995 | 351 | |---------------------------------------------------------------------------| | Active allocs | 704 | 705 | 1390 | 686 | | from large pool | 60 | 131 | 395 | 335 | | from small pool | 644 | 645 | 995 | 351 | |---------------------------------------------------------------------------| | GPU reserved segments | 102 | 102 | 102 | 0 | | from large pool | 74 | 74 | 74 | 0 | | from small pool | 28 | 28 | 28 | 0 | |---------------------------------------------------------------------------| | Non-releasable allocs | 34 | 54 | 431 | 397 | | from large pool | 15 | 48 | 208 | 193 | | from small pool | 19 | 19 | 223 | 204 | |===========================================================================| ``` [ghstack-poisoned]
Pull Request resolved: #41840 1. Only NCCL's `get_future` API is suppoerted, so disabled gloo tests. 2. I'll use this PR to test NCCL's `get_future` API's flow-cli performance to validate no regression on speed. 3. This PR will also solve Issue [#41266] (#41266) once `get_future` API supports gloo and mpi backends. 4. We register a default communication hook to reducer called AllreduceHook. 5. We now allow DDP communication hook to be registered multiple times. 6. Removed test_ddp_comm_hook_register_just_once unit test. ghstack-source-id: 108445379 Differential Revision: [D22660198](https://our.internmc.facebook.com/intern/diff/D22660198/)
Summary: Pull Request resolved: #41596 We've modified the previous design of `convert_dist_work_to_future` API in the GH Issue [#39272](#39272). 1. Whenever we create a `WorkNCCL` object, create a `Future` associated with `WorkNCCL` and store it with the object. 2. Add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`. 3. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation. 4. To mark the future associated with WorkNCCL completed, implement a `cudaStreamCallback` function. `cudaStreamAddCallback` is marked as deprecated. An alternative is `cudaLaunchHostFunc`, but it is supported for CUDA > 10 and may not be deprecated until there's a reasonable alternative available according to [this discussion](https://stackoverflow.com/questions/56448390/how-to-recover-from-cuda-errors-when-using-cudalaunchhostfunc-instead-of-cudastr). ghstack-source-id: 108409748 Test Plan: Run old python test/distributed/test_c10d.py. Some additional tests: `test_ddp_comm_hook_allreduce_hook_nccl`: This unit test verifies whether a DDP communication hook that just calls allreduce gives the same result result with the case of no hook registered. Without the then callback, the future_value in reducer is no longer a PyObject, and this unit test verifies future_value is properly checked. `test_ddp_comm_hook_allreduce_then_mult_ten_hook_nccl`: This unit test verifies whether a DDP communication hook that calls allreduce and then multiplies the result by ten gives the expected result. As of v10: ``` ........................s.....s.....................................................s............................... ---------------------------------------------------------------------- Ran 116 tests OK (skipped=3) ``` `flow-cli` performance validation using a stacked diff where `bucket.work` is completely replaced with `bucket.future_work` in `reducer`. See PR [#41840](#41840) [D22660198](https://www.internalfb.com/intern/diff/D22660198/). Reviewed By: izdeby Differential Revision: D22583690 fbshipit-source-id: 8c059745261d68d543eaf21a5700e64826e8d94a
…e_work`" 1. Only NCCL's `get_future` API is suppoerted, so disabled gloo tests. 2. I'll use this PR to test NCCL's `get_future` API's flow-cli performance to validate no regression on speed. 3. This PR will also solve Issue [#41266](#41266) once `get_future` API supports gloo and mpi backends. Differential Revision: [D22660198](https://our.internmc.facebook.com/intern/diff/D22660198/) #### Test plan: 1. Run `python test/distributed/test_c10d.py`. ``` ss..........ss.sss...s.ssssss.s.ss.......s...ssssssssssssssssssssssssssssssssssssssssssss........sssssss..........s. ---------------------------------------------------------------------- Ran 116 tests OK (skipped=70) ``` [ghstack-poisoned]
…e_work`" 1. Only NCCL's `get_future` API is suppoerted, so disabled gloo tests. 2. I'll use this PR to test NCCL's `get_future` API's flow-cli performance to validate no regression on speed. 3. This PR will also solve Issue [#41266](#41266) once `get_future` API supports gloo and mpi backends. Differential Revision: [D22660198](https://our.internmc.facebook.com/intern/diff/D22660198/) #### Test plan: 1. Run `python test/distributed/test_c10d.py`. ``` ss..........ss.sss...s.ssssss.s.ss.......s...ssssssssssssssssssssssssssssssssssssssssssss........sssssss..........s. ---------------------------------------------------------------------- Ran 116 tests OK (skipped=70) ``` [ghstack-poisoned]
Pull Request resolved: #41840 1. Only NCCL's `get_future` API is suppoerted, so disabled gloo tests. 2. I'll use this PR to test NCCL's `get_future` API's flow-cli performance to validate no regression on speed. 3. This PR will also solve Issue [#41266] (#41266) once `get_future` API supports gloo and mpi backends. 4. We register a default communication hook to reducer called AllreduceHook. 5. We now allow DDP communication hook to be registered multiple times. 6. Removed test_ddp_comm_hook_register_just_once unit test. ghstack-source-id: 109448918 Differential Revision: [D22660198](https://our.internmc.facebook.com/intern/diff/D22660198/)
…e_work`" 1. Only NCCL's `get_future` API is suppoerted, so disabled gloo tests. 2. I'll use this PR to test NCCL's `get_future` API's flow-cli performance to validate no regression on speed. 3. This PR will also solve Issue [#41266](#41266) once `get_future` API supports gloo and mpi backends. Differential Revision: [D22660198](https://our.internmc.facebook.com/intern/diff/D22660198/) #### Test plan: 1. Run `python test/distributed/test_c10d.py`. ``` ss..........ss.sss...s.ssssss.s.ss.......s...ssssssssssssssssssssssssssssssssssssssssssss........sssssss..........s. ---------------------------------------------------------------------- Ran 116 tests OK (skipped=70) ``` [ghstack-poisoned]
Pull Request resolved: #41840 1. Only NCCL's `get_future` API is suppoerted, so disabled gloo tests. 2. I'll use this PR to test NCCL's `get_future` API's flow-cli performance to validate no regression on speed. 3. This PR will also solve Issue [#41266] (#41266) once `get_future` API supports gloo and mpi backends. 4. We register a default communication hook to reducer called AllreduceHook. 5. We now allow DDP communication hook to be registered multiple times. 6. Removed test_ddp_comm_hook_register_just_once unit test. ghstack-source-id: 111148361 Differential Revision: [D22660198](https://our.internmc.facebook.com/intern/diff/D22660198/)
|
@pritamdamania87, @sinannasir should this PR be closed? |
|
@ngimel Sure we can close this out, it was mostly for reference once we support all the backends. We can reuse this PR later as and when we add support for all backends. |
Stack from ghstack:
workandfuture_work#41840 [NCCL] [For Test] In DDP's reducer mergeworkandfuture_workget_futureAPI is suppoerted, so disabled gloo tests.get_futureAPI's flow-cli performance to validate no regression on speed.get_futureAPI supports gloo and mpi backends.Differential Revision: D22660198
Test plan:
python test/distributed/test_c10d.py.