all_gather supporting single tensor output#82639
all_gather supporting single tensor output#82639kwen2501 wants to merge 5 commits intopytorch:masterfrom
Conversation
🔗 Helpful links
❌ 3 New Failures, 2 PendingAs of commit 9dad87f (more details on the Dr. CI page): Expand to see more
🕵️ 3 new failures recognized by patternsThe following CI failures do not appear to be due to upstream breakages
|
| [tensor([1.+1.j, 2.+2.j]), tensor([3.+3.j, 4.+4.j])] # Rank 1 | ||
|
|
||
| """ | ||
| # `tensor_list` would be understood as "tensor or list." |
There was a problem hiding this comment.
maybe then best name it more verbosely tensor_or_list to avoid understanding as "list of tensors"?
There was a problem hiding this comment.
Thanks for the suggestion. I changed the name to output to be generic (also to indicate it is an output).
In the documentation, I added that output can be a tensor or a list of tensor.
| [tensor([1.+1.j, 2.+2.j]), tensor([3.+3.j, 4.+4.j])] # Rank 1 | ||
|
|
||
| """ | ||
| # `tensor_list` would be understood as "tensor or list." |
There was a problem hiding this comment.
doc should be updated? Also, I'm not sure how much of a fan I am of an argument possibly having multiple types, and changing the behavior based on the type - it seems that it can be surprising / confusing to users. But, if we think this is the best way to consolidate the API, then I am open to it.
There was a problem hiding this comment.
Thanks for the suggestion. I updated the document. Can you please take a look? Thank you!
Also filling in a bit of context:
-
In recent discussion with @ptrblck @crcrpar regarding use of the
basefeature in Megatron, it was identified that unifying the two APIs was preferred to productizing_all_gather_base. Also per the initial idea to unify them as mentioned by @zhaojuanmao when_all_gather_basewas first implemented [torch distributed] Implementing all_gather_base (#56304) #56315 (review) -
I also discussed with @ptrblck and @crcrpar regarding ways to unifying the two APIs.
Method 1: haveall_gatheraccept a single tensor
Method 2: haveall_gatheraccept a list, the list containing a single tensor
It appears to me that Method 1 is cleaner in semantics -- when user passes in a single tensor, it is explicit that they ask for merging the input tensors into a (larger) output tensor. Also, since this is something user passes in rather than a return of the function, there is no need for user to figure out what the return is (a list or a tensor). -
Regarding why a single function name:
I want to give user an experience that they only need to think about what output format they want, and not having to additionally think about which function to call (especially when there are already several all_gather_* APIs there). In retrospective, it may also seem that the list implementation "hijacked" theall_gathername, whereas the single tensor implementation can be more natural, smoother in application use flow maybe, and more performant. Hence my thought of giving the tensor output the "first-class" name ofall_gather.
I am happy to hear about comments on the above thoughts.
| # `tensor_list` would be understood as "tensor or list." | ||
| # If it is a single tensor, we would route it to the _all_gather_base implementation. | ||
| if isinstance(tensor_list, torch.Tensor): | ||
| return _all_gather_base(tensor_list, tensor, group, async_op) |
There was a problem hiding this comment.
how about a new flag into dist.all_gather such as inplace=True? And when this is provided, we document that the first arg should be an appropriately sized tensor and we fill the result in that?
There was a problem hiding this comment.
This PR is tangential to in-place vs out-of-place. It is more about in which format the user wants to receive the all-gather result.
crcrpar
left a comment
There was a problem hiding this comment.
Thank you for the PR. I dropped some newbie questions :)
|
|
||
|
|
||
| def all_gather(tensor_list, tensor, group=None, async_op=False): | ||
| def all_gather(output, tensor, group=None, async_op=False): |
There was a problem hiding this comment.
IMHO this looks backward compatibility breaking and I'm wondering if we could just make _all_gather_base public.
There was a problem hiding this comment.
Hi @crcrpar, I started writing my reply to @rohan-varma on why the unification and why such signature before seeing your comment. Do you mind checking if that responds to the second part of your question? Please feel free to let me know if you have any comment about that thought process.
Regarding backward compatibility, I will think more about it.
| # `output` can be a single tensor or a list of tensor. | ||
| # If it is a single tensor, we route it to the _all_gather_base implementation. | ||
| if isinstance(output, torch.Tensor): | ||
| return _all_gather_base(output, tensor, group, async_op) |
There was a problem hiding this comment.
To my knowledge, _all_gather_base is not a custom op in terms of torch script while all_gather is according to #79669.
Could this mean that one function behaves slightly differently depending on the output?
There was a problem hiding this comment.
We can make _all_gather_base a custom op in the backend too. That only concerns the internal implementation.
### Description - This PR renames `_all_gather_base` to `all_gather_into_tensor` so that it is clearer in meaning. - The `all_gather_into_tensor` API differs from the `all_gather` API in the output it accepts -- a single, large tensor instead of a list of tensors. - This PR also adds deprecation warning to `_all_gather_base`. ### Issue `_all_gather_base` was implemented in #33924 to avoid unnecessary flattening. There was previous effort (#82639) to merge `_all_gather_base` with the existing `all_gather` API by detecting the parameter type passed in for the output. There are, however, two "blockers" that make the merge difficult: (i) The merge leads to backward compatibility break. We would need to change the parameter name `tensor_list` in `all_gather` to a general name `output` that can cover both tensor and tensor list. (ii) Recently, the `all_gather` API has added uneven tensor support, utilizing the tensor boundaries implied by the list. We are, however, not sure to add such support to the `_all_gather_base` function, because that would require users to pass in additional tensor boundary information. In view of the above, we decided to productize `_all_gather_base` as a separate function, but with a clearer name. ### Testing Added tests: - `test_all_gather_into_cat_tensor_cuda` -- output form as with `torch.cat`. For example: ``` >>> tensor_in tensor([1, 2], device='cuda:0') # Rank 0 tensor([3, 4], device='cuda:1') # Rank 1 >>> tensor_out tensor([1, 2, 3, 4], device='cuda:0') # Rank 0 tensor([1, 2, 3, 4], device='cuda:1') # Rank 1 ``` - `test_all_gather_into_stack_tensor_cuda` -- output form as with `torch.stack`. For example: ``` >>> tensor_out2 tensor([[1, 2], [3, 4]], device='cuda:0') # Rank 0 tensor([[1, 2], [3, 4]], device='cuda:1') # Rank 1 ``` The output form is determined by the shape of the output tensor passed by the user, no flag used. Cc @rohan-varma @mrshenli @crcrpar @ptrblck @H-Huang Pull Request resolved: #85686 Approved by: https://github.com/rohan-varma, https://github.com/crcrpar
### Description - This PR renames `_all_gather_base` to `all_gather_into_tensor` so that it is clearer in meaning. - The `all_gather_into_tensor` API differs from the `all_gather` API in the output it accepts -- a single, large tensor instead of a list of tensors. - This PR also adds deprecation warning to `_all_gather_base`. ### Issue `_all_gather_base` was implemented in pytorch#33924 to avoid unnecessary flattening. There was previous effort (pytorch#82639) to merge `_all_gather_base` with the existing `all_gather` API by detecting the parameter type passed in for the output. There are, however, two "blockers" that make the merge difficult: (i) The merge leads to backward compatibility break. We would need to change the parameter name `tensor_list` in `all_gather` to a general name `output` that can cover both tensor and tensor list. (ii) Recently, the `all_gather` API has added uneven tensor support, utilizing the tensor boundaries implied by the list. We are, however, not sure to add such support to the `_all_gather_base` function, because that would require users to pass in additional tensor boundary information. In view of the above, we decided to productize `_all_gather_base` as a separate function, but with a clearer name. ### Testing Added tests: - `test_all_gather_into_cat_tensor_cuda` -- output form as with `torch.cat`. For example: ``` >>> tensor_in tensor([1, 2], device='cuda:0') # Rank 0 tensor([3, 4], device='cuda:1') # Rank 1 >>> tensor_out tensor([1, 2, 3, 4], device='cuda:0') # Rank 0 tensor([1, 2, 3, 4], device='cuda:1') # Rank 1 ``` - `test_all_gather_into_stack_tensor_cuda` -- output form as with `torch.stack`. For example: ``` >>> tensor_out2 tensor([[1, 2], [3, 4]], device='cuda:0') # Rank 0 tensor([[1, 2], [3, 4]], device='cuda:1') # Rank 1 ``` The output form is determined by the shape of the output tensor passed by the user, no flag used. Cc @rohan-varma @mrshenli @crcrpar @ptrblck @H-Huang Pull Request resolved: pytorch#85686 Approved by: https://github.com/rohan-varma, https://github.com/crcrpar
|
Closing this PR since an alternative solution (#85686) has been landed. |
### Description - This PR renames `_all_gather_base` to `all_gather_into_tensor` so that it is clearer in meaning. - The `all_gather_into_tensor` API differs from the `all_gather` API in the output it accepts -- a single, large tensor instead of a list of tensors. - This PR also adds deprecation warning to `_all_gather_base`. ### Issue `_all_gather_base` was implemented in #33924 to avoid unnecessary flattening. There was previous effort (#82639) to merge `_all_gather_base` with the existing `all_gather` API by detecting the parameter type passed in for the output. There are, however, two "blockers" that make the merge difficult: (i) The merge leads to backward compatibility break. We would need to change the parameter name `tensor_list` in `all_gather` to a general name `output` that can cover both tensor and tensor list. (ii) Recently, the `all_gather` API has added uneven tensor support, utilizing the tensor boundaries implied by the list. We are, however, not sure to add such support to the `_all_gather_base` function, because that would require users to pass in additional tensor boundary information. In view of the above, we decided to productize `_all_gather_base` as a separate function, but with a clearer name. ### Testing Added tests: - `test_all_gather_into_cat_tensor_cuda` -- output form as with `torch.cat`. For example: ``` >>> tensor_in tensor([1, 2], device='cuda:0') # Rank 0 tensor([3, 4], device='cuda:1') # Rank 1 >>> tensor_out tensor([1, 2, 3, 4], device='cuda:0') # Rank 0 tensor([1, 2, 3, 4], device='cuda:1') # Rank 1 ``` - `test_all_gather_into_stack_tensor_cuda` -- output form as with `torch.stack`. For example: ``` >>> tensor_out2 tensor([[1, 2], [3, 4]], device='cuda:0') # Rank 0 tensor([[1, 2], [3, 4]], device='cuda:1') # Rank 1 ``` The output form is determined by the shape of the output tensor passed by the user, no flag used. Cc @rohan-varma @mrshenli @crcrpar @ptrblck @H-Huang Pull Request resolved: #85686 Approved by: https://github.com/rohan-varma, https://github.com/crcrpar
Description
This PR unifies the
all_gatherAPI and the_all_gather_baseAPI by having the former support both single tensor and list of tensor as output.Issue
Following the request of avoiding unnecessary flattening in #33924,
_all_gather_basewas implemented. That API, however, has never been productized, due to the concern of creating multiple all-gather variants. Thus, instead of productizing_all_gather_base, this PR usesall_gatherto cover both, because in traditional sense (as in MPI or NCCL),all_gathercan indeed stand for gathering multiple small buffers into a single big buffer.Testing
Added tests:
test_all_gather_single_output_tensor_cudaContext:
In recent discussion with @ptrblck @crcrpar regarding use of the base feature in Megatron, it was identified that unifying the two APIs was preferred to productizing _all_gather_base. Also per the initial idea to unify them as mentioned by @zhaojuanmao when _all_gather_base was first implemented #56315 (review)
There are two ways to unifying the two APIs.
Method 1: have all_gather accept a single tensor
Method 2: have all_gather accept a list, the list containing a single tensor
It appears to me that Method 1 is cleaner in semantics -- when user passes in a single tensor, it is explicit that they ask for merging the input tensors into a (larger) output tensor. Also, since this is something user passes in rather than a return of the function, there is no need for user to figure out what the return is (a list or a tensor).
Regarding why a single function name:
I want to give user an experience that they only need to think about what output format they want, and not having to additionally think about which function to call (especially when there are already several all_gather_* APIs there). In retrospective, it may also seem that the list implementation "hijacked" the all_gather name, whereas the single tensor implementation can be more natural, smoother in application use flow maybe, and more performant. Hence my thought of giving the tensor output the "first-class" name of all_gather.