Skip to content

More idist allreduce/gather test fixes#3302

Closed
vfdev-5 wants to merge 4 commits intomasterfrom
more-gpu-test-fixes
Closed

More idist allreduce/gather test fixes#3302
vfdev-5 wants to merge 4 commits intomasterfrom
more-gpu-test-fixes

Conversation

@vfdev-5
Copy link
Copy Markdown
Collaborator

@vfdev-5 vfdev-5 commented Nov 7, 2024

Description:

  • fixed all_reduce with group implementation
  • fixed and rewrote tests: _test_distrib_all_reduce_group, _test_distrib_all_gather_group, etc

Based on #3301

@github-actions github-actions bot added module: metrics Metrics module module: distributed Distributed module labels Nov 7, 2024
@vfdev-5 vfdev-5 mentioned this pull request Feb 22, 2025
3 tasks
@sadra-barikbin
Copy link
Copy Markdown
Collaborator

sadra-barikbin commented Feb 24, 2025

@vfdev-5 , it seems that HVD test test_idist_all_gather_hvd fails because new_group() does not return a ProcessGroup.NON_GROUP_MEMBER in HVD model in case the rank does not belong to the group, although dist.new_group does.

if isinstance(group, list) and all(isinstance(item, int) for item in group):
group = _model.new_group(group)

def _do_new_group(self, ranks: List[int], **kwargs: Any) -> Any:
return hvd.ProcessSet(ranks)

By the way doesn't HVD really support all_gather with process groups?

https://github.com/horovod/horovod/blob/80780e10d97271472fc6f08cab630ad78d9c5aeb/horovod/torch/mpi_ops.py#L652

@vfdev-5
Copy link
Copy Markdown
Collaborator Author

vfdev-5 commented Feb 24, 2025

By the way doesn't HVD really support all_gather with process groups?

We raise an error for now if a group is passed as arg:

if group is not None:
raise NotImplementedError("all_reduce with group for horovod is not implemented")

One error in the CI is :

[0]<stderr>:  File "/work/tests/ignite/distributed/utils/__init__.py", line 347, in _test_idist_all_gather_tensors_with_shapes_group
[0]<stderr>:    tensors = all_gather_tensors_with_shapes(rank_tensor, [[r + 2, r + 3, r + 4] for r in ranks], ranks)
[0]<stderr>:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[0]<stderr>:  File "/work/ignite/distributed/utils.py", line 395, in all_gather_tensors_with_shapes
[0]<stderr>:    padded_tensor = torch.nn.functional.pad(
[0]<stderr>:                    ^^^^^^^^^^^^^^^^^^^^^^^^
[0]<stderr>:  File "/opt/conda/lib/python3.11/site-packages/torch/nn/functional.py", line 5096, in pad
[0]<stderr>:    return torch._C._nn.pad(input, pad, mode, value)
[0]<stderr>:           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[0]<stderr>:RuntimeError: Padding length should be less than or equal to two times the input dimension but got padding length 6 and input of dimension 1

and we check it with

            with pytest.raises(NotImplementedError, match=r"all_gather with group for horovod is not implemented"):
                tensors = all_gather_tensors_with_shapes(rank_tensor, [[r + 2, r + 3, r + 4] for r in ranks], ranks)

Maybe there is something else is going on ?

@sadra-barikbin
Copy link
Copy Markdown
Collaborator

In all_gather_tensors_with_shapes, first a group is created, if the arg is given, which is equal to ProcessGroup.NON_GROUP_MEMBER for non-participating processes. But for HVD, the resulting group is a ProcessSet anyway. When padding is being done, which takes places before allgather, it tries to pad the 1-dim input tensor which is torch.tensor([NON_PARITIPATING_RANK]) with the padding size argument of 3 dimensions (3 * 2 (for left and right padding)) and raises error.

@sadra-barikbin
Copy link
Copy Markdown
Collaborator

sadra-barikbin commented Feb 24, 2025

I mean the below if statement is not enough for horovod as the resulting group is always a ProcessSet

if isinstance(_model, _SerialModel) or group == dist.GroupMember.NON_GROUP_MEMBER:
return [tensor]

@vfdev-5
Copy link
Copy Markdown
Collaborator Author

vfdev-5 commented Feb 24, 2025

I mean the below if statement is not enough for horovod as the resulting group is always a ProcessSet

if isinstance(_model, _SerialModel) or group == dist.GroupMember.NON_GROUP_MEMBER:
return [tensor]

actually, this is weird why we use specific dist.GroupMember.NON_GROUP_MEMBER in the generic all_gather_tensors_with_shapes method...

@sadra-barikbin
Copy link
Copy Markdown
Collaborator

So we could replace it with something like this maybe:

# In idist:

def is_in_group(group):
    if group is dist.GroupMember.NON_GROUP_MEMBER:
        return False
    if isinstance(group, hvd.ProcessSet):
        return group.included()
    return True

@vfdev-5
Copy link
Copy Markdown
Collaborator Author

vfdev-5 commented Feb 26, 2025

So we could replace it with something like this maybe:

# In idist:

def is_in_group(group):
    if group is dist.GroupMember.NON_GROUP_MEMBER:
        return False
    if isinstance(group, hvd.ProcessSet):
        return group.included()
    return True

Yes, this makes sense. I would just make it private and call like in pytorch: _rank_not_in_group.
Sadra, can you please send a PR with this update?

@vfdev-5 vfdev-5 force-pushed the more-gpu-test-fixes branch from ed934ef to 06836ba Compare March 19, 2025 09:01
@github-actions github-actions bot added the ci CI label Mar 19, 2025
@vfdev-5
Copy link
Copy Markdown
Collaborator Author

vfdev-5 commented Mar 27, 2025

I think we can close this PR

@vfdev-5 vfdev-5 closed this Mar 27, 2025
@vfdev-5 vfdev-5 deleted the more-gpu-test-fixes branch March 27, 2025 11:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci CI module: distributed Distributed module module: metrics Metrics module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants