Skip to content

The result of gloo all_gather error #20421

@qijianan777

Description

@qijianan777

I use gloo for Model Parallelism, when I use all_gather, the result is error.

There are two process, I expect the all_gather result is [Tensor1, Tensor2], but the result is actually [Tensor1, Tensor1].
The Tensor2 is like this
image
The result is like this, The senod Tensor in the result should be equal as the Tensor2
image
But before all_gather, I use torch.reshape(torch.tensor(range(xxxx), dtype=torch.float32), [16, 16, 16]) to create two Tensor and all_gather. The result is correctly.
The code is:

for _ in range(stage.get_devices_num()):
        gather_tensor.append(torch.zeros_like(in_slice))
dist.all_gather(gather_tensor, in_slice.contiguous(), group=group)

Environment:
macos
pytorch 1.0.1
pytorch-cpu 1.1.0
numpy 1.16.2

PS: We use torch.chunk to split the Tensor and the dim is 0, and all_gather the chunked tensor by gloo, the all_gather result is error.I think although I use contiguous() to make memory contiguous, but the it is not effective after I chunk the tensor at dim 0.

Metadata

Metadata

Assignees

Labels

oncall: distributedAdd this issue/PR to distributed oncall triage queuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions