-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Description
I use gloo for Model Parallelism, when I use all_gather, the result is error.
There are two process, I expect the all_gather result is [Tensor1, Tensor2], but the result is actually [Tensor1, Tensor1].
The Tensor2 is like this

The result is like this, The senod Tensor in the result should be equal as the Tensor2

But before all_gather, I use torch.reshape(torch.tensor(range(xxxx), dtype=torch.float32), [16, 16, 16]) to create two Tensor and all_gather. The result is correctly.
The code is:
for _ in range(stage.get_devices_num()):
gather_tensor.append(torch.zeros_like(in_slice))
dist.all_gather(gather_tensor, in_slice.contiguous(), group=group)Environment:
macos
pytorch 1.0.1
pytorch-cpu 1.1.0
numpy 1.16.2
PS: We use torch.chunk to split the Tensor and the dim is 0, and all_gather the chunked tensor by gloo, the all_gather result is error.I think although I use contiguous() to make memory contiguous, but the it is not effective after I chunk the tensor at dim 0.