The result of  gloo all_gather error

I use gloo for Model Parallelism, when I use all_gather, the result is error.

There are two process, I expect the all_gather result is [Tensor1, Tensor2], but the result is actually [Tensor1, Tensor1].
The Tensor2 is like this
![image](https://user-images.githubusercontent.com/50440190/57607871-949b5080-759e-11e9-9f9e-d39b4faaf1a0.png)
The result  is like this, The senod Tensor in the result should be equal as the Tensor2
![image](https://user-images.githubusercontent.com/50440190/57608339-581c2480-759f-11e9-9c08-75d657551d9c.png)
But before all_gather, I use torch.reshape(torch.tensor(range(xxxx), dtype=torch.float32), [16, 16, 16]) to create two Tensor and all_gather. The result is correctly.
The code is:

```python
for _ in range(stage.get_devices_num()):
        gather_tensor.append(torch.zeros_like(in_slice))
dist.all_gather(gather_tensor, in_slice.contiguous(), group=group)
```

Environment:
     macos
     pytorch                   1.0.1
     pytorch-cpu               1.1.0
     numpy                     1.16.2


PS: We use torch.chunk to split the Tensor and the dim is 0, and all_gather the chunked tensor by gloo, the all_gather  result is error.I think  although I use contiguous() to make memory contiguous, but the it is not effective after I chunk the tensor at dim 0.




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

The result of gloo all_gather error #20421

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

The result of gloo all_gather error #20421

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions