distributed data parallel, gloo backend works, but nccl deadlock,

I have encountered a problem when using DistributedDataParallel with multi-processes.
I use two machines, a docker machine A and a physical machine B: 
A is the master and init_method = 'tcp://ip_A:free_port', and B can telnet the address 'tcp://ip_A:free_port'.

with gloo backend, it works well. But with nccl backend, it got stucked.
Runtime environment:
python 3.6 + pytorch-1.0-dev
DistributedDataParallel and multiprocessing,
node A: P40, node B: K40
2 nodes, 4 processes each node and each process runs in a single GPU
world-size 2, rank 0-7

configurations and codes:
```
    if args.multiprocessing_distributed:
        args.world_size = ngpus_per_node * args.world_size
        mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args))
    else:
        main_worker(args.gpu, ngpus_per_node, args)
```

```
    if args.distributed:
        if args.dist_url == "env://" and args.rank == -1:
            args.rank = int(os.environ["RANK"])
        if args.multiprocessing_distributed:
            args.rank = args.rank * ngpus_per_node + gpu
        dist.init_process_group(backend=args.dist_backend, init_method=args.dist_url,
                                world_size=args.world_size, rank=args.rank)
```

```
            torch.cuda.set_device(args.gpu)
            model.cuda(args.gpu)
            args.batch_size = int(args.batch_size / ngpus_per_node)
            model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu])
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

distributed data parallel, gloo backend works, but nccl deadlock, #14870

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

distributed data parallel, gloo backend works, but nccl deadlock, #14870

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions