Skip to content

distributed data parallel, gloo backend works, but nccl deadlock, #14870

@htlchh

Description

@htlchh

I have encountered a problem when using DistributedDataParallel with multi-processes.
I use two machines, a docker machine A and a physical machine B:
A is the master and init_method = 'tcp://ip_A:free_port', and B can telnet the address 'tcp://ip_A:free_port'.

with gloo backend, it works well. But with nccl backend, it got stucked.
Runtime environment:
python 3.6 + pytorch-1.0-dev
DistributedDataParallel and multiprocessing,
node A: P40, node B: K40
2 nodes, 4 processes each node and each process runs in a single GPU
world-size 2, rank 0-7

configurations and codes:

    if args.multiprocessing_distributed:
        args.world_size = ngpus_per_node * args.world_size
        mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args))
    else:
        main_worker(args.gpu, ngpus_per_node, args)
    if args.distributed:
        if args.dist_url == "env://" and args.rank == -1:
            args.rank = int(os.environ["RANK"])
        if args.multiprocessing_distributed:
            args.rank = args.rank * ngpus_per_node + gpu
        dist.init_process_group(backend=args.dist_backend, init_method=args.dist_url,
                                world_size=args.world_size, rank=args.rank)
            torch.cuda.set_device(args.gpu)
            model.cuda(args.gpu)
            args.batch_size = int(args.batch_size / ngpus_per_node)
            model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu])

Metadata

Metadata

Assignees

No one assigned

    Labels

    oncall: distributedAdd this issue/PR to distributed oncall triage queue

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions