-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Closed
Labels
oncall: distributedAdd this issue/PR to distributed oncall triage queueAdd this issue/PR to distributed oncall triage queue
Description
I have encountered a problem when using DistributedDataParallel with multi-processes.
I use two machines, a docker machine A and a physical machine B:
A is the master and init_method = 'tcp://ip_A:free_port', and B can telnet the address 'tcp://ip_A:free_port'.
with gloo backend, it works well. But with nccl backend, it got stucked.
Runtime environment:
python 3.6 + pytorch-1.0-dev
DistributedDataParallel and multiprocessing,
node A: P40, node B: K40
2 nodes, 4 processes each node and each process runs in a single GPU
world-size 2, rank 0-7
configurations and codes:
if args.multiprocessing_distributed:
args.world_size = ngpus_per_node * args.world_size
mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args))
else:
main_worker(args.gpu, ngpus_per_node, args)
if args.distributed:
if args.dist_url == "env://" and args.rank == -1:
args.rank = int(os.environ["RANK"])
if args.multiprocessing_distributed:
args.rank = args.rank * ngpus_per_node + gpu
dist.init_process_group(backend=args.dist_backend, init_method=args.dist_url,
world_size=args.world_size, rank=args.rank)
torch.cuda.set_device(args.gpu)
model.cuda(args.gpu)
args.batch_size = int(args.batch_size / ngpus_per_node)
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu])
Metadata
Metadata
Assignees
Labels
oncall: distributedAdd this issue/PR to distributed oncall triage queueAdd this issue/PR to distributed oncall triage queue