DDP mismatch in rank to GPU selection

## 🐛 Bug

Possible root cause for https://github.com/pytorch/pytorch/pull/45435.  CC @walterddr 
Thanks to @jaglinux for the following triage information.

For barrier call, all reduce uses tensor of device type cuda with the following formula:

`int16_t deviceIdx = static_cast<int16_t>(rank_ % numGPUs);`

https://github.com/pytorch/pytorch/blob/a49367e9c9fcfa9547782420a24441f13fef19dc/torch/lib/c10d/ProcessGroupNCCL.cpp#L1413

But when the tests are called, the tensors uses different formula to calculate rank to GPU selection.

https://github.com/pytorch/pytorch/blob/a49367e9c9fcfa9547782420a24441f13fef19dc/torch/testing/_internal/distributed/distributed_test.py#L367-L374

Hence for rank 0,1,2 ; barrier uses tensors of cuda0, cuda1, cuda2 device type and for testing, tensors of cuda0, cuda2, cuda4 are used.

If we change `rank_to_GPU` data structure in distributed_test.py to return `rank % numGPUs`. With this change, all the tests are passing.

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @xush6528 @osalpekar @jiayisuse @agolynski @SciPioneer @H-Huang @mrzzd

	nGPUs_per_process = nGPUs // world_size
	rank_to_GPU = {
	i: list(
	visible_devices[i * nGPUs_per_process: (i + 1) * nGPUs_per_process]
	)
	for i in range(world_size)
	}
	return rank_to_GPU

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DDP mismatch in rank to GPU selection #47629

🐛 Bug

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

DDP mismatch in rank to GPU selection #47629

Description

🐛 Bug

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions