Skip to content

DDP mismatch in rank to GPU selection #47629

@jeffdaily

Description

@jeffdaily

🐛 Bug

Possible root cause for #45435. CC @walterddr
Thanks to @jaglinux for the following triage information.

For barrier call, all reduce uses tensor of device type cuda with the following formula:

int16_t deviceIdx = static_cast<int16_t>(rank_ % numGPUs);

int16_t deviceIdx = static_cast<int16_t>(rank_ % numGPUs);

But when the tests are called, the tensors uses different formula to calculate rank to GPU selection.

nGPUs_per_process = nGPUs // world_size
rank_to_GPU = {
i: list(
visible_devices[i * nGPUs_per_process: (i + 1) * nGPUs_per_process]
)
for i in range(world_size)
}
return rank_to_GPU

Hence for rank 0,1,2 ; barrier uses tensors of cuda0, cuda1, cuda2 device type and for testing, tensors of cuda0, cuda2, cuda4 are used.

If we change rank_to_GPU data structure in distributed_test.py to return rank % numGPUs. With this change, all the tests are passing.

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @xush6528 @osalpekar @jiayisuse @agolynski @SciPioneer @H-Huang @mrzzd

Metadata

Metadata

Assignees

No one assigned

    Labels

    module: ddpIssues/PRs related distributed data parallel trainingoncall: distributedAdd this issue/PR to distributed oncall triage queuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions