-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Don't run NCCL/gloo distributed test without GPUs #150764
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
If there aren't any GPUs the WORLD_SIZE would be zero which does not work. So skip those backends completely in that case.
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/150764
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 1c75299 with merge base 7cae790 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
test/run_test.py
Outdated
| if dist.is_gloo_available(): | ||
| if dist.is_gloo_available() and num_gpus > 0: | ||
| DISTRIBUTED_TESTS_CONFIG["gloo"] = { | ||
| # TODO: retire testing gloo with CUDA | ||
| "WORLD_SIZE": f"{torch.cuda.device_count()}", | ||
| "WORLD_SIZE": f"{num_gpus}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you mentioned, maybe we can remove the num_gpus > 0 condition for Gloo, and set WOLRD_SIZE to min(4, num_gpus) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about the "4". And min would still yield zero for missing GPUs which was the reason for this PR.
The previous world size was 3 and still is for MPI. The current (before this MR) intention is to use 1 rank for GPU.
So I'd use: cuda_world_size = torch.cuda.device_count() or 3. Or even max(3, torch.cuda.device_count()). That might skip some tests on single and dual-GPU systems but likely matches the previous and intended behavior.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just want to remove the and num_gpus > 0 condition for Gloo, so that we would still run CPU tests on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, especially with the TODO above it makes sense. I did the minimal change and use 3 processes when there are no GPUs, else the number of GPUs, i.e. the first suggestion
715de14 to
91cf9b6
Compare
kwen2501
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the change. Looks good to me.
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 jobs have failed, first few of them are: trunk / linux-focal-rocm-py3.10 / test (default, 2, 2, linux.rocm.gpu.2) Details for Dev Infra teamRaised by workflow job |
|
@pytorchbot merge -i |
Merge startedYour change will be merged while ignoring the following 1 checks: trunk / linux-focal-rocm-py3.10 / test (default, 2, 2, linux.rocm.gpu.2) Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command |
|
@pytorchbot merge -i |
Merge startedYour change will be merged while ignoring the following 0 checks: Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
If there aren't any GPUs the WORLD_SIZE would be zero which does not work.
So skip those backends completely in that case.
Fix after #137161
It might make sense to still run the (CPU-) part of the tests by using something like
world_size = max(3, gpu_count)ornum_gpus if num_gpus else 3instead of skipping them all