Skip to content

Conversation

@Flamefire
Copy link
Collaborator

@Flamefire Flamefire commented Apr 7, 2025

If there aren't any GPUs the WORLD_SIZE would be zero which does not work.
So skip those backends completely in that case.

Fix after #137161

It might make sense to still run the (CPU-) part of the tests by using something like world_size = max(3, gpu_count) or num_gpus if num_gpus else 3 instead of skipping them all

If there aren't any GPUs the WORLD_SIZE would be zero which does not
work.
So skip those backends completely in that case.
@Flamefire Flamefire requested review from kwen2501 and seemethere April 7, 2025 11:43
@Flamefire Flamefire requested a review from a team as a code owner April 7, 2025 11:43
@pytorch-bot
Copy link

pytorch-bot bot commented Apr 7, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/150764

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 1c75299 with merge base 7cae790 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Apr 7, 2025
test/run_test.py Outdated
Comment on lines 351 to 355
if dist.is_gloo_available():
if dist.is_gloo_available() and num_gpus > 0:
DISTRIBUTED_TESTS_CONFIG["gloo"] = {
# TODO: retire testing gloo with CUDA
"WORLD_SIZE": f"{torch.cuda.device_count()}",
"WORLD_SIZE": f"{num_gpus}",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you mentioned, maybe we can remove the num_gpus > 0 condition for Gloo, and set WOLRD_SIZE to min(4, num_gpus) ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about the "4". And min would still yield zero for missing GPUs which was the reason for this PR.

The previous world size was 3 and still is for MPI. The current (before this MR) intention is to use 1 rank for GPU.

So I'd use: cuda_world_size = torch.cuda.device_count() or 3. Or even max(3, torch.cuda.device_count()). That might skip some tests on single and dual-GPU systems but likely matches the previous and intended behavior.

What do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just want to remove the and num_gpus > 0 condition for Gloo, so that we would still run CPU tests on it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, especially with the TODO above it makes sense. I did the minimal change and use 3 processes when there are no GPUs, else the number of GPUs, i.e. the first suggestion

@bdhirsh bdhirsh added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Apr 8, 2025
Copy link
Collaborator

@kwen2501 kwen2501 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the change. Looks good to me.

@kwen2501
Copy link
Collaborator

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Apr 28, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / linux-focal-rocm-py3.10 / test (default, 2, 2, linux.rocm.gpu.2)

Details for Dev Infra team Raised by workflow job

@kwen2501
Copy link
Collaborator

@pytorchbot merge -i

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged while ignoring the following 1 checks: trunk / linux-focal-rocm-py3.10 / test (default, 2, 2, linux.rocm.gpu.2)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

@cyyever
Copy link
Collaborator

cyyever commented Apr 29, 2025

@pytorchbot merge -i

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged while ignoring the following 0 checks:

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@Flamefire Flamefire deleted the distr-test-gpu branch April 29, 2025 06:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged open source topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants