Skip to content

Conversation

@dvrogozh
Copy link
Contributor

@dvrogozh dvrogozh commented Dec 6, 2024

@pytorch-bot
Copy link

pytorch-bot bot commented Dec 6, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/142184

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit fce2f63 with merge base 2bfc600 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Dec 6, 2024
@kwen2501 kwen2501 requested a review from albanD December 6, 2024 00:52
Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix!
That sounds good to me.

FYI @wconstab

@albanD
Copy link
Collaborator

albanD commented Dec 6, 2024

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 6, 2024
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

dvrogozh added a commit to dvrogozh/llama-models that referenced this pull request Dec 6, 2024
This commit adds support of non-cuda pytorch backend devices
to vision models. Commit verified on Llama3.2-11B-Vision-Instruct
model for:
* "cuda" device type on NVidia A10 GPU
* "cpu" device type
* "xpu" device type on Intel Data Center Max Series GPU (PVC)

Note that this commit requires a fix on pytorch side for gloo
torch distributed backend to restore TLS on gloo working threads.

Requires: pytorch/pytorch#142184
Signed-off-by: Dmitry Rogozhkin <[email protected]>
dvrogozh added a commit to dvrogozh/llama-models that referenced this pull request Jan 14, 2025
This commit adds support of non-cuda pytorch backend devices
to vision models. Commit verified on Llama3.2-11B-Vision-Instruct
model for:
* "cuda" device type on NVidia A10 GPU
* "cpu" device type
* "xpu" device type on Intel Data Center Max Series GPU (PVC)

Note that this commit requires a fix on pytorch side for gloo
torch distributed backend to restore TLS on gloo working threads.

Requires: pytorch/pytorch#142184
Signed-off-by: Dmitry Rogozhkin <[email protected]>
dvrogozh added a commit to dvrogozh/llama-models that referenced this pull request Jan 14, 2025
This commit adds support of non-cuda pytorch backend devices
to vision models. Commit verified on Llama3.2-11B-Vision-Instruct
model for:
* "cuda" device type on NVidia A10 GPU
* "cpu" device type
* "xpu" device type on Intel Data Center Max Series GPU (PVC)

Note that this commit requires a fix on pytorch side for gloo
torch distributed backend to restore TLS on gloo working threads.

Requires: pytorch/pytorch#142184
Signed-off-by: Dmitry Rogozhkin <[email protected]>
ashwinb pushed a commit to meta-llama/llama-models that referenced this pull request Jan 28, 2025
* feat: support non-cuda devices for text models

This commit adds support of non-cuda pytorch backend devices
to text models. Commit extends existing test to run for the
externally specified device (cuda is a default). Commit verified on
Llama3.2-3B-Instruct model for:
* "cuda" device type on NVidia A10 GPU
* "cpu" device type
* "xpu" device type on Intel Data Center Max Series GPU (PVC)

Co-authored-by: anordin95 <[email protected]>
Signed-off-by: Dmitry Rogozhkin <[email protected]>

* feat: support non-cuda devices for vision models

This commit adds support of non-cuda pytorch backend devices
to vision models. Commit verified on Llama3.2-11B-Vision-Instruct
model for:
* "cuda" device type on NVidia A10 GPU
* "cpu" device type
* "xpu" device type on Intel Data Center Max Series GPU (PVC)

Note that this commit requires a fix on pytorch side for gloo
torch distributed backend to restore TLS on gloo working threads.

Requires: pytorch/pytorch#142184
Signed-off-by: Dmitry Rogozhkin <[email protected]>

* tests: test cpu and on-device inference

This change modifies a test for the reference inference in a way
that on the cpu inference is always tested and on-device is tested
if device is available (currently checking for cuda and xpu in that
order) or if user explicitly specified DEVICE to test via environment
variable.

Signed-off-by: Dmitry Rogozhkin <[email protected]>

---------

Signed-off-by: Dmitry Rogozhkin <[email protected]>
Co-authored-by: anordin95 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue open source release notes: distributed (c10d) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Distributed collective ops fail in inference_mode for CPU-only

4 participants