Use task submitter TLS in gloo working threads #142184

dvrogozh · 2024-12-06T00:11:24Z

Fixes: #86830

CC: @albanD

cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

Fixes: pytorch#86830 Signed-off-by: Dmitry Rogozhkin <[email protected]>

pytorch-bot · 2024-12-06T00:11:28Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/142184

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit fce2f63 with merge base 2bfc600 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

albanD

Thanks for the fix!
That sounds good to me.

FYI @wconstab

albanD · 2024-12-06T14:53:11Z

@pytorchbot merge

pytorchmergebot · 2024-12-06T14:56:03Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

This commit adds support of non-cuda pytorch backend devices to vision models. Commit verified on Llama3.2-11B-Vision-Instruct model for: * "cuda" device type on NVidia A10 GPU * "cpu" device type * "xpu" device type on Intel Data Center Max Series GPU (PVC) Note that this commit requires a fix on pytorch side for gloo torch distributed backend to restore TLS on gloo working threads. Requires: pytorch/pytorch#142184 Signed-off-by: Dmitry Rogozhkin <[email protected]>

@albanD

Fixes: #86830 CC: @albanD Pull Request resolved: #142184 Approved by: https://github.com/albanD

This commit adds support of non-cuda pytorch backend devices to vision models. Commit verified on Llama3.2-11B-Vision-Instruct model for: * "cuda" device type on NVidia A10 GPU * "cpu" device type * "xpu" device type on Intel Data Center Max Series GPU (PVC) Note that this commit requires a fix on pytorch side for gloo torch distributed backend to restore TLS on gloo working threads. Requires: pytorch/pytorch#142184 Signed-off-by: Dmitry Rogozhkin <[email protected]>

* feat: support non-cuda devices for text models This commit adds support of non-cuda pytorch backend devices to text models. Commit extends existing test to run for the externally specified device (cuda is a default). Commit verified on Llama3.2-3B-Instruct model for: * "cuda" device type on NVidia A10 GPU * "cpu" device type * "xpu" device type on Intel Data Center Max Series GPU (PVC) Co-authored-by: anordin95 <[email protected]> Signed-off-by: Dmitry Rogozhkin <[email protected]> * feat: support non-cuda devices for vision models This commit adds support of non-cuda pytorch backend devices to vision models. Commit verified on Llama3.2-11B-Vision-Instruct model for: * "cuda" device type on NVidia A10 GPU * "cpu" device type * "xpu" device type on Intel Data Center Max Series GPU (PVC) Note that this commit requires a fix on pytorch side for gloo torch distributed backend to restore TLS on gloo working threads. Requires: pytorch/pytorch#142184 Signed-off-by: Dmitry Rogozhkin <[email protected]> * tests: test cpu and on-device inference This change modifies a test for the reference inference in a way that on the cpu inference is always tested and on-device is tested if device is available (currently checking for cuda and xpu in that order) or if user explicitly specified DEVICE to test via environment variable. Signed-off-by: Dmitry Rogozhkin <[email protected]> --------- Signed-off-by: Dmitry Rogozhkin <[email protected]> Co-authored-by: anordin95 <[email protected]>

Use task submitter TLS in gloo working threads

fce2f63

Fixes: pytorch#86830 Signed-off-by: Dmitry Rogozhkin <[email protected]>

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Dec 6, 2024

dvrogozh mentioned this pull request Dec 6, 2024

Distributed collective ops fail in inference_mode for CPU-only #86830

Closed

pytorchbot added the open source label Dec 6, 2024

kwen2501 requested a review from albanD December 6, 2024 00:52

albanD approved these changes Dec 6, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 6, 2024

pytorchmergebot added the merging label Dec 6, 2024

pytorchmergebot added the Merged label Dec 6, 2024

pytorchmergebot closed this in 5872a8c Dec 6, 2024

pytorchmergebot removed the merging label Dec 6, 2024

dvrogozh mentioned this pull request Dec 6, 2024

feat: support non-cuda devices for text and vision models meta-llama/llama-models#233

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use task submitter TLS in gloo working threads #142184

Use task submitter TLS in gloo working threads #142184

Uh oh!

dvrogozh commented Dec 6, 2024 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Dec 6, 2024 •

edited

Loading

Uh oh!

albanD left a comment

Uh oh!

albanD commented Dec 6, 2024

Uh oh!

pytorchmergebot commented Dec 6, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Use task submitter TLS in gloo working threads #142184

Use task submitter TLS in gloo working threads #142184

Uh oh!

Conversation

dvrogozh commented Dec 6, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Dec 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/142184

✅ No Failures

Uh oh!

albanD left a comment

Choose a reason for hiding this comment

Uh oh!

albanD commented Dec 6, 2024

Uh oh!

pytorchmergebot commented Dec 6, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dvrogozh commented Dec 6, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Dec 6, 2024 •

edited

Loading