[fr] [xpu] Add FlightRecorder support for ProcessGroupXCCL #158568

frost-intel · 2025-07-17T15:49:18Z

Adds support for FlightRecorder in ProcessGroupXCCL.

See intel/torch-xpu-ops#1867 for XCCL implementation and more details.

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @gujinghui @EikanWang @fengyuan14 @guangyey

pytorch-bot · 2025-07-17T15:49:22Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/158568

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 51cd2a0 with merge base a9fabeb ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

torch/csrc/distributed/c10d/init.cpp

zhangxiaoli73 · 2025-07-28T01:41:22Z

tools/flight_recorder/components/types.py

        self.profiling_name = event["profiling_name"]
        nccl, name = self.profiling_name.split(":")
-        assert nccl == "nccl", f"name formatting error? {nccl} != 'nccl'"
+        assert nccl in ["nccl", "xccl"], (


nccl is a bit ambiguous，please consider using a more clearer term.

guangyey · 2025-08-05T15:30:51Z

@pytorchbot rebase

guangyey · 2025-08-05T15:34:18Z

Sorry @frost-intel, I do mistake when I help resolve the conflicts in xpu.txt. Could you help me update xpu.txt again within this PR.

guangyey · 2025-08-05T16:52:24Z

Sorry @frost-intel, I do mistake when I help resolve the conflicts in xpu.txt. Could you help me update xpu.txt again within this PR.

Github is back. Done.

guangyey · 2025-08-06T05:36:41Z

torch/csrc/distributed/c10d/FlightRecorder.hpp

      const std::tuple<std::string, std::string>& pg_name,
      std::vector<uint64_t> ranks);

-  void record_accelerator_version(const std::string nccl_version);


I think we only keep one parameter here is enough, such as void record_accelerator_version(const std::string ccl_version);
And define a ccl_version_key_str in

pytorch/torch/csrc/distributed/c10d/FlightRecorder.hpp

Line 26 in 2457e62

DEFINE_CONSTANT(nccl_version_key, "nccl_version")

and assign ccl_version to both nccl_version_str and ccl_version_str. (keep nccl_version_str for BC only)
What do you think.

I think this is a good idea, I've fixed the code as you said.

why not directly call it accelerator? why ccl?

guangyey

Just a nit, otherwise LGTM. Leave @zhangxiaoli73 to make the decision.

fduwjj

The only thing I have concern about is the CCL part, everything else looks good to me.

fduwjj · 2025-08-11T18:50:39Z

tools/flight_recorder/components/types.py

-        assert nccl == "nccl", f"name formatting error? {nccl} != 'nccl'"
+        ccl_backend, name = self.profiling_name.split(":")
+        assert ccl_backend in ["nccl", "xccl"], (
+            f"name formatting error? {ccl_backend} != 'nccl' or 'xccl'"


shall we call it accelerator instead of ccl?

fduwjj · 2025-08-11T18:52:59Z

torch/csrc/distributed/c10d/FlightRecorder.hpp

 DEFINE_CONSTANT(entries_key, "entries")
 DEFINE_CONSTANT(nccl_comm_key, "nccl_comm_state")
 DEFINE_CONSTANT(nccl_version_key, "nccl_version")
+DEFINE_CONSTANT(ccl_version_key, "ccl_version")


I really think we should name it accelerator version. The nccl_version is something we should change tbh. And I am ok with a bc change if we change it to accelerator_version so the code is generic enough. ccl is not ideal tbh.

"accelerator" is a fairly overloaded term already. I was using "ccl" in the sense of "collective communications library" as is used for NCCL, oneCCL (xccl), gloo, etc. Here the version is the version of the library, not of the accelerator hardware itself. In my opinion accelerator seems more of a hardware-focused term, and I'm using "ccl" to refer to the comm software.

However, if you feel strongly we should change it all to ccl, that's fine.

I agree accelerator might more related to HW, like GPU/TPU, etc, but I would rather we do this more explicitly like we call it comm_lib_version? The reason I don't like ccl is that, not all comm library has a name like that right? For example, gloo. But if you say comm_lib it is more explicit. And again the nccl definitely is not a good idea and we admit that, we will clean that up in the future, does this sound make sense?

I've renamed them to comm_lib. It seems like the only break this causes is in test/distributed/test_c10d_nccl.py, which I also changed.

fduwjj · 2025-08-11T18:53:48Z

torch/csrc/distributed/c10d/FlightRecorder.hpp

      const std::tuple<std::string, std::string>& pg_name,
      std::vector<uint64_t> ranks);

-  void record_accelerator_version(const std::string nccl_version);


why not directly call it accelerator? why ccl?

frost-intel · 2025-08-15T17:44:07Z

@pytorchbot rebase

pytorchmergebot · 2025-08-15T17:45:34Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2025-08-20T15:50:05Z

Successfully rebased xpu_flightRecorder onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout xpu_flightRecorder && git pull --rebase)

frost-intel · 2025-08-20T15:57:06Z

@pytorchbot label ciflow/xpu ciflow/trunk

pytorch-bot · 2025-08-20T15:57:17Z

To add these label(s) (ciflow/xpu, ciflow/trunk) to the PR, please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

frost-intel · 2025-08-20T20:48:15Z

@guangyey Could you relabel the issue with ciflow/trunk, ciflow/xpu? There was a test test_transformers.py::TestSDPAXpuOnlyXPU::test_scaled_dot_product_fused_attention_mask_vs_math_fused_kernel1_float16_batch_size_4_n_head_32_q_size_32_kv_size_32_head_dim_128_mask_type_causal_train_False_xpu_float16 which failed, but also seems to be failing on other up-to-date xpu issues, such as #161050.

pytorch-bot · 2025-08-21T12:31:32Z

To add the ciflow label ciflow/xpu please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

pytorch-bot · 2025-08-21T12:31:32Z

To add the ciflow label ciflow/trunk please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

guangyey · 2025-08-22T05:03:47Z

@pytorchbot merge

pytorchmergebot · 2025-08-22T05:05:45Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…58568) Adds support for FlightRecorder in ProcessGroupXCCL. See intel/torch-xpu-ops#1867 for XCCL implementation and more details. Pull Request resolved: pytorch#158568 Approved by: https://github.com/guangyey, https://github.com/fduwjj

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Jul 17, 2025

pytorchbot added the open source label Jul 17, 2025

guangyey added this to PyTorch Intel Jul 24, 2025

guangyey added the module: xpu Intel XPU related issues label Jul 24, 2025

guangyey reviewed Jul 24, 2025

View reviewed changes

torch/csrc/distributed/c10d/init.cpp Outdated Show resolved Hide resolved

zhangxiaoli73 reviewed Jul 28, 2025

View reviewed changes

frost-intel requested review from guangyey and zhangxiaoli73 August 5, 2025 14:30

guangyey reviewed Aug 6, 2025

View reviewed changes

guangyey approved these changes Aug 6, 2025

View reviewed changes

guangyey marked this pull request as ready for review August 7, 2025 02:10

guangyey requested review from EikanWang and gujinghui as code owners August 7, 2025 02:10

guangyey added the ciflow/xpu Run XPU CI tasks label Aug 7, 2025

guangyey moved this to Review Required in PyTorch Intel Aug 7, 2025

guangyey requested review from d4l3k and zhangxiaoli73 and removed request for zhangxiaoli73 August 7, 2025 02:43

pytorch-bot bot removed the ciflow/xpu Run XPU CI tasks label Aug 7, 2025

frost-intel requested a review from kwen2501 as a code owner August 8, 2025 20:07

d4l3k requested review from fduwjj and removed request for zhangxiaoli73 August 11, 2025 16:47

fduwjj reviewed Aug 11, 2025

View reviewed changes

frost-intel added 3 commits August 20, 2025 15:50

lint

4f6f117

lint

21306a0

Update FR version

51cd2a0

pytorchmergebot force-pushed the xpu_flightRecorder branch from 003fdb4 to 51cd2a0 Compare August 20, 2025 15:50

pytorch-bot bot removed ciflow/trunk Trigger trunk jobs on your pull request ciflow/xpu Run XPU CI tasks labels Aug 20, 2025

frost-intel mentioned this pull request Aug 20, 2025

[pytorchbot] Stop removing labels from maintainers on push/rebase pytorch/test-infra#7037

Closed

frost-intel marked this pull request as draft August 21, 2025 12:31

frost-intel added ciflow/trunk Trigger trunk jobs on your pull request ciflow/xpu Run XPU CI tasks labels Aug 21, 2025

pytorch-bot bot removed ciflow/xpu Run XPU CI tasks ciflow/trunk Trigger trunk jobs on your pull request labels Aug 21, 2025

frost-intel marked this pull request as ready for review August 21, 2025 12:35

guangyey added the ciflow/xpu Run XPU CI tasks label Aug 22, 2025

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 22, 2025

pytorchmergebot added the merging label Aug 22, 2025

pytorchmergebot added the Merged label Aug 22, 2025

pytorchmergebot closed this in 9b4adc4 Aug 22, 2025

github-project-automation bot moved this from Approved to Done in PyTorch Intel Aug 22, 2025

pytorchmergebot removed the merging label Aug 22, 2025

[fr] [xpu] Add FlightRecorder support for ProcessGroupXCCL #158568

[fr] [xpu] Add FlightRecorder support for ProcessGroupXCCL #158568

Uh oh!

Conversation

frost-intel commented Jul 17, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/158568

✅ No Failures

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guangyey commented Aug 5, 2025

Uh oh!

guangyey commented Aug 5, 2025

Uh oh!

guangyey commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guangyey left a comment

Choose a reason for hiding this comment

Uh oh!

fduwjj left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

frost-intel commented Aug 15, 2025

Uh oh!

pytorchmergebot commented Aug 15, 2025

Uh oh!

pytorchmergebot commented Aug 20, 2025

Uh oh!

frost-intel commented Aug 20, 2025

Uh oh!

pytorch-bot bot commented Aug 20, 2025

Uh oh!

frost-intel commented Aug 20, 2025

Uh oh!

pytorch-bot bot commented Aug 21, 2025

Uh oh!

pytorch-bot bot commented Aug 21, 2025

Uh oh!

guangyey commented Aug 22, 2025

Uh oh!

pytorchmergebot commented Aug 22, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

frost-intel commented Jul 17, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Jul 17, 2025 •

edited

Loading

guangyey commented Aug 5, 2025 •

edited

Loading