Skip to content

Conversation

@frost-intel
Copy link
Collaborator

@frost-intel frost-intel commented Jul 17, 2025

Adds support for FlightRecorder in ProcessGroupXCCL.

See intel/torch-xpu-ops#1867 for XCCL implementation and more details.

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @gujinghui @EikanWang @fengyuan14 @guangyey

@pytorch-bot
Copy link

pytorch-bot bot commented Jul 17, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/158568

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 51cd2a0 with merge base a9fabeb (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Jul 17, 2025
@guangyey guangyey added the module: xpu Intel XPU related issues label Jul 24, 2025
self.profiling_name = event["profiling_name"]
nccl, name = self.profiling_name.split(":")
assert nccl == "nccl", f"name formatting error? {nccl} != 'nccl'"
assert nccl in ["nccl", "xccl"], (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nccl is a bit ambiguous,please consider using a more clearer term.

@guangyey
Copy link
Collaborator

guangyey commented Aug 5, 2025

@pytorchbot rebase

@guangyey
Copy link
Collaborator

guangyey commented Aug 5, 2025

Sorry @frost-intel, I do mistake when I help resolve the conflicts in xpu.txt. Could you help me update xpu.txt again within this PR.

@guangyey
Copy link
Collaborator

guangyey commented Aug 5, 2025

Sorry @frost-intel, I do mistake when I help resolve the conflicts in xpu.txt. Could you help me update xpu.txt again within this PR.

Github is back. Done.

const std::tuple<std::string, std::string>& pg_name,
std::vector<uint64_t> ranks);

void record_accelerator_version(const std::string nccl_version);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we only keep one parameter here is enough, such as void record_accelerator_version(const std::string ccl_version);
And define a ccl_version_key_str in

DEFINE_CONSTANT(nccl_version_key, "nccl_version")
and assign ccl_version to both nccl_version_str and ccl_version_str. (keep nccl_version_str for BC only)
What do you think.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good idea, I've fixed the code as you said.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not directly call it accelerator? why ccl?

Copy link
Collaborator

@guangyey guangyey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a nit, otherwise LGTM. Leave @zhangxiaoli73 to make the decision.

@guangyey guangyey marked this pull request as ready for review August 7, 2025 02:10
@guangyey guangyey added the ciflow/xpu Run XPU CI tasks label Aug 7, 2025
@guangyey guangyey moved this to Review Required in PyTorch Intel Aug 7, 2025
@guangyey guangyey requested review from d4l3k and zhangxiaoli73 and removed request for zhangxiaoli73 August 7, 2025 02:43
@pytorch-bot pytorch-bot bot removed the ciflow/xpu Run XPU CI tasks label Aug 7, 2025
@frost-intel frost-intel requested a review from kwen2501 as a code owner August 8, 2025 20:07
@d4l3k d4l3k requested review from fduwjj and removed request for zhangxiaoli73 August 11, 2025 16:47
Copy link
Contributor

@fduwjj fduwjj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only thing I have concern about is the CCL part, everything else looks good to me.

assert nccl == "nccl", f"name formatting error? {nccl} != 'nccl'"
ccl_backend, name = self.profiling_name.split(":")
assert ccl_backend in ["nccl", "xccl"], (
f"name formatting error? {ccl_backend} != 'nccl' or 'xccl'"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we call it accelerator instead of ccl?

DEFINE_CONSTANT(entries_key, "entries")
DEFINE_CONSTANT(nccl_comm_key, "nccl_comm_state")
DEFINE_CONSTANT(nccl_version_key, "nccl_version")
DEFINE_CONSTANT(ccl_version_key, "ccl_version")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really think we should name it accelerator version. The nccl_version is something we should change tbh. And I am ok with a bc change if we change it to accelerator_version so the code is generic enough. ccl is not ideal tbh.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"accelerator" is a fairly overloaded term already. I was using "ccl" in the sense of "collective communications library" as is used for NCCL, oneCCL (xccl), gloo, etc. Here the version is the version of the library, not of the accelerator hardware itself. In my opinion accelerator seems more of a hardware-focused term, and I'm using "ccl" to refer to the comm software.

However, if you feel strongly we should change it all to ccl, that's fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree accelerator might more related to HW, like GPU/TPU, etc, but I would rather we do this more explicitly like we call it comm_lib_version? The reason I don't like ccl is that, not all comm library has a name like that right? For example, gloo. But if you say comm_lib it is more explicit. And again the nccl definitely is not a good idea and we admit that, we will clean that up in the future, does this sound make sense?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've renamed them to comm_lib. It seems like the only break this causes is in test/distributed/test_c10d_nccl.py, which I also changed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

const std::tuple<std::string, std::string>& pg_name,
std::vector<uint64_t> ranks);

void record_accelerator_version(const std::string nccl_version);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not directly call it accelerator? why ccl?

@frost-intel
Copy link
Collaborator Author

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased xpu_flightRecorder onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout xpu_flightRecorder && git pull --rebase)

@pytorch-bot pytorch-bot bot removed ciflow/trunk Trigger trunk jobs on your pull request ciflow/xpu Run XPU CI tasks labels Aug 20, 2025
@frost-intel
Copy link
Collaborator Author

@pytorchbot label ciflow/xpu ciflow/trunk

@pytorch-bot
Copy link

pytorch-bot bot commented Aug 20, 2025

To add these label(s) (ciflow/xpu, ciflow/trunk) to the PR, please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

@frost-intel
Copy link
Collaborator Author

@guangyey Could you relabel the issue with ciflow/trunk, ciflow/xpu? There was a test test_transformers.py::TestSDPAXpuOnlyXPU::test_scaled_dot_product_fused_attention_mask_vs_math_fused_kernel1_float16_batch_size_4_n_head_32_q_size_32_kv_size_32_head_dim_128_mask_type_causal_train_False_xpu_float16 which failed, but also seems to be failing on other up-to-date xpu issues, such as #161050.

@frost-intel frost-intel marked this pull request as draft August 21, 2025 12:31
@frost-intel frost-intel added ciflow/trunk Trigger trunk jobs on your pull request ciflow/xpu Run XPU CI tasks labels Aug 21, 2025
@pytorch-bot
Copy link

pytorch-bot bot commented Aug 21, 2025

To add the ciflow label ciflow/xpu please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

@pytorch-bot
Copy link

pytorch-bot bot commented Aug 21, 2025

To add the ciflow label ciflow/trunk please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

@pytorch-bot pytorch-bot bot removed ciflow/xpu Run XPU CI tasks ciflow/trunk Trigger trunk jobs on your pull request labels Aug 21, 2025
@frost-intel frost-intel marked this pull request as ready for review August 21, 2025 12:35
@guangyey guangyey added the ciflow/xpu Run XPU CI tasks label Aug 22, 2025
@guangyey
Copy link
Collaborator

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 22, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request ciflow/xpu Run XPU CI tasks Merged module: xpu Intel XPU related issues oncall: distributed Add this issue/PR to distributed oncall triage queue open source release notes: distributed (c10d) release notes category

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

6 participants