-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[fr] [xpu] Add FlightRecorder support for ProcessGroupXCCL #158568
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/158568
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 51cd2a0 with merge base a9fabeb ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
| self.profiling_name = event["profiling_name"] | ||
| nccl, name = self.profiling_name.split(":") | ||
| assert nccl == "nccl", f"name formatting error? {nccl} != 'nccl'" | ||
| assert nccl in ["nccl", "xccl"], ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nccl is a bit ambiguous,please consider using a more clearer term.
|
@pytorchbot rebase |
|
Sorry @frost-intel, I do mistake when I help resolve the conflicts in xpu.txt. Could you help me update xpu.txt again within this PR. |
Github is back. Done. |
| const std::tuple<std::string, std::string>& pg_name, | ||
| std::vector<uint64_t> ranks); | ||
|
|
||
| void record_accelerator_version(const std::string nccl_version); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we only keep one parameter here is enough, such as void record_accelerator_version(const std::string ccl_version);
And define a ccl_version_key_str in
| DEFINE_CONSTANT(nccl_version_key, "nccl_version") |
ccl_version to both nccl_version_str and ccl_version_str. (keep nccl_version_str for BC only)What do you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good idea, I've fixed the code as you said.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not directly call it accelerator? why ccl?
guangyey
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a nit, otherwise LGTM. Leave @zhangxiaoli73 to make the decision.
fduwjj
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only thing I have concern about is the CCL part, everything else looks good to me.
| assert nccl == "nccl", f"name formatting error? {nccl} != 'nccl'" | ||
| ccl_backend, name = self.profiling_name.split(":") | ||
| assert ccl_backend in ["nccl", "xccl"], ( | ||
| f"name formatting error? {ccl_backend} != 'nccl' or 'xccl'" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we call it accelerator instead of ccl?
| DEFINE_CONSTANT(entries_key, "entries") | ||
| DEFINE_CONSTANT(nccl_comm_key, "nccl_comm_state") | ||
| DEFINE_CONSTANT(nccl_version_key, "nccl_version") | ||
| DEFINE_CONSTANT(ccl_version_key, "ccl_version") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really think we should name it accelerator version. The nccl_version is something we should change tbh. And I am ok with a bc change if we change it to accelerator_version so the code is generic enough. ccl is not ideal tbh.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"accelerator" is a fairly overloaded term already. I was using "ccl" in the sense of "collective communications library" as is used for NCCL, oneCCL (xccl), gloo, etc. Here the version is the version of the library, not of the accelerator hardware itself. In my opinion accelerator seems more of a hardware-focused term, and I'm using "ccl" to refer to the comm software.
However, if you feel strongly we should change it all to ccl, that's fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree accelerator might more related to HW, like GPU/TPU, etc, but I would rather we do this more explicitly like we call it comm_lib_version? The reason I don't like ccl is that, not all comm library has a name like that right? For example, gloo. But if you say comm_lib it is more explicit. And again the nccl definitely is not a good idea and we admit that, we will clean that up in the future, does this sound make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've renamed them to comm_lib. It seems like the only break this causes is in test/distributed/test_c10d_nccl.py, which I also changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
| const std::tuple<std::string, std::string>& pg_name, | ||
| std::vector<uint64_t> ranks); | ||
|
|
||
| void record_accelerator_version(const std::string nccl_version); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not directly call it accelerator? why ccl?
|
@pytorchbot rebase |
|
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
|
Successfully rebased |
003fdb4 to
51cd2a0
Compare
|
@pytorchbot label ciflow/xpu ciflow/trunk |
|
To add these label(s) (ciflow/xpu, ciflow/trunk) to the PR, please first approve the workflows that are awaiting approval (scroll to the bottom of this page). This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows. |
|
@guangyey Could you relabel the issue with ciflow/trunk, ciflow/xpu? There was a test |
|
To add the ciflow label This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows. |
|
To add the ciflow label This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows. |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…58568) Adds support for FlightRecorder in ProcessGroupXCCL. See intel/torch-xpu-ops#1867 for XCCL implementation and more details. Pull Request resolved: pytorch#158568 Approved by: https://github.com/guangyey, https://github.com/fduwjj
Adds support for FlightRecorder in ProcessGroupXCCL.
See intel/torch-xpu-ops#1867 for XCCL implementation and more details.
cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @gujinghui @EikanWang @fengyuan14 @guangyey