-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Add a new distributed backend (XCCL) for Intel GPUs #136343
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/136343
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 385c218 with merge base fe0e9fb ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
malfet
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this PR introduces a lots of code duplication by renaming NCCL to XCCL
If collective communication principles on XPUs are very different from GPUs, than perhaps indeed it requires a different backend, but if they are the same, perhaps code could be refactored to share more of an infrastructure with NCCL
Also, I would like to challenge the user story here a bit.
It would be great, if users do not have to modify their programs for every new backend, but rather use with torch.device('xpu'): and keep rest of the code unchanged.
|
I asked a question in the original RFC about the detailed product line supported. Would appreciate comments :) |
|
Hi, @malfet Thank you for your comments. We are currently working on refactoring the code to enhance XCCL's ability to reuse NCCL backend. Meanwhile, we will still need maintain some differences for XCCL. For backend selection, PyTorch has implemented a feature that allows for automatic backend selection — using NCCL on CUDA devices and XCCL on XPU devices corresponding. Therefore, when users don't want to specify a backend explicitly, this automatic selection will take effect without the need for any programs modification. I think that is what you want, is my understanding right? Let me show you a common example: Users can initialize the global default process group in When users call distributed collectives like For high-level distributed components like |
|
Hi, @kwen2501 I responded to the product line in the RFC and would like to reference it here: "For the product line, we are developing on Intel® Data Center GPU Max Series based on Intel® Tiber™ Developer Cloud and will continue to enable next gen of Data Center GPUs afterwards.” Let me know if you have other questions. Thanks. |
|
That's clear, thanks @zhangxiaoli73 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a short description to the build logic in this PR?
Such as the "if sth exists / sth enabled; else ..." decisions.
It can help us keep a record.
(Would be even better if you could include it in some of the cmake files -- to make them more readable!)
Thanks!
cmake/Modules/FindXCCL.cmake
Outdated
| set(XCCL_ROOT "") | ||
| if(DEFINED ENV{CCL_ROOT}) | ||
| set(XCCL_ROOT $ENV{CCL_ROOT}) | ||
| endif() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the relationship between XCCL_ROOT and CCL_ROOT?
It looks like CCL_ROOT is the user-facing env?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, CCL_ROOT is a user-facing environment variable that is automatically set by Intel oneCCL.
caffe2/CMakeLists.txt
Outdated
| target_compile_definitions(torch_xpu PUBLIC USE_C10D_XCCL) | ||
| set_source_files_properties( | ||
| ${TORCH_SRC_DIR}/csrc/distributed/c10d/ProcessGroupXCCL.cpp | ||
| PROPERTIES COMPILE_DEFINITIONS "CCL_ENABLE_ZE;CCL_ENABLE_SYCL") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for my education, what do "CCL_ENABLE_ZE;CCL_ENABLE_SYCL" stand for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those flags are needed in oneCCL (a communication library from Intel) for gcc build, to let oneCCL know that we are building host code by gcc in framework and need device kernel in runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Chao1Han , please add some comments to elaborate on the motivation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let oneCCL know that we are building host code by gcc in framework and need device kernel in runtime.
@Chao1Han CCL is a library while CCL_ENABLE_ZE;CCL_ENABLE_SYCL a compilation flag. How does the Intel CCL library aware the compilation flag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those flag is needed in oneCCL to find correct header. We will define those flags in XCCL backend file instead of passing to gcc compiler.
|
Two more general questions: |
|
|
||
| if(USE_XPU) | ||
| if(USE_XCCL) | ||
| append_filelist("libtorch_xpu_distributed_extra_sources" Caffe2_XPU_SRCS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will libtorch_xpu_distributed_extra_sources be used at different places? It seems like libtorch_xpu_distributed_extra_sources is an unused variable now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The purpose of the append_filelist function is to append the files listed in the first variable to the second variable. libtorch_xpu_distributed_extra_sources is defined in build_variables.bzl. The reason ProcessGroupXCCL.cpp is added to this list instead of being directly appended to the Caffe2_XPU_SRCS variable is that we plan to add more utility files to the list later.
caffe2/CMakeLists.txt
Outdated
| target_compile_definitions(torch_xpu PUBLIC USE_C10D_XCCL) | ||
| set_source_files_properties( | ||
| ${TORCH_SRC_DIR}/csrc/distributed/c10d/ProcessGroupXCCL.cpp | ||
| PROPERTIES COMPILE_DEFINITIONS "CCL_ENABLE_ZE;CCL_ENABLE_SYCL") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let oneCCL know that we are building host code by gcc in framework and need device kernel in runtime.
@Chao1Han CCL is a library while CCL_ENABLE_ZE;CCL_ENABLE_SYCL a compilation flag. How does the Intel CCL library aware the compilation flag?
| if (torch.cuda.is_available() and torch.cuda.device_count() >= x) or \ | ||
| (torch.xpu.is_available() and torch.xpu.device_count() >= x): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we provide comprehensive support for the testing? Why does at_least_x_gpu not need to support Intel GPU? Because we do not test it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we will provide sufficient tests. This modification is limited to this particular section because the test cases for this PR currently only involve this part. Changes to unit tests will be coordinated with the frontend PRs (e.g., DDP, FSDP, etc.).
| try: | ||
| from torch._C._distributed_c10d import ProcessGroupXCCL | ||
|
|
||
| ProcessGroupXCCL.__module__ = "torch.distributed.distributed_c10d" | ||
| __all__ += ["ProcessGroupXCCL"] | ||
| except ImportError: | ||
| _XCCL_AVAILABLE = False | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to refine the logic for different backends a little bit? It seems like we just copy-paste the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this follows the logic of UCC, NCCL, and Gloo. We can consider refining this logic later.
Hi, @kwen2501 Sure, the build logic for XCCL backend is: |
|
Hi @malfet
Make sense. We will follow to give an abstraction design to make the code more device-agnostic, and review with you.
According to @zhangxiaoli73 's comments, this feature should be implemented in PyTorch already. |
Any comments for above option? Thanks. |
Yeah, fine with me. |
| setXCCLEnvVar("CCL_PROCESS_LAUNCHER", "none"); | ||
| setXCCLEnvVar("CCL_LOCAL_RANK", local_rank); | ||
| setXCCLEnvVar("CCL_LOCAL_SIZE", local_world_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this kind of passing may not work well when there are n-dimensional groups, each with different local rank and local world_size.
Would the oneCCL library consider accepting those as API argument?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I know your concern. We have asked oneCCL library to detect those information automatically instead of passing by manual, so remove those code.
| if device_id is not None and (device_id.index is None or device_id.type != "cuda"): | ||
| if device_id is not None and ( | ||
| device_id.index is None | ||
| or (device_id.type != "cuda" and device_id.type != "xpu") | ||
| ): | ||
| raise ValueError( | ||
| "init_process_group device_id parameter must be a cuda device with an " | ||
| "id, e.g. cuda:0, not just cuda or cpu" | ||
| "id, e.g. cuda:0, xpu, not just cuda or xpu or cpu" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe we can just simply the logic:
if device_id is not None and device_id.index is None:
raise ValueError(
"init_process_group device_id parameter must be a device with an index"
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, changes done.
| nGPUs = torch.cuda.device_count() | ||
| nGPUs = torch.xpu.device_count() if torch.xpu.is_available() else torch.cuda.device_count() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change introduced priority.
Can we determine which device count to use based on backend: str passed?
The parent function API:
def init_multigpu_helper(world_size: int, backend: str):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. We will move those if..else in this PR to make code generic.
malfet
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we split this PR into 3:
- First one just introduces some common refactors
- 2nd one adds XCCL bindings
- Last one adds testing (which requires some new HW to be added to the infra, shouldn't it?)
| if cmake_cache_vars["USE_XCCL"]: | ||
| report("-- Building XCCL library") | ||
| else: | ||
| report("-- Not using XCCL") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This list keeps growing, I'm not sure if anyone is reading it. Can you elaborate why this is needed other than to mimic behavior similar to NCCL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The results of this list will appear in the final part of the successful build output, but the information may repeat some of the flags printed at the beginning of the build. I think its purpose is to provide some flag information in the log after an incremental build, since small changes to the code do not trigger the message printing from the CMakeLists.txt.
test/distributed/test_c10d_common.py
Outdated
| device_count = ( | ||
| torch.xpu.device_count() | ||
| if torch.xpu.is_available() | ||
| else torch.cuda.device_count() | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to move it to some torch/testing/_internal as say get_device_count() or something like that? Because you are making similar change in 3-4 places in the code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense. We will do some code generalize even for tests.
| UCC = 3, | ||
| MPI = 4, | ||
| CUSTOM = 5, | ||
| XCCL = 6, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we keep CUSTOM as last option?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. Changes done.
| if (backendType == ProcessGroup::BackendType::GLOO || | ||
| backendType == ProcessGroup::BackendType::NCCL || | ||
| backendType == ProcessGroup::BackendType::XCCL || | ||
| backendType == ProcessGroup::BackendType::UCC) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we invert the conditional? Or perhaps add inline bool backendSupportsSequenceNumbers() and call it here and on line 533 as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, make sense to put the condition check into a function instead of a long if ...else.
| int getXCCLEnvVar(std::string envVarName) { | ||
| char* stringValue = std::getenv(envVarName.c_str()); | ||
| if (stringValue != nullptr) { | ||
| try { | ||
| int val = std::stoi(stringValue); | ||
| return val; | ||
| } catch (std::exception& e) { | ||
| TORCH_CHECK( | ||
| false, | ||
| "Invalid value for environment variable: " + std::string(envVarName)); | ||
| } | ||
| } else { | ||
| return -1; | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function does not look specific to XCCL, does it? Can it go into some sort of a generic header?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this function is not specific to XCCL but it's not needed right now, due to some code logic change in XCCL and then no need to parse env variables.
@malfet Make sense. This PR is too large to review, and we will plan to split PR to several small ones and start from the first point: common code refactor and generalization. We're a bit unsure whether to continue refactoring in this PR or to create a new one. Welcome your comments. |
perhaps it would be better to start from writing an RFC issue or a google doc with proposed refactoring steps, we can chat on slack and discuss until it looks good, then start to implement them? cc @kwen2501 |
@wconstab , thanks. It is helpful to ensure we are on the same page first. @Chao1Han , @zhangxiaoli73 , let's do it first. |
|
Just thinking out loud here, as discussed in our meeting, PyTorch 2.5 can auto-load/import a backend's module if detected from the installed python package environment. This would eliminate the need for users to do something like Following that capability, it seems the @Chao1Han , what do you think? |
|
Thanks, @kwen2501! You provided a great alternative with |
Citing @malfet's [comment](#136343 (review)) in #136343 > It would be great, if users do not have to modify their programs for every new backend, but rather use with torch.device('xpu'): and keep rest of the code unchanged. This PR makes the backend specification ("nccl", "gloo") optional when user provides a `devce_id` to `init_process_group` (the acceptance of `device_id` has been previously supported for the purpose of eager init). New user experience: ``` device = torch.device(device_type, rank % device_count) dist.init_process_group(device_id=device) ``` The line of `device = torch.device(...)` is anyway needed because user would use it for tensor creation etc. Pull Request resolved: #140963 Approved by: https://github.com/wconstab
Citing @malfet's [comment](pytorch#136343 (review)) in pytorch#136343 > It would be great, if users do not have to modify their programs for every new backend, but rather use with torch.device('xpu'): and keep rest of the code unchanged. This PR makes the backend specification ("nccl", "gloo") optional when user provides a `devce_id` to `init_process_group` (the acceptance of `device_id` has been previously supported for the purpose of eager init). New user experience: ``` device = torch.device(device_type, rank % device_count) dist.init_process_group(device_id=device) ``` The line of `device = torch.device(...)` is anyway needed because user would use it for tensor creation etc. Pull Request resolved: pytorch#140963 Approved by: https://github.com/wconstab
|
close it due to #141856 merged. |
Motivation:
Corresponding to [RFC] Intel GPU Distributed Support in PyTorch, this is the first PR to enable distributed support on Intel GPUs ( device name for Intel GPU in PyTorch is XPU). In this PR, we would like to add a new distributed backend
ProcessGroupXCCLand implementallreduceas an entrypoint.Solution:
ProcessGroupXCCLwith namexcclwhich represents XPU Collective Communications Library in this post. ThisProcessGroupXCCLinherits from thec10::Backendclass likeProcessGroupNCCL.allreduceforProcessGroupXCCLin this PR. More collectives forProcessGroupXCCLwill be added next.ProcessGroupXCCLtolibtorch_xpu.sowith build flagUSE_XCCL. This flag is to beONonly when Intel SYCL runtime and oneCCL runtime library is installed, as well asUSE_DISTRIBUTEDandUSE_XPUto beON.Example:
Here is a simple example of using spawn to launch
XCCLbackend and performallreduceon XPU tensors.UT Plan:
Add collective unit test cases in test/distributed/test_c10d_xccl.py to verify the correct registration of
ProcessGroupXCCLand correctness check ofallreduceoperation.Additional Context:
cc @jgong5 @gujinghui @EikanWang @fengyuan14 @guangyey
cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o