-
Notifications
You must be signed in to change notification settings - Fork 27k
Description
🚀 The feature, motivation and pitch
Motivation
Cutlass is an efficient template library for compute-heavy GPU operations like GEMM, Convolution, and others.
We enabled CUTLASS to support Intel GPUs in sycl-tla, which adds SYCL support while maintaining interface consistency. It enables the generation of high-performance GEMM kernels for Intel GPUs.
Accordingly, we propose to generalize cutlass into Inductor to enable high-performance GEMM support on Intel GPUs, consistent with the existing NV cutlass integration.
Proposal
Based on the principles of maximizing reuse of the existing infrastructure and minimizing code changes, we propose the following six-part design:
1. Code generation
The diagram below shows the workflow of how Inductor generates the cutlass kernel source code for a GEMM operation.
In the current design, the classes marked in red in the above diagram that we intend to generalize to add Intel GPU support.
1.1 CUDATemplate
We propose decoupling cutlass as an independent component by making it a subclass of KernelTemplate, similar to Triton.
Subsequently, a device_type attribute is added to the class for device-specific implementation while reusing the shared cutlass kernel generation logic.
1.2 CUDATemplateKernel, CUDATemplateCaller, CUDATemplateBuffer
Since most of the code here is reusable for Intel Cutlass, we are going to generalize it into CutlassTemplateKernel, CutlassTemplateCaller, and CutlassTemplateBuffer.
2. Scheduling
Currently, CUDA uses CUDACombinedScheduling, which can handle both Triton and cutlass kernels, while Intel GPUs only support Triton.
We plan to rename CUDACombinedScheduling to CombinedScheduling and extend its support to Intel GPUs.
Both CUDA and Intel GPUs share substantial common logic, including Cutlass epilogue fusion decisions, kernel definitions, and kernel generation from the Cutlass template. Therefore, we propose generalizing CUDACPPScheduling into CUTLASSScheduling, so that it can be shared between CUDA and XPU.
3. Compile and cache
In codecache.py, CUDACodeCache currently manages and compiles the generated cutlass kernels. For XPU, the only difference lies in the compilation step.
We propose abstracting CUDACodeCache into CUTLASSCodeCache, with CUDACodeCache and XPUCodeCache inheriting from it and implementing device-specific compilation logic.

4. Run and benchmark
The Cutlass kernel benchmark functionality is currently implemented in CUDABenchmarkRequest, and we need a similar implementation for XPU. Therefore, we plan to generalize CUDABenchmarkRequest into CutlassBenchmarkRequest and add a device_type attribute to support device-specific logic.
5. Inductor Cutlass Configuration
Currently, all cutlass configuration options are placed under torch._inductor.config.cuda.
We plan to move these configurations to torch._inductor.config.cutlass for clearer separation and maintainability.
6. Cutlass Path:
Currently, the default cutlass path is third_party/cutlass.
We plan to add a new submodule, cutlass-sycl, and automatically set it as the default cutlass path when XPU is available.
Implementation Plan
- [Inductor XPU GEMM] Step 1/N: Refactor cutlass configuration. #160174
- [Inductor XPU GEMM] Step 2/N: Move out cutlass files from torch/_inductor/codegen/cuda #160685
- [Inductor XPU GEMM] Step 3/N: Refactor CUDATempalte to CUTLASSTemplate. #160686
- [Inductor XPU GEMM] Step 4/N: Refactor CUDAKernel to CUTLASSKernel. #160687
- [Inductor XPU GEMM] Step 5/N: Refactor CUDACombinedScheduling and CUDACppScheduling. #160688
- [Inductor XPU GEMM] Step 6/N: Refactor CUDACodeCache. #160706
- [Inductor XPU GEMM] Step 7/N: Refactor CUDABenchmarkRequest #160729
- [Inductor XPU GEMM] Step 8/N: Add XPU code compilation and codecache. #161938
- [Inductor XPU GEMM] Step 9/N: Support generating XPU cutlass gemm kernel #161939
- [xpu][feature][Inductor XPU GEMM] Step 10/N: Enable XPU sycl-tla(Intel cutlass) backend. #161940
- Generalize test suite
test/inductor/test_cutlass_backend.pyand reuse if for Intel GPU.
Test Plan
Generalize test/inductor/test_cutlass_backend.py and reuse the test cases for Intel GPU.
Alternatives
Additional context
Alternatives
No response
Additional context
No response
Metadata
Metadata
Assignees
Labels
Type
Projects
Status