[RFC] Enable cutlass to support Intel GPU into PyTorch Inductor.

### 🚀 The feature, motivation and pitch

#### Motivation
Cutlass is an efficient template library for compute-heavy GPU operations like GEMM, Convolution, and others.

We enabled CUTLASS to support Intel GPUs in [sycl-tla](https://github.com/intel/sycl-tla/tree/main), which adds SYCL support while maintaining interface consistency. It enables the generation of high-performance GEMM kernels for Intel GPUs.

Accordingly, we propose to generalize cutlass into Inductor to enable high-performance GEMM support on Intel GPUs, consistent with the existing NV cutlass integration.

#### Proposal



Based on the principles of maximizing reuse of the existing infrastructure and minimizing code changes, we propose the following six-part design:

##### 1. Code generation

The diagram below shows the workflow of how Inductor generates the cutlass kernel source code for a GEMM operation.

<img width="3708" height="3044" alt="Image" src="https://github.com/user-attachments/assets/ed272eb4-9349-4d56-b7bd-82f974175762" />

In the current design, the classes marked in red in the above diagram that we intend to generalize to add Intel GPU support.

###### 1.1 `CUDATemplate`
We propose decoupling cutlass as an independent component by making it a subclass of `KernelTemplate`, similar to Triton.
Subsequently, a `device_type` attribute is added to the class for device-specific implementation while reusing the shared cutlass kernel generation logic.

<img width="2275" height="1656" alt="Image" src="https://github.com/user-attachments/assets/1d5e8a32-6cc6-4d03-a483-351675504512" />

###### 1.2 `CUDATemplateKernel`, `CUDATemplateCaller`, `CUDATemplateBuffer`
Since most of the code here is reusable for Intel Cutlass, we are going to generalize it into `CutlassTemplateKernel`, `CutlassTemplateCaller`, `and CutlassTemplateBuffer`.
 

##### 2. Scheduling
Currently, CUDA uses `CUDACombinedScheduling`, which can handle both Triton and cutlass kernels, while Intel GPUs only support Triton.
We plan to rename CUDACombinedScheduling to CombinedScheduling and extend its support to Intel GPUs.
Both CUDA and Intel GPUs share substantial common logic, including Cutlass epilogue fusion decisions, kernel definitions, and kernel generation from the Cutlass template. Therefore, we propose generalizing CUDACPPScheduling into CUTLASSScheduling, so that it can be shared between CUDA and XPU.

<img width="4508" height="1952" alt="Image" src="https://github.com/user-attachments/assets/ff07d0b4-136f-4081-8763-a6f36db36bab" />

##### 3. Compile and cache
In codecache.py, CUDACodeCache currently manages and compiles the generated cutlass kernels. For XPU, the only difference lies in the compilation step.
We propose abstracting `CUDACodeCache` into `CUTLASSCodeCache`, with `CUDACodeCache` and `XPUCodeCache` inheriting from it and implementing device-specific compilation logic.
<img width="2260" height="867" alt="Image" src="https://github.com/user-attachments/assets/c8f153a9-4272-4dba-9dd1-4aaeced58c3f" />

#### 4. Run and benchmark
The Cutlass kernel benchmark functionality is currently implemented in `CUDABenchmarkRequest`, and we need a similar implementation for XPU. Therefore, we plan to generalize `CUDABenchmarkRequest` into `CutlassBenchmarkRequest` and add a `device_type` attribute to support device-specific logic.

##### 5. Inductor Cutlass Configuration
Currently, all cutlass configuration options are placed under `torch._inductor.config.cuda`.
We plan to move these configurations to `torch._inductor.config.cutlass` for clearer separation and maintainability.

##### 6. Cutlass Path:
Currently, the default cutlass path is third_party/cutlass.
We plan to add a new submodule, [cutlass-sycl](https://github.com/intel/cutlass-sycl/tree/sycl-develop), and automatically set it as the default cutlass path when XPU is available.


#### Implementation Plan
- [ ] #160174
- [ ] #160685
- [ ] #160686
- [ ] #160687
- [ ] #160688
- [ ] #160706
- [ ] #160729
- [ ] #161938
- [ ] #161939
- [ ] #161940
- [ ] Generalize test suite `test/inductor/test_cutlass_backend.py` and reuse if for Intel GPU.

#### Test Plan
Generalize `test/inductor/test_cutlass_backend.py` and reuse the test cases for Intel GPU.

### Alternatives
### Additional context

### Alternatives

_No response_

### Additional context

_No response_

cc @gujinghui @EikanWang @fengyuan14 @guangyey

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Enable cutlass to support Intel GPU into PyTorch Inductor. #160175

🚀 The feature, motivation and pitch

Motivation

Proposal

1. Code generation

1.1 `CUDATemplate`

1.2 `CUDATemplateKernel`, `CUDATemplateCaller`, `CUDATemplateBuffer`

2. Scheduling

3. Compile and cache

4. Run and benchmark

5. Inductor Cutlass Configuration

6. Cutlass Path:

Implementation Plan

Test Plan

Alternatives

Additional context

Alternatives

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC] Enable cutlass to support Intel GPU into PyTorch Inductor. #160175

Description

🚀 The feature, motivation and pitch

Motivation

Proposal

1. Code generation

1.1 CUDATemplate

1.2 CUDATemplateKernel, CUDATemplateCaller, CUDATemplateBuffer

2. Scheduling

3. Compile and cache

4. Run and benchmark

5. Inductor Cutlass Configuration

6. Cutlass Path:

Implementation Plan

Test Plan

Alternatives

Additional context

Alternatives

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1.1 `CUDATemplate`

1.2 `CUDATemplateKernel`, `CUDATemplateCaller`, `CUDATemplateBuffer`