Skip to content

[RFC] Enable cutlass to support Intel GPU into PyTorch Inductor. #160175

@etaf

Description

@etaf

🚀 The feature, motivation and pitch

Motivation

Cutlass is an efficient template library for compute-heavy GPU operations like GEMM, Convolution, and others.

We enabled CUTLASS to support Intel GPUs in sycl-tla, which adds SYCL support while maintaining interface consistency. It enables the generation of high-performance GEMM kernels for Intel GPUs.

Accordingly, we propose to generalize cutlass into Inductor to enable high-performance GEMM support on Intel GPUs, consistent with the existing NV cutlass integration.

Proposal

Based on the principles of maximizing reuse of the existing infrastructure and minimizing code changes, we propose the following six-part design:

1. Code generation

The diagram below shows the workflow of how Inductor generates the cutlass kernel source code for a GEMM operation.

Image

In the current design, the classes marked in red in the above diagram that we intend to generalize to add Intel GPU support.

1.1 CUDATemplate

We propose decoupling cutlass as an independent component by making it a subclass of KernelTemplate, similar to Triton.
Subsequently, a device_type attribute is added to the class for device-specific implementation while reusing the shared cutlass kernel generation logic.

Image
1.2 CUDATemplateKernel, CUDATemplateCaller, CUDATemplateBuffer

Since most of the code here is reusable for Intel Cutlass, we are going to generalize it into CutlassTemplateKernel, CutlassTemplateCaller, and CutlassTemplateBuffer.

2. Scheduling

Currently, CUDA uses CUDACombinedScheduling, which can handle both Triton and cutlass kernels, while Intel GPUs only support Triton.
We plan to rename CUDACombinedScheduling to CombinedScheduling and extend its support to Intel GPUs.
Both CUDA and Intel GPUs share substantial common logic, including Cutlass epilogue fusion decisions, kernel definitions, and kernel generation from the Cutlass template. Therefore, we propose generalizing CUDACPPScheduling into CUTLASSScheduling, so that it can be shared between CUDA and XPU.

Image
3. Compile and cache

In codecache.py, CUDACodeCache currently manages and compiles the generated cutlass kernels. For XPU, the only difference lies in the compilation step.
We propose abstracting CUDACodeCache into CUTLASSCodeCache, with CUDACodeCache and XPUCodeCache inheriting from it and implementing device-specific compilation logic.
Image

4. Run and benchmark

The Cutlass kernel benchmark functionality is currently implemented in CUDABenchmarkRequest, and we need a similar implementation for XPU. Therefore, we plan to generalize CUDABenchmarkRequest into CutlassBenchmarkRequest and add a device_type attribute to support device-specific logic.

5. Inductor Cutlass Configuration

Currently, all cutlass configuration options are placed under torch._inductor.config.cuda.
We plan to move these configurations to torch._inductor.config.cutlass for clearer separation and maintainability.

6. Cutlass Path:

Currently, the default cutlass path is third_party/cutlass.
We plan to add a new submodule, cutlass-sycl, and automatically set it as the default cutlass path when XPU is available.

Implementation Plan

Test Plan

Generalize test/inductor/test_cutlass_backend.py and reuse the test cases for Intel GPU.

Alternatives

Additional context

Alternatives

No response

Additional context

No response

cc @gujinghui @EikanWang @fengyuan14 @guangyey

Metadata

Metadata

Assignees

Labels

module: xpuIntel XPU related issuestriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

Type

No type

Projects

Status

No status

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions