Skip to content

[RFC] Add Cpp Template for GEMM related ops via max-autotune for Inductor CPU #125683

@jgong5

Description

@jgong5

🚀 The feature, motivation and pitch

Motivation

torch.compile provides the "max-autotune" mode. For CUDA, the inductor backend leverages online benchmark results to select the best-performing kernels from various options, including ATen kernels and template-based kernels implemented with Triton and CUTLASS. These kernels are primarily designed to accelerate GEMM-related operations. However, for CPU, this "max-autotune" mechanism is not yet supported, and only ATen kernels are currently utilized.

This RFC proposes the introduction of similar template-based code generation support for GEMM-related operations on CPUs, implemented with C++ and activated through the "max-autotune" mode of torch.compile. By utilizing the autotuning mechanism of Inductor, users are expected to achieve enhanced performance for GEMM-related operations beyond the capabilities of ATen-based implementations.

Approaches

At a high level, the autotuning and template infrastructure from CUDA is mature enough to be adapted for CPU usage. We plan to extend the existing autotuning code to support CPU and develop the C++ template abstraction by referencing the CUTLASS template counterpart. Additionally, CPU-specific challenges such as thread decomposition, data layout arrangement (e.g., weight prepacking), and data blocking at various memory hierarchy levels for optimal performance need to be addressed. Based on our previous experiences, we employ a two-level abstraction to implement GEMMs: an outer loop that manages thread decomposition and cache blocking, and an inner micro-kernel that handles register blocking and various CPU architecture-specific optimizations. This approach allows for flexible performance tuning at multiple levels and direct utilization of low-level CPU hardware acceleration.

Key Components

  1. Autotune Infrastructure for CPU: Generalizing and extending BenchmarkRequest with CPU support and Cpp module loader.
  2. Cpp Template Infrastructure: Involving similar template abstractions as the CUTLASS template, such as CppTemplate, CppTemplateKernel, CppTemplateBuffer. The MicroGemm micro-kernel abstraction can be used by Cpp GEMM templates.
  3. Micro Kernel Templates: Responsible for register blocking, instruction selection, and other CPU architecture-specific optimizations.
  4. Cpp Templates: Including various GEMM-related Cpp templates (single GEMM, weight-only quantized GEMM, attention, MLP, etc.) that are responsible for thread decomposition, cache blocking, and outer-loop scheduling calling into micro-kernels. Packed GEMM support included.
  5. Epilogue Fusion: This would involve support from Cpp templates, micro-kernel templates, and Cpp kernels.

Task Breakdowns

  1. Micro Kernel Templates
  1. Cpp Template
  1. Performance Tuning (ongoing work)

Alternatives

No response

Additional context

No response

cc @ezyang @msaroufim @bdhirsh @anijain2305 @chauhang @voznesenskym @penguinwu @EikanWang @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire

Metadata

Metadata

Labels

module: inductoroncall: cpu inductorCPU Inductor issues for Intel team to triageoncall: pt2triagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions