[RFC] Add Cpp Template for GEMM related ops via max-autotune for Inductor CPU

### 🚀 The feature, motivation and pitch

## Motivation
`torch.compile` provides the "max-autotune" mode. For CUDA, the inductor backend leverages online benchmark results to select the best-performing kernels from various options, including ATen kernels and template-based kernels implemented with Triton and CUTLASS. These kernels are primarily designed to accelerate GEMM-related operations. However, for CPU, this "max-autotune" mechanism is not yet supported, and only ATen kernels are currently utilized.

This RFC proposes the introduction of similar template-based code generation support for GEMM-related operations on CPUs, implemented with C++ and activated through the "max-autotune" mode of `torch.compile`. By utilizing the autotuning mechanism of Inductor, users are expected to achieve enhanced performance for GEMM-related operations beyond the capabilities of ATen-based implementations.

## Approaches
At a high level, the autotuning and template infrastructure from CUDA is mature enough to be adapted for CPU usage. We plan to extend the existing autotuning code to support CPU and develop the C++ template abstraction by referencing the CUTLASS template counterpart. Additionally, CPU-specific challenges such as thread decomposition, data layout arrangement (e.g., weight prepacking), and data blocking at various memory hierarchy levels for optimal performance need to be addressed. Based on our previous experiences, we employ a two-level abstraction to implement GEMMs: an outer loop that manages thread decomposition and cache blocking, and an inner micro-kernel that handles register blocking and various CPU architecture-specific optimizations. This approach allows for flexible performance tuning at multiple levels and direct utilization of low-level CPU hardware acceleration.

## Key Components
1. **Autotune Infrastructure for CPU**: Generalizing and extending BenchmarkRequest with CPU support and Cpp module loader.
2. **Cpp Template Infrastructure**: Involving similar template abstractions as the CUTLASS template, such as CppTemplate, CppTemplateKernel, CppTemplateBuffer. The MicroGemm micro-kernel abstraction can be used by Cpp GEMM templates.
3. **Micro Kernel Templates**: Responsible for register blocking, instruction selection, and other CPU architecture-specific optimizations.
4. **Cpp Templates**: Including various GEMM-related Cpp templates (single GEMM, weight-only quantized GEMM, attention, MLP, etc.) that are responsible for thread decomposition, cache blocking, and outer-loop scheduling calling into micro-kernels. Packed GEMM support included.
5. **Epilogue Fusion**: This would involve support from Cpp templates, micro-kernel templates, and Cpp kernels.

## Task Breakdowns

- [x] 1. **Autotune Infrastructure for CPU** (#125159)
- [x] 2. **Cpp Template Infrastructure** (#124021)
3. **Micro Kernel Templates**
- [x]    3.1 General FP32/BF16/FP16 MicroGemm based on ATen VEC (#124021 etc.)
- [x]    3.2 BF16 AMX MicroGemm for x86 (#126068 etc.)
- [ ]    3.3 FP16 AMX MicroGemm for x86
- [x]    3.4 INT8 AMX MicroGemm for x86 (#129220)
- [x]    3.5 INT8 Weight-quantized MicroGemm for x86 (#131887)
- [ ]    3.6 INT4 Weight-quantized MicroGemm for x86
- [x]    3.7 MicroGemms for ARM
4. **Cpp Template**
- [x]    4.1 Single GEMM, packed (#124021, #128472, #126545, #126019, #128825, #129048, #129103, #129220, #130690)
- [ ]    4.2 Single GEMM, unpacked
- [ ]    4.2 BMM (#129772)
- [ ]    4.3 WOQ GEMM (#131887 etc.)
- [ ]    4.4 SDPA
- [ ]    4.5 MLP
- [x] 5. **Epilogue Fusion** (#126019 #126545 etc.)
6. **Performance Tuning** (ongoing work)
- [ ]     6.1 Thread blocking optimization (#131024, #130821 etc.)
- [ ]     6.2 Cache blocking optimization (#129348, #132729, #129455 etc.)

### Alternatives

_No response_

### Additional context

_No response_

cc @ezyang @msaroufim @bdhirsh @anijain2305 @chauhang @voznesenskym @penguinwu @EikanWang @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC] Add Cpp Template for GEMM related ops via max-autotune for Inductor CPU #125683

🚀 The feature, motivation and pitch

Motivation

Approaches

Key Components

Task Breakdowns

Alternatives

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC] Add Cpp Template for GEMM related ops via max-autotune for Inductor CPU #125683

Description

🚀 The feature, motivation and pitch

Motivation

Approaches

Key Components

Task Breakdowns

Alternatives

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions