[RFC] Building system for SYCL and limited number of SYCL kernels for ATen fallbacks of TorchInductor

# Background
We are upstreaming for Intel GPU ([[RFC] Intel GPU Upstreaming · Issue #114723 · pytorch/pytorch (github.com)](https://github.com/pytorch/pytorch/issues/114723)). For the first step, targeting recent popular and typical DL workloads, we plan to enable Intel GPU backend by using `torch.compile(backend=Inductor)` on HuggingFace, TIMM, and TorchBench. Thus, main PRs will cover Intel GPU runtime, Intel GPU backend for TorchInductor (Triton) and ATen fallbacks of TorchInductor (oneDNN integration and limited number of SYCL kernels).

# Motivation 
Where the RFC is,
<img width="458" alt="image" src="https://github.com/pytorch/pytorch/assets/28250760/1b836fff-a337-45d8-9e44-79e29271b7b6">
The RFC focuses on introducing SYCL kernels to PyTorch. In addition to limited number of SYCL kernels, we will introduce building system for SYCL as well. Here is summary of RFC,  
1. Building system for SYCL. The design of setups is aligned as existing building system for CUDA.
2. Limited number of SYCL kernels will be added for fallbacks of TorchInductor. We collected TorchInductor fallback ATen operators and corresponding kernels by going through HuggingFace, TIMM and TorchBench models. Minimal set of ATen operators (about 8 SYCL kernel templates) will be added, including elementwise, reduce, random, concat, scan, indexing, sort and arange.

# Feature Details
### Building system for SYCL
Building and running SYCL kernels depends on,
1. Intel Graphics driver installation.
2. Intel DPCPP compiler/compiler-rt tool kit installation (an Intel SYCL compiler and runtime).
<img width="461" alt="image" src="https://github.com/pytorch/pytorch/assets/28250760/821cd94f-461f-44cd-b590-aea8320fb584">

The design of building system for SYCL is aligned with existing building system for CUDA in PyTorch. But in the RFC, targeting SYCL kernel compilation only, compared with setups in building system for CUDA, we will have part of them. Other components will be added gradually according to what Intel GPU backend will need in future. Building system for SYCL in PyTorch will include,
1. Compiler tool chain version detection.
    - To verify SYCL compiler version and SYCL includes version. These two versions should be same. 
    - To define macro for source code usage to isolate incompatible SYCL host/device APIs or intrinsic among different compiler versions.
2. Host compiler setup.
    - ABI compatib ility. At the first stage, PyTorch built with Intel GPU backend (will use `USE_XPU ` as building flag for Intel GPU) have to be built with ABI=1, due to ABI=0 still not supported in DPCPP runtime library. We will timely uplift DPCPP runtime library and retrieve the compilation option to align with other PyTorch building targets after DPCPP runtime library supports it (planed timeline: before PyTorch 2.5 cut out).
    - Requirement of host compiler. We will target GCC as 3rd party host compiler only before PyTorch2.5. Align (or won’t enlarge restriction) with existing PyTorch compiler versions.
3. Architecture detection for AOT compilation.
   - From source: enable AOT build for Intel Data Center Max 1550/1450/1100 GPU by default.
   - For release binary: enable AOT build for Intel Data Center Max 1500/1450/1100 GPU only.

### SYCL kernels for TorchInductor ATen fallback
Additions come from three parts,
1. Add ATen operators’ registration for PyTorch XPU dispatch key.
    - The list of ATen operators is collected by going through HuggingFace, TIMM and TorchBench models with `torch.compile(backend=inductor)`.
    - Registration strategy. There are two common ways to register backend for ATen operators.
        - Mark backend for operators in native_functions.yaml. Source code of registration will be generated automatically.
        - Not use native_functions.yaml. Write source code for registration with ATen registration API (Like `TORCH_LIBRARY_IMPL(aten, XPU, m)`) in operators’ implementation source files.
    - Regarding minimal changing and impact, for current stage, we will propose the second one.
2. Add corresponding SYCL kernel implementation. Multiple ATen operators can share same SYCL kernel. Kernel generalization aligns with CUDA kernels. Generalization for,
    - Operator semantics. (E.g. broadcast, transposed in elementwise.)
    - Data type. (PyTorch defined data type required.)
    - Function generalization. (E.g. sum/mean/var share same reduce backbone.) 
3. Add unit test cases accordingly. Reuse as-is.
Each following PRs targets a SYCL kernel, and accordingly includes ATen operator registration and unit test cases. Here is ‘SYCL kernels - ATen operators’ correspondence table,
<img width="433" alt="image" src="https://github.com/pytorch/pytorch/assets/28250760/a8d78949-d8a9-46d2-9a8f-f0a2f3c91c97">

### TODO PRs list [PR num]
- [ ] Building system for SYCL: [ ]
- [ ] Elementwise (Loops): [ ]
- [ ] Reduce: [ ]
- [ ] Random: [ ]
- [ ] Indexing: [ ]
- [ ] Concat: [ ]
- [ ] Scan: [ ]
- [ ] Sort: [ ]
- [ ] Arange: [ ]

cc @frank-wei @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @ezyang @msaroufim @wconstab @bdhirsh @anijain2305 @zou3519 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC] Building system for SYCL and limited number of SYCL kernels for ATen fallbacks of TorchInductor #114835

Background

Motivation

Feature Details

Building system for SYCL

SYCL kernels for TorchInductor ATen fallback

TODO PRs list [PR num]

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC] Building system for SYCL and limited number of SYCL kernels for ATen fallbacks of TorchInductor #114835

Description

Background

Motivation

Feature Details

Building system for SYCL

SYCL kernels for TorchInductor ATen fallback

TODO PRs list [PR num]

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions