-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Description
Background
We are upstreaming for Intel GPU ([RFC] Intel GPU Upstreaming · Issue #114723 · pytorch/pytorch (github.com)). For the first step, targeting recent popular and typical DL workloads, we plan to enable Intel GPU backend by using torch.compile(backend=Inductor) on HuggingFace, TIMM, and TorchBench. Thus, main PRs will cover Intel GPU runtime, Intel GPU backend for TorchInductor (Triton) and ATen fallbacks of TorchInductor (oneDNN integration and limited number of SYCL kernels).
Motivation
Where the RFC is,

The RFC focuses on introducing SYCL kernels to PyTorch. In addition to limited number of SYCL kernels, we will introduce building system for SYCL as well. Here is summary of RFC,
- Building system for SYCL. The design of setups is aligned as existing building system for CUDA.
- Limited number of SYCL kernels will be added for fallbacks of TorchInductor. We collected TorchInductor fallback ATen operators and corresponding kernels by going through HuggingFace, TIMM and TorchBench models. Minimal set of ATen operators (about 8 SYCL kernel templates) will be added, including elementwise, reduce, random, concat, scan, indexing, sort and arange.
Feature Details
Building system for SYCL
Building and running SYCL kernels depends on,
- Intel Graphics driver installation.
- Intel DPCPP compiler/compiler-rt tool kit installation (an Intel SYCL compiler and runtime).
The design of building system for SYCL is aligned with existing building system for CUDA in PyTorch. But in the RFC, targeting SYCL kernel compilation only, compared with setups in building system for CUDA, we will have part of them. Other components will be added gradually according to what Intel GPU backend will need in future. Building system for SYCL in PyTorch will include,
- Compiler tool chain version detection.
- To verify SYCL compiler version and SYCL includes version. These two versions should be same.
- To define macro for source code usage to isolate incompatible SYCL host/device APIs or intrinsic among different compiler versions.
- Host compiler setup.
- ABI compatib ility. At the first stage, PyTorch built with Intel GPU backend (will use
USE_XPUas building flag for Intel GPU) have to be built with ABI=1, due to ABI=0 still not supported in DPCPP runtime library. We will timely uplift DPCPP runtime library and retrieve the compilation option to align with other PyTorch building targets after DPCPP runtime library supports it (planed timeline: before PyTorch 2.5 cut out). - Requirement of host compiler. We will target GCC as 3rd party host compiler only before PyTorch2.5. Align (or won’t enlarge restriction) with existing PyTorch compiler versions.
- ABI compatib ility. At the first stage, PyTorch built with Intel GPU backend (will use
- Architecture detection for AOT compilation.
- From source: enable AOT build for Intel Data Center Max 1550/1450/1100 GPU by default.
- For release binary: enable AOT build for Intel Data Center Max 1500/1450/1100 GPU only.
SYCL kernels for TorchInductor ATen fallback
Additions come from three parts,
- Add ATen operators’ registration for PyTorch XPU dispatch key.
- The list of ATen operators is collected by going through HuggingFace, TIMM and TorchBench models with
torch.compile(backend=inductor). - Registration strategy. There are two common ways to register backend for ATen operators.
- Mark backend for operators in native_functions.yaml. Source code of registration will be generated automatically.
- Not use native_functions.yaml. Write source code for registration with ATen registration API (Like
TORCH_LIBRARY_IMPL(aten, XPU, m)) in operators’ implementation source files.
- Regarding minimal changing and impact, for current stage, we will propose the second one.
- The list of ATen operators is collected by going through HuggingFace, TIMM and TorchBench models with
- Add corresponding SYCL kernel implementation. Multiple ATen operators can share same SYCL kernel. Kernel generalization aligns with CUDA kernels. Generalization for,
- Operator semantics. (E.g. broadcast, transposed in elementwise.)
- Data type. (PyTorch defined data type required.)
- Function generalization. (E.g. sum/mean/var share same reduce backbone.)
- Add unit test cases accordingly. Reuse as-is.
Each following PRs targets a SYCL kernel, and accordingly includes ATen operator registration and unit test cases. Here is ‘SYCL kernels - ATen operators’ correspondence table,
TODO PRs list [PR num]
- Building system for SYCL: [ ]
- Elementwise (Loops): [ ]
- Reduce: [ ]
- Random: [ ]
- Indexing: [ ]
- Concat: [ ]
- Scan: [ ]
- Sort: [ ]
- Arange: [ ]
cc @frank-wei @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @ezyang @msaroufim @wconstab @bdhirsh @anijain2305 @zou3519 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler
Metadata
Metadata
Assignees
Labels
Type
Projects
Status