-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[inductor][cpp] BF16 AMX micro-gemm support #127195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[ghstack-poisoned]
@jansel Thanks for the comment. The existing UTs already cover this well when they run on the CPUs that support AMX. Now I added a new one to make sure AMX micro-gemm is selected when the CPU has the support. Please review. |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
Thanks @xwang233 for catching this, looks like this is causing the nightly build for ARM to fail since 6/22 - https://github.com/pytorch/pytorch/actions/runs/9623875645/job/26546874031. |
This PR fixes the build error on s390 after #127195. The following is the log of the build on s390x. This is because `SYS_arch_prctl` is not defined on s390x. ``` ... [792/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/cpu/FlushDenormal.cpp.o [793/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/cpu/Utils.cpp.o FAILED: caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/cpu/Utils.cpp.o /usr/bin/c++ -DAT_PER_OPERATOR_HEADERS -DCAFFE2_BUILD_MAIN_LIB -DFLASHATTENTION_DISABLE_ALIBI -DFMT_HEADER_ONLY=1 -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_RPC -DUSE_TENSORPIPE -D_FILE_OFFSET_BITS=64 -Dtorch_cpu_EXPORTS -I/pytorch/build/aten/src -I/pytorch/aten/src -I/pytorch/build -I/pytorch -I/pytorch/cmake/../third_party/benchmark/include -I/pytorch/third_party/onnx -I/pytorch/build/third_party/onnx -I/pytorch/third_party/foxi -I/pytorch/build/third_party/foxi -I/pytorch/torch/csrc/api -I/pytorch/torch/csrc/api/include -I/pytorch/caffe2/aten/src/TH -I/pytorch/build/caffe2/aten/src/TH -I/pytorch/build/caffe2/aten/src -I/pytorch/build/caffe2/../aten/src -I/pytorch/torch/csrc -I/pytorch/third_party/miniz-2.1.0 -I/pytorch/third_party/kineto/libkineto/include -I/pytorch/third_party/kineto/libkineto/src -I/pytorch/third_party/cpp-httplib -I/pytorch/aten/src/ATen/.. -I/pytorch/c10/.. -I/pytorch/third_party/FP16/include -I/pytorch/third_party/tensorpipe -I/pytorch/build/third_party/tensorpipe -I/pytorch/third_party/tensorpipe/third_party/libnop/include -I/pytorch/third_party/fmt/include -I/pytorch/third_party/flatbuffers/include -isystem /pytorch/cmake/../third_party/tensorpipe/third_party/libuv/include -isystem /pytorch/cmake/../third_party/googletest/googlemock/include -isystem /pytorch/cmake/../third_party/googletest/googletest/include -isystem /pytorch/third_party/protobuf/src -isystem /pytorch/cmake/../third_party/eigen -isystem /pytorch/build/include -Wno-maybe-uninitialized -Wno-uninitialized -Wno-free-nonheap-object -Wno-nonnull -D_GLIBCXX_USE_CXX11_ABI=1 -fvisibility-inlines-hidden -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DLIBKINETO_NOROCTRACER -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow -O3 -DNDEBUG -DNDEBUG -fPIC -DTORCH_USE_LIBUV -Wall -Wextra -Wdeprecated -Wno-unused-parameter -Wno-missing-field-initializers -Wno-unknown-pragmas -Wno-type-limits -Wno-array-bounds -Wno-strict-overflow -Wno-strict-aliasing -Wunused-function -Wno-maybe-uninitialized -fvisibility=hidden -O2 -fopenmp -MD -MT caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/cpu/Utils.cpp.o -MF caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/cpu/Utils.cpp.o.d -o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/cpu/Utils.cpp.o -c /pytorch/aten/src/ATen/cpu/Utils.cpp /pytorch/aten/src/ATen/cpu/Utils.cpp: In function 'bool at::cpu::init_amx()': /pytorch/aten/src/ATen/cpu/Utils.cpp:60:21: error: 'SYS_arch_prctl' was not declared in this scope; did you mean 'SYS_prctl'? 60 | long rc = syscall(SYS_arch_prctl, ARCH_REQ_XCOMP_PERM, XFEATURE_XTILEDATA); | ^~~~~~~~~~~~~~ | SYS_prctl [794/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/Integration.cpp.o [795/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/GridSampler.cpp.o [796/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/detail/CPUGuardImpl.cpp.o [797/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/ThreadLocalState.cpp.o [798/2147] Building CXX object caffe2/CMakeFiles/vec_test_all_types_DEFAULT.dir/__/aten/src/ATen/test/vec_test_all_types.cpp.o [799/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/Utils.cpp.o [800/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/VmapModeRegistrations.cpp.o [801/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/ZeroTensorFallback.cpp.o [802/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/autocast_mode.cpp.o ninja: build stopped: subcommand failed. Building wheel torch-2.5.0a0+git94dc325 -- Building version 2.5.0a0+git94dc325 cmake -GNinja -DBUILD_CAFFE2=0 -DBUILD_PYTHON=True -DBUILD_TEST=True -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/pytorch/torch -DCMAKE_PREFIX_PATH=/usr/local/lib/python3.10/dist-packages -DPython_EXECUTABLE=/usr/bin/python3 -DTORCH_BUILD_VERSION=2.5.0a0+git94dc325 -DUSE_GLOO=0 -DUSE_NUMPY=True /pytorch cmake --build . --target install --config Release Build step 'Execute shell' marked build as failure ... ``` Pull Request resolved: #129326 Approved by: https://github.com/Skylion007, https://github.com/eqy
|
aarch64 CI builds are also failing. Instead of adding (!defined) for every variant of arm, could someone please guard x86_64 AMX specific changes with an appropriate macro. |
Trying to mitigate aarch64 and s390 nightly failures as per this comment: #127195 (comment) Fixes #129443 Pull Request resolved: #129479 Approved by: https://github.com/nWEIdia, https://github.com/malfet
Stack from ghstack (oldest at bottom):
This PR adds the intrinsics based micro-gemm for BF16 using Advanced Matrix eXtension (AMX) instructions available in Intel 4th and 5th Xeon processors. A compilation check is added to
codecache.pyto check the validity of the compiler support. Also, since AMX requires an initialization in the Linux kernel to extra register states, an initialization function is added to do that and triggered viacodecache.py.Performance speedups with >=10% on BF16 AMP, max_autotune vs. no autotune, measured on Intel(R) Xeon(R) Platinum 8488C:
Static shapes
Single-threaded
Multi-threaded
Dynamic shapes
Single-threaded
Multi-threaded
No perf regressions.
cc @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang