Skip to content

Conversation

@jgong5
Copy link
Collaborator

@jgong5 jgong5 commented May 26, 2024

Stack from ghstack (oldest at bottom):

This PR adds the intrinsics based micro-gemm for BF16 using Advanced Matrix eXtension (AMX) instructions available in Intel 4th and 5th Xeon processors. A compilation check is added to codecache.py to check the validity of the compiler support. Also, since AMX requires an initialization in the Linux kernel to extra register states, an initialization function is added to do that and triggered via codecache.py.

Performance speedups with >=10% on BF16 AMP, max_autotune vs. no autotune, measured on Intel(R) Xeon(R) Platinum 8488C:
Static shapes
Single-threaded

Model Family Model Name Speedup
timm_models mixer_b16_224 1.54
timm_models convit_base 1.53
huggingface MobileBertForQuestionAnswering 1.52
torchbench fastNLP_Bert 1.44
torchbench llama 1.33
timm_models swin_base_patch4_window7_224 1.31
torchbench dlrm 1.28
torchbench timm_vision_transformer_large 1.28
huggingface MobileBertForMaskedLM 1.27
timm_models vit_base_patch16_224 1.26
timm_models beit_base_patch16_224 1.23
timm_models jx_nest_base 1.21
torchbench pyhpc_equation_of_state 1.18
huggingface Speech2Text2ForCausalLM 1.15
timm_models pit_b_224 1.14
timm_models twins_pcpvt_base 1.14
torchbench maml_omniglot 1.1
timm_models botnet26t_256 1.1

Multi-threaded

Model Family Model Name Speedup
torchbench BERT_pytorch 1.35
torchbench lennard_jones 2.43
torchbench hf_Albert 1.35
torchbench hf_T5 1.34
torchbench soft_actor_critic 1.34
torchbench fastNLP_Bert 1.28
huggingface LayoutLMForSequenceClassification 1.26
torchbench llama 1.24
huggingface GPT2ForSequenceClassification 1.19
torchbench hf_Bart 1.17
torchbench hf_Bert_large 1.16
torchbench hf_GPT2 1.16
timm_models gmixer_24_224 1.16
torchbench hf_GPT2_large 1.15
torchbench maml_omniglot 1.14
torchbench hf_Bert 1.13
torchbench hf_DistilBert 1.13
torchbench hf_T5_large 1.12
huggingface MT5ForConditionalGeneration 1.11

Dynamic shapes
Single-threaded

Model Family Model Name Speedup
timm_models mixer_b16_224 1.52
timm_models convit_base 1.5
huggingface MobileBertForQuestionAnswering 1.49
torchbench fastNLP_Bert 1.42
torchbench timm_vision_transformer_large 1.28
timm_models swin_base_patch4_window7_224 1.27
torchbench llama 1.26
huggingface MobileBertForMaskedLM 1.25
timm_models vit_base_patch16_224 1.25
timm_models beit_base_patch16_224 1.24
timm_models jx_nest_base 1.2
torchbench dlrm 1.19
timm_models pit_b_224 1.13
timm_models twins_pcpvt_base 1.13
torchbench hf_Bert_large 1.12
torchbench hf_BigBird 1.11
huggingface Speech2Text2ForCausalLM 1.11
timm_models eca_botnext26ts_256 1.11
timm_models botnet26t_256 1.1

Multi-threaded

Model Family Model Name Speedup
torchbench BERT_pytorch 1.18
torchbench lennard_jones 2.18
torchbench hf_Albert 1.37
torchbench soft_actor_critic 1.31
huggingface GPT2ForSequenceClassification 1.29
torchbench hf_T5 1.28
torchbench fastNLP_Bert 1.27
torchbench hf_Bart 1.21
torchbench hf_Bert_large 1.19
torchbench hf_T5_large 1.19
torchbench hf_Bert 1.16
torchbench hf_GPT2 1.16
huggingface CamemBert 1.16
torchbench hf_GPT2_large 1.13
torchbench functorch_maml_omniglot 1.12
huggingface BertForMaskedLM 1.12
huggingface MT5ForConditionalGeneration 1.12
torchbench hf_DistilBert 1.11
timm_models mixnet_l 1.11
timm_models tf_mixnet_l 1.11

No perf regressions.

cc @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang

@pytorch-bot
Copy link

pytorch-bot bot commented May 26, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/127195

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (8 Unrelated Failures)

As of commit d678025 with merge base 9dd8f8c (image):

UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]
@pytorch-bot pytorch-bot bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label May 27, 2024
jgong5 pushed a commit that referenced this pull request May 27, 2024
ghstack-source-id: a6ce91f
Pull Request resolved: #127195
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
jgong5 pushed a commit that referenced this pull request May 28, 2024
ghstack-source-id: 5591275
Pull Request resolved: #127195
@jgong5 jgong5 changed the title [inductor][cpp] AMX micro-gemm support [inductor][cpp] BF16 AMX micro-gemm support May 28, 2024
[ghstack-poisoned]
[ghstack-poisoned]
jgong5 pushed a commit that referenced this pull request May 28, 2024
ghstack-source-id: 5c0a0ba
Pull Request resolved: #127195
@jgong5 jgong5 marked this pull request as ready for review May 28, 2024 15:40
@jgong5 jgong5 added the topic: not user facing topic category label May 28, 2024
[ghstack-poisoned]
@jgong5 jgong5 requested a review from jansel May 29, 2024 01:30
[ghstack-poisoned]
@jgong5
Copy link
Collaborator Author

jgong5 commented Jun 20, 2024

Add tests?

@jansel Thanks for the comment. The existing UTs already cover this well when they run on the CPUs that support AMX. Now I added a new one to make sure AMX micro-gemm is selected when the CPU has the support. Please review.

[ghstack-poisoned]
@jgong5
Copy link
Collaborator Author

jgong5 commented Jun 21, 2024

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 21, 2024
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

@jgong5
Copy link
Collaborator Author

jgong5 commented Jun 21, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@xwang233
Copy link
Collaborator

xwang233 commented Jun 24, 2024

This PR somehow caused our ARM build to fail with similar error messages in #129326. Can we add some #ifdef guard to fix this?

cc @eqy @nWEIdia @tinglvv @atalman @malfet

@nWEIdia
Copy link
Collaborator

nWEIdia commented Jun 24, 2024

This PR somehow caused our ARM build to fail with similar error messages in #129326. Can we add some #ifdef guard to fix this?

cc @eqy @nWEIdia @tinglvv @atalman @malfet

Thanks @xwang233 In hindsight, ciflow/binaries could have been used to catch this.

@tinglvv
Copy link
Collaborator

tinglvv commented Jun 24, 2024

Thanks @xwang233 for catching this, looks like this is causing the nightly build for ARM to fail since 6/22 - https://github.com/pytorch/pytorch/actions/runs/9623875645/job/26546874031.

pytorchmergebot pushed a commit that referenced this pull request Jun 25, 2024
This PR fixes the build error on s390 after #127195.

The following is the log of the build on s390x. This is because `SYS_arch_prctl` is not defined on s390x.
```
...
[792/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/cpu/FlushDenormal.cpp.o
[793/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/cpu/Utils.cpp.o
FAILED: caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/cpu/Utils.cpp.o
/usr/bin/c++ -DAT_PER_OPERATOR_HEADERS -DCAFFE2_BUILD_MAIN_LIB -DFLASHATTENTION_DISABLE_ALIBI -DFMT_HEADER_ONLY=1 -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_RPC -DUSE_TENSORPIPE -D_FILE_OFFSET_BITS=64 -Dtorch_cpu_EXPORTS -I/pytorch/build/aten/src -I/pytorch/aten/src -I/pytorch/build -I/pytorch -I/pytorch/cmake/../third_party/benchmark/include -I/pytorch/third_party/onnx -I/pytorch/build/third_party/onnx -I/pytorch/third_party/foxi -I/pytorch/build/third_party/foxi -I/pytorch/torch/csrc/api -I/pytorch/torch/csrc/api/include -I/pytorch/caffe2/aten/src/TH -I/pytorch/build/caffe2/aten/src/TH -I/pytorch/build/caffe2/aten/src -I/pytorch/build/caffe2/../aten/src -I/pytorch/torch/csrc -I/pytorch/third_party/miniz-2.1.0 -I/pytorch/third_party/kineto/libkineto/include -I/pytorch/third_party/kineto/libkineto/src -I/pytorch/third_party/cpp-httplib -I/pytorch/aten/src/ATen/.. -I/pytorch/c10/.. -I/pytorch/third_party/FP16/include -I/pytorch/third_party/tensorpipe -I/pytorch/build/third_party/tensorpipe -I/pytorch/third_party/tensorpipe/third_party/libnop/include -I/pytorch/third_party/fmt/include -I/pytorch/third_party/flatbuffers/include -isystem /pytorch/cmake/../third_party/tensorpipe/third_party/libuv/include -isystem /pytorch/cmake/../third_party/googletest/googlemock/include -isystem /pytorch/cmake/../third_party/googletest/googletest/include -isystem /pytorch/third_party/protobuf/src -isystem /pytorch/cmake/../third_party/eigen -isystem /pytorch/build/include -Wno-maybe-uninitialized -Wno-uninitialized -Wno-free-nonheap-object -Wno-nonnull -D_GLIBCXX_USE_CXX11_ABI=1 -fvisibility-inlines-hidden -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DLIBKINETO_NOROCTRACER -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow -O3 -DNDEBUG -DNDEBUG -fPIC -DTORCH_USE_LIBUV -Wall -Wextra -Wdeprecated -Wno-unused-parameter -Wno-missing-field-initializers -Wno-unknown-pragmas -Wno-type-limits -Wno-array-bounds -Wno-strict-overflow -Wno-strict-aliasing -Wunused-function -Wno-maybe-uninitialized -fvisibility=hidden -O2 -fopenmp -MD -MT caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/cpu/Utils.cpp.o -MF caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/cpu/Utils.cpp.o.d -o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/cpu/Utils.cpp.o -c /pytorch/aten/src/ATen/cpu/Utils.cpp
/pytorch/aten/src/ATen/cpu/Utils.cpp: In function 'bool at::cpu::init_amx()':
/pytorch/aten/src/ATen/cpu/Utils.cpp:60:21: error: 'SYS_arch_prctl' was not declared in this scope; did you mean 'SYS_prctl'?
   60 |   long rc = syscall(SYS_arch_prctl, ARCH_REQ_XCOMP_PERM, XFEATURE_XTILEDATA);
      |                     ^~~~~~~~~~~~~~
      |                     SYS_prctl
[794/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/Integration.cpp.o
[795/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/GridSampler.cpp.o
[796/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/detail/CPUGuardImpl.cpp.o
[797/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/ThreadLocalState.cpp.o
[798/2147] Building CXX object caffe2/CMakeFiles/vec_test_all_types_DEFAULT.dir/__/aten/src/ATen/test/vec_test_all_types.cpp.o
[799/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/Utils.cpp.o
[800/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/VmapModeRegistrations.cpp.o
[801/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/ZeroTensorFallback.cpp.o
[802/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/autocast_mode.cpp.o
ninja: build stopped: subcommand failed.
Building wheel torch-2.5.0a0+git94dc325
-- Building version 2.5.0a0+git94dc325
cmake -GNinja -DBUILD_CAFFE2=0 -DBUILD_PYTHON=True -DBUILD_TEST=True -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/pytorch/torch -DCMAKE_PREFIX_PATH=/usr/local/lib/python3.10/dist-packages -DPython_EXECUTABLE=/usr/bin/python3 -DTORCH_BUILD_VERSION=2.5.0a0+git94dc325 -DUSE_GLOO=0 -DUSE_NUMPY=True /pytorch
cmake --build . --target install --config Release
Build step 'Execute shell' marked build as failure
...
```

Pull Request resolved: #129326
Approved by: https://github.com/Skylion007, https://github.com/eqy
@snadampal
Copy link
Collaborator

aarch64 CI builds are also failing. Instead of adding (!defined) for every variant of arm, could someone please guard x86_64 AMX specific changes with an appropriate macro.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request Merged module: cpu CPU specific problem (e.g., perf, algorithm) module: dynamo module: inductor open source topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants