[inductor][cpp] BF16 AMX micro-gemm support #127195

jgong5 · 2024-05-26T13:44:05Z

Stack from ghstack (oldest at bottom):

This PR adds the intrinsics based micro-gemm for BF16 using Advanced Matrix eXtension (AMX) instructions available in Intel 4th and 5th Xeon processors. A compilation check is added to codecache.py to check the validity of the compiler support. Also, since AMX requires an initialization in the Linux kernel to extra register states, an initialization function is added to do that and triggered via codecache.py.

Performance speedups with >=10% on BF16 AMP, max_autotune vs. no autotune, measured on Intel(R) Xeon(R) Platinum 8488C:
Static shapes
Single-threaded

Model Family	Model Name	Speedup
timm_models	mixer_b16_224	1.54
timm_models	convit_base	1.53
huggingface	MobileBertForQuestionAnswering	1.52
torchbench	fastNLP_Bert	1.44
torchbench	llama	1.33
timm_models	swin_base_patch4_window7_224	1.31
torchbench	dlrm	1.28
torchbench	timm_vision_transformer_large	1.28
huggingface	MobileBertForMaskedLM	1.27
timm_models	vit_base_patch16_224	1.26
timm_models	beit_base_patch16_224	1.23
timm_models	jx_nest_base	1.21
torchbench	pyhpc_equation_of_state	1.18
huggingface	Speech2Text2ForCausalLM	1.15
timm_models	pit_b_224	1.14
timm_models	twins_pcpvt_base	1.14
torchbench	maml_omniglot	1.1
timm_models	botnet26t_256	1.1

Multi-threaded

Model Family	Model Name	Speedup
torchbench	BERT_pytorch	1.35
torchbench	lennard_jones	2.43
torchbench	hf_Albert	1.35
torchbench	hf_T5	1.34
torchbench	soft_actor_critic	1.34
torchbench	fastNLP_Bert	1.28
huggingface	LayoutLMForSequenceClassification	1.26
torchbench	llama	1.24
huggingface	GPT2ForSequenceClassification	1.19
torchbench	hf_Bart	1.17
torchbench	hf_Bert_large	1.16
torchbench	hf_GPT2	1.16
timm_models	gmixer_24_224	1.16
torchbench	hf_GPT2_large	1.15
torchbench	maml_omniglot	1.14
torchbench	hf_Bert	1.13
torchbench	hf_DistilBert	1.13
torchbench	hf_T5_large	1.12
huggingface	MT5ForConditionalGeneration	1.11

Dynamic shapes
Single-threaded

Model Family	Model Name	Speedup
timm_models	mixer_b16_224	1.52
timm_models	convit_base	1.5
huggingface	MobileBertForQuestionAnswering	1.49
torchbench	fastNLP_Bert	1.42
torchbench	timm_vision_transformer_large	1.28
timm_models	swin_base_patch4_window7_224	1.27
torchbench	llama	1.26
huggingface	MobileBertForMaskedLM	1.25
timm_models	vit_base_patch16_224	1.25
timm_models	beit_base_patch16_224	1.24
timm_models	jx_nest_base	1.2
torchbench	dlrm	1.19
timm_models	pit_b_224	1.13
timm_models	twins_pcpvt_base	1.13
torchbench	hf_Bert_large	1.12
torchbench	hf_BigBird	1.11
huggingface	Speech2Text2ForCausalLM	1.11
timm_models	eca_botnext26ts_256	1.11
timm_models	botnet26t_256	1.1

Multi-threaded

Model Family	Model Name	Speedup
torchbench	BERT_pytorch	1.18
torchbench	lennard_jones	2.18
torchbench	hf_Albert	1.37
torchbench	soft_actor_critic	1.31
huggingface	GPT2ForSequenceClassification	1.29
torchbench	hf_T5	1.28
torchbench	fastNLP_Bert	1.27
torchbench	hf_Bart	1.21
torchbench	hf_Bert_large	1.19
torchbench	hf_T5_large	1.19
torchbench	hf_Bert	1.16
torchbench	hf_GPT2	1.16
huggingface	CamemBert	1.16
torchbench	hf_GPT2_large	1.13
torchbench	functorch_maml_omniglot	1.12
huggingface	BertForMaskedLM	1.12
huggingface	MT5ForConditionalGeneration	1.12
torchbench	hf_DistilBert	1.11
timm_models	mixnet_l	1.11
timm_models	tf_mixnet_l	1.11

No perf regressions.

cc @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang

[ghstack-poisoned]

pytorch-bot · 2024-05-26T13:44:07Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/127195

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (8 Unrelated Failures)

As of commit d678025 with merge base 9dd8f8c ():

UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:

inductor / cuda12.1-py3.10-gcc9-sm86 / test (aot_inductor_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu, unstable) (gh) (#128903)
hf_BigBird
inductor / cuda12.1-py3.10-gcc9-sm86 / test (dynamic_inductor_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu, unstable) (gh) (#128902)
hf_BigBird
inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu, unstable) (gh) (#128901)
hf_BigBird
inductor / rocm6.1-py3.8-inductor / test (inductor, 1, 1, linux.rocm.gpu.2, unstable) (gh) (#128871)
'Test'
inductor-periodic / cuda12.1-py3.10-gcc9-sm86-periodic-dynamo-benchmarks / test (aot_eager_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu, unstable) (gh) (#128929)
hf_BigBird
inductor-periodic / cuda12.1-py3.10-gcc9-sm86-periodic-dynamo-benchmarks / test (dynamic_aot_eager_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu, unstable) (gh) (#128931)
hf_BigBird
inductor-periodic / cuda12.1-py3.10-gcc9-sm86-periodic-dynamo-benchmarks / test (dynamo_eager_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu, unstable) (gh) (#128932)
hf_BigBird
trunk / linux-focal-rocm6.1-py3.8 / test (distributed, 1, 1, linux.rocm.gpu) (gh) (#129211)
docker: Error response from daemon: mkdir /media/4TB/docker-rootless/overlay2/5139e559bc86945f5ad6f39e19b7dec0fcb83a100a7d1dfbdf11783ec8294e80-init/merged/dev/pts: value too large for defined data type.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]

ghstack-source-id: a6ce91f Pull Request resolved: #127195

[ghstack-poisoned]

ghstack-source-id: 5591275 Pull Request resolved: #127195

[ghstack-poisoned]

ghstack-source-id: 5c0a0ba Pull Request resolved: #127195

[ghstack-poisoned]

jgong5 · 2024-06-20T01:48:58Z

Add tests?

@jansel Thanks for the comment. The existing UTs already cover this well when they run on the CPUs that support AMX. Now I added a new one to make sure AMX micro-gemm is selected when the CPU has the support. Please review.

[ghstack-poisoned]

jgong5 · 2024-06-21T00:21:39Z

@pytorchbot merge

pytorchmergebot · 2024-06-21T00:23:12Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-06-21T06:22:07Z

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

jgong5 · 2024-06-21T06:22:43Z

@pytorchbot merge

pytorchmergebot · 2024-06-21T06:25:04Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

xwang233 · 2024-06-24T18:44:04Z

This PR somehow caused our ARM build to fail with similar error messages in #129326. Can we add some #ifdef guard to fix this?

cc @eqy @nWEIdia @tinglvv @atalman @malfet

nWEIdia · 2024-06-24T20:23:21Z

This PR somehow caused our ARM build to fail with similar error messages in #129326. Can we add some #ifdef guard to fix this?

cc @eqy @nWEIdia @tinglvv @atalman @malfet

Thanks @xwang233 In hindsight, ciflow/binaries could have been used to catch this.

tinglvv · 2024-06-24T21:22:04Z

Thanks @xwang233 for catching this, looks like this is causing the nightly build for ARM to fail since 6/22 - https://github.com/pytorch/pytorch/actions/runs/9623875645/job/26546874031.

This PR fixes the build error on s390 after #127195. The following is the log of the build on s390x. This is because `SYS_arch_prctl` is not defined on s390x. ``` ... [792/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/cpu/FlushDenormal.cpp.o [793/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/cpu/Utils.cpp.o FAILED: caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/cpu/Utils.cpp.o /usr/bin/c++ -DAT_PER_OPERATOR_HEADERS -DCAFFE2_BUILD_MAIN_LIB -DFLASHATTENTION_DISABLE_ALIBI -DFMT_HEADER_ONLY=1 -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_RPC -DUSE_TENSORPIPE -D_FILE_OFFSET_BITS=64 -Dtorch_cpu_EXPORTS -I/pytorch/build/aten/src -I/pytorch/aten/src -I/pytorch/build -I/pytorch -I/pytorch/cmake/../third_party/benchmark/include -I/pytorch/third_party/onnx -I/pytorch/build/third_party/onnx -I/pytorch/third_party/foxi -I/pytorch/build/third_party/foxi -I/pytorch/torch/csrc/api -I/pytorch/torch/csrc/api/include -I/pytorch/caffe2/aten/src/TH -I/pytorch/build/caffe2/aten/src/TH -I/pytorch/build/caffe2/aten/src -I/pytorch/build/caffe2/../aten/src -I/pytorch/torch/csrc -I/pytorch/third_party/miniz-2.1.0 -I/pytorch/third_party/kineto/libkineto/include -I/pytorch/third_party/kineto/libkineto/src -I/pytorch/third_party/cpp-httplib -I/pytorch/aten/src/ATen/.. -I/pytorch/c10/.. -I/pytorch/third_party/FP16/include -I/pytorch/third_party/tensorpipe -I/pytorch/build/third_party/tensorpipe -I/pytorch/third_party/tensorpipe/third_party/libnop/include -I/pytorch/third_party/fmt/include -I/pytorch/third_party/flatbuffers/include -isystem /pytorch/cmake/../third_party/tensorpipe/third_party/libuv/include -isystem /pytorch/cmake/../third_party/googletest/googlemock/include -isystem /pytorch/cmake/../third_party/googletest/googletest/include -isystem /pytorch/third_party/protobuf/src -isystem /pytorch/cmake/../third_party/eigen -isystem /pytorch/build/include -Wno-maybe-uninitialized -Wno-uninitialized -Wno-free-nonheap-object -Wno-nonnull -D_GLIBCXX_USE_CXX11_ABI=1 -fvisibility-inlines-hidden -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DLIBKINETO_NOROCTRACER -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow -O3 -DNDEBUG -DNDEBUG -fPIC -DTORCH_USE_LIBUV -Wall -Wextra -Wdeprecated -Wno-unused-parameter -Wno-missing-field-initializers -Wno-unknown-pragmas -Wno-type-limits -Wno-array-bounds -Wno-strict-overflow -Wno-strict-aliasing -Wunused-function -Wno-maybe-uninitialized -fvisibility=hidden -O2 -fopenmp -MD -MT caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/cpu/Utils.cpp.o -MF caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/cpu/Utils.cpp.o.d -o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/cpu/Utils.cpp.o -c /pytorch/aten/src/ATen/cpu/Utils.cpp /pytorch/aten/src/ATen/cpu/Utils.cpp: In function 'bool at::cpu::init_amx()': /pytorch/aten/src/ATen/cpu/Utils.cpp:60:21: error: 'SYS_arch_prctl' was not declared in this scope; did you mean 'SYS_prctl'? 60 | long rc = syscall(SYS_arch_prctl, ARCH_REQ_XCOMP_PERM, XFEATURE_XTILEDATA); | ^~~~~~~~~~~~~~ | SYS_prctl [794/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/Integration.cpp.o [795/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/GridSampler.cpp.o [796/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/detail/CPUGuardImpl.cpp.o [797/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/ThreadLocalState.cpp.o [798/2147] Building CXX object caffe2/CMakeFiles/vec_test_all_types_DEFAULT.dir/__/aten/src/ATen/test/vec_test_all_types.cpp.o [799/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/Utils.cpp.o [800/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/VmapModeRegistrations.cpp.o [801/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/ZeroTensorFallback.cpp.o [802/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/autocast_mode.cpp.o ninja: build stopped: subcommand failed. Building wheel torch-2.5.0a0+git94dc325 -- Building version 2.5.0a0+git94dc325 cmake -GNinja -DBUILD_CAFFE2=0 -DBUILD_PYTHON=True -DBUILD_TEST=True -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/pytorch/torch -DCMAKE_PREFIX_PATH=/usr/local/lib/python3.10/dist-packages -DPython_EXECUTABLE=/usr/bin/python3 -DTORCH_BUILD_VERSION=2.5.0a0+git94dc325 -DUSE_GLOO=0 -DUSE_NUMPY=True /pytorch cmake --build . --target install --config Release Build step 'Execute shell' marked build as failure ... ``` Pull Request resolved: #129326 Approved by: https://github.com/Skylion007, https://github.com/eqy

snadampal · 2024-06-25T14:52:43Z

aarch64 CI builds are also failing. Instead of adding (!defined) for every variant of arm, could someone please guard x86_64 AMX specific changes with an appropriate macro.

Trying to mitigate aarch64 and s390 nightly failures as per this comment: #127195 (comment) Fixes #129443 Pull Request resolved: #129479 Approved by: https://github.com/nWEIdia, https://github.com/malfet

[inductor][cpp] AMX micro-gemm support

0e394f9

[ghstack-poisoned]

pytorch-bot bot added ciflow/inductor module: inductor labels May 26, 2024

pytorchbot added the open source label May 26, 2024

jgong5 marked this pull request as draft May 26, 2024 14:01

Update

164fc14

[ghstack-poisoned]

pytorch-bot bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label May 27, 2024

jgong5 pushed a commit that referenced this pull request May 27, 2024

[inductor][cpp] BF16 AMX micro-gemm support

0b0b8c4

ghstack-source-id: a6ce91f Pull Request resolved: #127195

Update

3ca38bd

[ghstack-poisoned]

Update

510125d

[ghstack-poisoned]

Update

5f54580

[ghstack-poisoned]

jgong5 pushed a commit that referenced this pull request May 28, 2024

[inductor][cpp] BF16 AMX micro-gemm support

be3595d

ghstack-source-id: 5591275 Pull Request resolved: #127195

jgong5 changed the title ~~[inductor][cpp] AMX micro-gemm support~~ [inductor][cpp] BF16 AMX micro-gemm support May 28, 2024

Update

b6aa830

[ghstack-poisoned]

Update

c50dc98

[ghstack-poisoned]

jgong5 pushed a commit that referenced this pull request May 28, 2024

[inductor][cpp] BF16 AMX micro-gemm support

1d0023e

ghstack-source-id: 5c0a0ba Pull Request resolved: #127195

jgong5 marked this pull request as ready for review May 28, 2024 15:40

jgong5 added the topic: not user facing topic category label May 28, 2024

Update

630a5d5

[ghstack-poisoned]

jgong5 requested a review from jansel May 29, 2024 01:30

Update

e6e3354

[ghstack-poisoned]

jgong5 mentioned this pull request Jun 20, 2024

[inductor][cpp] refactor CppTemplateKernel to inherit CppKernel #129101

Closed

jgong5 requested review from jansel and leslie-fang-intel June 20, 2024 01:49

jansel approved these changes Jun 20, 2024

View reviewed changes

Update

d678025

[ghstack-poisoned]

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 21, 2024

pytorchmergebot added the merging label Jun 21, 2024

pytorchmergebot added the Merged label Jun 21, 2024

pytorchmergebot closed this in 914d3ca Jun 21, 2024

pytorchmergebot removed the merging label Jun 21, 2024

kiszk mentioned this pull request Jun 23, 2024

Fix build error on s390x #129326

Closed

yushangdi mentioned this pull request Jun 24, 2024

[compilation error] PyTorch compilation failed on macOS #129404

Closed

dilililiwhy mentioned this pull request Jun 25, 2024

Compilation failure on ARM #129443

Closed

drisspg mentioned this pull request Jun 25, 2024

Set target dependencies to always build for sm90a on rowwise scaling #129402

Closed

atalman mentioned this pull request Jun 25, 2024

Add guard to use AMX for x86_64 only #129479

Closed

github-actions bot deleted the gh/jgong5/49/head branch July 26, 2024 01:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[inductor][cpp] BF16 AMX micro-gemm support #127195

[inductor][cpp] BF16 AMX micro-gemm support #127195

Uh oh!

jgong5 commented May 26, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented May 26, 2024 •

edited

Loading

Uh oh!

jgong5 commented Jun 20, 2024

Uh oh!

jgong5 commented Jun 21, 2024

Uh oh!

pytorchmergebot commented Jun 21, 2024

Uh oh!

pytorchmergebot commented Jun 21, 2024

Uh oh!

jgong5 commented Jun 21, 2024

Uh oh!

pytorchmergebot commented Jun 21, 2024

Uh oh!

xwang233 commented Jun 24, 2024 •

edited

Loading

Uh oh!

nWEIdia commented Jun 24, 2024

Uh oh!

tinglvv commented Jun 24, 2024

Uh oh!

snadampal commented Jun 25, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

[inductor][cpp] BF16 AMX micro-gemm support #127195

[inductor][cpp] BF16 AMX micro-gemm support #127195

Uh oh!

Conversation

jgong5 commented May 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented May 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/127195

✅ You can merge normally! (8 Unrelated Failures)

Uh oh!

jgong5 commented Jun 20, 2024

Uh oh!

jgong5 commented Jun 21, 2024

Uh oh!

pytorchmergebot commented Jun 21, 2024

Merge started

Uh oh!

pytorchmergebot commented Jun 21, 2024

Uh oh!

jgong5 commented Jun 21, 2024

Uh oh!

pytorchmergebot commented Jun 21, 2024

Merge started

Uh oh!

xwang233 commented Jun 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nWEIdia commented Jun 24, 2024

Uh oh!

tinglvv commented Jun 24, 2024

Uh oh!

snadampal commented Jun 25, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

jgong5 commented May 26, 2024 •

edited

Loading

pytorch-bot bot commented May 26, 2024 •

edited

Loading

xwang233 commented Jun 24, 2024 •

edited

Loading