[PyTorch] Migrate bf16 gemv fast path kernel from intrinsics to vec::Vectorized #139159

swolchok · 2024-10-29T05:41:06Z

Stack from ghstack (oldest at bottom):

Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16).

Testing: checked for regression with llm_experiments' benchmarks/benchmark_torch_mm.py llm on M1 Mac and it appeared to be neutral. Supported this assessment by inspecting assembly for the bf16 dot kernel (objdump -d --no-leading-addr --no-show-raw-insn build/caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp.DEFAULT.cpp.o | c++filt from pytorch root directory after python setup.py develop); observed minor instruction scheduling changes but nothing more.

Differential Revision: D65120325

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10

…Vectorized Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16). Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/) [ghstack-poisoned]

pytorch-bot · 2024-10-29T05:41:10Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/139159

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 40 Cancelled Jobs

As of commit a78691a with merge base 3e0f4d1 ():

NEW FAILURE - The following job has failed:

linux-aarch64 / linux-jammy-aarch64-py3.10 / build (gh)
/var/lib/jenkins/workspace/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp:215:27: error: cannot convert ‘at::vec::SVE256::Vectorized<float>’ to ‘float32x4_t’

CANCELLED JOBS - The following jobs were cancelled. Please retry:

Lint / lintrunner-clang / linux-job (gh)
##[error]The operation was canceled.
Lint / lintrunner-noclang / linux-job (gh)
##[error]The operation was canceled.
Mac MPS / macos-py3-arm64 / build (gh)
##[error]The operation was canceled.
Mac MPS / macos-py3-arm64-mps (gh)
pull / cuda12.1-py3.10-gcc9-sm75 (gh)
pull / cuda12.1-py3.10-gcc9-sm75 / build (gh)
##[error]The operation was canceled.
pull / linux-docs (gh)
pull / linux-focal-cpu-py3.10-gcc9-bazel-test / build-and-test (default, 1, 1, linux.4xlarge) (gh)
##[error]The operation was canceled.
pull / linux-focal-cuda11.8-py3.10-gcc9 (gh)
pull / linux-focal-cuda11.8-py3.10-gcc9 / build (gh)
##[error]The operation was canceled.
pull / linux-focal-cuda12.1-py3.10-gcc9 / build (gh)
##[error]The operation was canceled.
pull / linux-focal-cuda12.1-py3.10-gcc9-bazel-test / build-and-test (default, 1, 1, linux.4xlarge.nvidia.gpu) (gh)
##[error]The operation was canceled.
pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 (gh)
pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / build (gh)
##[error]The operation was canceled.
pull / linux-focal-cuda12.4-py3.10-gcc9-bazel-test / build-and-test (default, 1, 1, linux.4xlarge.nvidia.gpu) (gh)
##[error]The operation was canceled.
pull / linux-focal-py3_9-clang9-xla (gh)
pull / linux-focal-py3_9-clang9-xla / build (gh)
##[error]The operation was canceled.
pull / linux-focal-py3-clang9-android-ndk-r21e-gradle-custom-build-single / build-and-test (default, 1, 1, linux.2xlarge) (gh)
##[error]The operation was canceled.
pull / linux-focal-py3-clang9-android-ndk-r21e-gradle-custom-build-single-full-jit / build-and-test (default, 1, 1, linux.2xlarge) (gh)
##[error]The operation was canceled.
pull / linux-focal-py3-clang9-mobile-custom-build-static / build (gh)
##[error]The operation was canceled.
pull / linux-focal-py3.11-clang10 (gh)
pull / linux-focal-py3.11-clang10 / build (gh)
##[error]The operation was canceled.
pull / linux-focal-py3.12-clang10 (gh)
pull / linux-focal-py3.12-clang10 / build (gh)
##[error]The operation was canceled.
pull / linux-focal-py3.9-clang10 / build (gh)
##[error]The operation was canceled.
pull / linux-focal-py3.9-clang10-onnx (gh)
pull / linux-focal-py3.9-clang10-onnx / build (gh)
##[error]The operation was canceled.
pull / linux-focal-rocm6.2-py3.10 / build (gh)
##[error]The operation was canceled.
pull / linux-jammy-cuda11.8-cudnn9-py3.9-clang12 / build (gh)
##[error]The operation was canceled.
pull / linux-jammy-py3-clang12-executorch (gh)
pull / linux-jammy-py3-clang12-executorch / build (gh)
##[error]The operation was canceled.
pull / linux-jammy-py3-clang12-mobile-build / build (gh)
##[error]The operation was canceled.
pull / linux-jammy-py3.10-clang15-asan (gh)
pull / linux-jammy-py3.10-clang15-asan / build (gh)
##[error]The operation was canceled.
pull / linux-jammy-py3.9-gcc11 (gh)
pull / linux-jammy-py3.9-gcc11 / build (gh)
##[error]The operation was canceled.
pull / linux-jammy-py3.9-gcc11-mobile-lightweight-dispatch-build / build (gh)
##[error]The operation was canceled.
pull / linux-jammy-py3.9-gcc11-no-ops / build (gh)
##[error]The operation was canceled.
pull / linux-jammy-py3.9-gcc11-pch / build (gh)
##[error]The operation was canceled.
pull / win-vs2019-cpu-py3 / build (gh)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2024-10-29T05:41:29Z