Skip to content

Conversation

@swolchok
Copy link
Contributor

@swolchok swolchok commented Oct 14, 2024

Stack from ghstack (oldest at bottom):

Migrated as much as possible and convenient; focusing on fp16
for now. (This is building toward enabling these fast paths on x86 for
machines without AVX-512fp16/bf16 to fix
pytorch/torchchat#1253 .)

Differential Revision: D64218206

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10

…Vectorized

Migrated as much as possible and convenient; focusing on fp16
for now. (This is building toward enabling these fast paths on x86 for
machines without AVX-512fp16/bf16 to fix
pytorch/torchchat#1253 .)

Differential Revision: [D64218206](https://our.internmc.facebook.com/intern/diff/D64218206/)

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Oct 14, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137912

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit c8f47af with merge base b9618c9 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D64218206

…cs to vec::Vectorized"

Migrated as much as possible and convenient; focusing on fp16
for now. (This is building toward enabling these fast paths on x86 for
machines without AVX-512fp16/bf16 to fix
pytorch/torchchat#1253 .)

Differential Revision: [D64218206](https://our.internmc.facebook.com/intern/diff/D64218206/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D64218206

…cs to vec::Vectorized"

Migrated as much as possible and convenient; focusing on fp16
for now. (This is building toward enabling these fast paths on x86 for
machines without AVX-512fp16/bf16 to fix
pytorch/torchchat#1253 .)

Differential Revision: [D64218206](https://our.internmc.facebook.com/intern/diff/D64218206/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D64218206

…cs to vec::Vectorized"

Migrated as much as possible and convenient; focusing on fp16
for now. (This is building toward enabling these fast paths on x86 for
machines without AVX-512fp16/bf16 to fix
pytorch/torchchat#1253 .)

Differential Revision: [D64218206](https://our.internmc.facebook.com/intern/diff/D64218206/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D64218206

…cs to vec::Vectorized"

Migrated as much as possible and convenient; focusing on fp16
for now. (This is building toward enabling these fast paths on x86 for
machines without AVX-512fp16/bf16 to fix
pytorch/torchchat#1253 .)

Differential Revision: [D64218206](https://our.internmc.facebook.com/intern/diff/D64218206/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D64218206

…cs to vec::Vectorized"

Migrated as much as possible and convenient; focusing on fp16
for now. (This is building toward enabling these fast paths on x86 for
machines without AVX-512fp16/bf16 to fix
pytorch/torchchat#1253 .)

Differential Revision: [D64218206](https://our.internmc.facebook.com/intern/diff/D64218206/)

[ghstack-poisoned]
swolchok added a commit that referenced this pull request Oct 29, 2024
…el from intrinsics to vec::Vectorized"

Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16).

Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/)

[ghstack-poisoned]
swolchok added a commit that referenced this pull request Oct 29, 2024
…cs to vec::Vectorized"

Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16).

Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/)

[ghstack-poisoned]
swolchok added a commit that referenced this pull request Oct 29, 2024
…el from intrinsics to vec::Vectorized"

Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16).

Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/)

[ghstack-poisoned]
swolchok added a commit that referenced this pull request Oct 29, 2024
…cs to vec::Vectorized"

Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16).

Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/)

[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this pull request Oct 29, 2024
…pu/ (#137914)

This is in preparation for supporting x86 as well; we need to
be in this directory so that we can get rebuilt with different
CPU_CAPABILITY settings (AVX2/AVX-512). Also incidentally starts
fulfilling request from @malfet to split the ARM64 fast path stuff
into its own file. BFloat16 will be in a later diff.

Differential Revision: [D64265755](https://our.internmc.facebook.com/intern/diff/D64265755/)

Pull Request resolved: #137914
Approved by: https://github.com/Skylion007, https://github.com/malfet
ghstack dependencies: #137661, #137911, #137912, #137913
pytorchmergebot pushed a commit that referenced this pull request Oct 29, 2024
…whole vector register instead of half (#137916)

The fixup loop doesn't really need to vectorize the last 7 elements, and not doing so will make migrating to x86 simpler.

Differential Revision: [D64280689](https://our.internmc.facebook.com/intern/diff/D64280689/)

Pull Request resolved: #137916
Approved by: https://github.com/malfet
ghstack dependencies: #137661, #137911, #137912, #137913, #137914, #137915
@malfet malfet added ciflow/linux-aarch64 linux aarch64 CI workflow and removed Merged labels Oct 29, 2024
swolchok added a commit that referenced this pull request Oct 31, 2024
…el from intrinsics to vec::Vectorized"

Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16).

Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/)

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
swolchok added a commit that referenced this pull request Oct 31, 2024
…cs to vec::Vectorized"

Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16).

Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/)

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
swolchok added a commit that referenced this pull request Oct 31, 2024
…el from intrinsics to vec::Vectorized"


Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16).

Testing: checked for regression with llm_experiments' benchmarks/benchmark_torch_mm.py llm on M1 Mac and it appeared to be neutral. Supported this assessment by inspecting assembly for the bf16 dot kernel (`objdump -d --no-leading-addr --no-show-raw-insn build/caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp.DEFAULT.cpp.o | c++filt` from pytorch root directory after `python setup.py develop`); observed minor instruction scheduling changes but nothing more.

Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/)

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
swolchok added a commit that referenced this pull request Oct 31, 2024
…cs to vec::Vectorized"


Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16).

Testing: checked for regression with llm_experiments' benchmarks/benchmark_torch_mm.py llm on M1 Mac and it appeared to be neutral. Supported this assessment by inspecting assembly for the bf16 dot kernel (`objdump -d --no-leading-addr --no-show-raw-insn build/caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp.DEFAULT.cpp.o | c++filt` from pytorch root directory after `python setup.py develop`); observed minor instruction scheduling changes but nothing more.

Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/)

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
swolchok added a commit that referenced this pull request Nov 1, 2024
…el from intrinsics to vec::Vectorized"


Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16).

Testing: checked for regression with llm_experiments' benchmarks/benchmark_torch_mm.py llm on M1 Mac and it appeared to be neutral. Supported this assessment by inspecting assembly for the bf16 dot kernel (`objdump -d --no-leading-addr --no-show-raw-insn build/caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp.DEFAULT.cpp.o | c++filt` from pytorch root directory after `python setup.py develop`); observed minor instruction scheduling changes but nothing more.

Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/)

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
swolchok added a commit that referenced this pull request Nov 1, 2024
…cs to vec::Vectorized"


Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16).

Testing: checked for regression with llm_experiments' benchmarks/benchmark_torch_mm.py llm on M1 Mac and it appeared to be neutral. Supported this assessment by inspecting assembly for the bf16 dot kernel (`objdump -d --no-leading-addr --no-show-raw-insn build/caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp.DEFAULT.cpp.o | c++filt` from pytorch root directory after `python setup.py develop`); observed minor instruction scheduling changes but nothing more.

Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/)

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
swolchok added a commit that referenced this pull request Nov 1, 2024
…yTorch] Migrate bf16 gemv fast path kernel from intrinsics to vec::Vectorized"


Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16).

Testing: checked for regression with llm_experiments' benchmarks/benchmark_torch_mm.py llm on M1 Mac and it appeared to be neutral. Supported this assessment by inspecting assembly for the bf16 dot kernel (`objdump -d --no-leading-addr --no-show-raw-insn build/caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp.DEFAULT.cpp.o | c++filt` from pytorch root directory after `python setup.py develop`); observed minor instruction scheduling changes but nothing more.

Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/)

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
swolchok added a commit that referenced this pull request Nov 1, 2024
…bf16 gemv fast path kernel from intrinsics to vec::Vectorized"


Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16).

Testing: checked for regression with llm_experiments' benchmarks/benchmark_torch_mm.py llm on M1 Mac and it appeared to be neutral. Supported this assessment by inspecting assembly for the bf16 dot kernel (`objdump -d --no-leading-addr --no-show-raw-insn build/caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp.DEFAULT.cpp.o | c++filt` from pytorch root directory after `python setup.py develop`); observed minor instruction scheduling changes but nothing more.

Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/)

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
rahulsingh-intel pushed a commit to rahulsingh-intel/pytorch that referenced this pull request Nov 5, 2024
…Vectorized (pytorch#137912)

Migrated as much as possible and convenient; focusing on fp16
for now. (This is building toward enabling these fast paths on x86 for
machines without AVX-512fp16/bf16 to fix
pytorch/torchchat#1253 .)

Differential Revision: [D64218206](https://our.internmc.facebook.com/intern/diff/D64218206/)

Pull Request resolved: pytorch#137912
Approved by: https://github.com/malfet
ghstack dependencies: pytorch#137661, pytorch#137911
rahulsingh-intel pushed a commit to rahulsingh-intel/pytorch that referenced this pull request Nov 5, 2024
…pu/ (pytorch#137914)

This is in preparation for supporting x86 as well; we need to
be in this directory so that we can get rebuilt with different
CPU_CAPABILITY settings (AVX2/AVX-512). Also incidentally starts
fulfilling request from @malfet to split the ARM64 fast path stuff
into its own file. BFloat16 will be in a later diff.

Differential Revision: [D64265755](https://our.internmc.facebook.com/intern/diff/D64265755/)

Pull Request resolved: pytorch#137914
Approved by: https://github.com/Skylion007, https://github.com/malfet
ghstack dependencies: pytorch#137661, pytorch#137911, pytorch#137912, pytorch#137913
rahulsingh-intel pushed a commit to rahulsingh-intel/pytorch that referenced this pull request Nov 5, 2024
…whole vector register instead of half (pytorch#137916)

The fixup loop doesn't really need to vectorize the last 7 elements, and not doing so will make migrating to x86 simpler.

Differential Revision: [D64280689](https://our.internmc.facebook.com/intern/diff/D64280689/)

Pull Request resolved: pytorch#137916
Approved by: https://github.com/malfet
ghstack dependencies: pytorch#137661, pytorch#137911, pytorch#137912, pytorch#137913, pytorch#137914, pytorch#137915
@github-actions github-actions bot deleted the gh/swolchok/660/head branch November 29, 2024 02:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/linux-aarch64 linux aarch64 CI workflow ciflow/trunk Trigger trunk jobs on your pull request fb-exported module: cpu CPU specific problem (e.g., perf, algorithm) topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants