Skip to content

Conversation

@swolchok
Copy link
Contributor

@swolchok swolchok commented Oct 29, 2024

Stack from ghstack (oldest at bottom):

Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16).

Testing: checked for regression with llm_experiments' benchmarks/benchmark_torch_mm.py llm on M1 Mac and it appeared to be neutral. Supported this assessment by inspecting assembly for the bf16 dot kernel (objdump -d --no-leading-addr --no-show-raw-insn build/caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp.DEFAULT.cpp.o | c++filt from pytorch root directory after python setup.py develop); observed minor instruction scheduling changes but nothing more.

Differential Revision: D65120325

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10

…Vectorized

Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16).

Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/)

[ghstack-poisoned]
@pytorch-bot pytorch-bot bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Oct 29, 2024
@pytorch-bot
Copy link

pytorch-bot bot commented Oct 29, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/139159

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 40 Cancelled Jobs

As of commit a78691a with merge base 3e0f4d1 (image):

NEW FAILURE - The following job has failed:

CANCELLED JOBS - The following jobs were cancelled. Please retry:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D65120325

swolchok added a commit that referenced this pull request Oct 29, 2024
…Vectorized

Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16).

Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/)

ghstack-source-id: 250611665
Pull Request resolved: #139159
…cs to vec::Vectorized"

Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16).

Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D65120325

…cs to vec::Vectorized"

Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16).

Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D65120325

…cs to vec::Vectorized"

Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16).

Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/)

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D65120325

…cs to vec::Vectorized"


Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16).

Testing: checked for regression with llm_experiments' benchmarks/benchmark_torch_mm.py llm on M1 Mac and it appeared to be neutral. Supported this assessment by inspecting assembly for the bf16 dot kernel (`objdump -d --no-leading-addr --no-show-raw-insn build/caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp.DEFAULT.cpp.o | c++filt` from pytorch root directory after `python setup.py develop`); observed minor instruction scheduling changes but nothing more.

Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/)

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D65120325

@swolchok swolchok added ciflow/mps Run MPS tests (subset of trunk) ciflow/linux-aarch64 linux aarch64 CI workflow labels Oct 31, 2024
…cs to vec::Vectorized"


Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16).

Testing: checked for regression with llm_experiments' benchmarks/benchmark_torch_mm.py llm on M1 Mac and it appeared to be neutral. Supported this assessment by inspecting assembly for the bf16 dot kernel (`objdump -d --no-leading-addr --no-show-raw-insn build/caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp.DEFAULT.cpp.o | c++filt` from pytorch root directory after `python setup.py develop`); observed minor instruction scheduling changes but nothing more.

Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/)

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D65120325

…bf16 gemv fast path kernel from intrinsics to vec::Vectorized"


Very similar to #137912, but for bf16. (This is building toward enabling this fast path on non-ARM architectures, and in particular on x86 for machines without AVX512BF16).

Testing: checked for regression with llm_experiments' benchmarks/benchmark_torch_mm.py llm on M1 Mac and it appeared to be neutral. Supported this assessment by inspecting assembly for the bf16 dot kernel (`objdump -d --no-leading-addr --no-show-raw-insn build/caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp.DEFAULT.cpp.o | c++filt` from pytorch root directory after `python setup.py develop`); observed minor instruction scheduling changes but nothing more.

Differential Revision: [D65120325](https://our.internmc.facebook.com/intern/diff/D65120325/)

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D65120325

@swolchok
Copy link
Contributor Author

swolchok commented Nov 1, 2024

folding this one into #139081 because I am needing increasingly large parts of it there anyway

@swolchok swolchok closed this Nov 1, 2024
@github-actions github-actions bot deleted the gh/swolchok/682/head branch December 2, 2024 02:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/linux-aarch64 linux aarch64 CI workflow ciflow/mps Run MPS tests (subset of trunk) fb-exported module: cpu CPU specific problem (e.g., perf, algorithm) topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants