Don't split 256-bit AVX2 load/store intrinsics #20609

colesbury · 2019-05-16T21:16:25Z

Recent versions of GCC split unaligned load and store intrinsics into
two 128-bit instructions. On old processors (Sandy Bridge) this was a
bit faster for unaligned data, but bit slower for aligned data. On new
processors (Intel Haswell+, recent AMD) splitting loads is slower on
both aligned and unaligned data.

Clang, MSVC, and ICC do not split unaligned load and store intrinsics.

There's a good explanation here:
https://stackoverflow.com/questions/52626726/why-doesnt-gcc-resolve-mm256-loadu-pd-as-single-vmovupd#tab-top

Splitting load and store intrinsics makes no sense in our AVX2
configuration because the CPUs that support AVX2 instructions are the
same CPUs where splitting is disadvantageous on all data alignemnt.

Note that this doesn't change the AVX configuration (used by CPUs that
support AVX but not AVX2). It's possible this would be benficial for
that configuration too (our data is usually 32-byte aligned), but I'd
prefer the conservative change for now.

torch.add generated assembly (hot loop) (GCC 7.3.0)
before:
https://gist.github.com/colesbury/066376537bccd514daf8fe4ab54d8295

after:
https://gist.github.com/colesbury/8b4b948145001d44b225c51d2428bb91

Timing of torch.add(x, y, out=z) for size 10240 (1 thread, Broadwell,
no turbo):
before: 7.35 us after: 6.39 us

(Take the torch.add timings with a grain of salt. The difference in timings
is much larger than I would expect.)

Recent versions of GCC split unaligned load and store intrinsics into two 128-bit instructions. On old processors (Sandy Bridge) this was a bit faster for unaligned data, but bit slower for aligned data. On new processors (Intel Haswell+, recent AMD) splitting loads is slower on both aligned and unaligned data. Clang, MSVC, and ICC do not split unaligned load and store intrinsics. There's a good explanation here: https://stackoverflow.com/questions/52626726/why-doesnt-gcc-resolve-mm256-loadu-pd-as-single-vmovupd#tab-top Splitting load and store intrinsics makes no sense in our AVX2 configuration because the CPUs that support AVX2 instructions are the same CPUs where splitting is disadvantageous on all data alignemnt. Note that this doesn't change the AVX configuration (used by CPUs that support AVX but not AVX2). It's possible this would be benficial for that configuration too (our data is usually 32-byte aligned), but I'd prefer the conservative change for now. torch.add generated assembly (hot loop) before: https://gist.github.com/colesbury/066376537bccd514daf8fe4ab54d8295 after: https://gist.github.com/colesbury/8b4b948145001d44b225c51d2428bb91 Timing of `torch.add(x, y, out=z)` for size 10240 (1 thread, Broadwell, no turbo): before: 7.35 us after: 6.39 us (Take the torch.add timings with a grain of salt. The difference in timings is much larger than I would expect.)

facebook-github-bot

@colesbury has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: Recent versions of GCC split unaligned load and store intrinsics into two 128-bit instructions. On old processors (Sandy Bridge) this was a bit faster for unaligned data, but bit slower for aligned data. On new processors (Intel Haswell+, recent AMD) splitting loads is slower on both aligned and unaligned data. Clang, MSVC, and ICC do not split unaligned load and store intrinsics. There's a good explanation here: https://stackoverflow.com/questions/52626726/why-doesnt-gcc-resolve-mm256-loadu-pd-as-single-vmovupd#tab-top Splitting load and store intrinsics makes no sense in our AVX2 configuration because the CPUs that support AVX2 instructions are the same CPUs where splitting is disadvantageous on all data alignemnt. Note that this doesn't change the AVX configuration (used by CPUs that support AVX but not AVX2). It's possible this would be benficial for that configuration too (our data is usually 32-byte aligned), but I'd prefer the conservative change for now. torch.add generated assembly (hot loop) (GCC 7.3.0) before: https://gist.github.com/colesbury/066376537bccd514daf8fe4ab54d8295 after: https://gist.github.com/colesbury/8b4b948145001d44b225c51d2428bb91 Timing of `torch.add(x, y, out=z)` for size 10240 (1 thread, Broadwell, no turbo): before: 7.35 us after: 6.39 us (Take the torch.add timings with a grain of salt. The difference in timings is much larger than I would expect.) Pull Request resolved: pytorch/pytorch#20609 Differential Revision: D15385800 Pulled By: colesbury fbshipit-source-id: 66415b148a3b19360b9de9881af594ab46547b6f

facebook-github-bot · 2019-05-17T23:09:24Z

@colesbury merged this pull request in b90790a.

colesbury requested a review from cpuhrsch May 16, 2019 21:16

pytorchbot added the module: build Build system issues label May 16, 2019

soumith approved these changes May 16, 2019

View reviewed changes

facebook-github-bot reviewed May 16, 2019

View reviewed changes

facebook-github-bot closed this in b90790a May 17, 2019

facebook-github-bot added the merged label May 17, 2019

mruberry added the Merged label Oct 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Don't split 256-bit AVX2 load/store intrinsics #20609

Don't split 256-bit AVX2 load/store intrinsics #20609

Uh oh!

colesbury commented May 16, 2019 •

edited

Loading

Uh oh!

facebook-github-bot left a comment

Uh oh!

facebook-github-bot commented May 17, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Don't split 256-bit AVX2 load/store intrinsics #20609

Don't split 256-bit AVX2 load/store intrinsics #20609

Uh oh!

Conversation

colesbury commented May 16, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented May 17, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

colesbury commented May 16, 2019 •

edited

Loading