-
Notifications
You must be signed in to change notification settings - Fork 38.7k
ARMv8 SHA2 Intrinsics #24115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARMv8 SHA2 Intrinsics #24115
Conversation
|
Concept ACK!
On Linux (the only system we care about for ARM, i guess), the following would be the way to do detection: #include <sys/auxv.h>
#include <asm/hwcap.h>
…
#ifdef __arm__
/* ARM 32 bit */
if (getauxval(AT_HWCAP2) & HWCAP2_SHA2) {
have_arm_shani = true;
}
#endif
#ifdef __aarch64__
/* ARM 64 bit */
if (getauxval(AT_HWCAP) & HWCAP_SHA2) {
have_arm_shani = true;
}
#endifNote that the capability bit is on a different HWCAP word on 32 and 64 bit (dunno if you even want to support 32 bit here). |
Added in f7dd1ef |
3d77517 to
f7dd1ef
Compare
|
On commit f7dd1efae715593f5c9ff8186d518d25d1c9023c On a Linux aarch64 Cortex-A53 system with: which I presume means it has the necessary SHA2 extensions. The GCC 9.3.0 compiler used supports the extensions (crypto/libbitcoin_crypto_arm_shani.a is being built): Still, the extension doesn't seem to be detected. debug.log says: |
|
On c0849fc: |
|
This PR (c0849fc):
On master (e3ce019):
|
M1 macs would like a word with you. |
|
Speaking of m1, I was able to compile this locally on my m1 pro 10 core, ./configure realized that SHA2 intrinsics could be used. See benchmarks below. on c0849fc
on master
|
Yes, support for Apple Silicon is included in this PR. |
|
Tested c0849fc on Mac mini (M1, 2020): UPDATE. The same for the master branch (e3ce019): 51 min or 6% faster IBD. |
|
See https://github.com/sipa/bitcoin/commits/pr24115, which adds a 2-way 64-byte optimized variant. On my Cortex-A53 It's roughly a 2x speedup for the SHA256D64_1024 benchmark (relevant for Merkle root computation) compared to this PR. For more modern architectures I could imagine it's more:
For reference, master again:
|
|
@sipa's branch on m1:
previous results on c0849
|
|
I confirm the numbers on M1: before @sipa's improvements (f06f46c):
after @sipa's improvements (0e72995):
|
|
I'm not able to build this branch on m1 at the moment I just checked out sipa's branch here: sipa@0e72995 and compilation worked trivially |
|
@prusnak @PastaPastaPasta Perhaps you want to also benchmark with the two last commits removed (so at "Optimization: precompute a few 3rd transform intermediaries"). Whether the last two help may be very architecture-dependent. For me they contribute a ~30% speedup, but maytbe on M1 that is not the case. |
|
@sipa benchmark of 38ed75f
The improvement of using 0e72995 is there also for M1. |
|
Looks like the 2-way version is a clear win on M1 as well, thanks! |
|
@fanquake rebased on top of current master |
Guix builds:UPDATE: build artifacts are available in https://github.com/hebasto/artefacts/tree/master/pr24115/guix-build-aaa1d03d3ace/output |
Systems used:
I performed the following tests:
I think this proves the build mechanism and the runtime detection works as intended. |
|
Concept ACK. I find it near-impossible to follow what |
|
@Sjors That's quite possibly worth documenting in general (for all D64 code). What these functions do:
A bit about SHA256's structure.
In case of SHA256(SHA256(64 bytes)), there are 3 Transforms being invoked:
There are 3 types of optimizations we can do in this case:
The individual commits in https://github.com/sipa/bitcoin/commits/pr24115 show the process. Note that I don't think it's really required for verifying correctness to see these steps (otherwise I'd have argued for including them in this PR), but it may help understand how it came to be. |
|
I think this is the step that confuses me:
If the first transform is the equivalent of a single sha256(64 bytes) and the third is the equivalent of a second sha256() on the 32 byte result of the first, what is the second transform doing?
This is definitely worth documenting (can be another PR). Even nicer if we can generate the values in a Python script (for manual comparison, not code generation). |
|
There are two SHA256 invocations:
Input is 64 bytes, which means it gets 64 bytes of padding (because the padding is always between 9 and 72 bytes long, and the result is always a multiple of 64). For H2, SHA256(H1) just gets a 32-byte input, so it also only gets a 32-byte padding, and the result just needs one transform. So we can write it this way:
The first transform is the inner one for H1, the second the outer one for H1. The third transform is the H2 one. |
|
Ah that makes sense.
I naively assumed a 64 byte message wasn't padded, but it is: https://datatracker.ietf.org/doc/html/rfc6234#section-4.1 |
|
IBD up to block 700000 on a Rock Pi 4a w/ NVMe SSD, assumevalid=0, dbcache=2000: master (bd482b3): 68H52M Improvement ~5% |
Yes, it has to be. Otherwise you'd have a trivial 2nd preimage attack between hash(X) and hash(X || padding(len(X))), for non-multiple-of-64-bytes X. |
|
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. ConflictsReviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first. |
|
Code review and lightly tested ACK aaa1d03
|
| MSG3 = vreinterpretq_u32_u8(vrev32q_u8(vld1q_u8(chunk + 48))); | ||
| chunk += 64; | ||
|
|
||
| // Original implemenation preloaded message and constant addition which was 1-3% slower. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: implemenation ==> implementation
aaa1d03 Add optimized sha256d64_arm_shani::Transform_2way (Pieter Wuille) fe06298 Implement sha256_arm_shani::Transform (Pavol Rusnak) 48a72fa Add sha256_arm_shani to build system (Pavol Rusnak) c2b7934 Rename SHANI to X86_SHANI to allow future implementation of ARM_SHANI (Pavol Rusnak) Pull request description: This PR adds support for ARMv8 SHA2 Intrinsics. Fixes bitcoin#13401 and bitcoin#17414 * Integration part was done by me. * The original SHA2 NI code comes from https://github.com/noloader/SHA-Intrinsics/blob/master/sha256-arm.c * Minor optimizations from https://github.com/rollmeister/bitcoin-armv8/blob/master/src/crypto/sha256.cpp are applied too. * The 2-way transform added by @sipa ACKs for top commit: laanwj: Code review and lightly tested ACK aaa1d03 Tree-SHA512: 9689d6390c004269cb1ee79ed05430d7d35a6efef2554a2b6732f7258a11e7e959b3306c04b4e8637a9623fb4c12d1c1b3592da0ff0dc6d737932db302509669 # Conflicts: # configure.ac # src/Makefile.am # src/crypto/sha256.cpp
aaa1d03 Add optimized sha256d64_arm_shani::Transform_2way (Pieter Wuille) fe06298 Implement sha256_arm_shani::Transform (Pavol Rusnak) 48a72fa Add sha256_arm_shani to build system (Pavol Rusnak) c2b7934 Rename SHANI to X86_SHANI to allow future implementation of ARM_SHANI (Pavol Rusnak) Pull request description: This PR adds support for ARMv8 SHA2 Intrinsics. Fixes bitcoin#13401 and bitcoin#17414 * Integration part was done by me. * The original SHA2 NI code comes from https://github.com/noloader/SHA-Intrinsics/blob/master/sha256-arm.c * Minor optimizations from https://github.com/rollmeister/bitcoin-armv8/blob/master/src/crypto/sha256.cpp are applied too. * The 2-way transform added by @sipa ACKs for top commit: laanwj: Code review and lightly tested ACK aaa1d03 Tree-SHA512: 9689d6390c004269cb1ee79ed05430d7d35a6efef2554a2b6732f7258a11e7e959b3306c04b4e8637a9623fb4c12d1c1b3592da0ff0dc6d737932db302509669 # Conflicts: # configure.ac # src/Makefile.am # src/crypto/sha256.cpp
…shani} 7fd0860 Bugfix: configure: Define defaults for enable_arm_{crc,shani} (Luke Dashjr) Pull request description: Fix for #17398 and #24115 Trivial, mostly for consistency (you'd have to *try* to break this) ACKs for top commit: pk-b2: ACK 7fd0860 seejee: ACK 7fd0860 vincenzopalazzo: ACK 7fd0860 Tree-SHA512: 51c389787c369f431ca57071f03392438bff9fd41f128c63ce74ca30d2257213f8be225efcb5c1329ad80b714f44427d721215d4f848cc8e63060fa5bc8f1f2e
…m_{crc,shani}
7fd0860 Bugfix: configure: Define defaults for enable_arm_{crc,shani} (Luke Dashjr)
Pull request description:
Fix for bitcoin#17398 and bitcoin#24115
Trivial, mostly for consistency (you'd have to *try* to break this)
ACKs for top commit:
pk-b2:
ACK bitcoin@7fd0860
seejee:
ACK bitcoin@7fd0860
vincenzopalazzo:
ACK bitcoin@7fd0860
Tree-SHA512: 51c389787c369f431ca57071f03392438bff9fd41f128c63ce74ca30d2257213f8be225efcb5c1329ad80b714f44427d721215d4f848cc8e63060fa5bc8f1f2e
This PR adds support for ARMv8 SHA2 Intrinsics.
Fixes #13401 and #17414