-
Notifications
You must be signed in to change notification settings - Fork 38.8k
SHA256 implementations based on Intel SHA Extensions #13386
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
src/crypto/sha256.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about
ret = "sse4";
#if defined(ENABLE_SSE41) && !defined(BUILD_BITCOIN_INTERNAL)
TransformD64_4way = sha256d64_sse41::Transform_4way;
ret += ",sse41";
#endif
}?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
Great works as usual! I just came cross this thread I hope you will have time to look at what they did! |
|
Needs rebase |
|
Concept ACK, nice numbers. |
|
Rebased. @Kick1986 Nice, I'll have a look. |
Note to reviewers: This pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first. |
|
For clang version, looks like they were added in 3.4, but never noted in the release notes. |
src/crypto/sha256_shani.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these could be _mm_load_si128 if s[] was alignas(16).
src/Makefile.am
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are starting to get out of hand. I think we should take @TheBlueMatt's suggestion and treat LIBBITCOIN_CRYPTO as a collection of these helpers. That way we can just add $(LIBBITCOIN_CRYPTO) everywhere, and that will pull in the cpu-specific libs as well. Something like:
...
LIBBITCOIN_CRYPTO=crypto/libbitcoin_crypto.a
LIBBITCOIN_CRYPTO_AVX2=crypto/libbitcoin_crypto_avx2.a
LIBBITCOIN_CRYPTO_SHANI=crypto/libbitcoin_crypto_shani.a
LIBBITCOIN_CRYPTO+=$(LIBBITCOIN_CRYPTO_AVX2)
LIBBITCOIN_CRYPTO+=$(LIBBITCOIN_CRYPTO_SHANI)
...
Then the cpu-specific ones can be dropped from the LDADDs all over the place. I'm happy to do up a patch on top of this if you'd like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@theuni Actually, feel like PRing that as a separate PR, before this one goes in? Then @TheBlueMatt and I can both rebase ours on top of yours and not conflict with each other.
|
concept ACK. I noticed while testing #13400 that I added a bug, but bitcoind started up fine anyway, due to missing sanity checks for the double/4way/8way hashes. Mind adding those? |
src/crypto/sha256_shani.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe including immintrin.h is enough for both platforms. x86intrin includes immintrin.h, and immintrin.h includes everything that is needed: https://github.com/gcc-mirror/gcc/blob/master/gcc/config/i386/immintrin.h
EDIT: Tested on Linux GCC and Clang, including immintrin.h is enough for all platforms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, fixed.
f68049d crypto: cleanup sha256 build (Cory Fields) Pull request description: Requested by @sipa in #13386. Rather than appending all possible cpu variants to all targets, create a convenience variable that encompasses all. Tree-SHA512: 8e9ab2185515672b79bb7925afa4f3fbfe921bfcbe61456833d15457de4feba95290de17514344ce42ee81cc38b252476cd0c29432ac48c737c2225ed515a4bd
|
Rebased. |
| Needs rebase |
|
Rebased. |
src/crypto/sha256_shani.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Linter is yelling about include style here.
| Needs rebase |
|
Rebased after #13471 merge. Also split up the CPU feature detection logic change into its own commit.
That was addressed in #13438. |
|
|
Thanks! utACK 66b2cf1. |
|
Not sure what compelled me to do this, and it's probably overkill... but... Tested ACK 66b2cf1 with 100k rounds of quickcheck at various optimization levels, but only with the non-two way transform for now. |
|
Just a nit from older pull requests: Now that it has a custom CPUID function Lines 534 to 538 in d96bdd7
including <cpuid.h> is not necessary now:
Lines 12 to 16 in d96bdd7
Thank you for your awesome contributions! |
|
ACK |
|
utACK 66b2cf1 |
66b2cf1 Use immintrin.h everywhere for intrinsics (Pieter Wuille) 4c935e2 Add SHA256 implementation using using Intel SHA intrinsics (Pieter Wuille) 268400d [Refactor] CPU feature detection logic for SHA256 (Pieter Wuille) Pull request description: Based on #13191. This adds SHA256 implementations that use Intel's SHA Extension instructions (using intrinsics). This needs GCC 4.9 or Clang 3.4. In addition to #13191, two extra implementations are provided: * (a) A variable-length SHA256 implementation using SHA extensions. * (b) A 2-way 64-byte input double-SHA256 implementation using SHA extensions. Benchmarks for 9001-element Merkle tree root computation on an AMD Ryzen 1800X system: * Using generic C++ code (pre-#10821): 6.1ms * Using SSE4 (master, #10821): 4.6ms * Using 4-way SSE4 specialized for 64-byte inputs (#13191): 2.8ms * Using 8-way AVX2 specialized for 64-byte inputs (#13191): 2.1ms * Using 2-way SHA-NI specialized for 64-byte inputs (this PR): 0.56ms Benchmarks for 32-byte SHA256 on the same system: * Using SSE4 (master, #10821): 190ns * Using SHA-NI (this PR): 53ns Benchmarks for 1000000-byte SHA256 on the same system: * Using SSE4 (master, #10821): 2.5ms * Using SHA-NI (this PR): 0.51ms Tree-SHA512: 2b319e33b22579f815d91f9daf7994a5e1e799c4f73c13e15070dd54ba71f3f6438ccf77ae9cbd1ce76f972d9cbeb5f0edfea3d86f101bbc1055db70e42743b7
f68049d crypto: cleanup sha256 build (Cory Fields) Pull request description: Requested by @sipa in bitcoin#13386. Rather than appending all possible cpu variants to all targets, create a convenience variable that encompasses all. Tree-SHA512: 8e9ab2185515672b79bb7925afa4f3fbfe921bfcbe61456833d15457de4feba95290de17514344ce42ee81cc38b252476cd0c29432ac48c737c2225ed515a4bd
…ions 66b2cf1 Use immintrin.h everywhere for intrinsics (Pieter Wuille) 4c935e2 Add SHA256 implementation using using Intel SHA intrinsics (Pieter Wuille) 268400d [Refactor] CPU feature detection logic for SHA256 (Pieter Wuille) Pull request description: Based on bitcoin#13191. This adds SHA256 implementations that use Intel's SHA Extension instructions (using intrinsics). This needs GCC 4.9 or Clang 3.4. In addition to bitcoin#13191, two extra implementations are provided: * (a) A variable-length SHA256 implementation using SHA extensions. * (b) A 2-way 64-byte input double-SHA256 implementation using SHA extensions. Benchmarks for 9001-element Merkle tree root computation on an AMD Ryzen 1800X system: * Using generic C++ code (pre-bitcoin#10821): 6.1ms * Using SSE4 (master, bitcoin#10821): 4.6ms * Using 4-way SSE4 specialized for 64-byte inputs (bitcoin#13191): 2.8ms * Using 8-way AVX2 specialized for 64-byte inputs (bitcoin#13191): 2.1ms * Using 2-way SHA-NI specialized for 64-byte inputs (this PR): 0.56ms Benchmarks for 32-byte SHA256 on the same system: * Using SSE4 (master, bitcoin#10821): 190ns * Using SHA-NI (this PR): 53ns Benchmarks for 1000000-byte SHA256 on the same system: * Using SSE4 (master, bitcoin#10821): 2.5ms * Using SHA-NI (this PR): 0.51ms Tree-SHA512: 2b319e33b22579f815d91f9daf7994a5e1e799c4f73c13e15070dd54ba71f3f6438ccf77ae9cbd1ce76f972d9cbeb5f0edfea3d86f101bbc1055db70e42743b7
Based on #13191.
This adds SHA256 implementations that use Intel's SHA Extension instructions (using intrinsics). This needs GCC 4.9 or Clang 3.4.
In addition to #13191, two extra implementations are provided:
Benchmarks for 9001-element Merkle tree root computation on an AMD Ryzen 1800X system:
Benchmarks for 32-byte SHA256 on the same system:
Benchmarks for 1000000-byte SHA256 on the same system: