Skip to content

Conversation

@sipa
Copy link
Member

@sipa sipa commented Jun 3, 2018

Based on #13191.

This adds SHA256 implementations that use Intel's SHA Extension instructions (using intrinsics). This needs GCC 4.9 or Clang 3.4.

In addition to #13191, two extra implementations are provided:

  • (a) A variable-length SHA256 implementation using SHA extensions.
  • (b) A 2-way 64-byte input double-SHA256 implementation using SHA extensions.

Benchmarks for 9001-element Merkle tree root computation on an AMD Ryzen 1800X system:

Benchmarks for 32-byte SHA256 on the same system:

Benchmarks for 1000000-byte SHA256 on the same system:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about

        ret = "sse4";
#if defined(ENABLE_SSE41) && !defined(BUILD_BITCOIN_INTERNAL)
        TransformD64_4way = sha256d64_sse41::Transform_4way;
        ret += ",sse41";
#endif
    }

?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@ghost
Copy link

ghost commented Jun 4, 2018

@sipa

Great works as usual!

I just came cross this thread

https://github.com/armfazh/flo-shani-aesni/blob/master/README.md

I hope you will have time to look at what they did!

@DrahtBot
Copy link
Contributor

DrahtBot commented Jun 4, 2018

Needs rebase

@promag
Copy link
Contributor

promag commented Jun 4, 2018

Concept ACK, nice numbers.

@sipa
Copy link
Member Author

sipa commented Jun 4, 2018

Rebased.

@Kick1986 Nice, I'll have a look.

@DrahtBot
Copy link
Contributor

DrahtBot commented Jun 5, 2018

Note to reviewers: This pull request conflicts with the following ones:

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

@Empact
Copy link
Contributor

Empact commented Jun 5, 2018

For clang version, looks like they were added in 3.4, but never noted in the release notes.
Source: went from commit date[1] to release date[2] to file in release version[3].
Did not check for every intrinsic used.
[1] llvm-mirror/clang@b83f5a7#diff-c4f203a0f202bf56c364a657b0aecae9
[2] http://releases.llvm.org/
[3] https://github.com/llvm-mirror/clang/blob/release_34/lib/Headers/shaintrin.h

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these could be _mm_load_si128 if s[] was alignas(16).

src/Makefile.am Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are starting to get out of hand. I think we should take @TheBlueMatt's suggestion and treat LIBBITCOIN_CRYPTO as a collection of these helpers. That way we can just add $(LIBBITCOIN_CRYPTO) everywhere, and that will pull in the cpu-specific libs as well. Something like:

...
LIBBITCOIN_CRYPTO=crypto/libbitcoin_crypto.a
LIBBITCOIN_CRYPTO_AVX2=crypto/libbitcoin_crypto_avx2.a
LIBBITCOIN_CRYPTO_SHANI=crypto/libbitcoin_crypto_shani.a
LIBBITCOIN_CRYPTO+=$(LIBBITCOIN_CRYPTO_AVX2)
LIBBITCOIN_CRYPTO+=$(LIBBITCOIN_CRYPTO_SHANI)
...

Then the cpu-specific ones can be dropped from the LDADDs all over the place. I'm happy to do up a patch on top of this if you'd like.

Copy link
Member Author

@sipa sipa Jun 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@theuni Actually, feel like PRing that as a separate PR, before this one goes in? Then @TheBlueMatt and I can both rebase ours on top of yours and not conflict with each other.

@theuni
Copy link
Member

theuni commented Jun 5, 2018

concept ACK.

I noticed while testing #13400 that I added a bug, but bitcoind started up fine anyway, due to missing sanity checks for the double/4way/8way hashes. Mind adding those?

Copy link
Contributor

@DesWurstes DesWurstes Jun 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe including immintrin.h is enough for both platforms. x86intrin includes immintrin.h, and immintrin.h includes everything that is needed: https://github.com/gcc-mirror/gcc/blob/master/gcc/config/i386/immintrin.h

EDIT: Tested on Linux GCC and Clang, including immintrin.h is enough for all platforms.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed.

laanwj added a commit that referenced this pull request Jun 11, 2018
f68049d crypto: cleanup sha256 build (Cory Fields)

Pull request description:

  Requested by @sipa in #13386.

  Rather than appending all possible cpu variants to all targets, create a convenience variable that encompasses all.

Tree-SHA512: 8e9ab2185515672b79bb7925afa4f3fbfe921bfcbe61456833d15457de4feba95290de17514344ce42ee81cc38b252476cd0c29432ac48c737c2225ed515a4bd
@sipa
Copy link
Member Author

sipa commented Jun 11, 2018

Rebased.

@DrahtBot
Copy link
Contributor

Needs rebase

@sipa
Copy link
Member Author

sipa commented Jun 12, 2018

Rebased.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Linter is yelling about include style here.

@DrahtBot
Copy link
Contributor

Needs rebase

@sipa
Copy link
Member Author

sipa commented Jun 24, 2018

Rebased after #13471 merge.

Also split up the CPU feature detection logic change into its own commit.

I noticed while testing #13400 that I added a bug, but bitcoind started up fine anyway, due to missing sanity checks for the double/4way/8way hashes. Mind adding those?

That was addressed in #13438.

@theuni
Copy link
Member

theuni commented Jun 25, 2018


utACK otherwise.

@theuni
Copy link
Member

theuni commented Jun 26, 2018

Thanks! utACK 66b2cf1.

@jb55
Copy link
Contributor

jb55 commented Jun 27, 2018

Not sure what compelled me to do this, and it's probably overkill... but...

Tested ACK 66b2cf1 with 100k rounds of quickcheck at various optimization levels, but only with the non-two way transform for now.

@DesWurstes
Copy link
Contributor

Just a nit from older pull requests: Now that it has a custom CPUID function

// We can't use cpuid.h's __get_cpuid as it does not support subleafs.
void inline cpuid(uint32_t leaf, uint32_t subleaf, uint32_t& a, uint32_t& b, uint32_t& c, uint32_t& d)
{
__asm__ ("cpuid" : "=a"(a), "=b"(b), "=c"(c), "=d"(d) : "0"(leaf), "2"(subleaf));
}

including <cpuid.h> is not necessary now:

#if defined(__x86_64__) || defined(__amd64__)
#if defined(USE_ASM)
#include <cpuid.h>
namespace sha256_sse4
{

Thank you for your awesome contributions!

@gmaxwell
Copy link
Contributor

gmaxwell commented Jul 7, 2018

ACK

@laanwj
Copy link
Member

laanwj commented Jul 9, 2018

utACK 66b2cf1
tested that build passes on FreeBSD+OpenBSD

@laanwj laanwj merged commit 66b2cf1 into bitcoin:master Jul 9, 2018
laanwj added a commit that referenced this pull request Jul 9, 2018
66b2cf1 Use immintrin.h everywhere for intrinsics (Pieter Wuille)
4c935e2 Add SHA256 implementation using using Intel SHA intrinsics (Pieter Wuille)
268400d [Refactor] CPU feature detection logic for SHA256 (Pieter Wuille)

Pull request description:

  Based on #13191.

  This adds SHA256 implementations that use Intel's SHA Extension instructions (using intrinsics). This needs GCC 4.9 or Clang 3.4.

  In addition to #13191, two extra implementations are provided:
  * (a) A variable-length SHA256 implementation using SHA extensions.
  * (b) A 2-way 64-byte input double-SHA256 implementation using SHA extensions.

  Benchmarks for 9001-element Merkle tree root computation on an AMD Ryzen 1800X system:
  * Using generic C++ code (pre-#10821): 6.1ms
  * Using SSE4 (master, #10821): 4.6ms
  * Using 4-way SSE4 specialized for 64-byte inputs (#13191): 2.8ms
  * Using 8-way AVX2 specialized for 64-byte inputs (#13191): 2.1ms
  * Using 2-way SHA-NI specialized for 64-byte inputs (this PR): 0.56ms

  Benchmarks for 32-byte SHA256 on the same system:
  * Using SSE4 (master, #10821): 190ns
  * Using SHA-NI (this PR): 53ns

  Benchmarks for 1000000-byte SHA256 on the same system:
  * Using SSE4 (master, #10821): 2.5ms
  * Using SHA-NI (this PR): 0.51ms

Tree-SHA512: 2b319e33b22579f815d91f9daf7994a5e1e799c4f73c13e15070dd54ba71f3f6438ccf77ae9cbd1ce76f972d9cbeb5f0edfea3d86f101bbc1055db70e42743b7
codablock pushed a commit to codablock/dash that referenced this pull request Oct 1, 2019
f68049d crypto: cleanup sha256 build (Cory Fields)

Pull request description:

  Requested by @sipa in bitcoin#13386.

  Rather than appending all possible cpu variants to all targets, create a convenience variable that encompasses all.

Tree-SHA512: 8e9ab2185515672b79bb7925afa4f3fbfe921bfcbe61456833d15457de4feba95290de17514344ce42ee81cc38b252476cd0c29432ac48c737c2225ed515a4bd
codablock pushed a commit to codablock/dash that referenced this pull request Oct 1, 2019
…ions

66b2cf1 Use immintrin.h everywhere for intrinsics (Pieter Wuille)
4c935e2 Add SHA256 implementation using using Intel SHA intrinsics (Pieter Wuille)
268400d [Refactor] CPU feature detection logic for SHA256 (Pieter Wuille)

Pull request description:

  Based on bitcoin#13191.

  This adds SHA256 implementations that use Intel's SHA Extension instructions (using intrinsics). This needs GCC 4.9 or Clang 3.4.

  In addition to bitcoin#13191, two extra implementations are provided:
  * (a) A variable-length SHA256 implementation using SHA extensions.
  * (b) A 2-way 64-byte input double-SHA256 implementation using SHA extensions.

  Benchmarks for 9001-element Merkle tree root computation on an AMD Ryzen 1800X system:
  * Using generic C++ code (pre-bitcoin#10821): 6.1ms
  * Using SSE4 (master, bitcoin#10821): 4.6ms
  * Using 4-way SSE4 specialized for 64-byte inputs (bitcoin#13191): 2.8ms
  * Using 8-way AVX2 specialized for 64-byte inputs (bitcoin#13191): 2.1ms
  * Using 2-way SHA-NI specialized for 64-byte inputs (this PR): 0.56ms

  Benchmarks for 32-byte SHA256 on the same system:
  * Using SSE4 (master, bitcoin#10821): 190ns
  * Using SHA-NI (this PR): 53ns

  Benchmarks for 1000000-byte SHA256 on the same system:
  * Using SSE4 (master, bitcoin#10821): 2.5ms
  * Using SHA-NI (this PR): 0.51ms

Tree-SHA512: 2b319e33b22579f815d91f9daf7994a5e1e799c4f73c13e15070dd54ba71f3f6438ccf77ae9cbd1ce76f972d9cbeb5f0edfea3d86f101bbc1055db70e42743b7
@bitcoin bitcoin locked as resolved and limited conversation to collaborators Sep 8, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.