-
Notifications
You must be signed in to change notification settings - Fork 38.7k
feefrac: 128-bit multiply support in MSVC #29758
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. Code CoverageFor detailed information about the code coverage, see the test coverage report. ReviewsSee the guideline for information on the review process.
If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update. |
hebasto
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concept ACK.
|
Hmm, does our CI not run the fuzz corpus when building with MSVC? |
Afaik it does not but it would probably make sense. We also do it in the MacOS job. |
No, it does not. Even a fuzz binary is not built. |
|
I think it would be very useful if we did. Fuzzing itself won't work in MSVC, but I don't think there is a technical reason why we can't run the existing fuzz corpus. |
6988dd6 to
acf59f5
Compare
acf59f5 to
d2b94d3
Compare
d2b94d3 to
5fb70b5
Compare
18fd522 ci, msvc: Add "Run fuzz binaries" step (Hennadii Stepanov) 52933d7 fuzz: Pass `SystemRoot` environment variable to subprocess (Hennadii Stepanov) 23cb820 ci, msvc: Add "Clone fuzz corpus" step (Hennadii Stepanov) 19dcedd build, msvc: Build `fuzz.exe` binary (Hennadii Stepanov) 4c078d7 build, msvc: Enable preprocessor conformance mode (Hennadii Stepanov) 09f5a74 fuzz: Re-implement `read_stdin` in portable way (Hennadii Stepanov) Pull request description: Closes #29760. Suggested in #29758 (comment). ACKs for top commit: maflcko: lgtm ACK 18fd522 🔍 sipsorcery: tACK 18fd522 sipa: utACK 18fd522 Tree-SHA512: 672ed6926ee9091f68f13780e77b60fc1d48731f16e847d849374f8426ffe1dafd9bcab06a27af62e8052ba345bb57f20f40579d6be8540c12ef85c23a6eec8b
theuni
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
utACK 5fb70b5
(I didn't look into the additional unit tests)
Godbolt output here:
https://godbolt.org/z/8Msz984bv
Compiles down to roughly the same.
| { | ||
| // On 64-bit MSVC, use _mul128 intrinsic for wide multiplication. | ||
| std::pair<int64_t, uint64_t> ret; | ||
| ret.second = _mul128(a, b, &ret.first); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To help other reviewers:
__int64 _mul128(
__int64 Multiplier,
__int64 Multiplicand,
__int64 *HighProduct
);
Return value
The low 64 bits of the product.So this should result in:
ret = <high, low>
Which matches the other impls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note the weirdly signed return value __int64 here, which IMHO makes no sense whatsoever. Semantically, the return value is 264HighProduct + (uint64_t)ret.
| return __int128{a} * b; | ||
| } | ||
| #elif defined(_MSC_VER) && defined(_M_X64) | ||
| static inline std::pair<int64_t, uint64_t> Mul(int64_t a, int32_t b) noexcept |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do this Mul's return type and MulFallback's one differ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the point; they represent the 96-bit integer product in a different way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be possible to use std::pair<int32_t, uint64_t> here instead (as in, that type would be big enough to store the result), but _mul128 returns 64-bit results, which match the register size of x86_64, so no conversion is needed.
Also note that std::pair<int64_t, uint64_t> probably roughly corresponds to how __int128 is represented internally (the CPU has no 128-bit general-purpose registers, and the mul instruction returns two 64-bit registers).
|
I've add a benchmark for the Here are results on Windows 11,
The performance worsened by ~4%. cc @sipsorcery FWIW, this benchmark shows <10% of performance improvement for |
| // On 64-bit MSVC, use _mul128 intrinsic for wide multiplication. | ||
| std::pair<int64_t, uint64_t> ret; | ||
| ret.second = _mul128(a, b, &ret.first); | ||
| return ret; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on MSVC one needs to use the
_mul128or_mulhintrinsics instead
From my research, it follows that __mulh seems more performant.
This code:
| // On 64-bit MSVC, use _mul128 intrinsic for wide multiplication. | |
| std::pair<int64_t, uint64_t> ret; | |
| ret.second = _mul128(a, b, &ret.first); | |
| return ret; | |
| // On 64-bit MSVC, use __mulh intrinsic for wide multiplication. | |
| return {__mulh(a, b), std::bit_cast<uint64_t>(a) * b}; |
gives the following numbers for the benchmark:
> build_msvc\x64\Release\bench_bitcoin.exe -filter=FeefracMultipication
| ns/op | op/s | err% | total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
| 1.41 | 707,119,862.93 | 2.6% | 0.17 | `FeefracMultipication`
> build_msvc\x64\Release\bench_bitcoin.exe -filter=FeefracMultipication
| ns/op | op/s | err% | total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
| 1.36 | 734,209,621.15 | 0.9% | 0.16 | `FeefracMultipication`
> build_msvc\x64\Release\bench_bitcoin.exe -filter=FeefracMultipication
| ns/op | op/s | err% | total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
| 1.39 | 720,775,228.66 | 3.1% | 0.17 | `FeefracMultipication`
|
@hebasto Would you mind benchmarking with smaller realistic numbers (say fee and size both between 0 and 2^30)? If the top limb is equal, it's possible the naive code does worse. |
I switched to
Numbers have not changed:
|
|
@hebasto Thanks. I was assuming this would be an obvious improvement, but if it isn't, it'll need some more investigation into what this is all compiled to. That's not something I'm interested in doing for a half-supported architecture. |
Feerate comparisons in the recently (#29242) introduced
FeeFractype rely on multiplications between 32-bit and 64-bit integers. On 64-bit systems, hardware can do this natively. On GCC and Clang we can use the__int128type for this, but on MSVC one needs to use the_mul128or_mulhintrinsics instead. This PR adds the use of_mul128which is available on x86_64 systems.Performance of these operations isn't currently very important, but they will become crucial with cluster mempool.
I have not tested this code myself, though it's based on similar code in libsecp256k1 (see https://github.com/bitcoin-core/secp256k1/blob/v0.4.1/src/int128_struct_impl.h#L7L30).