-
Notifications
You must be signed in to change notification settings - Fork 38.7k
[IBD] specialize block serialization #31868
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[IBD] specialize block serialization #31868
Conversation
|
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. Code Coverage & BenchmarksFor details see: https://corecheck.dev/bitcoin/bitcoin/pulls/31868. ReviewsSee the guideline for information on the review process. ConflictsReviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first. LLM Linter (✨ experimental)Possible places where named args for integral literals may be used (e.g.
2026-01-15 |
a7db42f to
c5bbf57
Compare
|
Drafting until #31519 is merged, as recommended in #31868 (comment) |
c5bbf57 to
19d20ce
Compare
|
🚧 At least one of the CI tasks failed. HintsTry to run the tests locally, according to the documentation. However, a CI failure may still
Leave a comment here, if you need help tracking down a confusing failure. |
f98fb61 to
a9d311c
Compare
|
I will work on this more in the upcoming weeks, drafted it for now since I don't like the current single-byte specializations, have to find a better way. Suggestions are welcome. |
|
My guess is that the concept specialization here at most could have an effect on xor'd IO? This is because anything not xor'd won't need a buffer to (un)do the xor before reading or writing. Thus, it just seems easier to use the existing BufferedReader/Writer in places that could benefit from it? However, for block serialization they are already used. (Edit: I misunderstood that this has benefits outside of BufferedReader. Please ignore this comment) Again, no objection here, but it would be good to explain:
Right now:
|
073e28b to
a84f12e
Compare
|
Rebased to solve conflict, the PR is still in draft mode until I remeasure its effect on IBD - and I need to find a nicer way to do the single-byte specializations, since they have a measurable effect but make the code super-ugly... Suggestions are welcome. |
…addata16be` 0431a69 cleanup: remove unused `ser_writedata16be` and `ser_readdata16be` (Lőrinc) Pull request description: Remove dead code from serialization logic - extracted from #31868 ACKs for top commit: maflcko: lgtm ACK 0431a69 achow101: ACK 0431a69 jlest01: ACK 0431a69 Tree-SHA512: 1881a164b2a91bb6033770db625f10b845bfb17a9898efb7612ca29ba113175b29e345beb84488f388b4639ade98df98e3411e663149bcbaec6e3abeffe1cbef
a84f12e to
f4cb559
Compare
|
Are you still working on this? |
|
Of course, paused it since I'm not exactly sure how do handle single-byte entries: they measurably speed up the scenarios, but the resulting code is quite ugly. Do you have a suggestion? Should I attempt that in a separate PR and ignore it here? Edit: draft because of the requests in #31868 (comment) which I will need to remeasure |
|
What was the speedup? I can only see your last comment, which says:
|
Measure both full block serialization and size computation via `SizeComputer`. `SizeComputer` returns the exact final size of the serialized content without writing any bytes. > cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/bin/bench_bitcoin -filter='SizeComputerBlock|SerializeBlock' --min-time=10000 > C compiler ............................ AppleClang 16.0.0.16000026 | ns/block | block/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 195,610.62 | 5,112.20 | 0.3% | 11.00 | `SerializeBlock` | 12,061.83 | 82,906.19 | 0.1% | 11.01 | `SizeComputerBlock` > C++ compiler .......................... GNU 13.3.0 | ns/block | block/s | err% | ins/block | cyc/block | IPC | bra/block | miss% | total | benchmark |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:---------- | 867,857.55 | 1,152.26 | 0.0% | 8,015,883.90 | 3,116,099.08 | 2.572 | 1,517,035.87 | 0.5% | 10.81 | `SerializeBlock` | 30,928.27 | 32,332.88 | 0.0% | 221,683.03 | 111,055.84 | 1.996 | 53,037.03 | 0.8% | 11.03 | `SizeComputerBlock`
Merged multiple template methods into single constexpr-delimited implementation to reduce template bloat (i.e. related functionality is grouped into a single method, but can be optimized because of C++20 constexpr conditions). This unifies related methods that were only bound before by similar signatures - and enables `SizeComputer` optimizations later
Endianness doesn’t affect final size, so skip it in `SizeComputer`. Fold existing overloads into one implementation, short‑circuiting logic when only the serialized size is needed. > cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/src/bench/bench_bitcoin -filter='SizeComputerBlock|SerializeBlock' --min-time=10000 > C compiler ............................ AppleClang 16.0.0.16000026 | ns/block | block/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 191,652.29 | 5,217.78 | 0.4% | 10.96 | `SerializeBlock` | 10,323.55 | 96,865.92 | 0.2% | 11.01 | `SizeComputerBlock` > C++ compiler .......................... GNU 13.3.0 | ns/block | block/s | err% | ins/block | cyc/block | IPC | bra/block | miss% | total | benchmark |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:---------- | 614,847.32 | 1,626.42 | 0.0% | 8,015,883.64 | 2,207,628.07 | 3.631 | 1,517,035.62 | 0.5% | 10.56 | `SerializeBlock` | 26,020.31 | 38,431.52 | 0.0% | 159,390.03 | 93,438.33 | 1.706 | 42,131.03 | 0.9% | 11.00 | `SizeComputerBlock`
f4cb559 to
0068641
Compare
Single byte writes are used very often (used for every (u)int8_t or std::byte or bool and for every VarInt's first byte which is also needed for every (pre)Vector). It makes sense to avoid the generalized serialization infrastructure that isn't needed: * AutoFile write doesn't need to allocate 4k buffer for a single byte now; * `VectorWriter` and `DataStream` avoids memcpy/insert calls. > cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/bin/bench_bitcoin -filter='SizeComputerBlock|SerializeBlock' --min-time=10000 > C compiler ............................ AppleClang 16.0.0.16000026 | ns/block | block/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 174,569.19 | 5,728.39 | 0.6% | 10.89 | `SerializeBlock` | 10,241.16 | 97,645.21 | 0.0% | 11.00 | `SizeComputerBlock` > C++ compiler .......................... GNU 13.3.0 | ns/block | block/s | err% | ins/block | cyc/block | IPC | bra/block | miss% | total | benchmark |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:---------- | 615,000.56 | 1,626.01 | 0.0% | 8,015,883.64 | 2,208,340.88 | 3.630 | 1,517,035.62 | 0.5% | 10.56 | `SerializeBlock` | 25,676.76 | 38,945.72 | 0.0% | 159,390.03 | 92,202.10 | 1.729 | 42,131.03 | 0.9% | 11.00 | `SizeComputerBlock`
0068641 to
41ef25f
Compare
This change is part of [IBD] - Tracking PR for speeding up Initial Block Download
This PR is drafted until I remeasure everything after the recent merges and I need to find a way to simplify the 1 byte writes more nicely, I don't like all the specializations.
Summary
This PR contain a few different optimization I found by IBD profiling, and via the newly added block seralization benchmarks. It also takes advantage of the recently merged
std::spanchanges enabling propagating static extents.The commits merge similar (de)serialization methods, and separates them internally with
if constexpr- similarly to how it has been done here before. This enabled furtherSizeComputeroptimizations as well.Context
Other than these, since single byte writes are used very often (used for every
(u)int8_torstd::byteorbooland for everyVarInt's first byte which is also needed for every(pre)Vector), it makes sense to avoid the generalized serialization infrastructure that isn't needed:AutoFilewrite doesn't need to allocate 4k buffer for a single byte now;VectorWriterandDataStreamavoids memcpy/insert calls;CSHA256::Writecan avoidmemcpy.DeserializeBlockis dominated by the hash calculations so the optimizations barely affect it.Measurements
C compiler ............................ AppleClang 16.0.0.16000026
SerializeBlockSizeComputerBlockSerializeBlockSizeComputerBlockC++ compiler .......................... GNU 13.3.0
SerializeBlockSizeComputerBlockSerializeBlockSizeComputerBlockWhile this wasn't the main motivation for the change, IBD on Ubuntu/GCC on SSD with i9 indicates a 2% speedup as well:
Details