Skip to content

Conversation

@l0rinc
Copy link
Contributor

@l0rinc l0rinc commented Feb 14, 2025

This change is part of [IBD] - Tracking PR for speeding up Initial Block Download


This PR is drafted until I remeasure everything after the recent merges and I need to find a way to simplify the 1 byte writes more nicely, I don't like all the specializations.


Summary

This PR contain a few different optimization I found by IBD profiling, and via the newly added block seralization benchmarks. It also takes advantage of the recently merged std::span changes enabling propagating static extents.

The commits merge similar (de)serialization methods, and separates them internally with if constexpr - similarly to how it has been done here before. This enabled further SizeComputer optimizations as well.

Context

Other than these, since single byte writes are used very often (used for every (u)int8_t or std::byte or bool and for every VarInt's first byte which is also needed for every (pre)Vector), it makes sense to avoid the generalized serialization infrastructure that isn't needed:

  • AutoFile write doesn't need to allocate 4k buffer for a single byte now;
  • VectorWriter and DataStream avoids memcpy/insert calls;
  • CSHA256::Write can avoid memcpy.

DeserializeBlock is dominated by the hash calculations so the optimizations barely affect it.

Measurements

C compiler ............................ AppleClang 16.0.0.16000026

Before:

ns/block block/s err% total benchmark
195,610.62 5,112.20 0.3% 11.00 SerializeBlock
12,061.83 82,906.19 0.1% 11.01 SizeComputerBlock

After:

ns/block block/s err% total benchmark
174,569.19 5,728.39 0.6% 10.89 SerializeBlock
10,241.16 97,645.21 0.0% 11.00 SizeComputerBlock

SerializeBlock - ~12.% faster
SizeComputerBlock - ~17.7% faster


C++ compiler .......................... GNU 13.3.0

Before:

ns/block block/s err% ins/block cyc/block IPC bra/block miss% total benchmark
867,857.55 1,152.26 0.0% 8,015,883.90 3,116,099.08 2.572 1,517,035.87 0.5% 10.81 SerializeBlock
30,928.27 32,332.88 0.0% 221,683.03 111,055.84 1.996 53,037.03 0.8% 11.03 SizeComputerBlock

After:

ns/block block/s err% ins/block cyc/block IPC bra/block miss% total benchmark
615,000.56 1,626.01 0.0% 8,015,883.64 2,208,340.88 3.630 1,517,035.62 0.5% 10.56 SerializeBlock
25,676.76 38,945.72 0.0% 159,390.03 92,202.10 1.729 42,131.03 0.9% 11.00 SizeComputerBlock

SerializeBlock - ~41.1% faster
SizeComputerBlock - ~20.4% faster


While this wasn't the main motivation for the change, IBD on Ubuntu/GCC on SSD with i9 indicates a 2% speedup as well:

Details
COMMITS="05314bde0b06b820225f10c6529b5afae128ff81 1cd94ec2511874ec68b92db34ad7ec7d9534fed1"; \
STOP_HEIGHT=880000; DBCACHE=10000; \
C_COMPILER=gcc; CXX_COMPILER=g++; \
hyperfine \
--export-json "/mnt/my_storage/ibd-${COMMITS// /-}-${STOP_HEIGHT}-${DBCACHE}-${C_COMPILER}.json" \
--runs 3 \
--parameter-list COMMIT ${COMMITS// /,} \
--prepare "killall bitcoind || true; rm -rf /mnt/my_storage/BitcoinData/*; git checkout {COMMIT}; git clean -fxd; git reset --hard; cmake -B build -DCMAKE_BUILD_TYPE=Release -DENABLE_WALLET=OFF -DCMAKE_C_COMPILER=$C_COMPILER -DCMAKE_CXX_COMPILER=$CXX_COMPILER && cmake --build build -j$(nproc) --target bitcoind && ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=1 -printtoconsole=0 || true" \
--cleanup "cp /mnt/my_storage/BitcoinData/debug.log /mnt/my_storage/logs/debug-{COMMIT}-$(date +%s).log || true" \
"COMPILER=$C_COMPILER COMMIT={COMMIT} ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=$STOP_HEIGHT -dbcache=$DBCACHE -prune=550 -printtoconsole=0"
Benchmark 1: COMPILER=gcc COMMIT=05314bde0b06b820225f10c6529b5afae128ff81 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=880000 -dbcache=10000 -prune=550 -printtoconsole=0
  Time (mean ± σ):     33647.918 s ± 508.655 s    [User: 71503.409 s, System: 4404.899 s]
  Range (min … max):   33283.439 s … 34229.026 s    3 runs
 
Benchmark 2: COMPILER=gcc COMMIT=1cd94ec2511874ec68b92db34ad7ec7d9534fed1 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=880000 -dbcache=10000 -prune=550 -printtoconsole=0
  Time (mean ± σ):     33062.491 s ± 183.335 s    [User: 71246.532 s, System: 4318.490 s]
  Range (min … max):   32888.211 s … 33253.706 s    3 runs
 
Summary
  COMPILER=gcc COMMIT=1cd94ec2511874ec68b92db34ad7ec7d9534fed1 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=880000 -dbcache=10000 -prune=550 -printtoconsole=0 ran
    1.02 ± 0.02 times faster than COMPILER=gcc COMMIT=05314bde0b06b820225f10c6529b5afae128ff81 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=880000 -dbcache=10000 -prune=550 -printtoconsole=0

@DrahtBot
Copy link
Contributor

DrahtBot commented Feb 14, 2025

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage & Benchmarks

For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/31868.

Reviews

See the guideline for information on the review process.
A summary of reviews will appear here.

Conflicts

Reviewers, this pull request conflicts with the following ones:

  • #33878 (refactor, docs: Embedded ASMap [2/3]: Refactor asmap internals and add documentation by fjahr)
  • #33026 (test, refactor: Embedded ASMap [1/3]: Selected minor preparatory work by fjahr)
  • #32729 (test,refactor: extract script template helpers & widen sigop count coverage by l0rinc)
  • #31682 ([IBD] specialize CheckBlock's input & coinbase checks by l0rinc)
  • #28792 (build: Embedded ASMap [3/3]: Build binary dump header file by fjahr)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

LLM Linter (✨ experimental)

Possible places where named args for integral literals may be used (e.g. func(x, /*named_arg=*/0) in C++, and func(x, named_arg=0) in Python):

  • Transform(s, buf, 1) in src/crypto/sha256.cpp

2026-01-15

@l0rinc
Copy link
Contributor Author

l0rinc commented Mar 9, 2025

Drafting until #31519 is merged, as recommended in #31868 (comment)

@l0rinc l0rinc marked this pull request as draft March 9, 2025 23:59
@l0rinc l0rinc force-pushed the lorinc/block-serialization-optimizations branch from c5bbf57 to 19d20ce Compare March 10, 2025 00:09
@DrahtBot
Copy link
Contributor

🚧 At least one of the CI tasks failed.
Debug: https://github.com/bitcoin/bitcoin/runs/38458137649

Hints

Try to run the tests locally, according to the documentation. However, a CI failure may still
happen due to a number of reasons, for example:

  • Possibly due to a silent merge conflict (the changes in this pull request being
    incompatible with the current code in the target branch). If so, make sure to rebase on the latest
    commit of the target branch.

  • A sanitizer issue, which can only be found by compiling with the sanitizer and running the
    affected test.

  • An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

@l0rinc
Copy link
Contributor Author

l0rinc commented May 4, 2025

I will work on this more in the upcoming weeks, drafted it for now since I don't like the current single-byte specializations, have to find a better way. Suggestions are welcome.

@maflcko
Copy link
Member

maflcko commented May 4, 2025

My guess is that the concept specialization here at most could have an effect on xor'd IO? This is because anything not xor'd won't need a buffer to (un)do the xor before reading or writing.

Thus, it just seems easier to use the existing BufferedReader/Writer in places that could benefit from it?

However, for block serialization they are already used.

(Edit: I misunderstood that this has benefits outside of BufferedReader. Please ignore this comment)

Again, no objection here, but it would be good to explain:

  • What is the goal of the changes here, and what real-world usage do they affect
  • How is it achieved and how do the benchmarks represent the real-world usage

Right now:

  • It only states IBD as motivation and also that IBD isn't the motivation. Also the IBD speedup measurement seems outdated.
  • The added benchmark is about test-only code paths (DataStream) or p2p code paths (VectorWriter). It is unclear what the AutoFile changes are about. Also, it is unclear how this could possibly affect the kernel, given that the kernel doesn't call into test-only or p2p-only code paths and instead uses BufferedReader/Writer.

@l0rinc
Copy link
Contributor Author

l0rinc commented Jul 29, 2025

Rebased to solve conflict, the PR is still in draft mode until I remeasure its effect on IBD - and I need to find a nicer way to do the single-byte specializations, since they have a measurable effect but make the code super-ugly... Suggestions are welcome.
In the meantime I have split out the first refactor into #33093.

achow101 added a commit that referenced this pull request Jul 31, 2025
…addata16be`

0431a69 cleanup: remove unused `ser_writedata16be` and `ser_readdata16be` (Lőrinc)

Pull request description:

  Remove dead code from serialization logic - extracted from #31868

ACKs for top commit:
  maflcko:
    lgtm ACK 0431a69
  achow101:
    ACK 0431a69
  jlest01:
    ACK 0431a69

Tree-SHA512: 1881a164b2a91bb6033770db625f10b845bfb17a9898efb7612ca29ba113175b29e345beb84488f388b4639ade98df98e3411e663149bcbaec6e3abeffe1cbef
@l0rinc l0rinc force-pushed the lorinc/block-serialization-optimizations branch from a84f12e to f4cb559 Compare August 1, 2025 01:33
@maflcko
Copy link
Member

maflcko commented Sep 1, 2025

Are you still working on this?

@l0rinc
Copy link
Contributor Author

l0rinc commented Sep 1, 2025

Of course, paused it since I'm not exactly sure how do handle single-byte entries: they measurably speed up the scenarios, but the resulting code is quite ugly. Do you have a suggestion? Should I attempt that in a separate PR and ignore it here?

Edit: draft because of the requests in #31868 (comment) which I will need to remeasure

@maflcko
Copy link
Member

maflcko commented Sep 2, 2025

What was the speedup? I can only see your last comment, which says:

the PR is still in draft mode until I remeasure its effect on IBD

Measure both full block serialization and size computation via `SizeComputer`.
`SizeComputer` returns the exact final size of the serialized content without writing any bytes.

> cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/bin/bench_bitcoin -filter='SizeComputerBlock|SerializeBlock' --min-time=10000

> C compiler ............................ AppleClang 16.0.0.16000026

|            ns/block |             block/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|          195,610.62 |            5,112.20 |    0.3% |     11.00 | `SerializeBlock`
|           12,061.83 |           82,906.19 |    0.1% |     11.01 | `SizeComputerBlock`

> C++ compiler .......................... GNU 13.3.0

|            ns/block |             block/s |    err% |       ins/block |       cyc/block |    IPC |      bra/block |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|          867,857.55 |            1,152.26 |    0.0% |    8,015,883.90 |    3,116,099.08 |  2.572 |   1,517,035.87 |    0.5% |     10.81 | `SerializeBlock`
|           30,928.27 |           32,332.88 |    0.0% |      221,683.03 |      111,055.84 |  1.996 |      53,037.03 |    0.8% |     11.03 | `SizeComputerBlock`
Merged multiple template methods into single constexpr-delimited implementation to reduce template bloat (i.e. related functionality is grouped into a single method, but can be optimized because of C++20 constexpr conditions).
This unifies related methods that were only bound before by similar signatures - and enables `SizeComputer` optimizations later
Endianness doesn’t affect final size, so skip it in `SizeComputer`.
Fold existing overloads into one implementation, short‑circuiting logic when only the serialized size is needed.

> cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/src/bench/bench_bitcoin -filter='SizeComputerBlock|SerializeBlock' --min-time=10000

> C compiler ............................ AppleClang 16.0.0.16000026

|            ns/block |             block/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|          191,652.29 |            5,217.78 |    0.4% |     10.96 | `SerializeBlock`
|           10,323.55 |           96,865.92 |    0.2% |     11.01 | `SizeComputerBlock`

> C++ compiler .......................... GNU 13.3.0

|            ns/block |             block/s |    err% |       ins/block |       cyc/block |    IPC |      bra/block |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|          614,847.32 |            1,626.42 |    0.0% |    8,015,883.64 |    2,207,628.07 |  3.631 |   1,517,035.62 |    0.5% |     10.56 | `SerializeBlock`
|           26,020.31 |           38,431.52 |    0.0% |      159,390.03 |       93,438.33 |  1.706 |      42,131.03 |    0.9% |     11.00 | `SizeComputerBlock`
@l0rinc l0rinc force-pushed the lorinc/block-serialization-optimizations branch from f4cb559 to 0068641 Compare January 15, 2026 12:52
Single byte writes are used very often (used for every (u)int8_t or std::byte or bool and for every VarInt's first byte which is also needed for every (pre)Vector).
It makes sense to avoid the generalized serialization infrastructure that isn't needed:
* AutoFile write doesn't need to allocate 4k buffer for a single byte now;
* `VectorWriter` and `DataStream` avoids memcpy/insert calls.

> cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/bin/bench_bitcoin -filter='SizeComputerBlock|SerializeBlock' --min-time=10000

> C compiler ............................ AppleClang 16.0.0.16000026

|            ns/block |             block/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|          174,569.19 |            5,728.39 |    0.6% |     10.89 | `SerializeBlock`
|           10,241.16 |           97,645.21 |    0.0% |     11.00 | `SizeComputerBlock`

> C++ compiler .......................... GNU 13.3.0

|            ns/block |             block/s |    err% |       ins/block |       cyc/block |    IPC |      bra/block |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|          615,000.56 |            1,626.01 |    0.0% |    8,015,883.64 |    2,208,340.88 |  3.630 |   1,517,035.62 |    0.5% |     10.56 | `SerializeBlock`
|           25,676.76 |           38,945.72 |    0.0% |      159,390.03 |       92,202.10 |  1.729 |      42,131.03 |    0.9% |     11.00 | `SizeComputerBlock`
@l0rinc l0rinc force-pushed the lorinc/block-serialization-optimizations branch from 0068641 to 41ef25f Compare January 15, 2026 12:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants