[IBD] specialize block serialization #31868

l0rinc · 2025-02-14T16:48:23Z

This change is part of [IBD] - Tracking PR for speeding up Initial Block Download

This PR is drafted until I remeasure everything after the recent merges and I need to find a way to simplify the 1 byte writes more nicely, I don't like all the specializations.

Summary

This PR contain a few different optimization I found by IBD profiling, and via the newly added block seralization benchmarks. It also takes advantage of the recently merged std::span changes enabling propagating static extents.

The commits merge similar (de)serialization methods, and separates them internally with if constexpr - similarly to how it has been done here before. This enabled further SizeComputer optimizations as well.

Context

Other than these, since single byte writes are used very often (used for every (u)int8_t or std::byte or bool and for every VarInt's first byte which is also needed for every (pre)Vector), it makes sense to avoid the generalized serialization infrastructure that isn't needed:

AutoFile write doesn't need to allocate 4k buffer for a single byte now;
VectorWriter and DataStream avoids memcpy/insert calls;
CSHA256::Write can avoid memcpy.

DeserializeBlock is dominated by the hash calculations so the optimizations barely affect it.

Measurements

C compiler ............................ AppleClang 16.0.0.16000026

Before:

ns/block	block/s	err%	total	benchmark
195,610.62	5,112.20	0.3%	11.00	`SerializeBlock`
12,061.83	82,906.19	0.1%	11.01	`SizeComputerBlock`

After:

ns/block	block/s	err%	total	benchmark
174,569.19	5,728.39	0.6%	10.89	`SerializeBlock`
10,241.16	97,645.21	0.0%	11.00	`SizeComputerBlock`

SerializeBlock - ~12.% faster
SizeComputerBlock - ~17.7% faster

C++ compiler .......................... GNU 13.3.0

Before:

ns/block	block/s	err%	ins/block	cyc/block	IPC	bra/block	miss%	total	benchmark
867,857.55	1,152.26	0.0%	8,015,883.90	3,116,099.08	2.572	1,517,035.87	0.5%	10.81	`SerializeBlock`
30,928.27	32,332.88	0.0%	221,683.03	111,055.84	1.996	53,037.03	0.8%	11.03	`SizeComputerBlock`

After:

ns/block	block/s	err%	ins/block	cyc/block	IPC	bra/block	miss%	total	benchmark
615,000.56	1,626.01	0.0%	8,015,883.64	2,208,340.88	3.630	1,517,035.62	0.5%	10.56	`SerializeBlock`
25,676.76	38,945.72	0.0%	159,390.03	92,202.10	1.729	42,131.03	0.9%	11.00	`SizeComputerBlock`

SerializeBlock - ~41.1% faster
SizeComputerBlock - ~20.4% faster

While this wasn't the main motivation for the change, IBD on Ubuntu/GCC on SSD with i9 indicates a 2% speedup as well:

Details

COMMITS="05314bde0b06b820225f10c6529b5afae128ff81 1cd94ec2511874ec68b92db34ad7ec7d9534fed1"; \
STOP_HEIGHT=880000; DBCACHE=10000; \
C_COMPILER=gcc; CXX_COMPILER=g++; \
hyperfine \
--export-json "/mnt/my_storage/ibd-${COMMITS// /-}-${STOP_HEIGHT}-${DBCACHE}-${C_COMPILER}.json" \
--runs 3 \
--parameter-list COMMIT ${COMMITS// /,} \
--prepare "killall bitcoind || true; rm -rf /mnt/my_storage/BitcoinData/*; git checkout {COMMIT}; git clean -fxd; git reset --hard; cmake -B build -DCMAKE_BUILD_TYPE=Release -DENABLE_WALLET=OFF -DCMAKE_C_COMPILER=$C_COMPILER -DCMAKE_CXX_COMPILER=$CXX_COMPILER && cmake --build build -j$(nproc) --target bitcoind && ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=1 -printtoconsole=0 || true" \
--cleanup "cp /mnt/my_storage/BitcoinData/debug.log /mnt/my_storage/logs/debug-{COMMIT}-$(date +%s).log || true" \
"COMPILER=$C_COMPILER COMMIT={COMMIT} ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=$STOP_HEIGHT -dbcache=$DBCACHE -prune=550 -printtoconsole=0"
Benchmark 1: COMPILER=gcc COMMIT=05314bde0b06b820225f10c6529b5afae128ff81 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=880000 -dbcache=10000 -prune=550 -printtoconsole=0
  Time (mean ± σ):     33647.918 s ± 508.655 s    [User: 71503.409 s, System: 4404.899 s]
  Range (min … max):   33283.439 s … 34229.026 s    3 runs
 
Benchmark 2: COMPILER=gcc COMMIT=1cd94ec2511874ec68b92db34ad7ec7d9534fed1 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=880000 -dbcache=10000 -prune=550 -printtoconsole=0
  Time (mean ± σ):     33062.491 s ± 183.335 s    [User: 71246.532 s, System: 4318.490 s]
  Range (min … max):   32888.211 s … 33253.706 s    3 runs
 
Summary
  COMPILER=gcc COMMIT=1cd94ec2511874ec68b92db34ad7ec7d9534fed1 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=880000 -dbcache=10000 -prune=550 -printtoconsole=0 ran
    1.02 ± 0.02 times faster than COMPILER=gcc COMMIT=05314bde0b06b820225f10c6529b5afae128ff81 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=880000 -dbcache=10000 -prune=550 -printtoconsole=0

DrahtBot · 2025-02-14T16:48:28Z

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage & Benchmarks

For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/31868.

Reviews

See the guideline for information on the review process.
A summary of reviews will appear here.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#33878 (refactor, docs: Embedded ASMap [2/3]: Refactor asmap internals and add documentation by fjahr)
#33026 (test, refactor: Embedded ASMap [1/3]: Selected minor preparatory work by fjahr)
#32729 (test,refactor: extract script template helpers & widen sigop count coverage by l0rinc)
#31682 ([IBD] specialize CheckBlock's input & coinbase checks by l0rinc)
#28792 (build: Embedded ASMap [3/3]: Build binary dump header file by fjahr)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

LLM Linter (✨ experimental)

Possible places where named args for integral literals may be used (e.g. func(x, /*named_arg=*/0) in C++, and func(x, named_arg=0) in Python):

Transform(s, buf, 1) in src/crypto/sha256.cpp

^2026-01-15

src/streams.h

l0rinc · 2025-03-09T23:59:46Z

Drafting until #31519 is merged, as recommended in #31868 (comment)

DrahtBot · 2025-03-10T00:09:11Z

🚧 At least one of the CI tasks failed.
_{Debug: https://github.com/bitcoin/bitcoin/runs/38458137649}

Hints

Try to run the tests locally, according to the documentation. However, a CI failure may still
happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being
incompatible with the current code in the target branch). If so, make sure to rebase on the latest
commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the
affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

l0rinc · 2025-05-04T13:49:40Z

I will work on this more in the upcoming weeks, drafted it for now since I don't like the current single-byte specializations, have to find a better way. Suggestions are welcome.

maflcko · 2025-05-04T14:46:48Z

My guess is that the concept specialization here at most could have an effect on xor'd IO? This is because anything not xor'd won't need a buffer to (un)do the xor before reading or writing.

Thus, it just seems easier to use the existing BufferedReader/Writer in places that could benefit from it?

However, for block serialization they are already used.

(Edit: I misunderstood that this has benefits outside of BufferedReader. Please ignore this comment)

Again, no objection here, but it would be good to explain:

What is the goal of the changes here, and what real-world usage do they affect
How is it achieved and how do the benchmarks represent the real-world usage

Right now:

It only states IBD as motivation and also that IBD isn't the motivation. Also the IBD speedup measurement seems outdated.
The added benchmark is about test-only code paths (DataStream) or p2p code paths (VectorWriter). It is unclear what the AutoFile changes are about. Also, it is unclear how this could possibly affect the kernel, given that the kernel doesn't call into test-only or p2p-only code paths and instead uses BufferedReader/Writer.

l0rinc · 2025-07-29T23:34:29Z

Rebased to solve conflict, the PR is still in draft mode until I remeasure its effect on IBD - and I need to find a nicer way to do the single-byte specializations, since they have a measurable effect but make the code super-ugly... Suggestions are welcome.
In the meantime I have split out the first refactor into #33093.

…addata16be` 0431a69 cleanup: remove unused `ser_writedata16be` and `ser_readdata16be` (Lőrinc) Pull request description: Remove dead code from serialization logic - extracted from #31868 ACKs for top commit: maflcko: lgtm ACK 0431a69 achow101: ACK 0431a69 jlest01: ACK 0431a69 Tree-SHA512: 1881a164b2a91bb6033770db625f10b845bfb17a9898efb7612ca29ba113175b29e345beb84488f388b4639ade98df98e3411e663149bcbaec6e3abeffe1cbef

maflcko · 2025-09-01T10:35:57Z

Are you still working on this?

l0rinc · 2025-09-01T18:12:11Z

Of course, paused it since I'm not exactly sure how do handle single-byte entries: they measurably speed up the scenarios, but the resulting code is quite ugly. Do you have a suggestion? Should I attempt that in a separate PR and ignore it here?

Edit: draft because of the requests in #31868 (comment) which I will need to remeasure

maflcko · 2025-09-02T07:26:25Z

What was the speedup? I can only see your last comment, which says:

the PR is still in draft mode until I remeasure its effect on IBD

Measure both full block serialization and size computation via `SizeComputer`. `SizeComputer` returns the exact final size of the serialized content without writing any bytes. > cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/bin/bench_bitcoin -filter='SizeComputerBlock|SerializeBlock' --min-time=10000 > C compiler ............................ AppleClang 16.0.0.16000026 | ns/block | block/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 195,610.62 | 5,112.20 | 0.3% | 11.00 | `SerializeBlock` | 12,061.83 | 82,906.19 | 0.1% | 11.01 | `SizeComputerBlock` > C++ compiler .......................... GNU 13.3.0 | ns/block | block/s | err% | ins/block | cyc/block | IPC | bra/block | miss% | total | benchmark |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:---------- | 867,857.55 | 1,152.26 | 0.0% | 8,015,883.90 | 3,116,099.08 | 2.572 | 1,517,035.87 | 0.5% | 10.81 | `SerializeBlock` | 30,928.27 | 32,332.88 | 0.0% | 221,683.03 | 111,055.84 | 1.996 | 53,037.03 | 0.8% | 11.03 | `SizeComputerBlock`

Merged multiple template methods into single constexpr-delimited implementation to reduce template bloat (i.e. related functionality is grouped into a single method, but can be optimized because of C++20 constexpr conditions). This unifies related methods that were only bound before by similar signatures - and enables `SizeComputer` optimizations later

Endianness doesn’t affect final size, so skip it in `SizeComputer`. Fold existing overloads into one implementation, short‑circuiting logic when only the serialized size is needed. > cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/src/bench/bench_bitcoin -filter='SizeComputerBlock|SerializeBlock' --min-time=10000 > C compiler ............................ AppleClang 16.0.0.16000026 | ns/block | block/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 191,652.29 | 5,217.78 | 0.4% | 10.96 | `SerializeBlock` | 10,323.55 | 96,865.92 | 0.2% | 11.01 | `SizeComputerBlock` > C++ compiler .......................... GNU 13.3.0 | ns/block | block/s | err% | ins/block | cyc/block | IPC | bra/block | miss% | total | benchmark |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:---------- | 614,847.32 | 1,626.42 | 0.0% | 8,015,883.64 | 2,207,628.07 | 3.631 | 1,517,035.62 | 0.5% | 10.56 | `SerializeBlock` | 26,020.31 | 38,431.52 | 0.0% | 159,390.03 | 93,438.33 | 1.706 | 42,131.03 | 0.9% | 11.00 | `SizeComputerBlock`

Single byte writes are used very often (used for every (u)int8_t or std::byte or bool and for every VarInt's first byte which is also needed for every (pre)Vector). It makes sense to avoid the generalized serialization infrastructure that isn't needed: * AutoFile write doesn't need to allocate 4k buffer for a single byte now; * `VectorWriter` and `DataStream` avoids memcpy/insert calls. > cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/bin/bench_bitcoin -filter='SizeComputerBlock|SerializeBlock' --min-time=10000 > C compiler ............................ AppleClang 16.0.0.16000026 | ns/block | block/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 174,569.19 | 5,728.39 | 0.6% | 10.89 | `SerializeBlock` | 10,241.16 | 97,645.21 | 0.0% | 11.00 | `SizeComputerBlock` > C++ compiler .......................... GNU 13.3.0 | ns/block | block/s | err% | ins/block | cyc/block | IPC | bra/block | miss% | total | benchmark |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:---------- | 615,000.56 | 1,626.01 | 0.0% | 8,015,883.64 | 2,208,340.88 | 3.630 | 1,517,035.62 | 0.5% | 10.56 | `SerializeBlock` | 25,676.76 | 38,945.72 | 0.0% | 159,390.03 | 92,202.10 | 1.729 | 42,131.03 | 0.9% | 11.00 | `SizeComputerBlock`

l0rinc mentioned this pull request Feb 14, 2025

reduce template bloat in primitive serialization bitcoin-dev-tools/benchcoin#116

Closed

theuni reviewed Feb 14, 2025

View reviewed changes

src/streams.h Outdated Show resolved Hide resolved

This was referenced Feb 15, 2025

[IBD] specialize CheckBlock's input & coinbase checks #31682

Open

[IBD] batch block reads/writes during AutoFile serialization #31551

Merged

refactor: Use std::span over Span #31519

Merged

[IBD] multi-byte block obfuscation #31144

Merged

DrahtBot added CI failed and removed CI failed labels Feb 16, 2025

l0rinc mentioned this pull request Feb 19, 2025

refactor: modernize outdated trait patterns using helper aliases (C++14/C++17) #31904

Merged

l0rinc force-pushed the lorinc/block-serialization-optimizations branch from a7db42f to c5bbf57 Compare March 9, 2025 23:58

l0rinc marked this pull request as draft March 9, 2025 23:59

l0rinc force-pushed the lorinc/block-serialization-optimizations branch from c5bbf57 to 19d20ce Compare March 10, 2025 00:09

DrahtBot added the CI failed label Mar 10, 2025

l0rinc force-pushed the lorinc/block-serialization-optimizations branch 2 times, most recently from f98fb61 to a9d311c Compare March 10, 2025 14:54

DrahtBot mentioned this pull request May 9, 2025

bench: replace benchmark block with more representative one (413567 → 784588) #32457

Closed

DrahtBot added CI failed Needs rebase labels May 16, 2025

l0rinc force-pushed the lorinc/block-serialization-optimizations branch from 073e28b to a84f12e Compare July 29, 2025 23:26

l0rinc mentioned this pull request Jul 29, 2025

refactor: remove unused ser_writedata16be and ser_readdata16be #33093

Merged

DrahtBot removed the Needs rebase label Jul 30, 2025

DrahtBot mentioned this pull request Jul 30, 2025

test, refactor: Embedded ASMap [1/3]: Selected minor preparatory work #33026

Merged

DrahtBot removed the CI failed label Jul 30, 2025

DrahtBot mentioned this pull request Jul 30, 2025

test,refactor: extract script template helpers & widen sigop count coverage #32729

Open

l0rinc force-pushed the lorinc/block-serialization-optimizations branch from a84f12e to f4cb559 Compare August 1, 2025 01:33

DrahtBot mentioned this pull request Nov 18, 2025

refactor, docs: Embedded ASMap [2/3]: Refactor asmap internals and add documentation #33878

Open

DrahtBot added the Needs rebase label Nov 19, 2025

l0rinc mentioned this pull request Jan 14, 2026

Coins Cache Cleanup checklist #34280

Open

23 tasks

l0rinc added 4 commits January 15, 2026 13:52

refactor: add explicit static extent to spans

e2be418

l0rinc force-pushed the lorinc/block-serialization-optimizations branch from f4cb559 to 0068641 Compare January 15, 2026 12:52

l0rinc force-pushed the lorinc/block-serialization-optimizations branch from 0068641 to 41ef25f Compare January 15, 2026 12:54

DrahtBot added CI failed and removed Needs rebase CI failed labels Jan 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[IBD] specialize block serialization #31868

[IBD] specialize block serialization #31868

Uh oh!

l0rinc commented Feb 14, 2025 •

edited

Loading

Uh oh!

DrahtBot commented Feb 14, 2025 •

edited

Loading

Uh oh!

Uh oh!

l0rinc commented Mar 9, 2025

Uh oh!

DrahtBot commented Mar 10, 2025

Uh oh!

l0rinc commented May 4, 2025

Uh oh!

maflcko commented May 4, 2025 •

edited

Loading

Uh oh!

l0rinc commented Jul 29, 2025

Uh oh!

maflcko commented Sep 1, 2025

Uh oh!

l0rinc commented Sep 1, 2025 •

edited

Loading

Uh oh!

maflcko commented Sep 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[IBD] specialize block serialization #31868

Are you sure you want to change the base?

[IBD] specialize block serialization #31868

Uh oh!

Conversation

l0rinc commented Feb 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Context

Measurements

Uh oh!

DrahtBot commented Feb 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Coverage & Benchmarks

Reviews

Conflicts

LLM Linter (✨ experimental)

Uh oh!

Uh oh!

l0rinc commented Mar 9, 2025

Uh oh!

DrahtBot commented Mar 10, 2025

Uh oh!

l0rinc commented May 4, 2025

Uh oh!

maflcko commented May 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

l0rinc commented Jul 29, 2025

Uh oh!

maflcko commented Sep 1, 2025

Uh oh!

l0rinc commented Sep 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maflcko commented Sep 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

l0rinc commented Feb 14, 2025 •

edited

Loading

DrahtBot commented Feb 14, 2025 •

edited

Loading

maflcko commented May 4, 2025 •

edited

Loading

l0rinc commented Sep 1, 2025 •

edited

Loading