Skip to content

Conversation

@andrewtoth
Copy link
Contributor

@andrewtoth andrewtoth commented Oct 22, 2024

Parts of this PR are isolated in independent smaller PRs to ease review:


This PR parallelizes fetching all input prevouts of a block during block connection, achieving over 3x faster IBD performance in some scenarios12345.

Problem

Currently, when fetching inputs in ConnectBlock, each input is fetched from the cache sequentially. A cache miss requires a round trip to the disk database to fetch the outpoint and insert it into the cache. Since the database is read-only during ConnectBlock, we can fetch all inputs of a block in parallel on multiple threads while connecting.

Solution

We introduce a new CoinsViewCacheAsync CCoinsViewCache subclass that manages worker threads to fetch block inputs in parallel. The block is passed to the CoinsViewCacheAsync view before entering ConnectBlock, which kicks off the worker threads to begin fetching the inputs. The view is then passed to ConnectBlock since it provides the same API as CCoinsViewCache. The cache returns fetched coins as they become available via the overridden FetchCoin method. If not available yet, the main thread also fetches coins as it waits.

Implementation Details

The CoinsViewCacheAsync implements a lock-free MPSC (Multiple Producer, Single Consumer) queue design:

  • Work Distribution: Collects all input prevouts into a queue and uses a barrier to start all worker threads simultaneously
  • Synchronization: Worker threads use an atomic counter to claim which inputs to fetch, and each input has an atomic flag to signal completion to the main thread
  • Main Thread Processing: The main thread waits for inputs in order and moves their Coin into the cacheCoins map as they become available.
  • Work Stealing: If the main thread catches up to the workers, it assists with fetching to maximize parallelism
  • Completion: All threads synchronize on a barrier via a Reset method, which ensures all threads are parked before proceeding

Safety and Correctness

  • The CoinsViewCacheAsync works on a block that has not been fully validated, but it does not interfere or modify any of the validation during ConnectBlock
  • It simply fetches inputs in parallel, which must be fetched before a transaction is validated anyways
  • Invalid blocks: If an invalid block is mined, the temporary cache is reset without being flushed. This is an improvement over the current behavior, where existing inputs are inserted into the main CoinsTip() cache when pulled through for an invalid block

Performance

Benchmarks show over 3x faster IBD performance in a cloud environment with network connected storage1, 3x faster IBD for an M4 Mac and up to 46% faster with directly connected storage2345. The parallelization of expensive disk lookups provides significant speedup.

Flamegraphs show how the execution is changed.

Credits

Inspired by this comment.

Resolves #34121.

Footnotes

  1. https://github.com/bitcoin/bitcoin/pull/31132#issuecomment-3678847806 2

  2. https://github.com/bitcoin/bitcoin/pull/31132#pullrequestreview-3515011880 2

  3. https://github.com/bitcoin/bitcoin/pull/31132#issuecomment-3463894000 2

  4. https://github.com/bitcoin/bitcoin/pull/31132#pullrequestreview-3347436866 2

  5. https://github.com/bitcoin/bitcoin/pull/31132#issuecomment-3488760623 2

@DrahtBot
Copy link
Contributor

DrahtBot commented Oct 22, 2024

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage & Benchmarks

For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/31132.

Reviews

See the guideline for information on the review process.

Type Reviewers
Concept ACK ismaelsadeeq, Raimo33, l0rinc

If your review is incorrectly listed, please copy-paste <!--meta-tag:bot-skip--> into the comment that the bot should ignore.

Conflicts

Reviewers, this pull request conflicts with the following ones:

  • #34165 (coins: don't mutate main cache when connecting block by andrewtoth)
  • #34164 (validation: add reusable coins view for ConnectBlock by andrewtoth)
  • #34132 (refactor: inline CCoinsViewErrorCatcher into CCoinsViewDB by l0rinc)
  • #34124 (refactor: make CCoinsView a purely virtual abstract base class by l0rinc)
  • #32317 (kernel: Separate UTXO set access from validation functions by sedited)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

@DrahtBot
Copy link
Contributor

🚧 At least one of the CI tasks failed.
Debug: https://github.com/bitcoin/bitcoin/runs/31894441286

Hints

Try to run the tests locally, according to the documentation. However, a CI failure may still
happen due to a number of reasons, for example:

  • Possibly due to a silent merge conflict (the changes in this pull request being
    incompatible with the current code in the target branch). If so, make sure to rebase on the latest
    commit of the target branch.

  • A sanitizer issue, which can only be found by compiling with the sanitizer and running the
    affected test.

  • An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

@andrewtoth andrewtoth changed the title validation: fetch block inputs parallel threads ~17% faster IBD validation: fetch block inputs on parallel threads ~17% faster IBD Oct 22, 2024
@andrewtoth andrewtoth force-pushed the threaded-inputs branch 2 times, most recently from e2feb0c to e9e23b5 Compare October 22, 2024 18:46
Copy link
Contributor

@l0rinc l0rinc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concept ACK

I'm still missing tests and benchmarks here and I think we need to find better default values for SSD and HDD parallelism, and I'd be interested in how coroutines would perform here instead of trying to find the best batching size manually.

Copy link
Member

@furszy furszy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool idea.

Since the inputs fetcher call is blocking, instead of creating a new set of worker threads, what do you think about re-using the existing script validation ones (or any other unused worker threads) by implementing a general-purpose thread pool shared among the validation checks?
The script validation checks and the inputs fetching mechanism are never done concurrently because you need the inputs in order to verify the scripts. So, they could share workers.

This should be benchmarked because it might add some overhead but, #26966 introduces such structure inside 401f21bfd72f32a28147677af542887518a4dbff, which we could pull off and use for validation.

@andrewtoth
Copy link
Contributor Author

implementing a general-purpose thread pool shared among the validation checks?

Nice, yes that would be great! That would simplify this PR a lot if it could just schedule tasks on worker threads and receive the responses, instead of implementing all the sync code itself.

#26966 introduces such structure inside 401f21b, which we could pull off and use for validation.

Concept ACK!

@l0rinc
Copy link
Contributor

l0rinc commented Oct 24, 2024

Finished benching on a HDD until 860k on Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz, CPU = 8:

Summary
'COMMIT=f278ca4ec3f0a90c285e640f1a270869ca594d20 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=860000 -dbcache=10000 -printtoconsole=0' ran
 1.02 times faster than 'COMMIT=e9e23b59f8eedb8dfae75aa660328299fba92b50 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=860000 -dbcache=10000 -printtoconsole=
0'

f278ca4ec3 coins: allow emplacing non-dirty coins internally (39993.343777768874 seconds = 11.1 hours)
e9e23b59f8 validation: fetch block inputs in parallel (40929.84310861388 seconds = 11.3 hours)


So likely on HDD we shouldn't use so many threads, apparently it slows down IBD.
Maybe we could add a new config option (iothreads or iothreadmultiplier or something).
The defaults should likely depend on whether it's an SSD or HDD.


Edit:

Previous results
"command": "COMMIT=f278ca4ec3f0a90c285e640f1a270869ca594d20 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=860000 -dbcache=10000 -printtoconsole=0",
"times": [39993.343777768874],

"command": "COMMIT=e9e23b59f8eedb8dfae75aa660328299fba92b50 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=860000 -dbcache=10000 -printtoconsole=0",
"times": [40929.84310861388],

I have retried the same with half the parallelism (rebased, but no other change in the end, otherwise the results would be hard to interpret):

"command": "COMMIT=8207d372b2fac24af0f8999b30e71e88d40b3a13 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=860000 -dbcache=10000 -printtoconsole=0",
"times": [40579.00445769842],

So it's a tiny bit faster than before (surprisingly stable for an actual IBD with real peers), but still slower-than/same-as before, so not sure why it's not faster.


Edit:

Running it on a HDD with a low dbcache value reproduces the original result:

benchmark
hyperfine --runs 1 --show-output --export-json /mnt/my_storage/ibd_full-threaded-inputs-3.json --parameter-list COMMIT 92fc718592be55812b2c73a3bf57599fc81425fa,8207d372b2fac24af0f8999b30e71e88d40b3a13 --prepare 'rm -rf /mnt/my_storage/BitcoinData/* && git checkout {COMMIT} && git clean -fxd && git reset --hard && cmake -B build -DCMAKE_BUILD_TYPE=Release -DBUILD_UTIL=OFF -DBUILD_TX=OFF -DBUILD_TESTS=OFF -DENABLE_WALLET=OFF -DINSTALL_MAN=OFF && cmake --build build -j$(nproc)' 'COMMIT={COMMIT} ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=860000 -dbcache=1000 -printtoconsole=0'
8207d372b2 validation: fetch block inputs in parallel
92fc718592 coins: allow emplacing non-dirty coins internally
Summary
  'COMMIT=8207d372b2fac24af0f8999b30e71e88d40b3a13 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=860000 -dbcache=1000 -printtoconsole=0' ran
    1.16 times faster than 'COMMIT=92fc718592be55812b2c73a3bf57599fc81425fa ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=860000 -dbcache=1000 -printtoconsole=0'

@andrewtoth
Copy link
Contributor Author

So likely on HDD we shouldn't use so many threads, apparently it slows down IBD.

I'm not sure we can conclude that from your benchmark. It used a very high dbcache setting, which makes the effect of this change less important. It also is syncing from untrusted network peers, so there is some variance which could also account for the 2% difference.

@andrewtoth andrewtoth force-pushed the threaded-inputs branch 2 times, most recently from 942f300 to 8207d37 Compare October 24, 2024 19:25
@DrahtBot
Copy link
Contributor

🚧 At least one of the CI tasks failed.
Debug: https://github.com/bitcoin/bitcoin/runs/32027275494

Hints

Try to run the tests locally, according to the documentation. However, a CI failure may still
happen due to a number of reasons, for example:

  • Possibly due to a silent merge conflict (the changes in this pull request being
    incompatible with the current code in the target branch). If so, make sure to rebase on the latest
    commit of the target branch.

  • A sanitizer issue, which can only be found by compiling with the sanitizer and running the
    affected test.

  • An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

@andrewtoth andrewtoth force-pushed the threaded-inputs branch 2 times, most recently from a0f6902 to a00dbc5 Compare October 26, 2024 18:51
@DrahtBot
Copy link
Contributor

🚧 At least one of the CI tasks failed.
Debug: https://github.com/bitcoin/bitcoin/runs/32107893176

Hints

Try to run the tests locally, according to the documentation. However, a CI failure may still
happen due to a number of reasons, for example:

  • Possibly due to a silent merge conflict (the changes in this pull request being
    incompatible with the current code in the target branch). If so, make sure to rebase on the latest
    commit of the target branch.

  • A sanitizer issue, which can only be found by compiling with the sanitizer and running the
    affected test.

  • An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

@andrewtoth andrewtoth force-pushed the threaded-inputs branch 4 times, most recently from 7115abe to f7ab210 Compare October 26, 2024 23:07
@maflcko
Copy link
Member

maflcko commented Jan 2, 2026

Now that I have access to a Windows benchmarking server, managed to run a few rounds

It looks like this is run inside WSL (Linux) and compiled for Linux. I wonder if this is meaningful of real end-user performance, which normally run native Windows (.exe) binaries?

@andrewtoth
Copy link
Contributor Author

andrewtoth commented Jan 3, 2026

Measured the performance at tip. This branch is ~64% faster connecting newly seen blocks than master.

Node 1 Node 2 Average
master 273.80ms/blk 319.19ms/blk 296.50ms/blk
branch 179.89ms/blk 181.45ms/blk 180.67ms/blk

I ran 5 t3.small AWS instances with 20 GB gp2 EBS volumes attached. I ran all nodes pruned to 550, with the exact same blocks, chainstate, and mempool.dat uploaded to them. I ran 2 nodes at master, and 2 nodes at this branch. These 4 nodes only connected to the 5th node, which itself connected to a trusted node outside the VPC. Using debug=bench, we can see the cumulative block connection speed in the debug logs. These are linked in the table above.

Edit: There was a network outage with the gateway node for 12 hours, and on connection all nodes caught up. This skews results. Restarted the nodes and will get more data.

@andrewtoth andrewtoth force-pushed the threaded-inputs branch 2 times, most recently from dc06842 to b3cb5bb Compare January 5, 2026 01:36
@andrewtoth andrewtoth force-pushed the threaded-inputs branch 2 times, most recently from 35e8b6f to 396f784 Compare January 11, 2026 18:13
Copy link
Contributor

@l0rinc l0rinc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the new cleanup changes and the ASCII art, I only had time and patience to quickly go over it, hope the comments are useful.

@l0rinc l0rinc mentioned this pull request Jan 14, 2026
23 tasks
andrewtoth and others added 11 commits January 14, 2026 13:54
Add a Reset() method to CCoinsViewCache that clears cacheCoins,
cachedCoinsUsage, and hashBlock without flushing to the base view.
This allows efficiently reusing a cache instance across multiple blocks.

Introduce CoinsViewCacheController to wrap a CCoinsViewCache and provide
scoped access through a Handle. The Handle dereferences to the cache and
automatically calls Reset() on destruction. This RAII pattern ensures the
cache is always properly reset between blocks.

Add m_connect_block_controller as a persistent controller for ConnectBlock,
avoiding repeated memory allocations.

Co-authored-by: l0rinc <[email protected]>
Introduce a helper to look up a Coin through a stack of CCoinsViewCache layers without populating parent caches.

This is useful for ephemeral views (e.g. during ConnectBlock) that want to avoid polluting CoinsTip() when validating invalid blocks.

Co-authored-by: l0rinc <[email protected]>
Introduce CoinsViewCacheAsync, a CCoinsViewCache subclass that reads coins
without mutating the underlying cache.

Override GetCoinFromBase() to call PeekCoin() instead of GetCoin(). This
prevents the main cache from caching inputs pulled from disk for a block that
has not yet been fully validated. Once Flush() is called on the async cache,
these inputs will be added as spent to cacheCoins in the main cache via
BatchWrite().

Rename CoinsViewCacheController to CoinsViewCacheAsyncController and change
it to wrap CoinsViewCacheAsync internally. Update m_connect_block_controller
to use the new async controller.

This is the foundation for async input fetching, where worker threads must not
mutate shared state.

Co-authored-by: l0rinc <[email protected]>
Refactor TestCoinsView() to accept the cache as a parameter instead of
creating it internally. This prepares for adding CoinsViewCacheAsync
fuzz targets that need to pass in a different cache type.

This is a non-functional change.

Co-authored-by: l0rinc <[email protected]>
Add StartFetching() to CoinsViewCacheAsync that populates a queue of all
transaction inputs in a block, then fetches them all via ProcessInput()
before entering ConnectBlock. GetCoinFromBase() now checks this queue first.

Update the controller's Handle to call StartFetching(block) in its constructor,
binding the fetched inputs to the lifetime of the Handle. Rename Start() to
StartFetching(block) in the controller. When the Handle goes out of scope,
Reset() is called which clears the fetched inputs.

Introduce InputToFetch struct to track each input's outpoint and fetched coin.
GetCoinFromBase() scans the queue sequentially, matching ConnectBlock's access
pattern where inputs are processed in block order.

ProcessInput() fetches coins one at a time using PeekCoin(),
preparing for parallel execution in later commits.

Also add fuzz targets for CoinsViewCacheAsync and update unit tests to use
StartFetching(block).

Co-authored-by: sedited <[email protected]>
Co-authored-by: l0rinc <[email protected]>
Add a benchmark measuring CoinsViewCacheAsync performance when fetching
inputs for a block. Creates a realistic scenario by adding all inputs from
block 413567 to a chainstate with an in memory leveldb backend.

Measures the time to access all inputs through the cache.

Co-authored-by: l0rinc <[email protected]>
Skip fetching inputs that spend outputs created earlier in the same block,
since these coins won't exist in the cache or database yet.

Store the first 8 bytes of each transaction's txid in a sorted vector for
O(log n) binary search lookups. Using truncated txids is a performance
optimization; in the rare case of a collision, the input simply won't be
prefetched and will fall back to normal fetching on the main thread.

This adds a performance regression due to the extra sorting and filtering.
Since the benchmark uses an in-memory leveldb, there is no real disk I/O that is avoided.

> bench: add CoinsViewCacheAsync benchmark
|               ns/op |                op/s |    err% |          ins/op |          cyc/op |    IPC |         bra/op |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|        1,664,383.00 |              600.82 |    2.4% |   31,957,257.00 |    4,017,069.00 |  7.955 |   5,318,396.00 |    0.6% |      0.02 | CoinsViewCacheAsyncBenchmark

> coins: filter inputs created in same block before fetching
|               ns/op |                op/s |    err% |          ins/op |          cyc/op |    IPC |         bra/op |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|        1,970,543.00 |              507.47 |    4.0% |   32,640,039.00 |    4,760,784.00 |  6.856 |   5,506,291.00 |    1.2% |      0.02 | CoinsViewCacheAsyncBenchmark

Co-authored-by: l0rinc <[email protected]>
Allow the main thread to process unfetched inputs while waiting for a
specific coin. Instead of blocking, GetCoinFromBase() calls ProcessInput() to
make forward progress on the queue.

This prepares for parallel fetching where the main thread can help workers
complete the queue rather than idling while waiting.

Co-authored-by: l0rinc <[email protected]>
Restructure TestCoinsView() to perform all checks that don't mutate the
backend before accessing backend_coins_view with HaveCoin()/GetCoin().

This prepares for CoinsViewCacheAsync testing, where we want to run as many
checks as possible while async fetching is still active. Only at the very
end do we call StopFetching() and perform the backend consistency checks
that require mutating calls (HaveCoin/GetCoin call FetchCoin which writes
to cacheCoins).

Non-mutating operations like GetBestBlock(), EstimateSize(), and Cursor()
can safely run on the backend while workers are still fetching.

Co-authored-by: l0rinc <[email protected]>
Rename ProcessInput to ProcessInputInBackground.

Add thread-safe synchronization primitives to allow any thread to safely call
ProcessInputInBackground once all threads arrive_and_wait() a
std::barrier.

Make m_input_head a std::atomic_unit32_t, so workers can claim inputs
atomically in ProcessInputInBackground.

Make ready flag a std::atomic_flag per InputToFetch to act as an atomic
memory fence. Workers release and the main thread acquires the flag to
ensure the coin is seen correctly no matter which thread has written it.

Add StopFetching() private method that skips all remaining inputs, waits for
all threads to arrive at the std::barrier, and resets all state in
CoinsViewCacheAsync.

Override Flush(), Sync(), and SetBackend() to call StopFetching() before
calling CCoinsViewCache base class methods. This ensures no worker threads
can access base while it is being mutated.

Co-authored-by: l0rinc <[email protected]>
Spawn a fixed pool of worker threads (default 4) that fetch coins in parallel.
Workers wait at the barrier until StartFetching() signals work is available,
then race to claim and fetch inputs from the queue.

Once all inputs have been fetched, the workers wait at the barrier until the
main thread arrives via StopFetching().

The destructor arrives at the barrier a final time with an empty m_inputs,
which signals to the threads to exit their loop.

> coins: filter inputs created in same block before fetching
|               ns/op |                op/s |    err% |          ins/op |          cyc/op |    IPC |         bra/op |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|        1,970,543.00 |              507.47 |    4.0% |   32,640,039.00 |    4,760,784.00 |  6.856 |   5,506,291.00 |    1.2% |      0.02 | CoinsViewCacheAsyncBenchmark
> validation: fetch inputs on parallel threads
|               ns/op |                op/s |    err% |          ins/op |          cyc/op |    IPC |         bra/op |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|        1,601,969.00 |              624.23 |    2.9% |    8,345,989.00 |    2,232,468.00 |  3.738 |   1,089,340.00 |    1.8% |      0.03 | CoinsViewCacheAsyncBenchmark

Co-authored-by: l0rinc <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Syncing takes many days with state of the art hardware now - Reviving bootstrapping as a solution?