-
Notifications
You must be signed in to change notification settings - Fork 38.8k
validation: fetch block inputs on parallel threads 3x faster IBD #31132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. Code Coverage & BenchmarksFor details see: https://corecheck.dev/bitcoin/bitcoin/pulls/31132. ReviewsSee the guideline for information on the review process.
If your review is incorrectly listed, please copy-paste ConflictsReviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first. |
6d96d87 to
9fcd08e
Compare
|
🚧 At least one of the CI tasks failed. HintsTry to run the tests locally, according to the documentation. However, a CI failure may still
Leave a comment here, if you need help tracking down a confusing failure. |
9fcd08e to
fe0fe59
Compare
e2feb0c to
e9e23b5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concept ACK
I'm still missing tests and benchmarks here and I think we need to find better default values for SSD and HDD parallelism, and I'd be interested in how coroutines would perform here instead of trying to find the best batching size manually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool idea.
Since the inputs fetcher call is blocking, instead of creating a new set of worker threads, what do you think about re-using the existing script validation ones (or any other unused worker threads) by implementing a general-purpose thread pool shared among the validation checks?
The script validation checks and the inputs fetching mechanism are never done concurrently because you need the inputs in order to verify the scripts. So, they could share workers.
This should be benchmarked because it might add some overhead but, #26966 introduces such structure inside 401f21bfd72f32a28147677af542887518a4dbff, which we could pull off and use for validation.
Nice, yes that would be great! That would simplify this PR a lot if it could just schedule tasks on worker threads and receive the responses, instead of implementing all the sync code itself.
Concept ACK! |
|
Finished benching on a HDD until 860k on Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz, CPU = 8: Summary
'COMMIT=f278ca4ec3f0a90c285e640f1a270869ca594d20 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=860000 -dbcache=10000 -printtoconsole=0' ran
1.02 times faster than 'COMMIT=e9e23b59f8eedb8dfae75aa660328299fba92b50 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=860000 -dbcache=10000 -printtoconsole=
0'
Edit: Previous results"command": "COMMIT=f278ca4ec3f0a90c285e640f1a270869ca594d20 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=860000 -dbcache=10000 -printtoconsole=0",
"times": [39993.343777768874],
"command": "COMMIT=e9e23b59f8eedb8dfae75aa660328299fba92b50 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=860000 -dbcache=10000 -printtoconsole=0",
"times": [40929.84310861388],I have retried the same with half the parallelism (rebased, but no other change in the end, otherwise the results would be hard to interpret): "command": "COMMIT=8207d372b2fac24af0f8999b30e71e88d40b3a13 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=860000 -dbcache=10000 -printtoconsole=0",
"times": [40579.00445769842],So it's a tiny bit faster than before (surprisingly stable for an actual IBD with real peers), but still slower-than/same-as before, so not sure why it's not faster. Edit: Running it on a HDD with a low dbcache value reproduces the original result: benchmarkhyperfine --runs 1 --show-output --export-json /mnt/my_storage/ibd_full-threaded-inputs-3.json --parameter-list COMMIT 92fc718592be55812b2c73a3bf57599fc81425fa,8207d372b2fac24af0f8999b30e71e88d40b3a13 --prepare 'rm -rf /mnt/my_storage/BitcoinData/* && git checkout {COMMIT} && git clean -fxd && git reset --hard && cmake -B build -DCMAKE_BUILD_TYPE=Release -DBUILD_UTIL=OFF -DBUILD_TX=OFF -DBUILD_TESTS=OFF -DENABLE_WALLET=OFF -DINSTALL_MAN=OFF && cmake --build build -j$(nproc)' 'COMMIT={COMMIT} ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=860000 -dbcache=1000 -printtoconsole=0'8207d372b2 validation: fetch block inputs in parallel
92fc718592 coins: allow emplacing non-dirty coins internally
Summary
'COMMIT=8207d372b2fac24af0f8999b30e71e88d40b3a13 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=860000 -dbcache=1000 -printtoconsole=0' ran
1.16 times faster than 'COMMIT=92fc718592be55812b2c73a3bf57599fc81425fa ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=860000 -dbcache=1000 -printtoconsole=0' |
I'm not sure we can conclude that from your benchmark. It used a very high dbcache setting, which makes the effect of this change less important. It also is syncing from untrusted network peers, so there is some variance which could also account for the 2% difference. |
942f300 to
8207d37
Compare
|
🚧 At least one of the CI tasks failed. HintsTry to run the tests locally, according to the documentation. However, a CI failure may still
Leave a comment here, if you need help tracking down a confusing failure. |
a0f6902 to
a00dbc5
Compare
|
🚧 At least one of the CI tasks failed. HintsTry to run the tests locally, according to the documentation. However, a CI failure may still
Leave a comment here, if you need help tracking down a confusing failure. |
7115abe to
f7ab210
Compare
It looks like this is run inside WSL (Linux) and compiled for Linux. I wonder if this is meaningful of real end-user performance, which normally run native Windows (.exe) binaries? |
0827d5d to
dbeb0b4
Compare
ea74623 to
60cfa5d
Compare
60cfa5d to
b5aa017
Compare
|
Measured the performance at tip. This branch is ~64% faster connecting newly seen blocks than master.
I ran 5 Edit: There was a network outage with the gateway node for 12 hours, and on connection all nodes caught up. This skews results. Restarted the nodes and will get more data. |
dc06842 to
b3cb5bb
Compare
35e8b6f to
396f784
Compare
l0rinc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the new cleanup changes and the ASCII art, I only had time and patience to quickly go over it, hope the comments are useful.
Add a Reset() method to CCoinsViewCache that clears cacheCoins, cachedCoinsUsage, and hashBlock without flushing to the base view. This allows efficiently reusing a cache instance across multiple blocks. Introduce CoinsViewCacheController to wrap a CCoinsViewCache and provide scoped access through a Handle. The Handle dereferences to the cache and automatically calls Reset() on destruction. This RAII pattern ensures the cache is always properly reset between blocks. Add m_connect_block_controller as a persistent controller for ConnectBlock, avoiding repeated memory allocations. Co-authored-by: l0rinc <[email protected]>
Introduce a helper to look up a Coin through a stack of CCoinsViewCache layers without populating parent caches. This is useful for ephemeral views (e.g. during ConnectBlock) that want to avoid polluting CoinsTip() when validating invalid blocks. Co-authored-by: l0rinc <[email protected]>
Introduce CoinsViewCacheAsync, a CCoinsViewCache subclass that reads coins without mutating the underlying cache. Override GetCoinFromBase() to call PeekCoin() instead of GetCoin(). This prevents the main cache from caching inputs pulled from disk for a block that has not yet been fully validated. Once Flush() is called on the async cache, these inputs will be added as spent to cacheCoins in the main cache via BatchWrite(). Rename CoinsViewCacheController to CoinsViewCacheAsyncController and change it to wrap CoinsViewCacheAsync internally. Update m_connect_block_controller to use the new async controller. This is the foundation for async input fetching, where worker threads must not mutate shared state. Co-authored-by: l0rinc <[email protected]>
Refactor TestCoinsView() to accept the cache as a parameter instead of creating it internally. This prepares for adding CoinsViewCacheAsync fuzz targets that need to pass in a different cache type. This is a non-functional change. Co-authored-by: l0rinc <[email protected]>
Add StartFetching() to CoinsViewCacheAsync that populates a queue of all transaction inputs in a block, then fetches them all via ProcessInput() before entering ConnectBlock. GetCoinFromBase() now checks this queue first. Update the controller's Handle to call StartFetching(block) in its constructor, binding the fetched inputs to the lifetime of the Handle. Rename Start() to StartFetching(block) in the controller. When the Handle goes out of scope, Reset() is called which clears the fetched inputs. Introduce InputToFetch struct to track each input's outpoint and fetched coin. GetCoinFromBase() scans the queue sequentially, matching ConnectBlock's access pattern where inputs are processed in block order. ProcessInput() fetches coins one at a time using PeekCoin(), preparing for parallel execution in later commits. Also add fuzz targets for CoinsViewCacheAsync and update unit tests to use StartFetching(block). Co-authored-by: sedited <[email protected]> Co-authored-by: l0rinc <[email protected]>
Add a benchmark measuring CoinsViewCacheAsync performance when fetching inputs for a block. Creates a realistic scenario by adding all inputs from block 413567 to a chainstate with an in memory leveldb backend. Measures the time to access all inputs through the cache. Co-authored-by: l0rinc <[email protected]>
Skip fetching inputs that spend outputs created earlier in the same block, since these coins won't exist in the cache or database yet. Store the first 8 bytes of each transaction's txid in a sorted vector for O(log n) binary search lookups. Using truncated txids is a performance optimization; in the rare case of a collision, the input simply won't be prefetched and will fall back to normal fetching on the main thread. This adds a performance regression due to the extra sorting and filtering. Since the benchmark uses an in-memory leveldb, there is no real disk I/O that is avoided. > bench: add CoinsViewCacheAsync benchmark | ns/op | op/s | err% | ins/op | cyc/op | IPC | bra/op | miss% | total | benchmark |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:---------- | 1,664,383.00 | 600.82 | 2.4% | 31,957,257.00 | 4,017,069.00 | 7.955 | 5,318,396.00 | 0.6% | 0.02 | CoinsViewCacheAsyncBenchmark > coins: filter inputs created in same block before fetching | ns/op | op/s | err% | ins/op | cyc/op | IPC | bra/op | miss% | total | benchmark |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:---------- | 1,970,543.00 | 507.47 | 4.0% | 32,640,039.00 | 4,760,784.00 | 6.856 | 5,506,291.00 | 1.2% | 0.02 | CoinsViewCacheAsyncBenchmark Co-authored-by: l0rinc <[email protected]>
Allow the main thread to process unfetched inputs while waiting for a specific coin. Instead of blocking, GetCoinFromBase() calls ProcessInput() to make forward progress on the queue. This prepares for parallel fetching where the main thread can help workers complete the queue rather than idling while waiting. Co-authored-by: l0rinc <[email protected]>
Restructure TestCoinsView() to perform all checks that don't mutate the backend before accessing backend_coins_view with HaveCoin()/GetCoin(). This prepares for CoinsViewCacheAsync testing, where we want to run as many checks as possible while async fetching is still active. Only at the very end do we call StopFetching() and perform the backend consistency checks that require mutating calls (HaveCoin/GetCoin call FetchCoin which writes to cacheCoins). Non-mutating operations like GetBestBlock(), EstimateSize(), and Cursor() can safely run on the backend while workers are still fetching. Co-authored-by: l0rinc <[email protected]>
Rename ProcessInput to ProcessInputInBackground. Add thread-safe synchronization primitives to allow any thread to safely call ProcessInputInBackground once all threads arrive_and_wait() a std::barrier. Make m_input_head a std::atomic_unit32_t, so workers can claim inputs atomically in ProcessInputInBackground. Make ready flag a std::atomic_flag per InputToFetch to act as an atomic memory fence. Workers release and the main thread acquires the flag to ensure the coin is seen correctly no matter which thread has written it. Add StopFetching() private method that skips all remaining inputs, waits for all threads to arrive at the std::barrier, and resets all state in CoinsViewCacheAsync. Override Flush(), Sync(), and SetBackend() to call StopFetching() before calling CCoinsViewCache base class methods. This ensures no worker threads can access base while it is being mutated. Co-authored-by: l0rinc <[email protected]>
Spawn a fixed pool of worker threads (default 4) that fetch coins in parallel. Workers wait at the barrier until StartFetching() signals work is available, then race to claim and fetch inputs from the queue. Once all inputs have been fetched, the workers wait at the barrier until the main thread arrives via StopFetching(). The destructor arrives at the barrier a final time with an empty m_inputs, which signals to the threads to exit their loop. > coins: filter inputs created in same block before fetching | ns/op | op/s | err% | ins/op | cyc/op | IPC | bra/op | miss% | total | benchmark |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:---------- | 1,970,543.00 | 507.47 | 4.0% | 32,640,039.00 | 4,760,784.00 | 6.856 | 5,506,291.00 | 1.2% | 0.02 | CoinsViewCacheAsyncBenchmark > validation: fetch inputs on parallel threads | ns/op | op/s | err% | ins/op | cyc/op | IPC | bra/op | miss% | total | benchmark |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:---------- | 1,601,969.00 | 624.23 | 2.9% | 8,345,989.00 | 2,232,468.00 | 3.738 | 1,089,340.00 | 1.8% | 0.03 | CoinsViewCacheAsyncBenchmark Co-authored-by: l0rinc <[email protected]>
396f784 to
13d32ed
Compare
Parts of this PR are isolated in independent smaller PRs to ease review:
This PR parallelizes fetching all input prevouts of a block during block connection, achieving over 3x faster IBD performance in some scenarios12345.
Problem
Currently, when fetching inputs in
ConnectBlock, each input is fetched from the cache sequentially. A cache miss requires a round trip to the disk database to fetch the outpoint and insert it into the cache. Since the database is read-only duringConnectBlock, we can fetch all inputs of a block in parallel on multiple threads while connecting.Solution
We introduce a new
CoinsViewCacheAsyncCCoinsViewCachesubclass that manages worker threads to fetch block inputs in parallel. The block is passed to theCoinsViewCacheAsyncview before enteringConnectBlock, which kicks off the worker threads to begin fetching the inputs. The view is then passed toConnectBlocksince it provides the same API asCCoinsViewCache. The cache returns fetched coins as they become available via the overriddenFetchCoinmethod. If not available yet, the main thread also fetches coins as it waits.Implementation Details
The
CoinsViewCacheAsyncimplements a lock-free MPSC (Multiple Producer, Single Consumer) queue design:Coininto thecacheCoinsmap as they become available.Resetmethod, which ensures all threads are parked before proceedingSafety and Correctness
CoinsViewCacheAsyncworks on a block that has not been fully validated, but it does not interfere or modify any of the validation duringConnectBlockCoinsTip()cache when pulled through for an invalid blockPerformance
Benchmarks show over 3x faster IBD performance in a cloud environment with network connected storage1, 3x faster IBD for an M4 Mac and up to 46% faster with directly connected storage2345. The parallelization of expensive disk lookups provides significant speedup.
Flamegraphs show how the execution is changed.
Credits
Inspired by this comment.
Resolves #34121.
Footnotes
https://github.com/bitcoin/bitcoin/pull/31132#issuecomment-3678847806 ↩ ↩2
https://github.com/bitcoin/bitcoin/pull/31132#pullrequestreview-3515011880 ↩ ↩2
https://github.com/bitcoin/bitcoin/pull/31132#issuecomment-3463894000 ↩ ↩2
https://github.com/bitcoin/bitcoin/pull/31132#pullrequestreview-3347436866 ↩ ↩2
https://github.com/bitcoin/bitcoin/pull/31132#issuecomment-3488760623 ↩ ↩2