-
Notifications
You must be signed in to change notification settings - Fork 38.7k
(RFC) kernel: Replace leveldb-based BlockTreeDB with flat-file based store #32427
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. Code Coverage & BenchmarksFor details see: https://corecheck.dev/bitcoin/bitcoin/pulls/32427. ReviewsSee the guideline for information on the review process.
If your review is incorrectly listed, please copy-paste ConflictsReviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first. |
|
Approach ACK The codebase changes seem surprisingly small for this proposal. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @TheCharlatan for this proposal and making the code changes.!
Please find my review comments, suggestions and some clarifying questions:
-
How will concurrent reads (and potentially writes) be handled with the flat file format?
-
Even if no deletion occurs, file corruption or partial writes can happen. Are you planning mmap or memory buffering?
-
What would be the "Corruption Recovery Strategy?" -- While the write-ahead log is mentioned as future work, providing even a minimal rollback/recovery mechanism in the initial version would make this stronger.
-
What would be the "Migration Path"? -- Will there be tooling or a migration process from existing leveldb-based data
-
Data Integrity Guarantees? -- Are checksums or hash-based verifications being added per entry or per file?
-
Consider including a pluggable interface that allows fallback to LevelDB for testing or backwards compatibility
-
Write contention and corruption risks: -- While flat files avoid LevelDB’s process-level locking, concurrent writes require a mechanism (e.g., file locks, flock()) to prevent race conditions.
-
Portability and Cross platform related edge cases: -- Ensure file-locking mechanisms (e.g., fcntl on Unix, LockFileEx on Windows) are robust.
-
How about Handling Large Files? -- Test edge cases like file sizes approaching OS limits (e.g., 2+ GB on 32-bit systems). What would happen in such cases?
-
On similar lines to point-5 -- Corruption Detection mechanism could also be implemented to detect corruptions very early in the cycle.
The flat-file approach is a reasonable trade-off given the simplicity of the block tree storage requirements.
However, there are major significant challenges and risks involved with this approach as highlighted in the above 10 comments. (Concurrency, Corruption/Integrity, Performance, Migration, Corruption detection, Portability, Backward compatibility etc.)
|
Concept ACK
A concrete example is index building in electrs / esplora / etc. For example, Electrs does this today by:
I started on a PoC for Electrs to use libbitcoinkernel for index building to demonstrate how this could be done much more efficiently and saw promising results. However, the requirement that Bitcoin Core be shut down before Electrs could process the block files made this approach clunky. I'll revive this PoC as a means of testing this PR and hopefully provide some use case motivated feedback on the approach. |
|
Have you considered simply having one block per file? Typical blk files are 130 MB, so for "modern" blocks it would 50x the number of files. But is that actually a problem? It's a lot simpler if we can just have |
I'm not sure what you are suggesting here. Are you suggesting we create ~900k files and then have some subdivision within those files into 10k groups where each of those has a single |
Yes. If we're going to redesign block storage, it seems good to wonder why can't let the file system handle things.
One block per file. We can still have a single file for the block index, which would contain the header (not just the hash) and validation state. Block files themselves would be in a predictable location, so we wouldn't need an index for that.
Do you mean compared to the alternative of having a single file for all blocks? I would imagine that would create I/O problems, since the operating system wouldn't know which part of the big file changed. And it can't defragment it. Having one file per block makes pruning marginally easier than now, since you don't have to worry about keeping nearby blocks in the same file. One downside of what I'm suggesting is that the headers would either be stored redundantly (in the block file as well as in the index), or anyone parsing the block files has to prepend the header themselves. |
|
How worried are we about file corruption here? I thought the main reason we use leveldb and sqlite databases in places like this where we don't need indexing is that they support atomic updates, so you can pull the power cord any time and next time you reboot you will will see some consistent view of the data, even if it's not the latest data. I didn't look too closely at the implementation here but it seems like it is updating data in the files in place, so if writes are interrupted, data could be corrupt, and it's not clear if there are even checksums in place that would detect this. Maybe this is not an issue for the PR, but it would be good to make clear what types of corruption BlockTreeStore can and can't detect and what types of corruption it can recover from. If it can do simple things to detect corruption like adding checksums, or to prevent it like writing to temporary files and renaming them in place, those could be good to consider. If this PR does introduce some increased risk of corruption, maybe that is worth it for reasons listed in the description. I also think another alternative could be to use sqlite for this since this would not necessarily introduce a new dependency and we already have a ReadKey/WriteKey/EraseKey/HasKey wrapper functions for sqlite that might help it be an easy replacement for leveldb. |
Yes, that is what I meant. We never change block files, so that is not a problem. I'm also not sure how real this problem actually is. A bunch of databases just maintain one big file and have good performance doing so. I'm still not sure what the benefit of what you propose would be. Either way, I think this is a bit out of scope, since while this change implements a database migration, it does not require a reindex, which a change to the block file format would. Improving pruning behavior is also not the goal here. |
I was hoping to provoke a discussion about this as I alluded to in the PR description - thanks for providing your thoughts on this. I think the proof of work and integrity checks done on loading the index already provide fairly solid guarantees on load, but agree that we should do better. I have also talked to some other people about it offline, and there seems to be some appetite for improving corruption resistance. It is my understanding that the feature for reindexing the block tree was added as a salvaging option, because leveldb does not provide strong anti-corruption guarantees, but has sprawled a bit since. Removing the need to provide code for reindexing the block tree would be a nice simplification of validation code in my eyes. I think adding a checksum for the entries and writing from a log file could be fairly simple to implement and provide strong guarantees. I'm open to suggestions here.
This has also been brought up by some others. While I'm still not sure that we should be introducing a new validation dependency, maybe it would be good to implement it and open the change as an alternative draft / RFC pull request in the meantime? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A full -reindex can be necessary for two reasons:
- corruption in the block tree db
- corruption in the blk files.
In my personal experience of running a node on crappy hardware a long time ago, it was usually 2. that would happen (I knew that because the reindex wouldn't scan all block files but abort with an error somewhere, and switch to IBD from peers). My suspicion is that while 1. may have been the dominant reason in the early years, 2. may be just as important today.
However, if that was the case, changing the block tree db format wouldn't allow us to get rid of -reindex, even if the new format would never corrupt.
But we prune blocks, and they may not all be at the start of the big file.
Even on a spinning disk? That's where I tend to keep my
Compared to the current situation where we bundle a bunch of, but not all, blocks in one file, it just seems simpler to have one file per block. In the "corruption in the blk files" example above it also makes recovery really easy: just load the block files one by one, hash them, redownload if the hash doesn't match. No need to update any index. |
|
re: TheCharlatan #32427 (comment) Thanks for clarifying the situation with leveldb. I just assumed based on its design that it would support atomic updates pretty robustly but if it has corruption problems of its own then it doesn't sound like we would lose much by switching to something simpler. I still do think using sqlite could be a nice solution because data consistency issues can be a significant source of pain for users and developers, and with sqlite we basically just don't need to think those issues. But I also understand not wanting to require sqlite as a kernel dependency. Another thing about this PR (as of fabd3ab) is it seems like it doesn't actually remove much blockstorage code, and the BlockTreeDB class remains intact, I guess because migration code depends on it. An idea that might improve this could be to make BlockTreeStore methods pure virtual and have FlatFile and LevelDB subclasses implementing them. This could organize the code more cleanly by letting FlatFile and LevelDB implementations both live side-by side outside of blockstorage instead of one being inside and one being outside. This could also let kernel applications provide alternate backends, and allow things like differential fuzz testing. (This was also the exact same approach used to replace bdb with sqlite in the wallet.) re: Sjors #32427 (comment) FWIW I also think using individual block files could be great (assuming a sharded directory structure like the .git/objects to avoid having many files per directory). That idea is mostly tangential to this PR though, I think? Possible I am missing some connections. |
That is interesting, I don't think I've ever run into blk file corruption. I agree with you that if that is something we need to be able to salvage from, that the reindex logic would have to remain for it. If that is the case though, shouldn't we also be cleaning left over block data then? Seems like the user could just end up with 100's of GBs of unusable block data otherwise. |
|
I didn't have time to review this in detail - nor to form a detailed concept/approach feedback, but I ran a few reindexes to see if it affects performance because somebody was referring to this as an optimization and wanted to understand if that's indeed the case. I ran a DetailsCOMMITS="14b8dfb2bd5e2ca2b7c0c9a7f7d50e1e60adf75c fabd3ab615a7c718f37a60298a125864edb6106b"; \
STOP_HEIGHT=888888; DBCACHE=4500; \
CC=gcc; CXX=g++; \
BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
(echo ""; for c in $COMMITS; do git fetch origin $c -q && git log -1 --pretty=format:'%h %s' $c || exit 1; done; echo "") && \
hyperfine \
--sort 'command' \
--runs 1 \
--export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g; s/-$//' <<< "$COMMITS")-$STOP_HEIGHT-$DBCACHE-$CC.json" \
--parameter-list COMMIT ${COMMITS// /,} \
--prepare "killall bitcoind; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard; \
cmake -B build -DCMAKE_BUILD_TYPE=Release -DENABLE_WALLET=OFF && cmake --build build -j$(nproc) --target bitcoind && \
./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP_HEIGHT -dbcache=5000 -printtoconsole=0; sleep 10" \
--cleanup "cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log" \
"COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP_HEIGHT -reindex -blocksonly -connect=0 -printtoconsole=0 -dbcache=$DBCACHE"
Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888 -reindex -blocksonly -connect=0 -printtoconsole=0 -dbcache=4500 (COMMIT = 14b8dfb2bd5e2ca2b7c0c9a7f7d50e1e60adf75c)
Time (abs ≡): 27076.605 s [User: 32171.870 s, System: 1311.182 s]
Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888 -reindex -blocksonly -connect=0 -printtoconsole=0 -dbcache=4500 (COMMIT = fabd3ab615a7c718f37a60298a125864edb6106b)
Time (abs ≡): 27034.197 s [User: 32220.994 s, System: 1286.553 s]
Relative speed comparison
1.00 COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888 -reindex -blocksonly -connect=0 -printtoconsole=0 -dbcache=4500 (COMMIT = 14b8dfb2bd5e2ca2b7c0c9a7f7d50e1e60adf75c)
1.00 COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=888888 -reindex -blocksonly -connect=0 -printtoconsole=0 -dbcache=4500 (COMMIT = fabd3ab615a7c718f37a60298a125864edb6106b)Which indicates there's no measurable speed difference. DetailsCOMMITS="14b8dfb2bd5e2ca2b7c0c9a7f7d50e1e60adf75c fabd3ab615a7c718f37a60298a125864edb6106b"; \
STOP_HEIGHT=1; DBCACHE=450; \
CC=gcc; CXX=g++; \
BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
(echo ""; for c in $COMMITS; do git fetch origin $c -q && git log -1 --pretty=format:'%h %s' $c || exit 1; done; echo "") && \
hyperfine \
--sort 'command' \
--runs 1 \
--export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g; s/-$//' <<< "$COMMITS")-$STOP_HEIGHT-$DBCACHE-$CC.json" \
--parameter-list COMMIT ${COMMITS// /,} \
--prepare "killall bitcoind; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard; \
cmake -B build -DCMAKE_BUILD_TYPE=Release -DENABLE_WALLET=OFF && cmake --build build -j$(nproc) --target bitcoind && \
./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP_HEIGHT -dbcache=5000 -printtoconsole=0; sleep 10" \
--cleanup "cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log" \
"COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP_HEIGHT -reindex -blocksonly -connect=0 -printtoconsole=0 -dbcache=$DBCACHE"
Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=1 -reindex -blocksonly -connect=0 -printtoconsole=0 -dbcache=450 (COMMIT = 14b8dfb2bd5e2ca2b7c0c9a7f7d50e1e60adf75c)
Time (abs ≡): 7718.677 s [User: 7368.404 s, System: 174.230 s]
Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=1 -reindex -blocksonly -connect=0 -printtoconsole=0 -dbcache=450 (COMMIT = fabd3ab615a7c718f37a60298a125864edb6106b)
Time (abs ≡): 7683.972 s [User: 7344.276 s, System: 165.120 s]
Relative speed comparison
1.00 COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=1 -reindex -blocksonly -connect=0 -printtoconsole=0 -dbcache=450 (COMMIT = 14b8dfb2bd5e2ca2b7c0c9a7f7d50e1e60adf75c)
1.00 COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=1 -reindex -blocksonly -connect=0 -printtoconsole=0 -dbcache=450 (COMMIT = fabd3ab615a7c718f37a60298a125864edb6106b)Which also indicates there's no measurable speed difference. |
|
Concept ACK. Neat :)
Yeah, I think this is the heart of it. I'm onboard for a new impl outside of leveldb, but before getting too deep into the implementation itself we need to decide 2 main things:
The impl here with no slicing or atomicity attempts isn't very robust, but that's obviously fine for an RFC. |
|
I think a file structure of As for putting block and undo data in the same file, I'm unsure. Undo data to me feels more like a validation-level thing, while block data is more a storage-level thing. |
Re: One file per blockFWIW the idea of using one file per block gets me going too. :) It is slightly orthogonal but would be nice to avoid changing formats twice in short succession. My bikeshed color: Since block hashes start with zeroes, maybe one could shard based off the last two bytes: Genesis block ends up in something like: Using two levels deep directories of 256 entries at each branch point means we start averaging 1000 files per leaf directory at block height 65'536'000 (a bit earlier due to re-orgs). (File system limitations: https://stackoverflow.com/a/466596). If having the height as the key is more useful than the hash, I prefer #32427 (comment). Possible argument againstSpinning disks typically perform much better when sequentially accessed data is stored within the same file, so the current approach of multiple blocks per file may be more performant for some types of operations. I don't know if or how frequently we access the contents of blocks sequentially though. |
Just curious because the FAT32 file limit per-directory is so small, is there any scenario where a miner could DoS nodes with this format by also mining the last two bytes? I think no, because at tip the additional hash needed for two bytes would be prohibitively expensive, although I'm not sure if there is a birthday-problem-like advantage because an attacker doesn't necessarily need to target only one two-byte suffix. And for IBD, headers-first sync would prevent an attack where someone suffix-mines >65,000 blocks from genesis and tries to get nodes to download them. |
00c5a87 to
dba30d4
Compare
6c7a34f kernel: Add Purpose section to header documentation (TheCharlatan) 7e9f00b kernel: Allowing reducing exports (TheCharlatan) 7990463 kernel: Add pure kernel bitcoin-chainstate (TheCharlatan) 36ec9a3 Kernel: Add functions for working with outpoints (TheCharlatan) 5eec7fa kernel: Add block hash type and block tree utility functions to C header (TheCharlatan) f5d5d12 kernel: Add function to read block undo data from disk to C header (TheCharlatan) 09d0f62 kernel: Add functions to read block from disk to C header (TheCharlatan) a263a4c kernel: Add function for copying block data to C header (TheCharlatan) b30e15f kernel: Add functions for the block validation state to C header (TheCharlatan) aa262da kernel: Add validation interface to C header (TheCharlatan) d27e277 kernel: Add interrupt function to C header (TheCharlatan) 1976b13 kernel: Add import blocks function to C header (TheCharlatan) a747ca1 kernel: Add chainstate load options for in-memory dbs in C header (TheCharlatan) 070e777 kernel: Add options for reindexing in C header (TheCharlatan) ad80abc kernel: Add block validation to C header (TheCharlatan) cb1590b kernel: Add chainstate loading when instantiating a ChainstateManager (TheCharlatan) e2c1bd3 kernel: Add chainstate manager option for setting worker threads (TheCharlatan) 65571c3 kernel: Add chainstate manager object to C header (TheCharlatan) c62f657 kernel: Add notifications context option to C header (TheCharlatan) 9e1bac4 kernel: Add chain params context option to C header (TheCharlatan) 337ea86 kernel: Add kernel library context object (TheCharlatan) 28d679b kernel: Add logging to kernel library C header (TheCharlatan) 2cf136d kernel: Introduce initial kernel C header API (TheCharlatan) Pull request description: This is a first attempt at introducing a C header for the libbitcoinkernel library that may be used by external applications for interfacing with Bitcoin Core's validation logic. It currently is limited to operations on blocks. This is a conscious choice, since it already offers a lot of powerful functionality, but sits just on the cusp of still being reviewable scope-wise while giving some pointers on how the rest of the API could look like. The current design was informed by the development of some tools using the C header: * A re-implementation (part of this pull request) of [bitcoin-chainstate](https://github.com/bitcoin/bitcoin/blob/master/src/bitcoin-chainstate.cpp). * A re-implementation of the python [block linearize](https://github.com/bitcoin/bitcoin/tree/master/contrib/linearize) scripts: https://github.com/TheCharlatan/bitcoin/tree/kernelLinearize * A silent payment scanner: https://github.com/josibake/silent-payments-scanner * An electrs index builder: https://github.com/josibake/electrs/commits/electrs-kernel-integration * A rust bitcoin node: https://github.com/TheCharlatan/kernel-node * A reindexer: https://github.com/TheCharlatan/bitcoin/tree/kernelApi_Reindexer The library has also been used by other developers already: * A historical block analysis tool: https://github.com/ismaelsadeeq/mining-analysis * A swiftsync hints generator: https://github.com/theStack/swiftsync-hints-gen * Fast script validation in floresta: getfloresta/Floresta#456 * A swiftsync node implementation: https://github.com/2140-dev/swiftsync/tree/master/node Next to the C++ header also made available in this pull request, bindings for other languages are available here: * Rust: https://github.com/TheCharlatan/rust-bitcoinkernel * Python: https://github.com/stickies-v/py-bitcoinkernel * Go: https://github.com/stringintech/go-bitcoinkernel * Java: https://github.com/yuvicc/java-bitcoinkernel The rust bindings include unit and fuzz tests for the API. The header currently exposes logic for enabling the following functionality: * Feature-parity with the now deprecated libbitcoin-consensus * Optimized sha256 implementations that were not available to previous users of libbitcoin-consensus thanks to a static kernel context * Full support for logging as well as control over categories and severity * Feature parity with the existing experimental bitcoin-chainstate * Traversing the block index as well as using block index entries for reading block and undo data. * Running the chainstate in memory * Reindexing (both full and chainstate-only) * Interrupting long-running functions The pull request introduces a new kernel-only test binary that purely relies on the kernel C header and the C++ standard library. This is intentionally done to show its capabilities without relying on other code inside the project. This may be relaxed to include some of the existing utilities, or even be merged into the existing test suite. The complete docs for the API as well as some usage examples are hosted on [thecharlatan.ch/kernel-docs](https://thecharlatan.ch/kernel-docs/index.html). The docs are generated from the following repository (which also holds the examples): [github.com/TheCharlatan/kernel-docs](https://github.com/TheCharlatan/kernel-docs). #### How can I review this PR? Scrutinize the commit messages, run the tests, write your own little applications using the library, let your favorite code sanitizer loose on it, hook it up to your fuzzing infrastructure, profile the difference between the existing bitcoin-chainstate and the bitcoin-chainstate introduced here, be nitty on the documentation, police the C interface, opine on your own API design philosophy. To get a feeling for the API, read through the tests, or one of the examples. To configure this PR for making the shared library and the bitcoin-chainstate and test_kernel utilities available: ``` cmake -B build -DBUILD_KERNEL_LIB=ON -DBUILD_UTIL_CHAINSTATE=ON ``` Once compiled the library is part of the build artifacts that can be installed with: ``` cmake --install build ``` #### Why a C header (and not a C++ header) * Shipping a shared library with a C++ header is hard, because of name mangling and an unstable ABI. * Mature and well-supported tooling for integrating C exists for nearly every popular language. * C offers a reasonably stable ABI Also see #30595 (comment). #### What about versioning? The header and library are still experimental and I would expect this to remain so for some time, so best not to worry about versioning yet. #### Potential future additions In future, the C header could be expanded to support (some of these have been roughly implemented): * Handling transactions, block headers, coins cache, utxo set, meta data, and the mempool * Adapters for an abstract coins store * Adapters for an abstract block store * Adapters for an abstract block tree store * Allocators and buffers for more efficient memory usage * An "[io-less](https://sans-io.readthedocs.io/how-to-sans-io.html)" interface * Hooks for an external mempool, or external policy rules #### Current drawbacks * For external applications to read the block index of an existing Bitcoin Core node, Bitcoin Core needs to shut down first, since leveldb does not support reading across multiple processes. Other than migrating away from leveldb, there does not seem to be a solution for this problem. Such a migration is implemented in #32427. * The fatal error handling through the notifications is awkward. This is partly improved through #29642. * Handling shared pointers in the interfaces is unfortunate. They make ownership and freeing of the resources fuzzy and poison the interfaces with additional types and complexity. However, they seem to be an artifact of the current code that interfaces with the validation engine. The validation engine itself does not seem to make extensive use of these shared pointers. * If multiple instances of the same type of objects are used, there is no mechanism for distinguishing the log messages produced by each of them. A potential solution is #30342. * The background leveldb compaction thread may not finish in time leading to a non-clean exit. There seems to be nothing we can do about this, outside of patching leveldb. ACKs for top commit: alexanderwiederin: re-ACK 6c7a34f stringintech: re-ACK 6c7a34f laanwj: Code review ACK 6c7a34f ismaelsadeeq: reACK 6c7a34f 👾 fanquake: ACK 6c7a34f - soon we'll be running bitcoin (kernel) Tree-SHA512: ffe7d4581facb7017d06da8b685b81f4b5e4840576e878bb6845595021730eab808d8f9780ed0eb0d2b57f2647c85dcb36b6325180caaac469eaf339f7258030
dba30d4 to
a5a8bff
Compare
|
Rebased 00c5a87 -> a5a8bff (blocktreestore_6 -> blocktreestore_7, compare)
|
a5a8bff to
04d05a2
Compare
04d05a2 to
ef1f961
Compare
|
Rebased a5a8bff -> ef1f961 (blocktreestore_7 -> blocktreestore_8, compare) |
The BlockTreeStore introduces a new data format for storing block indexes and headers on disk. The class is very similar to the existing CBlockTreeDB, which stores the same data in a leveldb database. Unlike CBlockTreeDB, the data stored through the BlockTreeStore is directly serialized and written to flat .dat files. The storage schema introduced is simple. It relies on the assumption that no entry is ever deleted and that no duplicate entries are written. These assumptions hold for the current users of CBlockTreeDB. In order to efficiently update a CBlockIndex entry in the store, a new field is added to the class that tracks its position in the file. New serialization wrappers are added for both the CBlockIndex and CBlockFileInfo classes to avoid serializing integers as VARINT. Using VARINT encoding would make updating these fields impossible, since changing them might overwrite existing entries in the file. The new store supports atomic writes by using a write ahead log. Boolean flags are persisted through the (non-)existence of certain files. Data integrity is verified through the use of crc32c checksums on each data entry. This commit is part of a series to replace the leveldb database currently used for storing block indexes and headers with a flat file storage. This is motivated by the kernel library, where the usage of leveldb is a limiting factor to its future use cases. It also offers better performance and has a smaller on-disk footprint, though this is mostly negligible in the grand scheme of things. Also make flags based on file existence, instead of complicated boolean fields. This makes the operations atomic.
This commit is part of a series to replace the leveldb database currently used for storing block indexes and headers with a flat file storage. This is motivated by the kernel library, where the usage of leveldb is a limiting factor to its future use cases. It also offers better performance and has a smaller on-disk footprint, though this is mostly negligible in the grand scheme of things.
This hooks up the newly introduced BlockTreeStore class to the actual codebase. It also adds a migration function to migrate old leveldb block indexes to the new format on startup. The migration first reads from leveldb (blocks/index), and writes it to a BlockTreeStore in a separate migration directory (blocks/migration). Once done, the original directory (blocks/index) is deleted and the migration directory renamed to the original name. This commit is part of a series to replace the leveldb database currently used for storing block indexes and headers with a flat file storage. This is motivated by the kernel library, where the usage of leveldb is a limiting factor to its future use cases. It also offers better performance and has a smaller on-disk footprint, though this is mostly negligible in the grand scheme of things.
These are no longer needed after the migration to the new BlockTreeStore. The cache for the block tree db is also no longer needed, so grant what has been freed up to the coins db. This commit is part of a series to replace the leveldb database currently used for storing block indexes and headers with a flat file storage. This is motivated by the kernel library, where the usage of leveldb is a limiting factor to its future use cases. It also offers better performance and has a smaller on-disk footprint, though this is mostly negligible in the grand scheme of things.
Adds constants for pre-allocating the file size of the header storage file in the BlockTreeStore. The chosen constants leave a bit of extra space beyond the actual requirement. They may be updated on every release, though it is also not a strict requirement to do so. This commit is part of a series to replace the leveldb database currently used for storing block indexes and headers with a flat file storage. This is motivated by the kernel library, where the usage of leveldb is a limiting factor to its future use cases. It also offers better performance and has a smaller on-disk footprint, though this is mostly negligible in the grand scheme of things.
This is not called by anything anymore, so just remove it. This commit is part of a series to replace the leveldb database currently used for storing block indexes and headers with a flat file storage. This is motivated by the kernel library, where the usage of leveldb is a limiting factor to its future use cases. It also offers better performance and has a smaller on-disk footprint, though this is mostly negligible in the grand scheme of things.
4dcb6dc to
a2d5f75
Compare
|
Rebased ef1f961 -> a2d5f75 (blocktreestore_8 -> blocktreestore_9, compare)
|
|
@sedited The PR title still has RFC on it. This seems to have a lot of C's, so is the status still in RFC or are you convinced on the approach and concept, and this is ready for review? |
I would appreciate some further conceptual review, particularly around whether a different approach, like a different data structure altogether would be more compelling, and whether the tradeoffs here are worth it. To summarize my thoughts so far:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strong Concept ACK. Being able to access the block tree from multiple processes would be a significant improvement. Since the scope of rolling our own logic seems limited enough as demonstrated in this PR, I think reducing kernel's dependencies (such as e.g. on sqlite) is a good choice.
A couple of approach questions / thoughts:
- The
blockfiles.datfile is pretty small. I wonder if it would make sense to skip the wholem_dirty_fileinfobusiness and just rewrite the whole file on everyWriteBlockIndexDBcall, simplifying the code at the cost of increased (but I think manageable) write overhead? - If the log structure isn't going to change frequently, we could make the logic less generic and e.g. hardcode that there's going to be a single
LAST_BLOCKat the beginning and a singleHEADER_DATA_ENDat the end? Especially in combination with the previous suggestion, I think that might simplify the code quite a bit at the cost of reduced (but perhaps unnecessary) flexibility?
| // Cleanup a potentially previously failed migration | ||
| fs::remove_all(migration_dir); | ||
| LogInfo(" Writing data back to migration directory, reindexing: %b, pruned: %b", reindexing, pruned_block_files); | ||
| auto block_tree_store{std::make_unique<kernel::BlockTreeStore>(migration_dir, m_opts.chainparams, m_opts.wipe_block_tree_data)}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think passing m_opts.block_tree_db_params.wipe_data here is meaningless, migration_dir is empty anyway?
|
|
||
| BOOST_CHECK(std::filesystem::exists(in_memory_test_directory.m_directory / "blocks")); | ||
| BOOST_CHECK(!std::filesystem::exists(in_memory_test_directory.m_directory / "blocks" / "index")); | ||
| BOOST_CHECK(std::filesystem::exists(in_memory_test_directory.m_directory / "blocks" / "index")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The block_tree_db_in_memory parameter should be cleaned up here as well:
git diff on a2d5f75
diff --git a/src/test/kernel/test_kernel.cpp b/src/test/kernel/test_kernel.cpp
index 9f4d569f0f..7df33d79a1 100644
--- a/src/test/kernel/test_kernel.cpp
+++ b/src/test/kernel/test_kernel.cpp
@@ -655,7 +655,6 @@ BOOST_AUTO_TEST_CASE(btck_chainman_tests)
std::unique_ptr<ChainMan> create_chainman(TestDirectory& test_directory,
bool reindex,
bool wipe_chainstate,
- bool block_tree_db_in_memory,
bool chainstate_db_in_memory,
Context& context)
{
@@ -679,7 +678,7 @@ void chainman_reindex_test(TestDirectory& test_directory)
{
auto notifications{std::make_shared<TestKernelNotifications>()};
auto context{create_context(notifications, ChainType::MAINNET)};
- auto chainman{create_chainman(test_directory, true, false, false, false, context)};
+ auto chainman{create_chainman(test_directory, true, false, false, context)};
std::vector<std::string> import_files;
BOOST_CHECK(chainman->ImportBlocks(import_files));
@@ -722,7 +721,7 @@ void chainman_reindex_chainstate_test(TestDirectory& test_directory)
{
auto notifications{std::make_shared<TestKernelNotifications>()};
auto context{create_context(notifications, ChainType::MAINNET)};
- auto chainman{create_chainman(test_directory, false, true, false, false, context)};
+ auto chainman{create_chainman(test_directory, false, true, false, context)};
std::vector<std::string> import_files;
import_files.push_back((test_directory.m_directory / "blocks" / "blk00000.dat").string());
@@ -734,7 +733,7 @@ void chainman_mainnet_validation_test(TestDirectory& test_directory)
auto notifications{std::make_shared<TestKernelNotifications>()};
auto validation_interface{std::make_shared<TestValidationInterface>()};
auto context{create_context(notifications, ChainType::MAINNET, validation_interface)};
- auto chainman{create_chainman(test_directory, false, false, false, false, context)};
+ auto chainman{create_chainman(test_directory, false, false, false, context)};
// mainnet block 1
auto raw_block = hex_string_to_byte_vec("010000006fe28c0ab6f1b372c1a6a246ae63f74f931e8365e15a089c68d6190000000000982051fd1e4ba744bbbe680e1fee14677ba1a3c3540bf7b1cdb606e857233e0e61bc6649ffff001d01e362990101000000010000000000000000000000000000000000000000000000000000000000000000ffffffff0704ffff001d0104ffffffff0100f2052a0100000043410496b538e853519c726a2c91e61ec11600ae1390813a627c66fb8be7947be63c52da7589379515d4e0a604f8141781e62294721166bf621e73a82cbf2342c858eeac00000000");
@@ -815,7 +814,7 @@ BOOST_AUTO_TEST_CASE(btck_chainman_in_memory_tests)
auto notifications{std::make_shared<TestKernelNotifications>()};
auto context{create_context(notifications, ChainType::REGTEST)};
- auto chainman{create_chainman(in_memory_test_directory, false, false, true, true, context)};
+ auto chainman{create_chainman(in_memory_test_directory, false, false, true, context)};
for (auto& raw_block : REGTEST_BLOCK_DATA) {
Block block{hex_string_to_byte_vec(raw_block)};
@@ -844,7 +843,7 @@ BOOST_AUTO_TEST_CASE(btck_chainman_regtest_tests)
const size_t mid{REGTEST_BLOCK_DATA.size() / 2};
{
- auto chainman{create_chainman(test_directory, false, false, false, false, context)};
+ auto chainman{create_chainman(test_directory, false, false, false, context)};
for (size_t i{0}; i < mid; i++) {
Block block{hex_string_to_byte_vec(REGTEST_BLOCK_DATA[i])};
bool new_block{false};
@@ -853,7 +852,7 @@ BOOST_AUTO_TEST_CASE(btck_chainman_regtest_tests)
}
}
- auto chainman{create_chainman(test_directory, false, false, false, false, context)};
+ auto chainman{create_chainman(test_directory, false, false, false, context)};
for (size_t i{mid}; i < REGTEST_BLOCK_DATA.size(); i++) {
Block block{hex_string_to_byte_vec(REGTEST_BLOCK_DATA[i])};
| /** Minimum free space (in GB) needed for data directory when pruned; Does not include prune target*/ | ||
| uint64_t AssumedChainStateSize() const { return m_assumed_chain_state_size; } | ||
| /** Minimum free space (in MiB) needed for header store **/ | ||
| uint64_t AssumedHeaderStoreSize() const { return m_assumed_header_store_size; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Headers are fixed size, and we should know the number of headers from pre-sync before writing any to disk, so I think we can avoid this hardcoded variable by initializing the file to 0MiB (or some minimal size) and then exposing a void BlockTreeStore::AllocateHeaderStore(size_t header_count) that can be called when pre-sync is finished?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a nice suggestion, will see what that works out to.
|
|
||
| BlockfileType BlockfileTypeForHeight(int height); | ||
|
|
||
| std::unique_ptr<kernel::BlockTreeStore> CreateAndMigrateBlockTree(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's your view on how long we should keep the migration functionality alive? If we want to keep this for ~perpetuity, it might make sense to carve out the migration logic into a separate binary (that gets invoked on node startup), but I think that's probably overkill, right? If we keep it until most users have migrated (say, 5 versions), we could just force the odd remaining user to -reindex on startup?
Related note: I think the BlockTreeDB class docstring should be updated to reflect that this class is now only used for migration purposes, and should not be used for anything else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am in favour of making this a separate binary. I think keeping the migration as painless as possible for some time into the future is desirable. If we just have a binary that takes care of that and never changes, that seems like a good way to do it to me.
Opening this PR mostly to get concept/approach feedback.
This is motivated by the kernel library, where the internal usage of leveldb is a limiting factor to its future use cases. Specifically it is not possible to share leveldb databases between two processes. A notable use-case for the kernel library is accessing and analyzing existing block data. Currently this can only be done by first shutting down the node writing this data. Moving away from leveldb opens the door towards doing this in parallel. A flat file based approach was chosen, since the requirements for persistence here are fairly simple (no deletion, constant-size entries). The change also offers better performance by making node startup faster, and has a smaller on-disk footprint, though this is negligible in the grand scheme of things.
The BlockTreeStore introduces a new data format for storing block indexes and headers on disk. The class is very similar to the existing CBlockTreeDB, which stores the same data in a leveldb database. Unlike CBlockTreeDB, the data stored through the BlockTreeStore is directly serialized and written to flat .dat files. The storage schema introduced
is simple. It relies on the assumption that no entry is ever deleted and that no duplicate entries are written. These assumptions hold for the current users of CBlockTreeDB.
A write ahead ahead log and boolean flags as file existence checks ensure write atomicity. Every data entry is also given a crc32c checksum to detect data corruption.
The change also opens the door towards getting rid of having to reindex the block tree due to corruption (though switching from pruned -> un-pruned still requires extra logic). This would simplify a lot of kernel/validation code, as can be seen in this commit. The implementation here should be fairly robust against corruption (no parallel read/writes, no background compaction).
An alternative to this pull request, that could allow the same kernel feature, would be closing and opening the leveldb database only when reading and writing. This might incur a (negligible) performance penalty, but more importantly requires careful consideration of how to handle any contentions when opening, which might have complex side effects due to our current locking mode. It would also be possible to introduce an existing database with the required features for just the block tree, but that would introduce reliance on a new dependency and come with its own tradeoffs. For these reasons I chose this approach.