-
Notifications
You must be signed in to change notification settings - Fork 38.7k
optimization: buffer reads(23%)/writes(290%) in [undo]block [de]serialization, 6% faster IBD #31539
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. Code Coverage & BenchmarksFor details see: https://corecheck.dev/bitcoin/bitcoin/pulls/31539. ReviewsSee the guideline for information on the review process. ConflictsNo conflicts as of last run. |
6853b27 to
4ef1507
Compare
4ef1507 to
1d99994
Compare
…k` to reduce serialization calls 223081e scripted-diff: rename block and undo functions for consistency (Lőrinc) baaa3b2 refactor,blocks: remove costly asserts and modernize affected logs (Lőrinc) fa39f27 refactor,blocks: deduplicate block's serialized size calculations (Lőrinc) dfb2f9d refactor,blocks: inline `WriteBlockToDisk` (Lőrinc) 42bc491 refactor,blocks: inline `UndoWriteToDisk` (Lőrinc) 86b85bb bench: add SaveBlockBench (Lőrinc) 34f9a01 refactor,bench: rename bench/readblock.cpp to bench/readwriteblock.cpp (Lőrinc) Pull request description: `UndoWriteToDisk` and `WriteBlockToDisk` were delegating a subset of their functionality to single-use methods that didn't optimally capture a meaningful chunk of the algorithm, resulting in calculating things twice (serialized size, header size). This change inlines the awkward methods (asserting that all previous behavior was retained), and in separate commits makes the usages less confusing. Besides making the methods slightly more intuitive, the refactorings reduce duplicate calculations as well. The speed difference is insignificant for now (~0.5% for the new `SaveBlockToDiskBench`), but are a cleanup for follow-ups such as #31539 ACKs for top commit: ryanofsky: Code review ACK 223081e. Since last review, "Save" was renamed to "Write", uint32_t references were dropped, some log statements and comments were improved as suggested, and a lot of tweaks made to commits and commit messages which should make this easier to review. hodlinator: ACK 223081e TheCharlatan: ACK 223081e andrewtoth: ACK 223081e Tree-SHA512: 951bc8ad3504c510988afd95c561e3e259c6212bd14f6536fe56e8eb5bf5c35c32a368bbdb1d5aea1acc473d7e5bd9cdcde02008a148b05af1f955e413062d5c
c2fa31f to
0caf94a
Compare
0caf94a to
b5eeb50
Compare
|
Not sure what the failure means:
...
|
The unit tests are different from the functional tests. I guess the failure is due to a GHA VM corruption or OOM? |
b5eeb50 to
05ff5ee
Compare
|
I'll push a few more drafts to find out why CI is crashing - bear with me. |
05ff5ee to
19543d9
Compare
|
The unit tests are not failing. This can be seen in the output ( |
|
My recommendation would be to measure the asan CI max memory usage before and after this change. Maybe it is not largely different, but just the last drip to tip an OOM? |
|
You may be right, though I don't yet see how this memory saving option keeps crashing after the rebase (it was passing before), while the memory hogging alternative is passing. I suspected a stuck file or process or infinite loop...
I don't yet know how to run that on a Mac (and my linux boxes are running IBDs for now). |
c2640e5 to
75e85be
Compare
The Obfuscation (XOR) operations are currently done byte-by-byte during serialization, buffering the reads will enable batching the obfuscation operations later (not yet done here). Also, different operating systems seem to handle file caching differently, so reading bigger batches (and processing those from memory) is also a bit faster (likely because of fewer native fread calls or less locking). Buffered reading was done by first exhausting the buffer, and if more data is still needed, reading directly into the destination (to avoid copying into the buffer when we have huge destination Spans, such as the many megabyte vectors in later blocks), and lastly refilling the buffer completely. Running the `ReadBlockBench` benchmarks with different buffer sizes indicated that 16 KiB is the optimal buffer size (roughly 25% faster than master). Testing was done by randomly reading data from the same file with `AutoFile` and `BufferedReadOnlyFile` and making sure that the exact same data is read. `BufferedFile` wasn't used here because it does too many unrelated operations (and might be removed in the future). Before: | ns/op | op/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 2,288,264.16 | 437.01 | 0.2% | 11.00 | `ReadBlockBench` After: | ns/op | op/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 1,847,259.64 | 541.34 | 0.2% | 11.03 | `ReadBlockBench`
…eBlock` Similarly to the serialization reads, buffered writes will enable batched xor calculations - especially since currently we need to copy the write inputs Span to do the obfuscation on it, batching will enable doing the xor on the internal buffer instead. All write operations are delegated to `AutoFile`. Xor key offsets are also calculated based on where we are in the underlying file. To avoid the buffered write of 4096 in `AutoFile.write`, we're disabling obfuscation for the underlying `m_dest` and doing it ourselves for the whole buffer (instead of byte-by-byte) before writing it to file. We can't obfuscate the write's src directly since it would mutate the method's input (for very big writes it would avoid copying), but we can fill our own buffer and xor all of that safely. Running the `SaveBlockBench` benchmarks with different buffer sizes indicated that 1 MiB was the optimal buffer size. `BufferedFile` wasn't used here because it does too many unrelated operations (and might be removed in the future). Before: | ns/op | op/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 5,260,934.43 | 190.08 | 1.2% | 11.08 | `SaveBlockBench` After: | ns/op | op/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 1,804,208.61 | 554.26 | 1.4% | 10.89 | `SaveBlockBench`
75e85be to
b1b55a2
Compare
…and `SaveBlock`" This reverts commit b1b55a2.
It may be sufficient to just copy-paste the cxx flags into a normal build. In the future a cmake preset could be added, see also #31199 (comment) |
|
It looks like the problem was with Given that there is an alternative PR to this (reading the whole block into memory (instead of smaller chunks) - which we're already doing in other places, so shouldn't be controversial), I'll close this PR. If the memory allocations come up in that, I'll consider reopening. |
An alternative implementation of #31551, where the block is read into memory in chunks, resulting in similar gains, but is slightly more complicated, but uses less memory.
This is a precursor to #31144.
Currently, obfuscation operations are performed byte-by-byte during serialization. Buffering the reads allows batching these operations (implemented in #31144) and improves file access efficiency by reducing fread calls and associated locking overhead. Testing with various buffer sizes showed that 16 KiB provides the best performance.
For writes, buffering enables batched obfuscations directly on the internal buffer, avoiding the need to copy input spans for obfuscation. Xor key offsets are calculated based on the file position, and the batched obfuscation is applied before writing to disk.
BufferedFilewas avoided for both reads and writes due to its unrelated operations and potential deprecation.2 full IBD runs against master for 870k blocks (seeded from real nodes) indicates a 6% total speedup.
Details
Microbenchmarks show a ~23%/290% speedup (the followup XOR batching improves this further):
Before:
ReadBlockBenchSaveBlockBenchAfter
optimization: Buffer serialization reads in UndoRead and ReadBlock(23% faster read):ReadBlockBenchAfter
optimization: Buffer serialization writes in UndoWrite and WriteBlock(290% faster write):SaveBlockBench