Skip to content

Conversation

@l0rinc
Copy link
Contributor

@l0rinc l0rinc commented Dec 19, 2024

An alternative implementation of #31551, where the block is read into memory in chunks, resulting in similar gains, but is slightly more complicated, but uses less memory.


This is a precursor to #31144.

Currently, obfuscation operations are performed byte-by-byte during serialization. Buffering the reads allows batching these operations (implemented in #31144) and improves file access efficiency by reducing fread calls and associated locking overhead. Testing with various buffer sizes showed that 16 KiB provides the best performance.

For writes, buffering enables batched obfuscations directly on the internal buffer, avoiding the need to copy input spans for obfuscation. Xor key offsets are calculated based on the file position, and the batched obfuscation is applied before writing to disk.

BufferedFile was avoided for both reads and writes due to its unrelated operations and potential deprecation.


2 full IBD runs against master for 870k blocks (seeded from real nodes) indicates a 6% total speedup.

Details
hyperfine --runs 2 --export-json /mnt/my_storage/ibd-xor-buffered.json --parameter-list COMMIT d73f37dda221835b5109ede1b84db2dc7c4b74a1,6853b2740851befffa3ca0b24d94212ed1e48d66 --prepare 'rm -rf /mnt/my_storage/BitcoinData/* && git checkout {COMMIT} && git clean -fxd && git reset --hard && cmake -B build -DCMAKE_BUILD_TYPE=Release -DBUILD_UTIL=OFF -DBUILD_TX=OFF -DBUILD_TESTS=OFF -DENABLE_WALLET=OFF -DINSTALL_MAN=OFF && cmake --build build -j$(nproc)' 'COMMIT={COMMIT} ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=870000 -printtoconsole=0'

Benchmark 1: COMMIT=d73f37dda221835b5109ede1b84db2dc7c4b74a1 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=870000 -printtoconsole=0
  Time (mean ± σ):     40216.674 s ± 113.132 s    [User: 51496.289 s, System: 3541.340 s]
  Range (min … max):   40136.678 s … 40296.671 s    2 runs

Benchmark 2: COMMIT=6853b2740851befffa3ca0b24d94212ed1e48d66 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=870000 -printtoconsole=0
  Time (mean ± σ):     37964.015 s ± 624.115 s    [User: 49086.037 s, System: 3375.072 s]
  Range (min … max):   37522.699 s … 38405.331 s    2 runs

Summary
  COMMIT=6853b2740851befffa3ca0b24d94212ed1e48d66 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=870000 -printtoconsole=0 ran
    1.06 ± 0.02 times faster than COMMIT=d73f37dda221835b5109ede1b84db2dc7c4b74a1 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=870000 -printtoconsole=0

Microbenchmarks show a ~23%/290% speedup (the followup XOR batching improves this further):

cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/src/bench/bench_bitcoin -filter='ReadBlockBench|SaveBlockBench' -min-time=10000

Before:

ns/op op/s err% total benchmark
2,288,264.16 437.01 0.2% 11.00 ReadBlockBench
5,260,934.43 190.08 1.2% 11.08 SaveBlockBench

After optimization: Buffer serialization reads in UndoRead and ReadBlock (23% faster read):

ns/op op/s err% total benchmark
1,847,259.64 541.34 0.2% 11.03 ReadBlockBench

After optimization: Buffer serialization writes in UndoWrite and WriteBlock (290% faster write):

ns/op op/s err% total benchmark
1,804,208.61 554.26 1.4% 10.89 SaveBlockBench

@DrahtBot
Copy link
Contributor

DrahtBot commented Dec 19, 2024

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage & Benchmarks

For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/31539.

Reviews

See the guideline for information on the review process.
A summary of reviews will appear here.

Conflicts

No conflicts as of last run.

@l0rinc l0rinc changed the title optimization: buffer reads/writes in [undo]blocks optimization: buffer reads/writes in [undo]block [de]serialization Dec 19, 2024
@l0rinc l0rinc force-pushed the l0rinc/buffered-block-read-write branch from 6853b27 to 4ef1507 Compare December 21, 2024 10:08
@l0rinc l0rinc changed the title optimization: buffer reads/writes in [undo]block [de]serialization optimization: buffer reads(23%)/writes(19%) in [undo]block [de]serialization Dec 21, 2024
@l0rinc l0rinc force-pushed the l0rinc/buffered-block-read-write branch from 4ef1507 to 1d99994 Compare December 21, 2024 14:53
@l0rinc l0rinc changed the title optimization: buffer reads(23%)/writes(19%) in [undo]block [de]serialization optimization: buffer reads(23%)/writes(290%) in [undo]block [de]serialization Dec 21, 2024
@l0rinc l0rinc changed the title optimization: buffer reads(23%)/writes(290%) in [undo]block [de]serialization optimization: buffer reads(23%)/writes(290%) in [undo]block [de]serialization, 6% faster IBD Dec 21, 2024
@bitcoin bitcoin deleted a comment from sebm123 Jan 4, 2025
ryanofsky added a commit that referenced this pull request Jan 22, 2025
…k` to reduce serialization calls

223081e scripted-diff: rename block and undo functions for consistency (Lőrinc)
baaa3b2 refactor,blocks: remove costly asserts and modernize affected logs (Lőrinc)
fa39f27 refactor,blocks: deduplicate block's serialized size calculations (Lőrinc)
dfb2f9d refactor,blocks: inline `WriteBlockToDisk` (Lőrinc)
42bc491 refactor,blocks: inline `UndoWriteToDisk` (Lőrinc)
86b85bb bench: add SaveBlockBench (Lőrinc)
34f9a01 refactor,bench: rename bench/readblock.cpp to bench/readwriteblock.cpp (Lőrinc)

Pull request description:

  `UndoWriteToDisk` and `WriteBlockToDisk` were delegating a subset of their functionality to single-use methods that didn't optimally capture a meaningful chunk of the algorithm, resulting in calculating things twice (serialized size, header size).
  This change inlines the awkward methods (asserting that all previous behavior was retained), and in separate commits makes the usages less confusing.
  Besides making the methods slightly more intuitive, the refactorings reduce duplicate calculations as well.

  The speed difference is insignificant for now (~0.5% for the new `SaveBlockToDiskBench`), but are a cleanup for follow-ups such as #31539

ACKs for top commit:
  ryanofsky:
    Code review ACK 223081e. Since last review, "Save" was renamed to "Write", uint32_t references were dropped, some log statements and comments were improved as suggested, and a lot of tweaks made to commits and commit messages which should make this easier to review.
  hodlinator:
    ACK 223081e
  TheCharlatan:
    ACK 223081e
  andrewtoth:
    ACK 223081e

Tree-SHA512: 951bc8ad3504c510988afd95c561e3e259c6212bd14f6536fe56e8eb5bf5c35c32a368bbdb1d5aea1acc473d7e5bd9cdcde02008a148b05af1f955e413062d5c
@l0rinc l0rinc force-pushed the l0rinc/buffered-block-read-write branch 2 times, most recently from c2fa31f to 0caf94a Compare January 22, 2025 20:15
@l0rinc l0rinc force-pushed the l0rinc/buffered-block-read-write branch from 0caf94a to b5eeb50 Compare January 22, 2025 22:53
@l0rinc l0rinc marked this pull request as ready for review January 22, 2025 22:53
@l0rinc
Copy link
Contributor Author

l0rinc commented Jan 23, 2025

Not sure what the failure means:

100% tests passed, 0 tests failed out of 138

...

Error: Process completed with exit code 143.

@maflcko
Copy link
Member

maflcko commented Jan 23, 2025

Not sure what the failure means:

100% tests passed, 0 tests failed out of 138

...

Error: Process completed with exit code 143.

The unit tests are different from the functional tests. I guess the failure is due to a GHA VM corruption or OOM?

@l0rinc l0rinc marked this pull request as draft January 23, 2025 10:41
@l0rinc l0rinc force-pushed the l0rinc/buffered-block-read-write branch from b5eeb50 to 05ff5ee Compare January 23, 2025 15:48
@l0rinc
Copy link
Contributor Author

l0rinc commented Jan 23, 2025

I'll push a few more drafts to find out why CI is crashing - bear with me.

@l0rinc l0rinc force-pushed the l0rinc/buffered-block-read-write branch from 05ff5ee to 19543d9 Compare January 23, 2025 17:49
@maflcko
Copy link
Member

maflcko commented Jan 23, 2025

The unit tests are not failing. This can be seen in the output (100% tests passed, 0 tests failed out of 138). So skipping them should have no effect.

@maflcko
Copy link
Member

maflcko commented Jan 23, 2025

My recommendation would be to measure the asan CI max memory usage before and after this change. Maybe it is not largely different, but just the last drip to tip an OOM?

@l0rinc
Copy link
Contributor Author

l0rinc commented Jan 23, 2025

You may be right, though I don't yet see how this memory saving option keeps crashing after the rebase (it was passing before), while the memory hogging alternative is passing.
bitcoin-dev-tools#123 (comment) shows the before/after memory to be the same.

I suspected a stuck file or process or infinite loop...

asan CI max memory usage

I don't yet know how to run that on a Mac (and my linux boxes are running IBDs for now).
I'll push a few more rounds here to try to understand which part is causing the problem - and why only for ASan + LSan + UBSan + integer, if they won't reveal anything, I'll investigate how to replicate the CI behavior locally.

@l0rinc l0rinc force-pushed the l0rinc/buffered-block-read-write branch from c2640e5 to 75e85be Compare January 23, 2025 20:52
The Obfuscation (XOR) operations are currently done byte-by-byte during serialization, buffering the reads will enable batching the obfuscation operations later (not yet done here).

Also, different operating systems seem to handle file caching differently, so reading bigger batches (and processing those from memory) is also a bit faster (likely because of fewer native fread calls or less locking).

Buffered reading was done by first exhausting the buffer, and if more data is still needed, reading directly into the destination (to avoid copying into the buffer when we have huge destination Spans, such as the many megabyte vectors in later blocks), and lastly refilling the buffer completely.

Running the `ReadBlockBench` benchmarks with different buffer sizes indicated that 16 KiB is the optimal buffer size (roughly 25% faster than master).

Testing was done by randomly reading data from the same file with `AutoFile` and `BufferedReadOnlyFile` and making sure that the exact same data is read.

`BufferedFile` wasn't used here because it does too many unrelated operations (and might be removed in the future).

Before:

|               ns/op |                op/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|        2,288,264.16 |              437.01 |    0.2% |     11.00 | `ReadBlockBench`

After:
|               ns/op |                op/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|        1,847,259.64 |              541.34 |    0.2% |     11.03 | `ReadBlockBench`
…eBlock`

Similarly to the serialization reads, buffered writes will enable batched xor calculations - especially since currently we need to copy the write inputs Span to do the obfuscation on it, batching will enable doing the xor on the internal buffer instead.

All write operations are delegated to `AutoFile`.
Xor key offsets are also calculated based on where we are in the underlying file.
To avoid the buffered write of 4096 in `AutoFile.write`, we're disabling obfuscation for the underlying `m_dest` and doing it ourselves for the whole buffer (instead of byte-by-byte) before writing it to file.
We can't obfuscate the write's src directly since it would mutate the method's input (for very big writes it would avoid copying), but we can fill our own buffer and xor all of that safely.

Running the `SaveBlockBench` benchmarks with different buffer sizes indicated that 1 MiB was the optimal buffer size.

`BufferedFile` wasn't used here because it does too many unrelated operations (and might be removed in the future).

Before:

|               ns/op |                op/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|        5,260,934.43 |              190.08 |    1.2% |     11.08 | `SaveBlockBench`

After:
|               ns/op |                op/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|        1,804,208.61 |              554.26 |    1.4% |     10.89 | `SaveBlockBench`
@l0rinc l0rinc force-pushed the l0rinc/buffered-block-read-write branch from 75e85be to b1b55a2 Compare January 23, 2025 21:56
@maflcko
Copy link
Member

maflcko commented Jan 24, 2025

I'll investigate how to replicate the CI behavior locally

It may be sufficient to just copy-paste the cxx flags into a normal build. In the future a cmake preset could be added, see also #31199 (comment)

@l0rinc
Copy link
Contributor Author

l0rinc commented Jan 24, 2025

It looks like the problem was with BufferedWriteOnlyFile - if needed, I will investigate further.

Given that there is an alternative PR to this (reading the whole block into memory (instead of smaller chunks) - which we're already doing in other places, so shouldn't be controversial), I'll close this PR. If the memory allocations come up in that, I'll consider reopening.

@l0rinc l0rinc closed this Jan 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants