-
Notifications
You must be signed in to change notification settings - Fork 38.8k
refactor: optimize: avoid allocations in script & policy verification #33645
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. Code Coverage & BenchmarksFor details see: https://corecheck.dev/bitcoin/bitcoin/pulls/33645. ReviewsSee the guideline for information on the review process.
If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update. ConflictsReviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- reusing vector capacity across loop iterations is makes sense
- no functional or policy logic changes introduced
- Code refactoring (range-based loops, variable renames) improves readability
Here is my bench on MacOS M2 MAX:
Before:
| ns/op | op/s | err% | total | benchmark |
|---|---|---|---|---|
| 134,805.18 | 7,418.11 | 0.7% | 3.30 | AssembleBlock |
| 338.44 | 2,954,694.93 | 0.5% | 3.29 | CCoinsCaching |
After:
| ns/op | op/s | err% | total | benchmark |
|---|---|---|---|---|
| 134,100.45 | 7,457.10 | 0.1% | 3.29 | AssembleBlock |
| 286.00 | 3,496,467.59 | 0.3% | 3.29 | CCoinsCaching |
Not so sure about the Assemble Block bench though. Doesn't it measure block assembly after admission, therefore not timing the policy checks you changed?
Thanks for the review & benchmarks. I'm positive about the fact that the |
|
I don't really see any speedup for Change: gcc=-0.71%, clang=-0.82%
Change: gcc=+18.87%, clang=+11.79% Benchmark for compiler in gcc clang; do \
if [ "$compiler" = "gcc" ]; then CC=gcc; CXX=g++; COMP_VER=$(gcc -dumpfullversion); \
else CC=clang; CXX=clang++; COMP_VER=$(clang -dumpversion); fi && \
echo "> Compiler: $compiler $COMP_VER" && \
for commit in 64a7c7cbb975cb3c3f25a3f784779f32cd95ebaa aa4c5b81ce686a160a883394f00a38604d81ccdd 7ef44fe6876dcf69f043df82a944e36a8a787b16 5d4b008728e13e923cd9d9315620b486c92b225b; do \
git fetch origin $commit >/dev/null 2>&1 && git checkout $commit >/dev/null 2>&1 && git log -1 --pretty='%h %s' && \
rm -rf build && \
cmake -B build -DBUILD_BENCH=ON -DENABLE_IPC=OFF -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_C_COMPILER=$CC -DCMAKE_CXX_COMPILER=$CXX >/dev/null 2>&1 && \
cmake --build build -j$(nproc) >/dev/null 2>&1 && \
for i in 1 2; do \
build/bin/bench_bitcoin -filter='AssembleBlock|CCoinsCaching' -min-time=5000; \
done; \
done; \
doneMeasurements
64a7c7c Merge #33558: ci: Use native platform for win-cross task
aa4c5b81ce refactor: use range-based iteration
7ef44fe687 refactor: rename script stack
5d4b008728 optimize: reuse containers across iterations
64a7c7c Merge #33558: ci: Use native platform for win-cross task
aa4c5b81ce refactor: use range-based iteration
7ef44fe687 refactor: rename script stack
5d4b008728 optimize: reuse containers across iterations
But the code seems more complicated than before - is there a way to retain the speedup without complicating the code (i.e. minimizing the diff)? |
It does call code on the policy path, but only during setup, not during the timed region. bench.run([&] {
PrepareBlock(test_setup->m_node, options);
});Don't want to be nitpicking here, but just want to point out that if a bench is presented as evidence of a performance claim (faster/slower/same), it should measure the code which was changed. Fine if we use it as a sanity-check, just wanted to mention that |
It is more complicated. I would: Keep:
Discard to minimize diff (these changes don't give a (relevant) performance gain):
Optional: Add a comment/hint to clarify that moving This way we minimize the diff, avoiding making the code more complicated than necessary, while taking advantage of the performance gain. Happy to hear your thoughts. |
Agreed: it is not in the hot path (or the most likely path).
Agreed for
Agreed: it is not in the hot path (or the most likely path). |
Actually it can reallocate on every size exhaustion, i.e. every resize. |
5d4b008 to
cedfff1
Compare
|
I've removed some of the less impacting changes, considering @cedwies suggestions. It now shows a 9% improvement on CCoinsCache on my end (up from a 7.7% previously) |
That's not the case, at least not with the latest push:
This was my initial intuition, but it actually goes against your previous point. |
You are referring to iteration-to-iteration coupling? Yes, the iterations depend on each other, however in practice we still allocate less than before, so there should not be a slowdown. Are you suggesting something like: witnessprogram.reserve(40); // segwit witness program length <= 40 bytes? I would not use |
|
You're right, I'll use your commits with small tweaks and add you as co-author for now |
cedfff1 to
55574ee
Compare
|
I've found other unnecessary related memory allocations. Instead of using @l0rinc 's approach of modifying the already existing methods (adding branch complexity) I opted for defining new methods. While this increases redundancy between similar methods, it produces a smaller diff and cleaner separate paths. performance has further improved on my end. showing +15% for |
|
I deliberately avoided duplication, not sure why you think that's simpler |
|
|
|
8bde358 to
e9db221
Compare
e9db221 to
6b6dcee
Compare
|
I've noticed that the benchmark codestatic void CCoinsCaching(benchmark::Bench& bench)
{
ECC_Context ecc_context{};
FillableSigningProvider keystore;
CCoinsView coinsDummy;
CCoinsViewCache coins(&coinsDummy);
std::vector<CMutableTransaction> dummyTransactions =
SetupDummyInputs(keystore, coins, {11 * COIN, 50 * COIN, 21 * COIN, 22 * COIN});
CMutableTransaction t1;
t1.vin.resize(3);
t1.vin[0].prevout.hash = dummyTransactions[0].GetHash();
t1.vin[0].prevout.n = 1;
t1.vin[0].scriptSig << std::vector<unsigned char>(65, 0);
t1.vin[1].prevout.hash = dummyTransactions[1].GetHash();
t1.vin[1].prevout.n = 0;
t1.vin[1].scriptSig << std::vector<unsigned char>(65, 0) << std::vector<unsigned char>(33, 4);
t1.vin[2].prevout.hash = dummyTransactions[1].GetHash();
t1.vin[2].prevout.n = 1;
t1.vin[2].scriptSig << std::vector<unsigned char>(65, 0) << std::vector<unsigned char>(33, 4);
t1.vout.resize(2);
t1.vout[0].nValue = 90 * COIN;
t1.vout[0].scriptPubKey << OP_1;
// Benchmark.
const CTransaction tx_1(t1);
bench.run([&] {
bool success{AreInputsStandard(tx_1, coins)};
assert(success);
});
} |
510b0f2 to
2537dba
Compare
|
Please only push after you're actually done, everyone gets a notification for each push, and people will just simply unsubscribe from this thread otherwise... |
|
I strongly suggest splitting this draft into smaller, more focused PRs (or separate drafts), each exploring one idea at a time. The current draft mixes several independent changes (policy loop optimizations, Solver/IsWitnessProgram API changes, and stylistic updates), which makes it harder to benchmark and review effectively. I’m happy to follow along and review, but it would be great to have clear, targeted benchmarks for future iterations (ones that measure the actual code (paths) being changed, so that we can make solid, data-driven decisions about which changes are worth merging. For example, regarding:
I think whether it’s on the hot path or not is less important than having data to show a measurable gain. The benchmarks did not show a clear performance benefit. The performance benefit (AFAIK) came from hoisting |
2537dba to
996acda
Compare
|
As suggested, I've removed some commits to keep this PR simple. I'll open a follow up PR with the more complex but related commits that change the API. As per the benchmarks, there currently is no benchmark that realistically measures the impacted methods, and I'm afraid I'll not be able to code one. we'd need to measure |
|
|
||
| int witnessversion = 0; | ||
| std::vector<unsigned char> witnessprogram; | ||
| witnessprogram.clear(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be better to have IsWitnessProgram() populate a span rather than a vector if we're trying to minimise allocations, as far as I can see.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here the vector is totally useless. it's required by the IsWitnessProgram() API but not actually needed. I've made more significant changes on this private branch, for another PR: https://github.com/Raimo33/bitcoinknots/commits/optimize-tx-policy-verification-2/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about the other commits? @l0rinc.
there are some intresting zero-cost optimizations. maybe not worth for a PR alone, but could be included as small refactorings in bigger PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed my nack here, see #33757 (comment)
I'm not sure about the focus for the other commits, they seem like problems that you can solve instead of problems that need solving.
The resulting code isn't obviously cleaner and the dependencies are often more complicated now (reusing vectors for example). I know that was already in place, but most of our work is untangling code, I'm more enthusiastic about changes that are both cleaner and faster - which I don't think is the case here.
But I have retracted my objection so that others are free to review and opine on it.
Make the `vSolutionsRet` parameter of `Solver()` nullable pointer allowing callers that only need the return value to avoid allocating and populating the solutions vector. Benchmark results show a 25.0% performance improvement (4,240,283 to 5,300,923 operations per second) for callers that don't need solutions. This approach addresses the performance concern raised in bitcoin#33645 more fundamentally than the vector reuse optimization, while simplifying the call sites at the same time. Co-authored-by: Raimo33 <[email protected]>
Make the `vSolutionsRet` parameter of `Solver()` nullable pointer allowing callers that only need the return value to avoid allocating and populating the solutions vector. Benchmark results show a 25.0% performance improvement (4,240,283 to 5,300,923 operations per second) for callers that don't need solutions. This approach addresses the performance concern raised in bitcoin#33645 more fundamentally than the vector reuse optimization, while simplifying the call sites at the same time. Co-authored-by: Raimo33 <[email protected]>



Currently, some policy and script related methods are inefficiently allocating/reallocating containers where it is completely unnecessary.
This PR aims at optimizing policy verifications by reducing redundant heap allocations without losing performance even in worst case scenarios, effectively reducing the overall memory footprint.