-
Notifications
You must be signed in to change notification settings - Fork 38.8k
Faster Input Deduplication Algorithm #14387
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
src/bench/duplicate_inputs.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: some trailing white space here?
Also, I removed all comments about the linter failure, because they are just distracting from the actual pull request.
|
Can you post "typical" case benchmark comparisons? |
|
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. ConflictsReviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first. |
|
@JeremyRubin Impressive speedup! What is the risk-reward trade-off we're facing with this change? More specifically: what risks do you see associated with this change to consensus critical code? Does the change in which DoS error gets reported for transactions which have both duplicates and null inputs have any consequences or impose any risks? |
src/consensus/tx_verify.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why 8?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's my favorite number!
It's to get the expected work below one for the worst case as shown in the analysis.
We could add a 9th hash (we have more bits in the hash computed) but every hash adds more memory accesses in our table.
We could also remove a hash or two and the EV would be less than 10, which is probably acceptable 2. At about 4 or 5 hashes is when it blows up a bit more to an "unacceptable point" (EV 50 to 500 comparisons).
Solve for x such that:
Sum i = 0 to 24390 [ i*( i*x / 2**21)**x ] < 1
It's also possible to modify the algorithm such that if a false positive is hit, you do the current set based algorithm up to the conflict. I'm not sure how to analyze that though, and the current code is sufficiently simple with low enough probability of expensive scan that we don't care that much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More precisely, it's one of two choices (9 hashes or 8) with 21 bits with expected work less than 1.
Extracting 9 hashes from 3 64 bit integers is a bit more complex code wise, but doable.
sorted((sum(i*(float(i)x/ 2.0**y)**x for i in xrange(24930)) if (xy <= 192) else ('Inf', 0, 0),y,x) for y in xrange(1,22) for x in xrange(1,20))[:10]
[(0.10374566662377155, 21, 9), (0.4157347268068221, 21, 8), (1.9074647424172138, 21, 7), (10.226961125517702, 21, 6), (53.11778131137103, 20, 9), (65.8563129753341, 21, 5), (106.42809006254646, 20, 8), (244.15548702940336, 20, 7), (529.481130078109, 21, 4), (654.5255120331329, 20, 6)]
Another option would be to increase the number of hashes to 16 and then use a 20 bit table, requiring a 320-bit hash . This makes the expected work about 7 comparisons in the worst case, but makes the table half as large which reduces the constant bloat.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok cool. Thanks for the analysis. 8 seems like a pretty reasonable choice. I left a comment above, but was wondering how this bloom filter compares to native unordered set implementations in the stdlib.
|
Anyone measured |
src/consensus/tx_verify.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does your implementation compare to using std::unordered_set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #14397 for a more obvious version that is just the obvious thing.
For std::unordered_set, I'm clocking much worse performance for DeserializeAndCheckBlockTest and 2x worse performance for DuplicateInputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@leishman I tried a few alternatives:
- Master: 13.6 ms
- Faster duplicate input check in CheckTransaction (alternative to #14387) #14397: 6.3 ms
- Using a sorted vector with SipHash'ed prevouts: 3.7 ms
- This PR: 2.7 ms
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like sorting the siphash'd prevouts -- I'm guessing you then do the expensive check if that collides? Or are you tracking the pointers when you sort too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JeremyRubin Yeah, just delay the expensive check until after the cheap check fails. I haven't PR'ed that because I only have PoC, and I don't want to overload reviewers with a series of PRs without even knowing if we want to increase complexity here at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the expensive check, you can still do it in O(n) per colliding entry FYI, which is less expensive than doing the full O(n log n) expensive check given that we don't expect colliding entries without a duplicate.
|
My immediate reaction is that this seems very complex compared to a naive std::set comparison! This also pulls our SipHash implementation into consensus-critical code, which seems like a big price to pay for a performance win. I lean pretty strongly towards concept NACK. @JeremyRubin - your PR description talks about what this PR does, but not why. This makes block propagation faster, but do we have an understanding of how much these milliseconds matter? Is there a way we can determine whether the increased complexity introduced is a reasonable price to pay for the performance win? Also, +1 to @MarcoFalke's comment here: #14387 (comment) . Can reviewers please not start nitting code before there's been a concept discussion. |
|
Goal was to minimize the performance regression caused by the CVE fix. Understand this is sensitive code for that reason. This code is also generally theoretically useful for several other contexts because it is O(N). An adapted version (different parameters) could be used to check for duplicate inputs across a large number of txns (e.g., mempool syncing context). It's actually not thaaat complicated; it's basically just a bloom filter. The complexity is also mostly in the performance, the correctness is somewhat easy to check. I don't know if the performance win is worth it. I'll leave that for others to determine. Just putting it out there. |
|
@JeremyRubin It is sufficiently complicated to introduce undefined behaviour in consensus critical code without any of the reviewers noticing .-) I'm afraid the code as it is currently formulated will trigger undefined behaviour due to shift exponents being too large. |
|
@practicalswift i think I fixed that -- can you confirm? (and also a copy-paste error on which bit was being set :( ) |
|
@practicalswift No, I'm afraid the undefined behaviour is still present. Check this code: |
|
I feel like this is too much review work vs the gain. |
4ec7597 to
1f24773
Compare
|
@gmaxwell @sipa please re-review the updated version (should you have time). @practicalswift I think I have eliminated the UB, if not please let me know. In this version I have kept complexity limited in scope to validation.cpp. Performance wise this version is actually a bit better in the worst case compared to using the filter per-transaction (DuplicateInputs) and better in an average case (DeserializeAndCheckBlockTest) compared to master. The simpler put all in vector then sort then find duplicate algorithm could be used here too. The major benefit of this approach (as amended) is that we not only detect duplicate inputs per transaction, but across the entire block at the same time. This guarantees we won't see an in-block double spend in ConnectBlock and CheckTxInputs. This might enable us to parallelize checking that inputs exist (against a cache that tracks at what index an output created in that block was created). |
One can't do that without losing the ability to cache the check as part of the wtxid validity caching, as the simpler check could be. I am failing to see the argument for a gain here that even justifies reviewing the change at all. |
|
@gmaxwell That's not accurate -- by fixing the salt for the hash (which should be safe -- I will consider this more closely), you could store three uint64_t's per input in the wtxid validity cache and then re-insert those on the forward pass of the algorithm. To be clear, this just saves sip-hashing at the expense of memory, but you can just keep a table of precomputed sip-hashes for the inputs in the top of the mempool if it's actually an issue. Once you have the hashes, the check itself is very inexpensive... At the very least, such a cache could be configurable. By checking (without inserting) that all of the outputs (not txids) created in a transaction are not already in the table as we scan we ensure that none of the inputs spent are created later in the block. This can also be done with an additional backwards pass with a new table only tracking TXIDs for no hashing (for tx in txs.reverse(): insert_table(txid); for input in tx.inputs(): check_table(input.coutpoint.hash)). The overall expected more work on either of these approaches is around 2x, and with current parameters this is reasonable. With this check completed, it would be possible to apply all the transactions in a block out of order. Without removing any checks or adding parallelization, this should make less fragile much of the code after CheckBlock (e.g., ConnectBlock) because we never reach it for a block which has out-of-longchain-order transactions (and cause us to have to abort partially applied transactions). I wanted to get just the duplicates checking reviewed and accepted first, then, in the future work on these other projects. @instagibbs with this current version, it seems to minorly worse (1.5% median to median) on DeserializeAndCheckBlockTest. I'm unaware if this is a representative sample of blocks or if this tell you anything about the larger performance of a block being validated with respect to things like number of transactions, caches, etc so I'm reticent to give too much weight to this one in any case. If you have ideas for how we can write a better benchmark to test this for future work, let's chat about it. DeserializeAndCheckBlockTest, 5, 160, 12.2812, 0.0153002, 0.0154073, 0.0153456 Much thanks to all the reviewers who have spent their time reviewing this PR so-far, I appreciate that you took the time to review this contribution to Bitcoin. |
| Needs rebase |
… case e4eee7d Add Benchmark to test input de-duplication worst case (Jeremy Rubin) Pull request description: Because there are now 2PRs referencing this benchmark commit, we may as well add it independently as it is worth landing the benchmark even if neither patch is accepted. bitcoin#14397 bitcoin#14387 Tree-SHA512: 4d947323c02297b0d8f5871f9e7cc42488c0e1792a8b10dc174a25f4dd53da8146fd276949a5dbacf4083f0c6a7235cb6f21a8bc35caa499bc2508f8a048b987
|
@jnewbery nothing in this PR is in the block propagation typical case anymore (not since we finally implemented the full BIP152 HB mode forwarding), so the belief that this speeds up block propagation is largely mistaken. "Goal was to minimize the performance regression caused by the CVE fix." -- there wasn't one, or at least not an interesting one. The speedup to duplicate checking was later superseded by changes to allow blocks to be propagated without validating anything more than hash consistency. |
|
@gmaxwell see #14837 which supersedes this PR. In my opinion, you are incorrect that CheckBlock is not latency sensitive. Certainly there are a large class of users for whom CheckBlock performance is critical (e.g., miners performing full validation before mining a new block, and miners calling testblockvalidity to get a new template). This also has a non negligible impact on benchmarks like DeserializeAndCheckBlockTest, which suggests to me that speeding up these checks is important for reindexing, bootstrap, and other activities. |
|
Closing in favour of #14837. |
When it was on the critical path of propagation the small delays involved were cumulative across the whole network. As a result even though one was not particularly interesting, the sum total could be. Without that effect, you only get the single one shot delay. The effect of a one shot half millisecond delay on mining is negligible, and efforts spend considering that optimization could be better spent on things like getting testblockvalidity out of the critical path-- which would be an order of magnitude larger speedup, likely simpler to review and verify, and would also further moot the new proposed benefit.
That logic is spurious. Microbenchmarks are microbenchmarks. If it had an impact on reindexing/bootstrap it could be measured. If it did make a big impact there that would be an argument in favor of it, but unless prior profiling was erroneous, that isn't possible. |
|
@JeremyRand Regarding my comment regarding UB above. The problem was the following: ... is problematic due to ... ... which can be contrasted to ... |
This PR introduces a faster input deduplication algorithm.
In the first commit, I introduce a worst case duplicate input benchmark.
The existing code achieves the following performance.
DuplicateInputs, 5, 10, 0.710982, 0.0140756, 0.0143986, 0.0141852
In the next commit, I introduce a probabilistic checker which is much faster. It's analysis and use is documented extensively in the code comments. It achieves the following performance on the same benchmark.
DuplicateInputs, 5, 10, 0.166576, 0.00329314, 0.0034048, 0.0033221
This is about 4.3X faster.
Future work can further improve this by back-grounding the checks and hashing to another thread.
This does introduce one behavior change in which DoS error gets reported for transactions which have both duplicates and null inputs; the first error found will be the one reported.