Faster Input Deduplication Algorithm #14387

JeremyRubin · 2018-10-04T06:42:17Z

This PR introduces a faster input deduplication algorithm.

In the first commit, I introduce a worst case duplicate input benchmark.

The existing code achieves the following performance.

DuplicateInputs, 5, 10, 0.710982, 0.0140756, 0.0143986, 0.0141852

In the next commit, I introduce a probabilistic checker which is much faster. It's analysis and use is documented extensively in the code comments. It achieves the following performance on the same benchmark.

DuplicateInputs, 5, 10, 0.166576, 0.00329314, 0.0034048, 0.0033221

This is about 4.3X faster.

Future work can further improve this by back-grounding the checks and hashing to another thread.

This does introduce one behavior change in which DoS error gets reported for transactions which have both duplicates and null inputs; the first error found will be the one reported.

maflcko · 2018-10-04T07:24:17Z

src/bench/duplicate_inputs.cpp

nit: some trailing white space here?

Also, I removed all comments about the linter failure, because they are just distracting from the actual pull request.

instagibbs · 2018-10-04T07:29:16Z

Can you post "typical" case benchmark comparisons?

DrahtBot · 2018-10-04T08:38:43Z

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#14400 (Add Benchmark to test input de-duplication worst case by JeremyRubin)
#14397 (Faster duplicate input check in CheckTransaction (alternative to Faster Input Deduplication Algorithm #14387) by sipa)
#14074 (Use std::unordered_set instead of set in blockfilter interface by jimpo)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

practicalswift · 2018-10-04T08:58:00Z

@JeremyRubin Impressive speedup! What is the risk-reward trade-off we're facing with this change? More specifically: what risks do you see associated with this change to consensus critical code?

Does the change in which DoS error gets reported for transactions which have both duplicates and null inputs have any consequences or impose any risks?

leishman · 2018-10-04T13:08:53Z

src/consensus/tx_verify.cpp

It's my favorite number!

It's to get the expected work below one for the worst case as shown in the analysis.

We could add a 9th hash (we have more bits in the hash computed) but every hash adds more memory accesses in our table.

We could also remove a hash or two and the EV would be less than 10, which is probably acceptable 2. At about 4 or 5 hashes is when it blows up a bit more to an "unacceptable point" (EV 50 to 500 comparisons).

Solve for x such that:

Sum i = 0 to 24390 [ i*( i*x / 2**21)**x ] < 1

It's also possible to modify the algorithm such that if a false positive is hit, you do the current set based algorithm up to the conflict. I'm not sure how to analyze that though, and the current code is sufficiently simple with low enough probability of expensive scan that we don't care that much.

More precisely, it's one of two choices (9 hashes or 8) with 21 bits with expected work less than 1.

Extracting 9 hashes from 3 64 bit integers is a bit more complex code wise, but doable.

sorted((sum(i*(float(i)x/ 2.0**y)**x for i in xrange(24930)) if (xy <= 192) else ('Inf', 0, 0),y,x) for y in xrange(1,22) for x in xrange(1,20))[:10]

[(0.10374566662377155, 21, 9), (0.4157347268068221, 21, 8), (1.9074647424172138, 21, 7), (10.226961125517702, 21, 6), (53.11778131137103, 20, 9), (65.8563129753341, 21, 5), (106.42809006254646, 20, 8), (244.15548702940336, 20, 7), (529.481130078109, 21, 4), (654.5255120331329, 20, 6)]

Another option would be to increase the number of hashes to 16 and then use a 20 bit table, requiring a 320-bit hash . This makes the expected work about 7 comparisons in the worst case, but makes the table half as large which reduces the constant bloat.

Ok cool. Thanks for the analysis. 8 seems like a pretty reasonable choice. I left a comment above, but was wondering how this bloom filter compares to native unordered set implementations in the stdlib.

promag · 2018-10-04T23:33:33Z

Anyone measured -reindex with and without this change?

leishman · 2018-10-05T06:27:00Z

src/consensus/tx_verify.cpp

How does your implementation compare to using std::unordered_set?

See #14397 for a more obvious version that is just the obvious thing.

For std::unordered_set, I'm clocking much worse performance for DeserializeAndCheckBlockTest and 2x worse performance for DuplicateInputs.

@leishman I tried a few alternatives:

Master: 13.6 ms

Faster duplicate input check in CheckTransaction (alternative to #14387) #14397: 6.3 ms

Using a sorted vector with SipHash'ed prevouts: 3.7 ms

This PR: 2.7 ms

I like sorting the siphash'd prevouts -- I'm guessing you then do the expensive check if that collides? Or are you tracking the pointers when you sort too?

@JeremyRubin Yeah, just delay the expensive check until after the cheap check fails. I haven't PR'ed that because I only have PoC, and I don't want to overload reviewers with a series of PRs without even knowing if we want to increase complexity here at all.

For the expensive check, you can still do it in O(n) per colliding entry FYI, which is less expensive than doing the full O(n log n) expensive check given that we don't expect colliding entries without a duplicate.

jnewbery · 2018-10-05T07:00:01Z

My immediate reaction is that this seems very complex compared to a naive std::set comparison! This also pulls our SipHash implementation into consensus-critical code, which seems like a big price to pay for a performance win. I lean pretty strongly towards concept NACK.

@JeremyRubin - your PR description talks about what this PR does, but not why. This makes block propagation faster, but do we have an understanding of how much these milliseconds matter? Is there a way we can determine whether the increased complexity introduced is a reasonable price to pay for the performance win?

Also, +1 to @MarcoFalke's comment here: #14387 (comment) . Can reviewers please not start nitting code before there's been a concept discussion.

JeremyRubin · 2018-10-05T07:10:31Z

Goal was to minimize the performance regression caused by the CVE fix. Understand this is sensitive code for that reason. This code is also generally theoretically useful for several other contexts because it is O(N). An adapted version (different parameters) could be used to check for duplicate inputs across a large number of txns (e.g., mempool syncing context).

It's actually not thaaat complicated; it's basically just a bloom filter. The complexity is also mostly in the performance, the correctness is somewhat easy to check.

I don't know if the performance win is worth it. I'll leave that for others to determine. Just putting it out there.

practicalswift · 2018-10-05T07:40:07Z

@JeremyRubin It is sufficiently complicated to introduce undefined behaviour in consensus critical code without any of the reviewers noticing .-)

I'm afraid the code as it is currently formulated will trigger undefined behaviour due to shift exponents being too large.

JeremyRubin · 2018-10-05T07:57:16Z

@practicalswift i think I fixed that -- can you confirm? (and also a copy-paste error on which bit was being set :( )

practicalswift · 2018-10-05T09:24:49Z

@practicalswift No, I'm afraid the undefined behaviour is still present.

Check this code:

https://github.com/bitcoin/bitcoin/blob/4ec7597add35cfc458680fcba2b8fb64931711ef/src/consensus/tx_verify.cpp#L281-L288

gmaxwell · 2018-10-05T20:32:27Z

I feel like this is too much review work vs the gain.

JeremyRubin · 2018-10-06T02:13:13Z

@gmaxwell @sipa please re-review the updated version (should you have time). @practicalswift I think I have eliminated the UB, if not please let me know.

In this version I have kept complexity limited in scope to validation.cpp.

Performance wise this version is actually a bit better in the worst case compared to using the filter per-transaction (DuplicateInputs) and better in an average case (DeserializeAndCheckBlockTest) compared to master.

The simpler put all in vector then sort then find duplicate algorithm could be used here too.

The major benefit of this approach (as amended) is that we not only detect duplicate inputs per transaction, but across the entire block at the same time. This guarantees we won't see an in-block double spend in ConnectBlock and CheckTxInputs. This might enable us to parallelize checking that inputs exist (against a cache that tracks at what index an output created in that block was created).

gmaxwell · 2018-10-07T16:39:50Z

The major benefit of this approach (as amended) is that we not only detect duplicate inputs per transaction, but across the entire block at the sam

One can't do that without losing the ability to cache the check as part of the wtxid validity caching, as the simpler check could be.

I am failing to see the argument for a gain here that even justifies reviewing the change at all.

JeremyRubin · 2018-10-09T02:18:28Z

@gmaxwell That's not accurate -- by fixing the salt for the hash (which should be safe -- I will consider this more closely), you could store three uint64_t's per input in the wtxid validity cache and then re-insert those on the forward pass of the algorithm. To be clear, this just saves sip-hashing at the expense of memory, but you can just keep a table of precomputed sip-hashes for the inputs in the top of the mempool if it's actually an issue. Once you have the hashes, the check itself is very inexpensive... At the very least, such a cache could be configurable.

By checking (without inserting) that all of the outputs (not txids) created in a transaction are not already in the table as we scan we ensure that none of the inputs spent are created later in the block. This can also be done with an additional backwards pass with a new table only tracking TXIDs for no hashing (for tx in txs.reverse(): insert_table(txid); for input in tx.inputs(): check_table(input.coutpoint.hash)). The overall expected more work on either of these approaches is around 2x, and with current parameters this is reasonable. With this check completed, it would be possible to apply all the transactions in a block out of order. Without removing any checks or adding parallelization, this should make less fragile much of the code after CheckBlock (e.g., ConnectBlock) because we never reach it for a block which has out-of-longchain-order transactions (and cause us to have to abort partially applied transactions).

I wanted to get just the duplicates checking reviewed and accepted first, then, in the future work on these other projects.

@instagibbs with this current version, it seems to minorly worse (1.5% median to median) on DeserializeAndCheckBlockTest. I'm unaware if this is a representative sample of blocks or if this tell you anything about the larger performance of a block being validated with respect to things like number of transactions, caches, etc so I'm reticent to give too much weight to this one in any case. If you have ideas for how we can write a better benchmark to test this for future work, let's chat about it.

DeserializeAndCheckBlockTest, 5, 160, 12.2812, 0.0153002, 0.0154073, 0.0153456
DeserializeAndCheckBlockTest, 5, 160, 12.5319, 0.0153902, 0.0159464, 0.0155924

Much thanks to all the reviewers who have spent their time reviewing this PR so-far, I appreciate that you took the time to review this contribution to Bitcoin.

DrahtBot · 2018-11-06T15:31:36Z

Needs rebase

… case e4eee7d Add Benchmark to test input de-duplication worst case (Jeremy Rubin) Pull request description: Because there are now 2PRs referencing this benchmark commit, we may as well add it independently as it is worth landing the benchmark even if neither patch is accepted. bitcoin#14397 bitcoin#14387 Tree-SHA512: 4d947323c02297b0d8f5871f9e7cc42488c0e1792a8b10dc174a25f4dd53da8146fd276949a5dbacf4083f0c6a7235cb6f21a8bc35caa499bc2508f8a048b987

gmaxwell · 2018-11-29T08:11:20Z

@jnewbery nothing in this PR is in the block propagation typical case anymore (not since we finally implemented the full BIP152 HB mode forwarding), so the belief that this speeds up block propagation is largely mistaken.

"Goal was to minimize the performance regression caused by the CVE fix." -- there wasn't one, or at least not an interesting one. The speedup to duplicate checking was later superseded by changes to allow blocks to be propagated without validating anything more than hash consistency.

JeremyRubin · 2018-11-29T08:34:00Z

@gmaxwell see #14837 which supersedes this PR.

In my opinion, you are incorrect that CheckBlock is not latency sensitive. Certainly there are a large class of users for whom CheckBlock performance is critical (e.g., miners performing full validation before mining a new block, and miners calling testblockvalidity to get a new template).

This also has a non negligible impact on benchmarks like DeserializeAndCheckBlockTest, which suggests to me that speeding up these checks is important for reindexing, bootstrap, and other activities.

fanquake · 2018-11-29T08:34:52Z

Closing in favour of #14837.

gmaxwell · 2018-11-29T08:50:25Z

In my opinion, you are incorrect that CheckBlock is not latency sensitive. Certainly there are a large class of users for whom CheckBlock performance is critical (e.g., miners performing full validation before mining a new block, and miners calling testblockvalidity to get a new template).

When it was on the critical path of propagation the small delays involved were cumulative across the whole network. As a result even though one was not particularly interesting, the sum total could be. Without that effect, you only get the single one shot delay. The effect of a one shot half millisecond delay on mining is negligible, and efforts spend considering that optimization could be better spent on things like getting testblockvalidity out of the critical path-- which would be an order of magnitude larger speedup, likely simpler to review and verify, and would also further moot the new proposed benefit.

This also has a non negligible impact on benchmarks like DeserializeAndCheckBlockTest, which suggests to me that speeding up these checks is important for reindexing, bootstrap, and other activities.

That logic is spurious. Microbenchmarks are microbenchmarks. If it had an impact on reindexing/bootstrap it could be measured. If it did make a big impact there that would be an argument in favor of it, but unless prior profiling was erroneous, that isn't possible.

practicalswift · 2018-11-29T08:58:48Z

@JeremyRand Regarding my comment regarding UB above. The problem was the following:

uint64_t bit1 = 1<<((std::get<0>(h)) & 63);

... is problematic due to ...

$ cling
[cling]$ #include <cstdint>
[cling]$ 1 << 63
warning: shift count >= width of type [-Wshift-count-overflow]
 1 << 63
   ^  ~~
(int) 0

... which can be contrasted to ...

$ cling
[cling]$ #include <cstdint>
[cling]$ static_cast<uint64_t>(1) << 63
(unsigned long) 9223372036854775808

fanquake added the Consensus label Oct 4, 2018

JeremyRubin closed this Oct 4, 2018

JeremyRubin reopened this Oct 4, 2018

maflcko reviewed Oct 4, 2018

View reviewed changes

bitcoin deleted a comment from JeremyRubin Oct 4, 2018

bitcoin deleted a comment from promag Oct 4, 2018

bitcoin deleted a comment from jamesob Oct 4, 2018

This was referenced Oct 4, 2018

Trivial: update comment in CheckTransaction #14247 #14317

Closed

Use std::unordered_set instead of set in blockfilter interface #14074

Merged

leishman reviewed Oct 4, 2018

View reviewed changes

leishman reviewed Oct 5, 2018

View reviewed changes

sipa mentioned this pull request Oct 5, 2018

Faster duplicate input check in CheckTransaction (alternative to #14387) #14397

Closed

JeremyRubin mentioned this pull request Oct 5, 2018

Add Benchmark to test input de-duplication worst case #14400

Merged

Add Benchmark to test input de-duplication worst case

e1ed29f

JeremyRubin added 6 commits October 6, 2018 10:01

Add Probabalistic Checker for Duplicate Inputs

2871bc9

Fix hash computations to get unique bits

d2b195b

Use less error-prone code for bitset

5ff5f74

revert tx_verify.* to master

752682d

Add probabalistic duplicate detector to CheckBlock

4ca0ed7

Clean up the implementation of fast deduplication checking

c176a9c

Small clean up of comment & code layout for Duplicate Input checking

1f24773

JeremyRubin force-pushed the faster-dedup branch from 4ec7597 to 1f24773 Compare October 6, 2018 01:14

DrahtBot added the Needs rebase label Nov 6, 2018

JeremyRubin mentioned this pull request Nov 29, 2018

Probabalistic Checker for Duplicate Inputs #14836

Closed

JeremyRubin mentioned this pull request Nov 29, 2018

Stricter & More Performant Invariant Checking in CheckBlock #14837

Closed

fanquake closed this Nov 29, 2018

laanwj removed the Needs rebase label Oct 24, 2019

andrewtoth mentioned this pull request May 25, 2020

validation: Persist coins cache to disk and load on startup #18941

Closed

bitcoin locked as resolved and limited conversation to collaborators Dec 16, 2021

Faster Input Deduplication Algorithm #14387

Faster Input Deduplication Algorithm #14387

Uh oh!

Conversation

JeremyRubin commented Oct 4, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

instagibbs commented Oct 4, 2018

Uh oh!

DrahtBot commented Oct 4, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Conflicts

Uh oh!

practicalswift commented Oct 4, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JeremyRubin Oct 5, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

promag commented Oct 4, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sipa Oct 5, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnewbery commented Oct 5, 2018

Uh oh!

JeremyRubin commented Oct 5, 2018

Uh oh!

practicalswift commented Oct 5, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JeremyRubin commented Oct 5, 2018

Uh oh!

practicalswift commented Oct 5, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gmaxwell commented Oct 5, 2018

Uh oh!

JeremyRubin commented Oct 6, 2018

Uh oh!

gmaxwell commented Oct 7, 2018

Uh oh!

JeremyRubin commented Oct 9, 2018

Uh oh!

DrahtBot commented Nov 6, 2018

Uh oh!

gmaxwell commented Nov 29, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JeremyRubin commented Nov 29, 2018

Uh oh!

fanquake commented Nov 29, 2018

Uh oh!

gmaxwell commented Nov 29, 2018

Uh oh!

practicalswift commented Nov 29, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

DrahtBot commented Oct 4, 2018 •

edited

Loading

practicalswift commented Oct 4, 2018 •

edited

Loading

JeremyRubin Oct 5, 2018 •

edited

Loading

sipa Oct 5, 2018 •

edited

Loading

practicalswift commented Oct 5, 2018 •

edited

Loading

practicalswift commented Oct 5, 2018 •

edited

Loading

gmaxwell commented Nov 29, 2018 •

edited

Loading

practicalswift commented Nov 29, 2018 •

edited

Loading