Fix some concurrency issues in ActivateBestChain() #13023

skeees · 2018-04-18T20:09:34Z

Originally this PR was just to add tests around concurrency in block validation - those tests seem to have uncovered another bug in ActivateBestChain - this now fixes that bug and adds tests.

ActivateBestChain (invoked after a new block is validated) proceeds in steps - acquiring and releasing cs_main while incrementally disconnecting and connecting blocks to sync to the most work chain known (FindMostWorkChain()). Every time cs_main is released the result of FindMostWorkChain() can change - but currently that value is cached across acquisitions of cs_main and only refreshed when an invalid chain is explored. It needs to be refreshed every time cs_main is reacquired. The test added in 6094ce7 will occasionally fail without the commit fixing this issue 26bfdba

Original description below

After a bug discovered where UpdatedBlockTip() notifications could be triggered out of order (#12978), these unit tests check certain invariants about these signals.

The scheduler test asserts that a SingleThreadedSchedulerClient processes callbacks fully and sequentially.

The block validation test generates a random chain and calls ProcessNewBlock from multiple threads at random and in parallel. ValidationInterface callbacks verify that the ordering of BlockConnected BlockDisconnected and UpdatedBlockTip events occur as expected.

jamesob

Concept ACK, just a few nits. Nice test!

jamesob · 2018-04-18T20:11:28Z

src/Makefile.test.include

Looks like there's an alphabetic ordering here that may be good to preserve.

jamesob · 2018-04-18T20:13:05Z

src/test/scheduler_tests.cpp

should be required here

jamesob · 2018-04-18T20:23:58Z

src/test/validation_block_tests.cpp

Could be worth parameterizing Block with a bool make_invalid option to avoid the duplication here, but that's your call.

Extracted into FinalizeBlock - its probably worth refactoring BlockAssembler into a builder style class - would make unit tests much easier to write ... maybe one day

Yep, definitely agree.

jamesob · 2018-04-18T20:24:55Z

src/test/validation_block_tests.cpp

I know we don't have columnar limits in the styleguide, but this line's pretty long...

i agree with you - i originally had this on two lines - and then the linter put it all back on one

jamesob · 2018-04-18T20:25:49Z

src/test/validation_block_tests.cpp

Braces needed.

clang-format would put those in I believe

it doesn't! but i put them in now

Empact · 2018-04-18T21:37:45Z

src/test/validation_block_tests.cpp

Calling this ignored might be more straightforward

skeees · 2018-04-20T15:27:36Z

Updated to address reviewer comments and fix a bug that this test seems to have uncovered

ryanofsky

utACK 6094ce7. Fun tests!

ryanofsky · 2018-04-23T18:07:38Z

src/validation.cpp

-            if (pindexMostWork == nullptr) {
-                pindexMostWork = FindMostWorkChain();
-            }
+            pindexMostWork = FindMostWorkChain();


Might be good to add a comment here saying why it's important to recompute pindexMostWork each loop iteration (how sync could get stuck otherwise).

This might cause substantial slow-down during IBD or reindex, because currently FindMostWorkChain walks from the candidate tip to the fork point on the current chain to ensure none of the blocks are invalid. We may be able to optimize that, but we also might want to consider another solution.

ryanofsky · 2018-04-23T18:13:15Z

src/test/validation_block_tests.cpp

+    pubKey << i++ << OP_TRUE;
+
+    auto ptemplate = BlockAssembler(Params()).CreateNewBlock(pubKey, false);
+    CBlock* pblock = new CBlock(ptemplate->block);


I'd think this could be written more simply as:

auto block = make_shared<CBlock>(ptemplate->block); block->hashPrevBlock = ... return block;

It looks like pblock is leaked currently, or is this not the case?

sdaftuar · 2018-04-26T15:53:13Z

Thanks for opening this PR. I thought it would be helpful for other reviewers to clarify the (IMO significant) bug here, so that we can properly evaluate fixes.

If ActivateBestChain is invoked simultaneously in separate threads, then we can end up at a lower-work tip, and remain stuck until the next block comes in.

Suppose in thread 1, we have just been delivered a block that causes us to try to activate two blocks in ABC. We connect 1 of them in ABCStep, and then release cs_main before connecting the second.
In thread 2, suppose we have been delivered a 3rd block (say via rpc) that builds on the first two. It invokes ABC and gets to run after the first block has been connected in thread 1. It connects one block, releases cs_main, and then connects one more, and finishes.
When thread 1 gets to run again, the most work chain has advanced, but (before this PR) we don't refresh pindexMostWork (except when we find an invalid block). Consequently we would invoke ABCStep with a less work tip than our current tip(!). This would cause us to disconnect our tip and return.

Some travis failures have been observed due to this bug, as seen for instance here: https://travis-ci.org/bitcoin/bitcoin/jobs/370848272. The test that sometimes fails, rpc_deprecated.py, generates blocks on two nodes roughly simultaneously, so one of the nodes is generating blocks in an rpc thread while also processing blocks on the network thread, which I believe is enough to trigger this bug.

sdaftuar

Nice work finding this issue and good start on the new test. I haven't finished my review yet, but here's my initial feedback.

sdaftuar · 2018-04-26T16:13:33Z

src/validation.cpp

-            if (pindexMostWork == nullptr) {
-                pindexMostWork = FindMostWorkChain();
-            }
+            pindexMostWork = FindMostWorkChain();


This might cause substantial slow-down during IBD or reindex, because currently FindMostWorkChain walks from the candidate tip to the fork point on the current chain to ensure none of the blocks are invalid. We may be able to optimize that, but we also might want to consider another solution.

sdaftuar · 2018-04-26T16:28:47Z

src/test/validation_block_tests.cpp

+    auto pblock = Block(prev_hash);
+
+    // a second coinbase will make this block invalid
+    pblock->vtx.push_back(pblock->vtx[0]);


This method of constructing a bad block is not ideal, because this kind of invalidity is detected well before ConnectBlock. In order to get better code coverage, I think it would be better to generate different kinds of invalid blocks, to test failure at different points in the validation process.

(As it is, I believe this method of making an invalid block is detectable in CheckBlock (merkle-tree fails validation due to duplicate transactions). It would also fail in CheckBlock for having more than one coinbase, and it would fail in ContextualCheckBlock for having an invalid witness commitment.)

sdaftuar · 2018-04-27T15:22:24Z

I have a branch where I changed the test you wrote to produce invalid blocks that would be detected in ConnectBlock, and where I have an alternate fix for the bug you found here, as well as a fix for the bug I mentioned in #13092: https://github.com/sdaftuar/bitcoin/commits/2018-04-alternate-abc-fix.

For fixing the bug here, rather than invoke FindMostWorkChain on every loop iteration (which I expect to be very slow during reindex), I instead added a new test: if the tip has changed since the last loop iteration, or if pindexMostWork is no longer in setBlockIndexCandidates, then update pindexMostWork. (I believe that only the first criteria is actually necessary, but included the second as belt-and-suspenders.)

I compared this approach vs master for doing a reindex on my workstation, and saw < 3% slowdown (some slowdown was expected due to checking setBlockIndexCandidates each time).

skeees · 2018-04-27T16:29:15Z

Nice - thanks for this. I wonder if it might be a bit more readable to explore feasibility of caching results internally in findmostworkchain instead of doing it in the activatebestchain loop - the control flow there is getting harder and harder to follow. I'm happy to explore that.

sdaftuar · 2018-04-27T18:07:04Z

@TheBlueMatt and I discussed that idea offline (of rewriting FindMostWorkChain to use g_failed_blocks, our internal data structure for caching invalid blocks, to speed it up) but our initial reaction was that the review overhead to ensure correctness might be very high. I'm open to other ideas though (and if you think reworking FMWC is practical please feel free to take a shot at it).

skeees · 2018-05-04T15:57:25Z

@sdaftuar - pulled in your commits.

Its still hard for me to convince myself that the set of conditions you've defined are the only cases in which pindexMostWork needs to be refreshed. Maybe I'm just not familiar enough with those areas - I'll leave it to the people with more experience in that area to review.

Just wanted to throw out one alternate way to do determine when to refresh most work tip that's (imo) a bit easier to convince myself behaves correctly. Instead of caching pindexOldTip and using that as well as a couple of other things to determine whether to refresh - how about giving setBlockIndexCandidates a sequence number that gets incremented every time it gets updated. Then all ABC has to do is record the last tip candidate set version it saw (instead of the old tip) and use that to determine whether a refresh may be necessary. You could do this in ABC or even internally in FindMostWorkChain.

maflcko · 2018-05-04T18:04:03Z

Copy of travis output:

Running tests: validation_block_tests from test/validation_block_tests.cpp
Running 1 test case...
Test cases order is shuffled using seed: 1374948318
Entering test module "Bitcoin Test Suite"
test/validation_block_tests.cpp(24): Entering test suite "validation_block_tests"
test/validation_block_tests.cpp(125): Entering test case "processnewblock_signals_ordering"
test/validation_block_tests.cpp(160): info:  has passed
test/validation_block_tests.cpp(160): info: ProcessNewBlock(Params(), block, true, &ignored)
test/validation_block_tests.cpp(160): info: check 
test/validation_block_tests.cpp(160): info: ProcessNewBlock(Params(), block, true, &ignored)
test/validation_block_tests.cpp(160): info: ProcessNewBlock(Params(), block, true, &ignored)
test/validation_block_tests.cpp(160): info: check 
test/validation_block_tests.cpp(160): info:  has passed
test/validation_block_tests.cpp(160): info: ProcessNewBlock(Params(), block, true, &ignored)
terminate called after throwing an instance of 'std::length_error'
  what():  basic_string::_S_create
make[3]: *** [test/validation_block_tests.cpp.test] Error 1

sdaftuar · 2018-05-04T19:11:25Z

From some quick web searching, it seems that BOOST_CHECK may not be thread safe, so invoking it in each thread is likely the problem here. Perhaps we should just assert() on the return value from ProcessNewBlock() (at line 160).

sdaftuar · 2018-05-04T19:42:47Z

I do occasionally see this failure when I run validation_block_tests in a tight loop (after fixing the BOOST_CHECK issue above):

test/validation_block_tests.cpp(38): error in "processnewblock_signals_ordering": check m_expected_tip == block->hashPrevBlock failed [0f9188f13cb7b2c71f2a335e3a4fc328bf5beb436012afca590b1a11466e2206 != 0000000000000000000000000000000000000000000000000000000000000000]
Segmentation fault (core dumped)

Edit: perhaps this is also a threading issue though?

skeees · 2018-05-04T19:53:02Z

Hmmm thats the genesis block which connects through a different path I believe

skeees · 2018-05-04T20:07:54Z

@sdaftuar - thats the genesis block getting activated. Feels like it may be more of a test artefact (wasn't there a recent change that randomizes test case execution order) than a threading issue? I think solution for that is to just call PNB(genesis) at the very beginning.

fanquake · 2018-05-28T09:37:01Z

Backported in #13317

Technically, some internal datastructures may be in an inconsistent state if we do this, though there are no known bugs there. Still, for future safety, its much better to only unlock cs_main if we've made progress (not just tried a reorg which may make progress). Github-Pull: bitcoin#13023 Rebased-From: ecc3c4a

If multiple threads are invoking ActivateBestChain, it was possible to have them working towards different tips, and we could arrive at a less work tip than we should. Fix this by introducing a ChainState lock which must be held for the entire duration of ActivateBestChain to enforce exclusion in ABC. Github-Pull: bitcoin#13023 Rebased-From: a3ae8e6

After a recent bug discovered in callback ordering in MainSignals, this test checks invariants in ordering of BlockConnected / BlockDisconnected / UpdatedChainTip signals Github-Pull: bitcoin#13023 Rebased-From: dd435ad

…nt the memory model cbeaa91 Update ValidationInterface() documentation to explicitly specify threading and memory model (Jesse Cohen) b296b42 Update documentation for SingleThreadedSchedulerClient() to specify the memory model (Jesse Cohen) 9994d01 Add Unit Test for SingleThreadedSchedulerClient (Jesse Cohen) Pull request description: As discussed in #13023 I've split this test out into a separate pr This test (and documentation update) makes explicit the guarantee (previously undefined, but implied by the 'SingleThreaded' in `SingleThreadedSchedulerClient()`) - that callbacks pushed to the `SingleThreadedSchedulerClient()` obey the single threaded model for memory and execution - specifically, the callbacks are executed fully and in order, and even in cases where a subsequent callback is executed by a different thread, sequential consistency of memory for all threads executing these callbacks is maintained. Maintaining memory consistency should make the api more developer friendly - especially for users of the validationinterface. To the extent that there are performance implications from this decision, these are not currently present in practice because all use of this scheduler happens on a single thread currently, furthermore the lock should guarantee consistency across callback executions even when callbacks are executed by multiple threads (as the test does). Tree-SHA512: 5d95a7682c402e5ad76b05bc9dfbca99ca64105f62ab9e78f6fc0f6ea8c5277aa399fbb94298e35cc677b0c2181ff17259584bb7ae230e38aa68b85ecbc22856

… document the memory model cbeaa91 Update ValidationInterface() documentation to explicitly specify threading and memory model (Jesse Cohen) b296b42 Update documentation for SingleThreadedSchedulerClient() to specify the memory model (Jesse Cohen) 9994d01 Add Unit Test for SingleThreadedSchedulerClient (Jesse Cohen) Pull request description: As discussed in bitcoin#13023 I've split this test out into a separate pr This test (and documentation update) makes explicit the guarantee (previously undefined, but implied by the 'SingleThreaded' in `SingleThreadedSchedulerClient()`) - that callbacks pushed to the `SingleThreadedSchedulerClient()` obey the single threaded model for memory and execution - specifically, the callbacks are executed fully and in order, and even in cases where a subsequent callback is executed by a different thread, sequential consistency of memory for all threads executing these callbacks is maintained. Maintaining memory consistency should make the api more developer friendly - especially for users of the validationinterface. To the extent that there are performance implications from this decision, these are not currently present in practice because all use of this scheduler happens on a single thread currently, furthermore the lock should guarantee consistency across callback executions even when callbacks are executed by multiple threads (as the test does). Tree-SHA512: 5d95a7682c402e5ad76b05bc9dfbca99ca64105f62ab9e78f6fc0f6ea8c5277aa399fbb94298e35cc677b0c2181ff17259584bb7ae230e38aa68b85ecbc22856

…tChain + optimization 50dbec5 Add unit tests for signals generated by ProcessNewBlock() (furszy) f68251d Validation: rename one of the two instances using "bad-prevblk" to its correct description of "prevblk-not-found" (furszy) a51a755 Fix concurrency-related bugs in ActivateBestChain (Jesse Cohen) 198f435 Do not unlock cs_main in ABC unless we've actually made progress. (Matt Corallo) 8d15cf5 Optimize ActivateBestChain for long chains (Pieter Wuille) 8640be1 Fix fast-shutdown crash if genesis block was not loaded (Matt Corallo) ef24337 Hold cs_main while calling UpdatedBlockTip() and ui.NotifyBlockTip (Jesse Cohen) f9d2ab3 Update ValidationInterface() documentation to explicitly specify threading and memory model (Jesse Cohen) cb9bb25 Update documentation for SingleThreadedSchedulerClient() to specify the memory model (Jesse Cohen) 4ea2048 Add Unit Test for SingleThreadedSchedulerClient (Jesse Cohen) Pull request description: Made some back ports and adaptations to validate further the work introduced in #2203 and #2209. Now that the scheduler thread is processing most of the chain events, the callbacks execution order must be verified to dispatch the events serially, ensuring a consistent memory state in each event processing invocation. There are some concurrency issues solved, as well a an optimization for larger chains for the ABC workflow. Each commit details them well. List: bitcoin#7917 (only fb8fad1) bitcoin#12988 bitcoin#13023 bitcoin#13247 bitcoin#13835 This is built on top of #2284 (and the reason that made me notice that problem). So, 2284 comes first, then this one. ACKs for top commit: Fuzzbawls: ACK 50dbec5 random-zebra: ACK 50dbec5 and merging... Tree-SHA512: 573144cc96d2caa49ed2f0887dc8c53b5aca0efd54b6a25626063ccb780da426f56b3c6b7491648e7abefc1960c82b1354a4ff2c51425bfe99adaa4394562608

Technically, some internal datastructures may be in an inconsistent state if we do this, though there are no known bugs there. Still, for future safety, its much better to only unlock cs_main if we've made progress (not just tried a reorg which may make progress). zcash: cherry picked from commit ecc3c4a zcash: bitcoin/bitcoin#13023

If multiple threads are invoking ActivateBestChain, it was possible to have them working towards different tips, and we could arrive at a less work tip than we should. Fix this by introducing a ChainState lock which must be held for the entire duration of ActivateBestChain to enforce exclusion in ABC. zcash: cherry picked from commit a3ae8e6 zcash: bitcoin/bitcoin#13023

Technically, some internal datastructures may be in an inconsistent state if we do this, though there are no known bugs there. Still, for future safety, its much better to only unlock cs_main if we've made progress (not just tried a reorg which may make progress). zcash: cherry picked from commit ecc3c4a zcash: bitcoin/bitcoin#13023

If multiple threads are invoking ActivateBestChain, it was possible to have them working towards different tips, and we could arrive at a less work tip than we should. Fix this by introducing a ChainState lock which must be held for the entire duration of ActivateBestChain to enforce exclusion in ABC. zcash: cherry picked from commit a3ae8e6 zcash: bitcoin/bitcoin#13023

dd435ad Add unit tests for signals generated by ProcessNewBlock() (Jesse Cohen) a3ae8e6 Fix concurrency-related bugs in ActivateBestChain (Jesse Cohen) ecc3c4a Do not unlock cs_main in ABC unless we've actually made progress. (Matt Corallo) Pull request description: Originally this PR was just to add tests around concurrency in block validation - those tests seem to have uncovered another bug in ActivateBestChain - this now fixes that bug and adds tests. ActivateBestChain (invoked after a new block is validated) proceeds in steps - acquiring and releasing cs_main while incrementally disconnecting and connecting blocks to sync to the most work chain known (FindMostWorkChain()). Every time cs_main is released the result of FindMostWorkChain() can change - but currently that value is cached across acquisitions of cs_main and only refreshed when an invalid chain is explored. It needs to be refreshed every time cs_main is reacquired. The test added in bitcoin@6094ce7 will occasionally fail without the commit fixing this issue bitcoin@26bfdba Original description below -- After a bug discovered where UpdatedBlockTip() notifications could be triggered out of order (bitcoin#12978), these unit tests check certain invariants about these signals. The scheduler test asserts that a SingleThreadedSchedulerClient processes callbacks fully and sequentially. The block validation test generates a random chain and calls ProcessNewBlock from multiple threads at random and in parallel. ValidationInterface callbacks verify that the ordering of BlockConnected BlockDisconnected and UpdatedBlockTip events occur as expected. Tree-SHA512: 4102423a03d2ea28580c7a70add8a6bdb22ef9e33b107c3aadef80d5af02644cdfaae516c44933924717599c81701e0b96fbf9cf38696e9e41372401a5ee1f3c

skeees changed the title ~~Add unit tests for signals generated by ProcessNewBlock~~ [WIP] Add unit tests for signals generated by ProcessNewBlock Apr 18, 2018

jamesob reviewed Apr 18, 2018

View reviewed changes

Empact reviewed Apr 18, 2018

View reviewed changes

src/test/validation_block_tests.cpp Outdated

Copy link

Contributor

Empact Apr 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling this ignored might be more straightforward

fanquake added the Tests label Apr 18, 2018

Empact mentioned this pull request Apr 18, 2018

Validation Interface Logging #12994

Closed

skeees force-pushed the event-tests branch 2 times, most recently from 980f3ec to 8c09051 Compare April 20, 2018 15:23

skeees changed the title ~~[WIP] Add unit tests for signals generated by ProcessNewBlock~~ Always refresh most work chain when reacquiring cs_main in ActivateBestChain() Apr 20, 2018

skeees force-pushed the event-tests branch from 8c09051 to 6094ce7 Compare April 20, 2018 16:16

maflcko added Validation and removed Tests labels Apr 22, 2018

ryanofsky reviewed Apr 23, 2018

View reviewed changes

skeees mentioned this pull request Apr 23, 2018

[net] [validation] Call ProcessNewBlock() asynchronously #12934

Closed

sdaftuar mentioned this pull request Apr 26, 2018

ActivateBestChain concurrency issues #13092

Closed

sdaftuar suggested changes Apr 26, 2018

View reviewed changes

laanwj added this to the 0.16.1 milestone May 3, 2018

laanwj added the Needs backport label May 3, 2018

skeees force-pushed the event-tests branch from 6094ce7 to ab0cbc1 Compare May 4, 2018 15:44

skeees force-pushed the event-tests branch from ab0cbc1 to 0fc3d93 Compare May 4, 2018 19:51

fanquake removed the Needs backport label May 28, 2018

Sjors mentioned this pull request Aug 26, 2018

Track best-possible-headers (TheBlueMatt) #13937

Closed

HashUnlimited mentioned this pull request Sep 12, 2018

Add tests to SingleThreadedSchedulerClient() and document the memory model chaincoin/chaincoin#600

Merged

furszy mentioned this pull request Dec 27, 2019

Tracking purpose of backport btc#13023 PIVX-Project/PIVX#1230

Closed

LarryRuane mentioned this pull request Mar 23, 2021

Bitcoin 0.17 locking PRs zcash/zcash#5034

Open

furszy mentioned this pull request Apr 2, 2021

[Validation][Test] Fix concurrency issues in ActivateBestChain + optimization PIVX-Project/PIVX#2290

Merged

bitcoin locked as resolved and limited conversation to collaborators Dec 16, 2021

Fix some concurrency issues in ActivateBestChain() #13023

Fix some concurrency issues in ActivateBestChain() #13023

Uh oh!

Conversation

skeees commented Apr 18, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Original description below

Uh oh!

jamesob left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

skeees commented Apr 20, 2018

Uh oh!

ryanofsky left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sdaftuar commented Apr 26, 2018

Uh oh!

sdaftuar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sdaftuar commented Apr 27, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

skeees commented Apr 27, 2018

Uh oh!

sdaftuar commented Apr 27, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

skeees commented May 4, 2018

Uh oh!

maflcko commented May 4, 2018

Uh oh!

sdaftuar commented May 4, 2018

Uh oh!

sdaftuar commented May 4, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

skeees commented May 4, 2018

Uh oh!

skeees commented May 4, 2018

Uh oh!

fanquake commented May 28, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

skeees commented Apr 18, 2018 •

edited

Loading

sdaftuar commented Apr 27, 2018 •

edited

Loading

sdaftuar commented Apr 27, 2018 •

edited

Loading

sdaftuar commented May 4, 2018 •

edited

Loading