Skip to content

Conversation

@skeees
Copy link
Contributor

@skeees skeees commented Apr 18, 2018

Originally this PR was just to add tests around concurrency in block validation - those tests seem to have uncovered another bug in ActivateBestChain - this now fixes that bug and adds tests.

ActivateBestChain (invoked after a new block is validated) proceeds in steps - acquiring and releasing cs_main while incrementally disconnecting and connecting blocks to sync to the most work chain known (FindMostWorkChain()). Every time cs_main is released the result of FindMostWorkChain() can change - but currently that value is cached across acquisitions of cs_main and only refreshed when an invalid chain is explored. It needs to be refreshed every time cs_main is reacquired. The test added in 6094ce7 will occasionally fail without the commit fixing this issue 26bfdba

Original description below

After a bug discovered where UpdatedBlockTip() notifications could be triggered out of order (#12978), these unit tests check certain invariants about these signals.

The scheduler test asserts that a SingleThreadedSchedulerClient processes callbacks fully and sequentially.

The block validation test generates a random chain and calls ProcessNewBlock from multiple threads at random and in parallel. ValidationInterface callbacks verify that the ordering of BlockConnected BlockDisconnected and UpdatedBlockTip events occur as expected.

@skeees skeees changed the title Add unit tests for signals generated by ProcessNewBlock [WIP] Add unit tests for signals generated by ProcessNewBlock Apr 18, 2018
Copy link
Contributor

@jamesob jamesob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concept ACK, just a few nits. Nice test!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there's an alphabetic ordering here that may be good to preserve.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be required here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be worth parameterizing Block with a bool make_invalid option to avoid the duplication here, but that's your call.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extracted into FinalizeBlock - its probably worth refactoring BlockAssembler into a builder style class - would make unit tests much easier to write ... maybe one day

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, definitely agree.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we don't have columnar limits in the styleguide, but this line's pretty long...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree with you - i originally had this on two lines - and then the linter put it all back on one

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Braces needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-format would put those in I believe

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it doesn't! but i put them in now

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling this ignored might be more straightforward

@fanquake fanquake added the Tests label Apr 18, 2018
@skeees skeees force-pushed the event-tests branch 2 times, most recently from 980f3ec to 8c09051 Compare April 20, 2018 15:23
@skeees skeees changed the title [WIP] Add unit tests for signals generated by ProcessNewBlock Always refresh most work chain when reacquiring cs_main in ActivateBestChain() Apr 20, 2018
@skeees
Copy link
Contributor Author

skeees commented Apr 20, 2018

Updated to address reviewer comments and fix a bug that this test seems to have uncovered

Copy link
Contributor

@ryanofsky ryanofsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utACK 6094ce7. Fun tests!

if (pindexMostWork == nullptr) {
pindexMostWork = FindMostWorkChain();
}
pindexMostWork = FindMostWorkChain();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be good to add a comment here saying why it's important to recompute pindexMostWork each loop iteration (how sync could get stuck otherwise).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might cause substantial slow-down during IBD or reindex, because currently FindMostWorkChain walks from the candidate tip to the fork point on the current chain to ensure none of the blocks are invalid. We may be able to optimize that, but we also might want to consider another solution.

pubKey << i++ << OP_TRUE;

auto ptemplate = BlockAssembler(Params()).CreateNewBlock(pubKey, false);
CBlock* pblock = new CBlock(ptemplate->block);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd think this could be written more simply as:

auto block = make_shared<CBlock>(ptemplate->block);
block->hashPrevBlock = ...
return block;

It looks like pblock is leaked currently, or is this not the case?

@sdaftuar
Copy link
Member

Thanks for opening this PR. I thought it would be helpful for other reviewers to clarify the (IMO significant) bug here, so that we can properly evaluate fixes.

If ActivateBestChain is invoked simultaneously in separate threads, then we can end up at a lower-work tip, and remain stuck until the next block comes in.

  • Suppose in thread 1, we have just been delivered a block that causes us to try to activate two blocks in ABC. We connect 1 of them in ABCStep, and then release cs_main before connecting the second.
  • In thread 2, suppose we have been delivered a 3rd block (say via rpc) that builds on the first two. It invokes ABC and gets to run after the first block has been connected in thread 1. It connects one block, releases cs_main, and then connects one more, and finishes.
  • When thread 1 gets to run again, the most work chain has advanced, but (before this PR) we don't refresh pindexMostWork (except when we find an invalid block). Consequently we would invoke ABCStep with a less work tip than our current tip(!). This would cause us to disconnect our tip and return.

Some travis failures have been observed due to this bug, as seen for instance here: https://travis-ci.org/bitcoin/bitcoin/jobs/370848272. The test that sometimes fails, rpc_deprecated.py, generates blocks on two nodes roughly simultaneously, so one of the nodes is generating blocks in an rpc thread while also processing blocks on the network thread, which I believe is enough to trigger this bug.

Copy link
Member

@sdaftuar sdaftuar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work finding this issue and good start on the new test. I haven't finished my review yet, but here's my initial feedback.

if (pindexMostWork == nullptr) {
pindexMostWork = FindMostWorkChain();
}
pindexMostWork = FindMostWorkChain();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might cause substantial slow-down during IBD or reindex, because currently FindMostWorkChain walks from the candidate tip to the fork point on the current chain to ensure none of the blocks are invalid. We may be able to optimize that, but we also might want to consider another solution.

auto pblock = Block(prev_hash);

// a second coinbase will make this block invalid
pblock->vtx.push_back(pblock->vtx[0]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method of constructing a bad block is not ideal, because this kind of invalidity is detected well before ConnectBlock. In order to get better code coverage, I think it would be better to generate different kinds of invalid blocks, to test failure at different points in the validation process.

(As it is, I believe this method of making an invalid block is detectable in CheckBlock (merkle-tree fails validation due to duplicate transactions). It would also fail in CheckBlock for having more than one coinbase, and it would fail in ContextualCheckBlock for having an invalid witness commitment.)

@sdaftuar
Copy link
Member

sdaftuar commented Apr 27, 2018

I have a branch where I changed the test you wrote to produce invalid blocks that would be detected in ConnectBlock, and where I have an alternate fix for the bug you found here, as well as a fix for the bug I mentioned in #13092: https://github.com/sdaftuar/bitcoin/commits/2018-04-alternate-abc-fix.

For fixing the bug here, rather than invoke FindMostWorkChain on every loop iteration (which I expect to be very slow during reindex), I instead added a new test: if the tip has changed since the last loop iteration, or if pindexMostWork is no longer in setBlockIndexCandidates, then update pindexMostWork. (I believe that only the first criteria is actually necessary, but included the second as belt-and-suspenders.)

I compared this approach vs master for doing a reindex on my workstation, and saw < 3% slowdown (some slowdown was expected due to checking setBlockIndexCandidates each time).

@skeees
Copy link
Contributor Author

skeees commented Apr 27, 2018

Nice - thanks for this. I wonder if it might be a bit more readable to explore feasibility of caching results internally in findmostworkchain instead of doing it in the activatebestchain loop - the control flow there is getting harder and harder to follow. I'm happy to explore that.

@sdaftuar
Copy link
Member

sdaftuar commented Apr 27, 2018

@TheBlueMatt and I discussed that idea offline (of rewriting FindMostWorkChain to use g_failed_blocks, our internal data structure for caching invalid blocks, to speed it up) but our initial reaction was that the review overhead to ensure correctness might be very high. I'm open to other ideas though (and if you think reworking FMWC is practical please feel free to take a shot at it).

@laanwj laanwj added this to the 0.16.1 milestone May 3, 2018
@skeees
Copy link
Contributor Author

skeees commented May 4, 2018

@sdaftuar - pulled in your commits.

Its still hard for me to convince myself that the set of conditions you've defined are the only cases in which pindexMostWork needs to be refreshed. Maybe I'm just not familiar enough with those areas - I'll leave it to the people with more experience in that area to review.

Just wanted to throw out one alternate way to do determine when to refresh most work tip that's (imo) a bit easier to convince myself behaves correctly. Instead of caching pindexOldTip and using that as well as a couple of other things to determine whether to refresh - how about giving setBlockIndexCandidates a sequence number that gets incremented every time it gets updated. Then all ABC has to do is record the last tip candidate set version it saw (instead of the old tip) and use that to determine whether a refresh may be necessary. You could do this in ABC or even internally in FindMostWorkChain.

@maflcko
Copy link
Member

maflcko commented May 4, 2018

Copy of travis output:

Running tests: validation_block_tests from test/validation_block_tests.cpp
Running 1 test case...
Test cases order is shuffled using seed: 1374948318
Entering test module "Bitcoin Test Suite"
test/validation_block_tests.cpp(24): Entering test suite "validation_block_tests"
test/validation_block_tests.cpp(125): Entering test case "processnewblock_signals_ordering"
test/validation_block_tests.cpp(160): info:  has passed
test/validation_block_tests.cpp(160): info: ProcessNewBlock(Params(), block, true, &ignored)
test/validation_block_tests.cpp(160): info: check 
test/validation_block_tests.cpp(160): info: ProcessNewBlock(Params(), block, true, &ignored)
test/validation_block_tests.cpp(160): info: ProcessNewBlock(Params(), block, true, &ignored)
test/validation_block_tests.cpp(160): info: check 
test/validation_block_tests.cpp(160): info:  has passed
test/validation_block_tests.cpp(160): info: ProcessNewBlock(Params(), block, true, &ignored)
terminate called after throwing an instance of 'std::length_error'
  what():  basic_string::_S_create
make[3]: *** [test/validation_block_tests.cpp.test] Error 1

@sdaftuar
Copy link
Member

sdaftuar commented May 4, 2018

From some quick web searching, it seems that BOOST_CHECK may not be thread safe, so invoking it in each thread is likely the problem here. Perhaps we should just assert() on the return value from ProcessNewBlock() (at line 160).

@sdaftuar
Copy link
Member

sdaftuar commented May 4, 2018

I do occasionally see this failure when I run validation_block_tests in a tight loop (after fixing the BOOST_CHECK issue above):

test/validation_block_tests.cpp(38): error in "processnewblock_signals_ordering": check m_expected_tip == block->hashPrevBlock failed [0f9188f13cb7b2c71f2a335e3a4fc328bf5beb436012afca590b1a11466e2206 != 0000000000000000000000000000000000000000000000000000000000000000]
Segmentation fault (core dumped)

Edit: perhaps this is also a threading issue though?

@skeees
Copy link
Contributor Author

skeees commented May 4, 2018

Hmmm thats the genesis block which connects through a different path I believe

@skeees
Copy link
Contributor Author

skeees commented May 4, 2018

@sdaftuar - thats the genesis block getting activated. Feels like it may be more of a test artefact (wasn't there a recent change that randomizes test case execution order) than a threading issue? I think solution for that is to just call PNB(genesis) at the very beginning.

@fanquake
Copy link
Member

Backported in #13317

HashUnlimited pushed a commit to chaincoin/chaincoin that referenced this pull request Jun 29, 2018
Technically, some internal datastructures may be in an inconsistent
state if we do this, though there are no known bugs there. Still,
for future safety, its much better to only unlock cs_main if we've
made progress (not just tried a reorg which may make progress).

Github-Pull: bitcoin#13023
Rebased-From: ecc3c4a
HashUnlimited pushed a commit to chaincoin/chaincoin that referenced this pull request Jun 29, 2018
If multiple threads are invoking ActivateBestChain, it was possible to have
them working towards different tips, and we could arrive at a less work tip
than we should.  Fix this by introducing a ChainState lock which must
be held for the entire duration of ActivateBestChain to enforce
exclusion in ABC.

Github-Pull: bitcoin#13023
Rebased-From: a3ae8e6
HashUnlimited pushed a commit to chaincoin/chaincoin that referenced this pull request Jun 29, 2018
After a recent bug discovered in callback ordering in MainSignals,
this test checks invariants in ordering of
BlockConnected / BlockDisconnected / UpdatedChainTip signals

Github-Pull: bitcoin#13023
Rebased-From: dd435ad
maflcko pushed a commit that referenced this pull request Aug 1, 2018
…nt the memory model

cbeaa91 Update ValidationInterface() documentation to explicitly specify threading and memory model (Jesse Cohen)
b296b42 Update documentation for SingleThreadedSchedulerClient() to specify the memory model (Jesse Cohen)
9994d01 Add Unit Test for SingleThreadedSchedulerClient (Jesse Cohen)

Pull request description:

  As discussed in #13023 I've split this test out into a separate pr

  This test (and documentation update) makes explicit the guarantee (previously undefined, but implied by the 'SingleThreaded' in `SingleThreadedSchedulerClient()`) - that callbacks pushed to the `SingleThreadedSchedulerClient()` obey the single threaded model for memory and execution - specifically, the callbacks are executed fully and in order, and even in cases where a subsequent callback is executed by a different thread, sequential consistency of memory for all threads executing these callbacks is maintained.

  Maintaining memory consistency should make the api more developer friendly - especially for users of the validationinterface. To the extent that there are performance implications from this decision, these are not currently present in practice because all use of this scheduler happens on a single thread currently, furthermore the lock should guarantee consistency across callback executions even when callbacks are executed by multiple threads (as the test does).

Tree-SHA512: 5d95a7682c402e5ad76b05bc9dfbca99ca64105f62ab9e78f6fc0f6ea8c5277aa399fbb94298e35cc677b0c2181ff17259584bb7ae230e38aa68b85ecbc22856
PastaPastaPasta pushed a commit to PastaPastaPasta/dash that referenced this pull request Jul 17, 2020
… document the memory model

cbeaa91 Update ValidationInterface() documentation to explicitly specify threading and memory model (Jesse Cohen)
b296b42 Update documentation for SingleThreadedSchedulerClient() to specify the memory model (Jesse Cohen)
9994d01 Add Unit Test for SingleThreadedSchedulerClient (Jesse Cohen)

Pull request description:

  As discussed in bitcoin#13023 I've split this test out into a separate pr

  This test (and documentation update) makes explicit the guarantee (previously undefined, but implied by the 'SingleThreaded' in `SingleThreadedSchedulerClient()`) - that callbacks pushed to the `SingleThreadedSchedulerClient()` obey the single threaded model for memory and execution - specifically, the callbacks are executed fully and in order, and even in cases where a subsequent callback is executed by a different thread, sequential consistency of memory for all threads executing these callbacks is maintained.

  Maintaining memory consistency should make the api more developer friendly - especially for users of the validationinterface. To the extent that there are performance implications from this decision, these are not currently present in practice because all use of this scheduler happens on a single thread currently, furthermore the lock should guarantee consistency across callback executions even when callbacks are executed by multiple threads (as the test does).

Tree-SHA512: 5d95a7682c402e5ad76b05bc9dfbca99ca64105f62ab9e78f6fc0f6ea8c5277aa399fbb94298e35cc677b0c2181ff17259584bb7ae230e38aa68b85ecbc22856
random-zebra added a commit to PIVX-Project/PIVX that referenced this pull request Apr 4, 2021
…tChain + optimization

50dbec5 Add unit tests for signals generated by ProcessNewBlock() (furszy)
f68251d Validation: rename one of the two instances using "bad-prevblk" to its correct description of "prevblk-not-found" (furszy)
a51a755 Fix concurrency-related bugs in ActivateBestChain (Jesse Cohen)
198f435 Do not unlock cs_main in ABC unless we've actually made progress. (Matt Corallo)
8d15cf5 Optimize ActivateBestChain for long chains (Pieter Wuille)
8640be1 Fix fast-shutdown crash if genesis block was not loaded (Matt Corallo)
ef24337 Hold cs_main while calling UpdatedBlockTip() and ui.NotifyBlockTip (Jesse Cohen)
f9d2ab3 Update ValidationInterface() documentation to explicitly specify threading and memory model (Jesse Cohen)
cb9bb25 Update documentation for SingleThreadedSchedulerClient() to specify the memory model (Jesse Cohen)
4ea2048 Add Unit Test for SingleThreadedSchedulerClient (Jesse Cohen)

Pull request description:

  Made some back ports and adaptations to validate further the work introduced in #2203 and #2209.
  Now that the scheduler thread is processing most of the chain events, the callbacks execution order must be verified to dispatch the events serially, ensuring a consistent memory state in each event processing invocation.
  There are some concurrency issues solved, as well a an optimization for larger chains for the ABC workflow. Each commit details them well.

  List:
  bitcoin#7917 (only fb8fad1)
  bitcoin#12988
  bitcoin#13023
  bitcoin#13247
  bitcoin#13835

  This is built on top of #2284 (and the reason that made me notice that problem). So, 2284 comes first, then this one.

ACKs for top commit:
  Fuzzbawls:
    ACK 50dbec5
  random-zebra:
    ACK 50dbec5 and merging...

Tree-SHA512: 573144cc96d2caa49ed2f0887dc8c53b5aca0efd54b6a25626063ccb780da426f56b3c6b7491648e7abefc1960c82b1354a4ff2c51425bfe99adaa4394562608
LarryRuane pushed a commit to LarryRuane/zcash that referenced this pull request Apr 26, 2021
Technically, some internal datastructures may be in an inconsistent
state if we do this, though there are no known bugs there. Still,
for future safety, its much better to only unlock cs_main if we've
made progress (not just tried a reorg which may make progress).

zcash: cherry picked from commit ecc3c4a
zcash: bitcoin/bitcoin#13023
LarryRuane pushed a commit to LarryRuane/zcash that referenced this pull request Apr 26, 2021
If multiple threads are invoking ActivateBestChain, it was possible to have
them working towards different tips, and we could arrive at a less work tip
than we should.  Fix this by introducing a ChainState lock which must
be held for the entire duration of ActivateBestChain to enforce
exclusion in ABC.

zcash: cherry picked from commit a3ae8e6
zcash: bitcoin/bitcoin#13023
LarryRuane pushed a commit to LarryRuane/zcash that referenced this pull request May 27, 2021
Technically, some internal datastructures may be in an inconsistent
state if we do this, though there are no known bugs there. Still,
for future safety, its much better to only unlock cs_main if we've
made progress (not just tried a reorg which may make progress).

zcash: cherry picked from commit ecc3c4a
zcash: bitcoin/bitcoin#13023
LarryRuane pushed a commit to LarryRuane/zcash that referenced this pull request May 27, 2021
If multiple threads are invoking ActivateBestChain, it was possible to have
them working towards different tips, and we could arrive at a less work tip
than we should.  Fix this by introducing a ChainState lock which must
be held for the entire duration of ActivateBestChain to enforce
exclusion in ABC.

zcash: cherry picked from commit a3ae8e6
zcash: bitcoin/bitcoin#13023
UdjinM6 pushed a commit to UdjinM6/dash that referenced this pull request Jun 5, 2021
dd435ad Add unit tests for signals generated by ProcessNewBlock() (Jesse Cohen)
a3ae8e6 Fix concurrency-related bugs in ActivateBestChain (Jesse Cohen)
ecc3c4a Do not unlock cs_main in ABC unless we've actually made progress. (Matt Corallo)

Pull request description:

  Originally this PR was just to add tests around concurrency in block validation - those tests seem to have uncovered another bug in ActivateBestChain - this now fixes that bug and adds tests.

  ActivateBestChain (invoked after a new block is validated) proceeds in steps - acquiring and releasing cs_main while incrementally disconnecting and connecting blocks to sync to the most work chain known (FindMostWorkChain()). Every time cs_main is released the result of FindMostWorkChain() can change - but currently that value is cached across acquisitions of cs_main and only refreshed when an invalid chain is explored. It needs to be refreshed every time cs_main is reacquired. The test added in bitcoin@6094ce7 will occasionally fail without the commit fixing this issue bitcoin@26bfdba

  Original description below
  --

  After a bug discovered where UpdatedBlockTip() notifications could be triggered out of order (bitcoin#12978), these unit tests check certain invariants about these signals.

  The scheduler test asserts that a SingleThreadedSchedulerClient processes callbacks fully and sequentially.

  The block validation test generates a random chain and calls ProcessNewBlock from multiple threads at random and in parallel. ValidationInterface callbacks verify that the ordering of BlockConnected BlockDisconnected and UpdatedBlockTip events occur as expected.

Tree-SHA512: 4102423a03d2ea28580c7a70add8a6bdb22ef9e33b107c3aadef80d5af02644cdfaae516c44933924717599c81701e0b96fbf9cf38696e9e41372401a5ee1f3c
@bitcoin bitcoin locked as resolved and limited conversation to collaborators Dec 16, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants