Granular invalidateblock and RewindBlockIndex #15402

sipa · 2019-02-14T01:27:06Z

This PR makes a number of improvements to the InvalidateBlock (invalidateblock RPC) and RewindBlockIndex functions, primarily around breaking up their long-term cs_main holding. In addition:

They're made safely interruptible (bitcoind can be shutdown, and no progress in either will be lost, though if incomplete, invalidateblock won't continue after restart and will need to be called again)
The validation queue is prevented from overflowing (meaning invalidateblock on a very old block will not drive bitcoind OOM) (see Unbounded growth of scheduler queue #14289).
invalidateblock won't bother to move transactions back into the mempool after 10 blocks (optimization).

This is not an optimal solution, as we're relying on the scheduler call sites to make sure the scheduler doesn't overflow. Ideally, the scheduler would guarantee this directly, but that needs a few further changes (moving the signal emissions out of cs_main) to prevent deadlocks.

I have manually tested the invalidateblock changes (including interrupting, and running with -checkblockindex and -checkmempool), but haven't tried the rewinding (which is probably becoming increasingly unnecessary, as very few pre-0.13.1 nodes remain that would care to upgrade).

src/test/txindex_tests.cpp

DrahtBot · 2019-02-14T04:44:19Z

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#15494 (rpc: Return whether the block was invalidated on invalidateblock by promag)
#15305 ([validation] Crash if disconnecting a block fails by sdaftuar)
#14193 (validation: Add missing mempool locks by MarcoFalke)
#13582 (Extract AppInitLoadBlockIndex from AppInitMain by Empact)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

…ing destructor fab6b07 test: txindex: interrupt threadGroup before calling destructor (MarcoFalke) Pull request description: Fixes the data races with the tread sanitizer such as * https://travis-ci.org/MarcoFalke/bitcoin/jobs/492330554 * bitcoin#15402 (comment) * ... Tree-SHA512: 40608c70d92a1dd68efc1d41eecc8e2fb7738508e21f91f0ad353adcceed60fa624f15bf72a5b69a9444157b261183abbe9fc4cc5dd8aebc1c49506b239e8e88

src/validation.cpp

src/rpc/blockchain.cpp

ryanofsky · 2019-02-22T16:57:14Z

src/validation.cpp

In commit "Release cs_main during RewindBlockIndex operation" (8994e394c0b4e76616c10acdd309d0225eabad9e)

Could you add some rationale in this comment saying why this is done here? Previous behavior of erasing in one place instead of two places seems simpler. Is this an optimization? Is it needed for correctness for some reason now that cs_main is released?

I've moved to to a separate step instead of lumping it together with the nHeight calculation loop.

There's no need for it to be done while holding the lock. The reason it's separate is because the disconnect/erase of active blocks should be done simultaneously (so that things get erased even when interrupted, and erasing can't be done before disconnecting), but for non-mainchain blocks we need a separate loop still.

I've added some comments to elaborate, but I'll leave it to you to comment further or marking it as resolved.

Sjors

Tested 3af503b on macOS 10.14.3 by calling invalidateblock 0000000000000000001c8018d9cb3b742ef25114f27563e3fc4a1902167f9893 (SegWit activation at height 481,824) from inside the QT console, without setting -dbcache.

On master memory usage exceeds 10 GB by block 558,000. It also required a force quit to abort.

With this PR memory stays well below 1 GB and I'm able to gracefully exit during this. It then marks the most recent block as invalid, so when you restart it doesn't suddenly start syncing to the tip again. For a user upgrading from pre-segwit, would they have to manually restart the process by calling reconsiderblock [...]?

Rewinding to SegWit activation took 7 hours on a one year old iMac. At this point it's probably faster to just wipe the chain and sync from scratch, but that's not for this PR. QT sync window won't update during this, but I can live with that (something to solve for the next soft fork...).

src/validation.cpp

Sjors · 2019-02-22T15:47:48Z

src/validation.cpp

This note seems out of place now; it used to be above if (IsWitnessEnabled(pindexIter->pprev, params.GetConsensus()) && !(pindexIter->nStatus & BLOCK_OPT_WITNESS) && !chainActive.Contains(pindexIter))

Note that the former 'else' branch in RewindBlockIndex is now dealt with more naturally inside the EraseBlockData call (by checking whether the parent needs to be re-added as candidate after deleting a child).

This lets us simplify the iteration to just walking back in the chain, rather than looping over all of mapBlockIndex.

Sjors

tACK ac64328

Note that QT doesn't update the block count (etc) as this unwinding happens, although it does update the mempool count. I can live with that for now.

src/validation.cpp

maflcko

utACK ac64328b96fbe7475d2b12a9a0c2ba4787267204

I left some feedback in earlier commits, but then it solved itself after reading the later commits. Review felt like reading a story that tells itself. Would review again 10/10

src/validation.cpp

promag · 2019-02-27T12:01:17Z

Tested:

concurrent invalidateblock calls
shutdown during invalidateblock

After these tests I felt that invalidateblock could return whether the invalidation finished or not, or other meaningful data (other PR).

However when testing concurrent invalidateblock and reconsiderblock, got the following assertion on start:

...
2019-02-27T11:53:54Z Checking all blk files are present...
2019-02-27T11:53:54Z Opening LevelDB in /Users/joao/Library/Application Support/Bitcoin/regtest/chainstate
2019-02-27T11:53:54Z Opened LevelDB successfully
2019-02-27T11:53:54Z Using obfuscation key for /Users/joao/Library/Application Support/Bitcoin/regtest/chainstate: 326eeb3c93ca8734
Assertion failed: (!setBlockIndexCandidates.empty()), function PruneBlockIndexCandidates, file validation.cpp, line 2534.

To reproduce apply the patch

--- a/src/validation.cpp
+++ b/src/validation.cpp
@@ -2840,6 +2840,8 @@ bool CChainState::InvalidateBlock(CValidationState& state, const CChainParams& c
         to_mark_failed = invalid_walk_tip;
     }

+    MilliSleep(10000);
+
     {
         // Mark pindex itself as invalid, regardless of whether it was in the main chain or not.
         LOCK(cs_main);

Then:

invalidateblock <old block hash>
during the above sleep call reconsiderblock <previous tip>
restart bitcoind.

promag · 2019-02-27T12:19:10Z

Conceptually, should InvalidateBlock interrupt/error if it disconnects the same tip twice? Or should invalidateblock and reconsiderblock be exclusive? Something to think and probably tackle in a different PR.

promag · 2019-02-28T22:18:27Z

@sipa I did that, sleeps all over the place, and that case sounded like scratching the chain. I'll repeat the tests.

sipa · 2019-02-28T22:21:26Z

@promag An alternative is only sleeping when "disconnected==1" (which would be in the middle for any invalidate that's longer than 1 deep, but not sleep all the time).

It's possible that the resulting effect is a reconsider and invalidate that start fighting with eachother and disconnect/connect the same blocks over and over again. It's not too hard to detect this scenario inside invalidateblock and bail out, but I'd rather not overload the change here further with such edge cases.

Sjors

utACK 8d22041. Several improvements in 9bb32eb Release cs_main during InvalidateBlock iterations since my previous tACK. It's more readable, e.g thanks to the additional variable to_mark_failed.

In the event this PR needs further changes (hopefully not) consider splitting that commit; moving LOCK(cs_main); into the loop and adding if (ShutdownRequested()) break; vs. everything else.

I would also prefer dealing with edge cases with concurrent RPC calls in a later PR. We can point out in the release note that it's currently not 100% safe to do that.

Sjors · 2019-03-01T11:46:51Z

src/validation.cpp

-    setBlockIndexCandidates.erase(pindex);
-    m_failed_blocks.insert(pindex);
+    {
+        // Mark pindex (or the last disconnected block) as invalid, regardless of whether it was in the main chain or not.


Nit, move this comment below the interference check, replace "something is interfering" with "some other thread may have interfered before we reestablished the cs_main lock".

src/validation.cpp

gmaxwell · 2019-03-02T14:28:43Z

ACK

promag · 2019-03-03T16:36:43Z

src/validation.cpp

+        LOCK(cs_main);
+        if (chainActive.Contains(to_mark_failed)) {
+            // If the to-be-marked invalid block is in the active chain, something is interfering and we can't proceed.
+            return false;


Just noting that return value is not used.

src/validation.cpp

sipa · 2019-03-03T21:13:21Z

I added an extra commit that solves the situation where if you used invalidateblock, had a cache flush before it completed, then had your node crash, and then restarted with -checkblockindex would assert with a state corruption.

I've now tested a few more invalidateblock scenarios (with -checkblockindex and -checkmempool enabled), including interrupting it in the middle, killing/crashing it, and running reconsiderblock in parallel. They all seem to work fine. The RewindBlockIndex changes are untested still.

gmaxwell · 2019-03-04T17:51:08Z

re-ACK

sipa · 2019-03-06T21:57:37Z

I think this is ready.

519b0bc Make last disconnected block BLOCK_FAILED_VALID, even when aborted (Pieter Wuille) 8d22041 Optimization: don't add txn back to mempool after 10 invalidates (Pieter Wuille) 9ce9c37 Prevent callback overruns in InvalidateBlock and RewindBlockIndex (Pieter Wuille) 9bb32eb Release cs_main during InvalidateBlock iterations (Pieter Wuille) 9b1ff5c Call InvalidateBlock without cs_main held (Pieter Wuille) 241b2c7 Make RewindBlockIndex interruptible (Pieter Wuille) 880ce7d Call RewindBlockIndex without cs_main held (Pieter Wuille) 436f7d7 Release cs_main during RewindBlockIndex operation (Pieter Wuille) 1d34287 Merge the disconnection and erasing loops in RewindBlockIndex (Pieter Wuille) 32b2696 Move erasure of non-active blocks to a separate loop in RewindBlockIndex (Pieter Wuille) 9d6dcc5 Abstract EraseBlockData out of RewindBlockIndex (Pieter Wuille) Pull request description: Same repo and branch like "Granular invalidateblock and RewindBlockIndex #15402 ", can be merged as is. This saves us all the cherry-picks and review of the backport cherry-picks. Tree-SHA512: 20c27c5f807c3d85e0072f9e2cdefad4ad7d329d6b26658a00844d5fcf0ed729059daf765e04e6382db2b5915117b15949cd4989d864917ab105c92e2e5e9986

519b0bc Make last disconnected block BLOCK_FAILED_VALID, even when aborted (Pieter Wuille) 8d22041 Optimization: don't add txn back to mempool after 10 invalidates (Pieter Wuille) 9ce9c37 Prevent callback overruns in InvalidateBlock and RewindBlockIndex (Pieter Wuille) 9bb32eb Release cs_main during InvalidateBlock iterations (Pieter Wuille) 9b1ff5c Call InvalidateBlock without cs_main held (Pieter Wuille) 241b2c7 Make RewindBlockIndex interruptible (Pieter Wuille) 880ce7d Call RewindBlockIndex without cs_main held (Pieter Wuille) 436f7d7 Release cs_main during RewindBlockIndex operation (Pieter Wuille) 1d34287 Merge the disconnection and erasing loops in RewindBlockIndex (Pieter Wuille) 32b2696 Move erasure of non-active blocks to a separate loop in RewindBlockIndex (Pieter Wuille) 9d6dcc5 Abstract EraseBlockData out of RewindBlockIndex (Pieter Wuille) Pull request description: This PR makes a number of improvements to the InvalidateBlock (`invalidateblock` RPC) and RewindBlockIndex functions, primarily around breaking up their long-term cs_main holding. In addition: * They're made safely interruptible (`bitcoind` can be shutdown, and no progress in either will be lost, though if incomplete, `invalidateblock` won't continue after restart and will need to be called again) * The validation queue is prevented from overflowing (meaning `invalidateblock` on a very old block will not drive bitcoind OOM) (see #14289). * `invalidateblock` won't bother to move transactions back into the mempool after 10 blocks (optimization). This is not an optimal solution, as we're relying on the scheduler call sites to make sure the scheduler doesn't overflow. Ideally, the scheduler would guarantee this directly, but that needs a few further changes (moving the signal emissions out of cs_main) to prevent deadlocks. I have manually tested the `invalidateblock` changes (including interrupting, and running with -checkblockindex and -checkmempool), but haven't tried the rewinding (which is probably becoming increasingly unnecessary, as very few pre-0.13.1 nodes remain that would care to upgrade). Tree-SHA512: 692e42758bd3d3efc2eb701984a8cb5db25fbeee32e7575df0183a00d0c2c30fdf72ce64c7625c32ad8c8bdc56313da72a7471658faeb0d39eefe39c4b8b8474

…indBlockIndex Summary: bitcoin/bitcoin@9ce9c37 --- This thing has been merged (D4803) and reverted (D4959) before. Seems to be working correctly now, with the lock to m_cs_chainstate in ActivateBestChain having a corresponding one in UnwindBlock. Concludes backport of Core [[bitcoin/bitcoin#15402 | PR15402]] Test Plan: cmake -GNinja -DCMAKE_BUILD_TYPE=Debug ninja check check-functional also build without this patch, invalidate block #6 or similarly deep, see that bitcoind memory usage just keeps going up, even past 30% on my 16GB machine. build with this patch and do the same, memory usage stays steady at ~12% no matter how far you disconnect blocks. Reviewers: #bitcoin_abc, jasonbcox, deadalnix Reviewed By: #bitcoin_abc, jasonbcox, deadalnix Subscribers: jasonbcox Differential Revision: https://reviews.bitcoinabc.org/D7213

…ing destructor fab6b07 test: txindex: interrupt threadGroup before calling destructor (MarcoFalke) Pull request description: Fixes the data races with the tread sanitizer such as * https://travis-ci.org/MarcoFalke/bitcoin/jobs/492330554 * bitcoin#15402 (comment) * ... Tree-SHA512: 40608c70d92a1dd68efc1d41eecc8e2fb7738508e21f91f0ad353adcceed60fa624f15bf72a5b69a9444157b261183abbe9fc4cc5dd8aebc1c49506b239e8e88

519b0bc Make last disconnected block BLOCK_FAILED_VALID, even when aborted (Pieter Wuille) 8d22041 Optimization: don't add txn back to mempool after 10 invalidates (Pieter Wuille) 9ce9c37 Prevent callback overruns in InvalidateBlock and RewindBlockIndex (Pieter Wuille) 9bb32eb Release cs_main during InvalidateBlock iterations (Pieter Wuille) 9b1ff5c Call InvalidateBlock without cs_main held (Pieter Wuille) 241b2c7 Make RewindBlockIndex interruptible (Pieter Wuille) 880ce7d Call RewindBlockIndex without cs_main held (Pieter Wuille) 436f7d7 Release cs_main during RewindBlockIndex operation (Pieter Wuille) 1d34287 Merge the disconnection and erasing loops in RewindBlockIndex (Pieter Wuille) 32b2696 Move erasure of non-active blocks to a separate loop in RewindBlockIndex (Pieter Wuille) 9d6dcc5 Abstract EraseBlockData out of RewindBlockIndex (Pieter Wuille) Pull request description: This PR makes a number of improvements to the InvalidateBlock (`invalidateblock` RPC) and RewindBlockIndex functions, primarily around breaking up their long-term cs_main holding. In addition: * They're made safely interruptible (`bitcoind` can be shutdown, and no progress in either will be lost, though if incomplete, `invalidateblock` won't continue after restart and will need to be called again) * The validation queue is prevented from overflowing (meaning `invalidateblock` on a very old block will not drive bitcoind OOM) (see bitcoin#14289). * `invalidateblock` won't bother to move transactions back into the mempool after 10 blocks (optimization). This is not an optimal solution, as we're relying on the scheduler call sites to make sure the scheduler doesn't overflow. Ideally, the scheduler would guarantee this directly, but that needs a few further changes (moving the signal emissions out of cs_main) to prevent deadlocks. I have manually tested the `invalidateblock` changes (including interrupting, and running with -checkblockindex and -checkmempool), but haven't tried the rewinding (which is probably becoming increasingly unnecessary, as very few pre-0.13.1 nodes remain that would care to upgrade). Tree-SHA512: 692e42758bd3d3efc2eb701984a8cb5db25fbeee32e7575df0183a00d0c2c30fdf72ce64c7625c32ad8c8bdc56313da72a7471658faeb0d39eefe39c4b8b8474

fanquake added the Validation label Feb 14, 2019

sipa force-pushed the 201902_limitrewindinvalidate branch 3 times, most recently from d6ba031 to 01b0066 Compare February 14, 2019 02:08

maflcko reviewed Feb 14, 2019

View reviewed changes

src/test/txindex_tests.cpp Outdated Show resolved Hide resolved

laanwj added this to the 0.18.0 milestone Feb 14, 2019

maflcko mentioned this pull request Feb 14, 2019

test: txindex: interrupt threadGroup before calling destructor #15410

Merged

sipa force-pushed the 201902_limitrewindinvalidate branch 2 times, most recently from 3ef4214 to 3af503b Compare February 14, 2019 23:37

promag reviewed Feb 18, 2019

View reviewed changes

maflcko reviewed Feb 18, 2019

View reviewed changes

src/rpc/blockchain.cpp Outdated Show resolved Hide resolved

promag reviewed Feb 18, 2019

View reviewed changes

src/rpc/blockchain.cpp Outdated Show resolved Hide resolved

ryanofsky reviewed Feb 22, 2019

View reviewed changes

Sjors reviewed Feb 22, 2019

View reviewed changes

sipa added 2 commits February 24, 2019 12:38

Abstract EraseBlockData out of RewindBlockIndex

9d6dcc5

Note that the former 'else' branch in RewindBlockIndex is now dealt with more naturally inside the EraseBlockData call (by checking whether the parent needs to be re-added as candidate after deleting a child).

Move erasure of non-active blocks to a separate loop in RewindBlockIndex

32b2696

This lets us simplify the iteration to just walking back in the chain, rather than looping over all of mapBlockIndex.

sipa force-pushed the 201902_limitrewindinvalidate branch from 3af503b to 7bcac88 Compare February 24, 2019 21:31

Merge the disconnection and erasing loops in RewindBlockIndex

1d34287

sipa force-pushed the 201902_limitrewindinvalidate branch from 7bcac88 to d4ad413 Compare February 25, 2019 02:43

sipa added 4 commits February 24, 2019 18:55

Release cs_main during RewindBlockIndex operation

436f7d7

Call RewindBlockIndex without cs_main held

880ce7d

Make RewindBlockIndex interruptible

241b2c7

Call InvalidateBlock without cs_main held

9b1ff5c

sipa force-pushed the 201902_limitrewindinvalidate branch from d4ad413 to ac64328 Compare February 25, 2019 02:55

Sjors approved these changes Feb 25, 2019

View reviewed changes

src/validation.cpp Show resolved Hide resolved

maflcko approved these changes Feb 26, 2019

View reviewed changes

src/validation.cpp Show resolved Hide resolved

src/validation.cpp Show resolved Hide resolved

Sjors approved these changes Mar 1, 2019

View reviewed changes

promag reviewed Mar 3, 2019

View reviewed changes

Make last disconnected block BLOCK_FAILED_VALID, even when aborted

519b0bc

laanwj added the Needs backport label Mar 4, 2019

maflcko mentioned this pull request Mar 6, 2019

0.18: Granular invalidateblock and RewindBlockIndex #15552

Merged

maflcko removed the Needs backport label Mar 6, 2019

laanwj merged commit 519b0bc into bitcoin:master Mar 7, 2019

Empact mentioned this pull request Mar 9, 2019

Extract AppInitLoadBlockIndex from AppInitMain #13582

Closed

sdaftuar mentioned this pull request Jul 24, 2019

Assertion `setBlockIndexCandidates.count(pindex)' failed #16444

Closed

ariard mentioned this pull request Dec 16, 2019

Add ChainstateManager, remove BlockManager global #17737

Merged

bitcoin locked as resolved and limited conversation to collaborators Dec 16, 2021

Granular invalidateblock and RewindBlockIndex #15402

Granular invalidateblock and RewindBlockIndex #15402

Uh oh!

Conversation

sipa commented Feb 14, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

DrahtBot commented Feb 14, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Conflicts

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ryanofsky Feb 22, 2019

Choose a reason for hiding this comment

Uh oh!

sipa Feb 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Sjors left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Sjors Feb 22, 2019

Choose a reason for hiding this comment

Uh oh!

Sjors left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

maflcko left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

promag commented Feb 27, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

promag commented Feb 27, 2019

Uh oh!

promag commented Feb 28, 2019

Uh oh!

sipa commented Feb 28, 2019

Uh oh!

Sjors left a comment

Choose a reason for hiding this comment

Uh oh!

Sjors Mar 1, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gmaxwell commented Mar 2, 2019

Uh oh!

promag Mar 3, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sipa commented Mar 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gmaxwell commented Mar 4, 2019

Uh oh!

sipa commented Mar 6, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

sipa commented Feb 14, 2019 •

edited

Loading

DrahtBot commented Feb 14, 2019 •

edited

Loading

sipa Feb 24, 2019 •

edited

Loading

promag commented Feb 27, 2019 •

edited

Loading

sipa commented Mar 3, 2019 •

edited

Loading