-
Notifications
You must be signed in to change notification settings - Fork 38.8k
Granular invalidateblock and RewindBlockIndex #15402
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
d6ba031 to
01b0066
Compare
|
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. ConflictsReviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first. |
3ef4214 to
3af503b
Compare
…ing destructor fab6b07 test: txindex: interrupt threadGroup before calling destructor (MarcoFalke) Pull request description: Fixes the data races with the tread sanitizer such as * https://travis-ci.org/MarcoFalke/bitcoin/jobs/492330554 * bitcoin#15402 (comment) * ... Tree-SHA512: 40608c70d92a1dd68efc1d41eecc8e2fb7738508e21f91f0ad353adcceed60fa624f15bf72a5b69a9444157b261183abbe9fc4cc5dd8aebc1c49506b239e8e88
src/validation.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In commit "Release cs_main during RewindBlockIndex operation" (8994e394c0b4e76616c10acdd309d0225eabad9e)
Could you add some rationale in this comment saying why this is done here? Previous behavior of erasing in one place instead of two places seems simpler. Is this an optimization? Is it needed for correctness for some reason now that cs_main is released?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've moved to to a separate step instead of lumping it together with the nHeight calculation loop.
There's no need for it to be done while holding the lock. The reason it's separate is because the disconnect/erase of active blocks should be done simultaneously (so that things get erased even when interrupted, and erasing can't be done before disconnecting), but for non-mainchain blocks we need a separate loop still.
I've added some comments to elaborate, but I'll leave it to you to comment further or marking it as resolved.
Sjors
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested 3af503b on macOS 10.14.3 by calling invalidateblock 0000000000000000001c8018d9cb3b742ef25114f27563e3fc4a1902167f9893 (SegWit activation at height 481,824) from inside the QT console, without setting -dbcache.
On master memory usage exceeds 10 GB by block 558,000. It also required a force quit to abort.
With this PR memory stays well below 1 GB and I'm able to gracefully exit during this. It then marks the most recent block as invalid, so when you restart it doesn't suddenly start syncing to the tip again. For a user upgrading from pre-segwit, would they have to manually restart the process by calling reconsiderblock [...]?
Rewinding to SegWit activation took 7 hours on a one year old iMac. At this point it's probably faster to just wipe the chain and sync from scratch, but that's not for this PR. QT sync window won't update during this, but I can live with that (something to solve for the next soft fork...).
src/validation.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This note seems out of place now; it used to be above if (IsWitnessEnabled(pindexIter->pprev, params.GetConsensus()) && !(pindexIter->nStatus & BLOCK_OPT_WITNESS) && !chainActive.Contains(pindexIter))
Note that the former 'else' branch in RewindBlockIndex is now dealt with more naturally inside the EraseBlockData call (by checking whether the parent needs to be re-added as candidate after deleting a child).
This lets us simplify the iteration to just walking back in the chain, rather than looping over all of mapBlockIndex.
3af503b to
7bcac88
Compare
7bcac88 to
d4ad413
Compare
d4ad413 to
ac64328
Compare
Sjors
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maflcko
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
utACK ac64328b96fbe7475d2b12a9a0c2ba4787267204
I left some feedback in earlier commits, but then it solved itself after reading the later commits. Review felt like reading a story that tells itself. Would review again 10/10
|
Tested:
After these tests I felt that However when testing concurrent To reproduce apply the patch --- a/src/validation.cpp
+++ b/src/validation.cpp
@@ -2840,6 +2840,8 @@ bool CChainState::InvalidateBlock(CValidationState& state, const CChainParams& c
to_mark_failed = invalid_walk_tip;
}
+ MilliSleep(10000);
+
{
// Mark pindex itself as invalid, regardless of whether it was in the main chain or not.
LOCK(cs_main);Then:
|
|
Conceptually, should |
|
@sipa I did that, sleeps all over the place, and that case sounded like scratching the chain. I'll repeat the tests. |
|
@promag An alternative is only sleeping when "disconnected==1" (which would be in the middle for any invalidate that's longer than 1 deep, but not sleep all the time). It's possible that the resulting effect is a reconsider and invalidate that start fighting with eachother and disconnect/connect the same blocks over and over again. It's not too hard to detect this scenario inside invalidateblock and bail out, but I'd rather not overload the change here further with such edge cases. |
Sjors
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
utACK 8d22041. Several improvements in 9bb32eb Release cs_main during InvalidateBlock iterations since my previous tACK. It's more readable, e.g thanks to the additional variable to_mark_failed.
In the event this PR needs further changes (hopefully not) consider splitting that commit; moving LOCK(cs_main); into the loop and adding if (ShutdownRequested()) break; vs. everything else.
I would also prefer dealing with edge cases with concurrent RPC calls in a later PR. We can point out in the release note that it's currently not 100% safe to do that.
src/validation.cpp
Outdated
| setBlockIndexCandidates.erase(pindex); | ||
| m_failed_blocks.insert(pindex); | ||
| { | ||
| // Mark pindex (or the last disconnected block) as invalid, regardless of whether it was in the main chain or not. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit, move this comment below the interference check, replace "something is interfering" with "some other thread may have interfered before we reestablished the cs_main lock".
|
ACK |
| LOCK(cs_main); | ||
| if (chainActive.Contains(to_mark_failed)) { | ||
| // If the to-be-marked invalid block is in the active chain, something is interfering and we can't proceed. | ||
| return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noting that return value is not used.
|
I added an extra commit that solves the situation where if you used I've now tested a few more |
|
re-ACK |
|
I think this is ready. |
519b0bc Make last disconnected block BLOCK_FAILED_VALID, even when aborted (Pieter Wuille) 8d22041 Optimization: don't add txn back to mempool after 10 invalidates (Pieter Wuille) 9ce9c37 Prevent callback overruns in InvalidateBlock and RewindBlockIndex (Pieter Wuille) 9bb32eb Release cs_main during InvalidateBlock iterations (Pieter Wuille) 9b1ff5c Call InvalidateBlock without cs_main held (Pieter Wuille) 241b2c7 Make RewindBlockIndex interruptible (Pieter Wuille) 880ce7d Call RewindBlockIndex without cs_main held (Pieter Wuille) 436f7d7 Release cs_main during RewindBlockIndex operation (Pieter Wuille) 1d34287 Merge the disconnection and erasing loops in RewindBlockIndex (Pieter Wuille) 32b2696 Move erasure of non-active blocks to a separate loop in RewindBlockIndex (Pieter Wuille) 9d6dcc5 Abstract EraseBlockData out of RewindBlockIndex (Pieter Wuille) Pull request description: Same repo and branch like "Granular invalidateblock and RewindBlockIndex #15402 ", can be merged as is. This saves us all the cherry-picks and review of the backport cherry-picks. Tree-SHA512: 20c27c5f807c3d85e0072f9e2cdefad4ad7d329d6b26658a00844d5fcf0ed729059daf765e04e6382db2b5915117b15949cd4989d864917ab105c92e2e5e9986
519b0bc Make last disconnected block BLOCK_FAILED_VALID, even when aborted (Pieter Wuille) 8d22041 Optimization: don't add txn back to mempool after 10 invalidates (Pieter Wuille) 9ce9c37 Prevent callback overruns in InvalidateBlock and RewindBlockIndex (Pieter Wuille) 9bb32eb Release cs_main during InvalidateBlock iterations (Pieter Wuille) 9b1ff5c Call InvalidateBlock without cs_main held (Pieter Wuille) 241b2c7 Make RewindBlockIndex interruptible (Pieter Wuille) 880ce7d Call RewindBlockIndex without cs_main held (Pieter Wuille) 436f7d7 Release cs_main during RewindBlockIndex operation (Pieter Wuille) 1d34287 Merge the disconnection and erasing loops in RewindBlockIndex (Pieter Wuille) 32b2696 Move erasure of non-active blocks to a separate loop in RewindBlockIndex (Pieter Wuille) 9d6dcc5 Abstract EraseBlockData out of RewindBlockIndex (Pieter Wuille) Pull request description: This PR makes a number of improvements to the InvalidateBlock (`invalidateblock` RPC) and RewindBlockIndex functions, primarily around breaking up their long-term cs_main holding. In addition: * They're made safely interruptible (`bitcoind` can be shutdown, and no progress in either will be lost, though if incomplete, `invalidateblock` won't continue after restart and will need to be called again) * The validation queue is prevented from overflowing (meaning `invalidateblock` on a very old block will not drive bitcoind OOM) (see #14289). * `invalidateblock` won't bother to move transactions back into the mempool after 10 blocks (optimization). This is not an optimal solution, as we're relying on the scheduler call sites to make sure the scheduler doesn't overflow. Ideally, the scheduler would guarantee this directly, but that needs a few further changes (moving the signal emissions out of cs_main) to prevent deadlocks. I have manually tested the `invalidateblock` changes (including interrupting, and running with -checkblockindex and -checkmempool), but haven't tried the rewinding (which is probably becoming increasingly unnecessary, as very few pre-0.13.1 nodes remain that would care to upgrade). Tree-SHA512: 692e42758bd3d3efc2eb701984a8cb5db25fbeee32e7575df0183a00d0c2c30fdf72ce64c7625c32ad8c8bdc56313da72a7471658faeb0d39eefe39c4b8b8474
…indBlockIndex Summary: bitcoin/bitcoin@9ce9c37 --- This thing has been merged (D4803) and reverted (D4959) before. Seems to be working correctly now, with the lock to m_cs_chainstate in ActivateBestChain having a corresponding one in UnwindBlock. Concludes backport of Core [[bitcoin/bitcoin#15402 | PR15402]] Test Plan: cmake -GNinja -DCMAKE_BUILD_TYPE=Debug ninja check check-functional also build without this patch, invalidate block #6 or similarly deep, see that bitcoind memory usage just keeps going up, even past 30% on my 16GB machine. build with this patch and do the same, memory usage stays steady at ~12% no matter how far you disconnect blocks. Reviewers: #bitcoin_abc, jasonbcox, deadalnix Reviewed By: #bitcoin_abc, jasonbcox, deadalnix Subscribers: jasonbcox Differential Revision: https://reviews.bitcoinabc.org/D7213
…ing destructor fab6b07 test: txindex: interrupt threadGroup before calling destructor (MarcoFalke) Pull request description: Fixes the data races with the tread sanitizer such as * https://travis-ci.org/MarcoFalke/bitcoin/jobs/492330554 * bitcoin#15402 (comment) * ... Tree-SHA512: 40608c70d92a1dd68efc1d41eecc8e2fb7738508e21f91f0ad353adcceed60fa624f15bf72a5b69a9444157b261183abbe9fc4cc5dd8aebc1c49506b239e8e88
…ing destructor fab6b07 test: txindex: interrupt threadGroup before calling destructor (MarcoFalke) Pull request description: Fixes the data races with the tread sanitizer such as * https://travis-ci.org/MarcoFalke/bitcoin/jobs/492330554 * bitcoin#15402 (comment) * ... Tree-SHA512: 40608c70d92a1dd68efc1d41eecc8e2fb7738508e21f91f0ad353adcceed60fa624f15bf72a5b69a9444157b261183abbe9fc4cc5dd8aebc1c49506b239e8e88
519b0bc Make last disconnected block BLOCK_FAILED_VALID, even when aborted (Pieter Wuille) 8d22041 Optimization: don't add txn back to mempool after 10 invalidates (Pieter Wuille) 9ce9c37 Prevent callback overruns in InvalidateBlock and RewindBlockIndex (Pieter Wuille) 9bb32eb Release cs_main during InvalidateBlock iterations (Pieter Wuille) 9b1ff5c Call InvalidateBlock without cs_main held (Pieter Wuille) 241b2c7 Make RewindBlockIndex interruptible (Pieter Wuille) 880ce7d Call RewindBlockIndex without cs_main held (Pieter Wuille) 436f7d7 Release cs_main during RewindBlockIndex operation (Pieter Wuille) 1d34287 Merge the disconnection and erasing loops in RewindBlockIndex (Pieter Wuille) 32b2696 Move erasure of non-active blocks to a separate loop in RewindBlockIndex (Pieter Wuille) 9d6dcc5 Abstract EraseBlockData out of RewindBlockIndex (Pieter Wuille) Pull request description: This PR makes a number of improvements to the InvalidateBlock (`invalidateblock` RPC) and RewindBlockIndex functions, primarily around breaking up their long-term cs_main holding. In addition: * They're made safely interruptible (`bitcoind` can be shutdown, and no progress in either will be lost, though if incomplete, `invalidateblock` won't continue after restart and will need to be called again) * The validation queue is prevented from overflowing (meaning `invalidateblock` on a very old block will not drive bitcoind OOM) (see bitcoin#14289). * `invalidateblock` won't bother to move transactions back into the mempool after 10 blocks (optimization). This is not an optimal solution, as we're relying on the scheduler call sites to make sure the scheduler doesn't overflow. Ideally, the scheduler would guarantee this directly, but that needs a few further changes (moving the signal emissions out of cs_main) to prevent deadlocks. I have manually tested the `invalidateblock` changes (including interrupting, and running with -checkblockindex and -checkmempool), but haven't tried the rewinding (which is probably becoming increasingly unnecessary, as very few pre-0.13.1 nodes remain that would care to upgrade). Tree-SHA512: 692e42758bd3d3efc2eb701984a8cb5db25fbeee32e7575df0183a00d0c2c30fdf72ce64c7625c32ad8c8bdc56313da72a7471658faeb0d39eefe39c4b8b8474
519b0bc Make last disconnected block BLOCK_FAILED_VALID, even when aborted (Pieter Wuille) 8d22041 Optimization: don't add txn back to mempool after 10 invalidates (Pieter Wuille) 9ce9c37 Prevent callback overruns in InvalidateBlock and RewindBlockIndex (Pieter Wuille) 9bb32eb Release cs_main during InvalidateBlock iterations (Pieter Wuille) 9b1ff5c Call InvalidateBlock without cs_main held (Pieter Wuille) 241b2c7 Make RewindBlockIndex interruptible (Pieter Wuille) 880ce7d Call RewindBlockIndex without cs_main held (Pieter Wuille) 436f7d7 Release cs_main during RewindBlockIndex operation (Pieter Wuille) 1d34287 Merge the disconnection and erasing loops in RewindBlockIndex (Pieter Wuille) 32b2696 Move erasure of non-active blocks to a separate loop in RewindBlockIndex (Pieter Wuille) 9d6dcc5 Abstract EraseBlockData out of RewindBlockIndex (Pieter Wuille) Pull request description: This PR makes a number of improvements to the InvalidateBlock (`invalidateblock` RPC) and RewindBlockIndex functions, primarily around breaking up their long-term cs_main holding. In addition: * They're made safely interruptible (`bitcoind` can be shutdown, and no progress in either will be lost, though if incomplete, `invalidateblock` won't continue after restart and will need to be called again) * The validation queue is prevented from overflowing (meaning `invalidateblock` on a very old block will not drive bitcoind OOM) (see bitcoin#14289). * `invalidateblock` won't bother to move transactions back into the mempool after 10 blocks (optimization). This is not an optimal solution, as we're relying on the scheduler call sites to make sure the scheduler doesn't overflow. Ideally, the scheduler would guarantee this directly, but that needs a few further changes (moving the signal emissions out of cs_main) to prevent deadlocks. I have manually tested the `invalidateblock` changes (including interrupting, and running with -checkblockindex and -checkmempool), but haven't tried the rewinding (which is probably becoming increasingly unnecessary, as very few pre-0.13.1 nodes remain that would care to upgrade). Tree-SHA512: 692e42758bd3d3efc2eb701984a8cb5db25fbeee32e7575df0183a00d0c2c30fdf72ce64c7625c32ad8c8bdc56313da72a7471658faeb0d39eefe39c4b8b8474

This PR makes a number of improvements to the InvalidateBlock (
invalidateblockRPC) and RewindBlockIndex functions, primarily around breaking up their long-term cs_main holding. In addition:bitcoindcan be shutdown, and no progress in either will be lost, though if incomplete,invalidateblockwon't continue after restart and will need to be called again)invalidateblockon a very old block will not drive bitcoind OOM) (see Unbounded growth of scheduler queue #14289).invalidateblockwon't bother to move transactions back into the mempool after 10 blocks (optimization).This is not an optimal solution, as we're relying on the scheduler call sites to make sure the scheduler doesn't overflow. Ideally, the scheduler would guarantee this directly, but that needs a few further changes (moving the signal emissions out of cs_main) to prevent deadlocks.
I have manually tested the
invalidateblockchanges (including interrupting, and running with -checkblockindex and -checkmempool), but haven't tried the rewinding (which is probably becoming increasingly unnecessary, as very few pre-0.13.1 nodes remain that would care to upgrade).