-
Notifications
You must be signed in to change notification settings - Fork 38.7k
validation: Improve error handling when VerifyDB dosn't finish successfully #25574
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I have tested the PR the command |
|
In the future the RPC can probably be extended with a "warnings" and "error" field, but this patch lgtm. |
|
You mean changing the RPC return response prior adding objects for warning and error, It might require to or add new fields in the RPC? This might mean that changing the way return is responding in RPC? |
|
Note that this doesn't only affect the |
The above methods aborts, the code does produce assertion and thus abort, are you suggesting that it should rather identify for the user to simply ignore or warn the user that there is an issue with the verification of the chain, since we are starting up? I was thinking to simply ignore the assertion or probably by pass the assertion by catching it some way rather abort() and probably it will help for the better and cleaner user inspection. |
This PR reduces the number of checks that are being made when there is not sufficient ccache memory, so that the assertion can't hit anymore (assertions can't be ignored in general). |
|
Right, so it should be taken care by default or by input, as in this there is an input , so that means a user is always triggering the output. Now, this is my suggestion, that we should perhaps make this part of the default startup and should throw some debut output and continue with the startup process.? What do you think? @mzumsande |
Not sure I understand. |
|
I was under the assumption that you are talking about VerifyDB() handling the checks like the above on its own on the startup, rather the rpc doing the check ! |
|
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. ReviewsSee the guideline for information on the review process.
If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update. ConflictsReviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first. |
fanquake
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK b7adbca2061e5a88130f61e2744885ab97340d5c - Could be improved in follow ups. i.e if a user has explictly set -checklevel=3||4, it's probably better to abort on too-small dbcache/where we can't actually perform the checks, rather than log a warning.
I also agree that changing verifychain to return warnings/errors is also a better behaviour than returning true (and just logging) when you've actually skipped level 3 & 4 checks.
|
Happy to incorporate the suggestions from above here (no need for a follow-up). |
57f367a to
ad2a1f8
Compare
|
I expanded this PR to incorporate the feedback (and rebased):
Will adjust OP and title. |
ad2a1f8 to
a1c8316
Compare
pablomartin4btc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested ACK, I've managed to reproduce the issue before the fix as follows in the screenshots:
From the bitcoind execution, level 3 & 4 from cli (which crashes it)
From the client terminal, level 3 or 4 both return true (this one before it crashes)
After applying the fix, both levels return false:
And bitcoind doesn't crash anymore (capturing both levels):
|
Just FYI, I was unable to reproduce the problem with |
hernanmarino
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tested ACK , everything running as expected.
|
ACK a1c83166f33d44a0909d8789d3a4bc8b2539842e |
maflcko
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
review ACK a1c83166f33d44a0909d8789d3a4bc8b2539842e 💁
Show signature
Signature:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512
review ACK a1c83166f33d44a0909d8789d3a4bc8b2539842e 💁
-----BEGIN PGP SIGNATURE-----
iQGzBAEBCgAdFiEE+rVPoUahrI9sLGYTzit1aX5ppUgFAlwqrYAACgkQzit1aX5p
pUh5hgv9Eu0s1Ko/vpJoWfVuakYVEU7HmFMAqGM+iJtFO+eWSciUcfyMdMY+Pi7L
rS76o5lIRufk5x6CE7PaeOlxpfbMbKy7AEj13ps4cJiXEtFw/QPW4IhfGxgbxp8e
z7GBk7R9UzRhDYz3DF3MNp5M4m3xvQ/fQb2VlQKtjSgQ65vD4c1oLLhp48bMj/+B
/QNVUOw0L6AsSI6EDjp3SWviDtfXyMxQiCpmkS1t+iM5rAuVXsJVfSUldTP5V6SP
QFa5yNoomUIYXVwAOs1dHwj2G8r15r/wzZH8ucLz2JbGYiTsGgJszUhmrF1SLJlD
cTGZnw4/CI9bbiRKQj+Cbp/bZiwwYqe972w7OBugWlqp8YM5xYAcrjVRFUHzn15B
CvW6+LGE1QMQZSPRdjkea0F48CtxJ43WD8udtDNtor4Ehh7Lwx5BjQ3lhIsWa3Gp
LejMpwMO2ceG8TQaibc80ARA/ahg2YQpqyoZW+KJDMzqgTvTKJaBn0fq07j1RzEu
UodOx8e6
=Jqoz
-----END PGP SIGNATURE-----
ryanofsky
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code review ACK 42b192f0cbf9c04da111145c921344b0881b3ce3 with some suggestions and bikeshedding (feel free to ignore)
src/node/chainstate.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In commit "init, validation: Improve handling if VerifyDB() fails due to insufficient dbcache" (cf307244db97e7b9e29a3d0ff948a6fa7877b548)
true might be a safer default for libbitcoinkernel. Also the name of this option ChainstateLoadOptions::fail_on_insufficient_dbcache seems a little misleading because it seems like it is referring to insuffcient dbcache for loading, not insufficient_dbcache for verification. Would maybe change it to ChainstateLoadOptions::require_full_verification with a comment that setting the option to false can allow using a smaller dbcache size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to ChainstateLoadOptions::fail_on_insufficient_dbcache
also changed the default, although it's never used because the bool is set during init.
I don't understand the comment. require_full_verification is not an option the user can set, it's value is determined by -checkblocks and -checklevel. So if a user wants to run with an extremely low dbcache, they should change those parameters (or just keep the default so that init will only warn).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re: #25574 (comment)
Thanks for implementing the suggestion. This comment was not about bitcoind, but about code that uses libbitcoinkernel such as:
bitcoin/src/bitcoin-chainstate.cpp
Lines 93 to 96 in fe1b325
| node::ChainstateLoadOptions options; | |
| options.check_interrupt = [] { return false; }; | |
| auto [status, error] = node::LoadChainstate(chainman, cache_sizes, options); | |
| if (status != node::ChainstateLoadStatus::SUCCESS) { |
Code like this has it's own options and defaults, and is not aware of checkblocks or checklevel. It is just calling and using the LoadChainstate function directly. From the perspective of a LoadChainstate caller, I think fail_on_insufficient_dbcache could have been misleading because it sounds like it would allow loading with an insufficient dbcache, not just verifying with an insufficient dbcache. I do think in either case it wouldn't be bad for the option to have a code comment, so I suggested one below.
This does not change behavior. It is in preparation for special handling of the case where VerifyDB doesn't finish for various reasons, but doesn't fail.
…terrupted This means that the -verifydb RPC will now return false if it cannot finish due to the node being shutdown.
42b192f to
8f4753a
Compare
…cient dbcache The rpc command verifychain now fails if the dbcache was not sufficient to complete the verification at the specified level and depth. In the same situation, the VerifyDB check during Init will now fail (and lead to an early shutdown) if the user has explicitly specified -checkblocks or -checklevel but the check couldn't be executed because of the limited cache. If the user didn't change any of the two and is using the defaults, log a warning but don't prevent the node from starting up.
Now the verifychain RPC returns false if the checks didn't finish because the blocks requested to be queried have been pruned.
8f4753a to
0af16e7
Compare
|
Addressed comments by @MarcoFalke and @ryanofsky, thanks for the reviews! |
ryanofsky
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code review ACK 0af16e7. Only small suggested changes since the last review, like renaming some of the enum values. I did leave more suggestions, but they are not very important and could be followups
src/node/chainstate.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re: #25574 (comment)
Thanks for implementing the suggestion. This comment was not about bitcoind, but about code that uses libbitcoinkernel such as:
bitcoin/src/bitcoin-chainstate.cpp
Lines 93 to 96 in fe1b325
| node::ChainstateLoadOptions options; | |
| options.check_interrupt = [] { return false; }; | |
| auto [status, error] = node::LoadChainstate(chainman, cache_sizes, options); | |
| if (status != node::ChainstateLoadStatus::SUCCESS) { |
Code like this has it's own options and defaults, and is not aware of checkblocks or checklevel. It is just calling and using the LoadChainstate function directly. From the perspective of a LoadChainstate caller, I think fail_on_insufficient_dbcache could have been misleading because it sounds like it would allow loading with an insufficient dbcache, not just verifying with an insufficient dbcache. I do think in either case it wouldn't be bad for the option to have a code comment, so I suggested one below.
| bool reindex{false}; | ||
| bool reindex_chainstate{false}; | ||
| bool prune{false}; | ||
| bool require_full_verification{true}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In commit "init, validation: Improve handling if VerifyDB() fails due to insufficient dbcache" (0c7785b)
Might be worth adding a comment to explain why you would set the option. Maybe:
//! Setting require_full_verification to true will require all checks at
//! check_level (below) to succeed for loading to succeed. Setting it to
//! false will skip checks if cache is not big enough to run them, so may be
//! helpful for running with a small cache.| // check level 0: read from disk | ||
| if (!ReadBlockFromDisk(block, pindex, consensus_params)) { | ||
| return error("VerifyDB(): *** ReadBlockFromDisk failed at %d, hash=%s", pindex->nHeight, pindex->GetBlockHash().ToString()); | ||
| LogPrintf("Verification error: ReadBlockFromDisk failed at %d, hash=%s\n", pindex->nHeight, pindex->GetBlockHash().ToString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In commit "validation: Change return value of VerifyDB to enum type" (6360b53)
Commit message says this does not change behavior, but this is changing log prints slightly replacing VerifyDB(): prefix with Verification error:. This is probably a good change, but could be clarified in the commit message.
| options.check_blocks); | ||
| switch (result) { | ||
| case VerifyDBResult::SUCCESS: | ||
| case VerifyDBResult::INTERRUPTED: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In commit "validation: return VerifyDBResult::INTERRUPTED if verification was interrupted" (d6f781f)
It would be less surprising to have VerifyDBResult::INTERRUPTED return ChainstateLoadStatus::INTERRUPTED than ChainstateLoadStatus::SUCCESS. Another potentially good side effect would be that it would avoid the caller printing a misleading load time if the load didn't really succeed:
Lines 1520 to 1524 in fe1b325
| std::tie(status, error) = catch_exceptions([&]{ return VerifyLoadedChainstate(chainman, options);}); | |
| if (status == node::ChainstateLoadStatus::SUCCESS) { | |
| fLoaded = true; | |
| LogPrintf(" block index %15dms\n", Ticks<std::chrono::milliseconds>(SteadyClock::now() - load_block_index_start_time)); | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense - I'll open a small follow-up to address this in a few days.
john-moffett
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK 0af16e7
maflcko
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm re-ACK 0af16e7 🎚
Show signature
Signature:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512
lgtm re-ACK 0af16e7134459e0820ab95d751093876c1ec4c6d 🎚
-----BEGIN PGP SIGNATURE-----
iQGzBAEBCgAdFiEE+rVPoUahrI9sLGYTzit1aX5ppUgFAlwqrYAACgkQzit1aX5p
pUgCEAwAl2HOYuqTfTHatbl1G6IC16Itmtb0chf3GhxESu4SIE7uNvMXIXVK2WG1
k5Sz9TeW2POSCwZnJtxdAzhNKwzN6Q79yJ0KL+6SOMxX7j4CaWun5b1kJf1xOlQx
gHBqBN1zLW65Wfh1L3qa4nzKiWwKdJJ2iUAUiatdJGLxElYIox98uTdq9Z3PuSSb
Xu1ivXy/EdXu+nQMeogF5V+g/yZAdSKGYSgS2ldG8D7k8cvY5BxEM/0b1UAjYOrE
gDKR7Ac7tG6sIzdfv+oY1vdjzRPbaATf2LoQiFgv0x0tPECKX+R19/g5UrU3DCZ0
1HSsj5rcuPJpy5YrL517v7K/XgUJlUhUsg6Xx7gn+jvu0qRrlhsFCSpv0XLzn+0J
aIKCXWLTV+2bLcMcktl/0+c/iFpRiFHaO8q1E4ivBttoGf3cK8Jv1CFup/+yyPaK
Ld8w3sZXEwuASxMEI03GbFKb1o4RMQojSkm9PaimfU3frbUFLTtUiHnw6X6SMjZ4
8KN4cKiS
=q5XI
-----END PGP SIGNATURE-----
| enum class VerifyDBResult { | ||
| SUCCESS, | ||
| CORRUPTED_BLOCK_DB, | ||
| INTERRUPTED, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit in d6f781f (commit message), only if you retouch:
I think the rpc is called verifychain and not -verifydb?
|
ACK 0af16e7 |
…RUPTED when verification was interrupted. c5825e1 doc: add explanation for fail_on_insufficient_dbcache (Ryan Ofsky) 7dff7da init: Return more fitting ChainStateLoadStatus if verification was interrupted (Martin Zumsande) Pull request description: This addresses two outstanding comments by ryanofsky from #25574: * return `ChainstateLoadStatus::INTERRUPTED` instead of `ChainstateLoadStatus::SUCCESS` if verification was stopped by an interrupt. This would coincide with straightforward expectation, and it avoids a misleading [log entry](https://github.com/mzumsande/bitcoin/blob/c5825e14f8999a8c5f5121027af9e07ac51ab42e/src/init.cpp#L1526) in `init` for the block index load time (because that would include the verificiation, which didn't complete). It shouldn't affect node behavior otherwise because the shutdown signal would be caught in init anyway. In test, this would lead to an assert ([link](https://github.com/mzumsande/bitcoin/blob/c5825e14f8999a8c5f5121027af9e07ac51ab42e/src/test/util/setup_common.cpp#L230)), which also makes more sense because benign interrupts are not expected there during init. This can be tested by setting a large value for `-checkblocks`, interrupting the node during block verification and observing the log. bitcoin/bitcoin#25574 (comment) * add documentation for `require_full_verification` bitcoin/bitcoin#25574 (comment) ACKs for top commit: MarcoFalke: thanks lgtm ACK c5825e1 Tree-SHA512: ca1c71a1b046d30083337dd9ef6d52e66fa1ac8c4ecd807716e4aa6a894179a81df41caee916fa30997fd6e0b284412a3c8f2919d19c29d826fb580ffb89fd73
…hen verification was interrupted. c5825e1 doc: add explanation for fail_on_insufficient_dbcache (Ryan Ofsky) 7dff7da init: Return more fitting ChainStateLoadStatus if verification was interrupted (Martin Zumsande) Pull request description: This addresses two outstanding comments by ryanofsky from bitcoin#25574: * return `ChainstateLoadStatus::INTERRUPTED` instead of `ChainstateLoadStatus::SUCCESS` if verification was stopped by an interrupt. This would coincide with straightforward expectation, and it avoids a misleading [log entry](https://github.com/mzumsande/bitcoin/blob/c5825e14f8999a8c5f5121027af9e07ac51ab42e/src/init.cpp#L1526) in `init` for the block index load time (because that would include the verificiation, which didn't complete). It shouldn't affect node behavior otherwise because the shutdown signal would be caught in init anyway. In test, this would lead to an assert ([link](https://github.com/mzumsande/bitcoin/blob/c5825e14f8999a8c5f5121027af9e07ac51ab42e/src/test/util/setup_common.cpp#L230)), which also makes more sense because benign interrupts are not expected there during init. This can be tested by setting a large value for `-checkblocks`, interrupting the node during block verification and observing the log. bitcoin#25574 (comment) * add documentation for `require_full_verification` bitcoin#25574 (comment) ACKs for top commit: MarcoFalke: thanks lgtm ACK c5825e1 Tree-SHA512: ca1c71a1b046d30083337dd9ef6d52e66fa1ac8c4ecd807716e4aa6a894179a81df41caee916fa30997fd6e0b284412a3c8f2919d19c29d826fb580ffb89fd73
Summary: This does not change behavior. It is in preparation for special handling of the case where VerifyDB doesn't finish for various reasons, but doesn't fail. This is a backport of [[bitcoin/bitcoin#25574 | core#25574]] bitcoin/bitcoin@6360b53 Depends on D14925 Test Plan: `ninja all check-all` Reviewers: #bitcoin_abc, Fabien Reviewed By: #bitcoin_abc, Fabien Differential Revision: https://reviews.bitcoinabc.org/D14926
…terrupted Summary: This means that the verifydb RPC will now return false if it cannot finish due to the node being shutdown. This also avoids a misleading block index loadtime log entry in init. This is a partial backport of [[bitcoin/bitcoin#25574 | core#25574]] and [[bitcoin/bitcoin#27157 | core#27157]] bitcoin/bitcoin@d6f781f bitcoin/bitcoin@7dff7da Depends on D14926 Test Plan: Co-authored-by: Ryan Ofsky <[email protected]> `src/bitcoind checkblocks` Press Ctrl-C, check that it still exits gracefully and does not print a success message for the block verification or block loading. Reviewers: #bitcoin_abc, Fabien Reviewed By: #bitcoin_abc, Fabien Differential Revision: https://reviews.bitcoinabc.org/D14927
…cient dbcache Summary: The rpc command verifychain now fails if the dbcache was not sufficient to complete the verification at the specified level and depth. In the same situation, the VerifyDB check during Init will now fail (and lead to an early shutdown) if the user has explicitly specified -checkblocks or -checklevel but the check couldn't be executed because of the limited cache. If the user didn't change any of the two and is using the defaults, log a warning but don't prevent the node from starting up. This is a partial backport of [[bitcoin/bitcoin#25574 | core#25574]] bitcoin/bitcoin@0c7785b Depends on D14927 Test Plan: ``` $ src/bitcoin-cli verifychain 4 10000 false ``` ``` $ src/bitcoind -dbcache=10 ... 2023-12-06T13:33:04Z Verifying last 10000 blocks at level 4 2023-12-06T13:33:04Z Verification progress: 0% 2023-12-06T13:33:04Z Verification progress: 10% 2023-12-06T13:33:04Z Verification progress: 20% 2023-12-06T13:33:04Z Verification progress: 30% 2023-12-06T13:33:04Z Verification progress: 40% 2023-12-06T13:33:04Z Verification progress: 50% 2023-12-06T13:33:04Z Skipped verification of level >=3 (insufficient database cache size). Consider increasing -dbcache. ... ``` Reviewers: #bitcoin_abc, Fabien Reviewed By: #bitcoin_abc, Fabien Differential Revision: https://reviews.bitcoinabc.org/D14928
Summary: Now the verifychain RPC returns false if the checks didn't finish because the blocks requested to be queried have been pruned. This is a partial backport of [[bitcoin/bitcoin#25574 | core#25574]] bitcoin/bitcoin@57ef2a4 Depends on D14928 Test Plan: `ninja && ninja check-functional` Reviewers: #bitcoin_abc, Fabien Reviewed By: #bitcoin_abc, Fabien Differential Revision: https://reviews.bitcoinabc.org/D14929
…successfully Summary: This concludes backport of [[bitcoin/bitcoin#25574 | core#25574]] Test Plan: proof reading Reviewers: #bitcoin_abc, Fabien, bytesofman Reviewed By: #bitcoin_abc, Fabien, bytesofman Subscribers: bytesofman, Fabien Differential Revision: https://reviews.bitcoinabc.org/D14930



VerifyDB()can fail to complete due to insufficient dbcache at the level 3 checks. This PR improves the error handling in this case in the following ways:-verifychainnow returns false if the check can't be completed due to insufficient cache-checkblocksand-checklevelare taken and the check doesn't complete. However, if the user actively specifies one of these args, we return with an InitError if we can't complete the check.This PR also changes
-verifychainRPC to returnfalseif the verification didn't finish due to missing block data (pruning) or due to being interrupted by the node being shutdown.Previously, this PR also included a fix for a possible assert during verification - this was done in #27009 (now merged).