Skip to content

Conversation

@Sjors
Copy link
Member

@Sjors Sjors commented Nov 22, 2024

Ensure KernelNotifications m_tip_block is set even if no new block arrives.

Suggested in #31297 (comment)

@DrahtBot
Copy link
Contributor

DrahtBot commented Nov 22, 2024

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage & Benchmarks

For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/31346.

Reviews

See the guideline for information on the review process.

Type Reviewers
ACK ryanofsky, mzumsande, TheCharlatan

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

Conflicts

Reviewers, this pull request conflicts with the following ones:

  • #31393 (refactor: Move GuessVerificationProgress into ChainstateManager by maflcko)
  • #31325 (Make m_tip_block std::optional by Sjors)
  • #31283 (Add waitNext() to BlockTemplate interface by Sjors)
  • #31177 (rpc, logging: return "verificationprogress" of 1 when up to date by polespinasa)
  • #29700 (kernel, refactor: return error status on all fatal errors by ryanofsky)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

@maflcko
Copy link
Member

maflcko commented Nov 22, 2024

This is incomplete. You'll have to update the docs as well: "If the tip was not connected on
* startup, this will wait.", and "which may
//! be true even long after startup, until shutdown.". Also, in init.cpp you can remove the if-guard?

@Sjors Sjors force-pushed the 2024/11/init_m_tip_block branch from b6d5be1 to 57e15ce Compare November 22, 2024 14:18
@Sjors
Copy link
Member Author

Sjors commented Nov 22, 2024

I updated the waitTipChanged() and m_tip_block doc.

@maflcko which if-guard do you mean?

@ryanofsky
Copy link
Contributor

@maflcko which if-guard do you mean?

I think probably this one:

    // Wait for genesis block to be processed
    if (WITH_LOCK(chainman.GetMutex(), return chainman.ActiveTip() == nullptr)) {

Removing the if() might help verify that m_tip_block is set reliably

@maflcko
Copy link
Member

maflcko commented Nov 22, 2024

Also, the ZERO workaround could be removed completely, ensuring that any interfaces will only be spun up after that point in init?

@Sjors Sjors force-pushed the 2024/11/init_m_tip_block branch from 57e15ce to 5c45425 Compare November 22, 2024 14:42
@Sjors
Copy link
Member Author

Sjors commented Nov 22, 2024

I dropped the if guard.

ensuring that any interfaces will only be spun up after that point in init

It could, but I'm not sure if it's worth putting such a restriction on the interfaces.

Similarly we have startupnotify which happens at the end of init. If we also ensure that the interfaces are ready to listen before this notification, then an application launched with systemd (or similiar) can wait for this signal before trying to connect.

If on the other hand an external application is started at the same time, it would just have to keep trying to connect until we listen. Such a race would exist even if we opened the interface a bit earlier in the init process, though it would be shorter.

@ryanofsky
Copy link
Contributor

It could, but I'm not sure if it's worth putting such a restriction on the interfaces.

Tend to agree for now. Having waitTipChanged wait for a tip to be connected seems like a natural thing to implement and might avoid the need for clients to poll the node on startup. But I think this check could be dropped in BlockTemplate::waitNext in #31283 (maybe replaced with an assert there) since it should not be possible to get a blocktemplate reference before the tip is connected. After #31283 it might make sense to drop the waitTipChanged method entirely too.

@Sjors
Copy link
Member Author

Sjors commented Nov 22, 2024

Several reindex related tests are broken, investigating...

After #31283 it might make sense to drop the waitTipChanged method entirely too.

I think it's generically useful though. There are various types of applications that need to know when a new block is added to the chain (and have no need for a template).

@DrahtBot
Copy link
Contributor

🚧 At least one of the CI tasks failed.
Debug: https://github.com/bitcoin/bitcoin/runs/33386413030

Hints

Try to run the tests locally, according to the documentation. However, a CI failure may still
happen due to a number of reasons, for example:

  • Possibly due to a silent merge conflict (the changes in this pull request being
    incompatible with the current code in the target branch). If so, make sure to rebase on the latest
    commit of the target branch.

  • A sanitizer issue, which can only be found by compiling with the sanitizer and running the
    affected test.

  • An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

@Sjors
Copy link
Member Author

Sjors commented Nov 22, 2024

Specifically -reindex-chainstate is broken, easy to reproduce with e.g. testnet4.

It stops after the second block:

2024-11-22T16:44:14.338986Z Setting NODE_NETWORK on non-prune mode
2024-11-22T16:44:14.339014Z Wait for genesis block to be processed
2024-11-22T16:44:14.339023Z initload thread start
2024-11-22T16:44:14.341109Z UpdateTip: new best=00000000da84f2bafbbc53dee25a72ae507ff4914b867c565be350b0da8bf043 height=0 version=0x00000001 log2_work=32.000022 tx=1 date='2024-05-03T23:11:00Z' progress=0.000001 cache=0.3MiB(0txo)
2024-11-22T16:44:14.341179Z UpdateTip: new best=0000000012982b6d5f621229286b880e909984df669c2afabb102ce311b13f28 height=1 version=0x20000000 log2_work=33.000022 tx=2 date='2024-05-06T13:08:48Z' progress=0.000002 cache=0.3MiB(1txo)

Putting the if-guard back "fixes" it. When I put a log statement before WAIT_LOCK and after the wait function returns, I can see that this lines are called regardless of the if-guard.

Turns out the only critical thin is that the WAIT_LOCK stuff lives in its own scope.

src/init.cpp Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that this will only ever actually wait in three cases:
1.) first startup with an empty datadir
2.) reindex
3.) reindex-chainstate

where the genesis block is being connected by the initload thread, whereas otherwise the m_tip_block_cv wait condition should be satisfied because m_tip_block should have been set earlier by the same thread. So removing the if guard is not necessary but just a matter of taste.

After this PR, it will be harder to understand how the folllowing code block refers to the genesis block, so maybe the comment could be expanded to mention some of the above.

Copy link
Member Author

@Sjors Sjors Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I drafted a comment, but I think it's wrong so I haven't pushed it yet:

    /* 
     * Wait for genesis block to be processed. Typically kernel_notifications.m_tip_block
     * has already been set, except in three cases where the genesis block is
     * being connected by the initload thread:
     *
     * 1. first startup with an empty datadir
     * 2. reindex
     * 3. reindex-chainstate
     */

I don't think the initload thread ever sets kernel_notifications.m_tip_block. The thread only deals with indexes derived from BaseIndex, which doesn't include CBlockIndex.

I'm still confused under what circumstance it would actually be unset.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's clearly doing something though, because the following breaks several functional tests:

--- a/src/init.cpp
+++ b/src/init.cpp
@@ -1792,13 +1792,14 @@ bool AppInitMain(NodeContext& node, interfaces::BlockAndHeaderTipInfo* tip_info)
     });
 
     // Wait for genesis block to be processed
     {
         WAIT_LOCK(kernel_notifications.m_tip_block_mutex, lock);
-        kernel_notifications.m_tip_block_cv.wait(lock, [&]() EXCLUSIVE_LOCKS_REQUIRED(kernel_notifications.m_tip_block_mutex) {
-            return !kernel_notifications.m_tip_block.IsNull() || ShutdownRequested(node);
-        });
+        Assert(!kernel_notifications.m_tip_block.IsNull());
+        // kernel_notifications.m_tip_block_cv.wait(lock, [&]() EXCLUSIVE_LOCKS_REQUIRED(kernel_notifications.m_tip_block_mutex) {
+        //     return !kernel_notifications.m_tip_block.IsNull() || ShutdownRequested(node);
+        // });
     }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With some manual testing I indeed get all three conditions from your comment to hit that assert. I'm just confused why.

Copy link
Member Author

@Sjors Sjors Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outside the context of a snapshot, we only call LoadChainTip() if:

if (!is_coinsview_empty(chainstate)) {
// LoadChainTip initializes the chain based on CoinsTip()'s best block
if (!chainstate->LoadChainTip()) {
return {ChainstateLoadStatus::FAILURE, _("Error initializing block database")};

That explains why these three scenarios are different. With an empty datadir or when we -reindex(-chainstate) the UTXO set is empty[0].

Still looking into where it is set.

[0] more accurately, doesn't point to a tip or we asked to wipe it. The name is misleading, it doesn't matter if by some magic every UTXO ends up spent.

auto is_coinsview_empty = [&](Chainstate* chainstate) EXCLUSIVE_LOCKS_REQUIRED(::cs_main) {
return options.wipe_chainstate_db || chainstate->CoinsTip().GetBestBlock().IsNull();
};

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, so is the scenario specifically the case when a node started with -reindex is stopped before the final ActivateBestChain call? If it is stopped during the final ActivateBestChain call it seems like it would be picking up work where it left off when ActivateBestChain was called again, due to LoadChainTip setting the previous tip.

we will call already ActivateBestChain early, when the best header is somewhere mid-chain, connect a first batch of blocks to the chainstate, then reindex the remaining block files, and then connect a second batch.

This is referring to the ActivateBestChain call in LoadExternalBlockFile call you linked to earlier? And removing this call would fix the weirdness and assumevalid issue? It does seem like it would be nice to remove this call. I'm not sure what the point of connecting the genesis block early but not connecting any other blocks is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, so is the scenario specifically the case when a node started with -reindex is stopped before the final ActivateBestChain call?

Yes.

This is referring to the ActivateBestChain call in LoadExternalBlockFile call you linked to earlier?

Yes!

And removing this call would fix the weirdness and assumevalid issue? It does seem like it would be nice to remove this call. I'm not sure what the point of connecting the genesis block early but not connecting any other blocks is.

I don't think we'd want to remove it completely, because in case of a -reindex it allows init to complete normal startup without having to wait for the import thread to finish reindexing block files (which can take a long time for mainnet). I think that @l0rinc is looking into possible fixes, but without having tested anything my first idea was to only call ActivateBestChain there it if the chain is empty, so it only doesn't get called in the restart scenario where we already loaded Genesis during LoadChainTip.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in my backlog, but I would be very happy if someone who already understands this fixes it instead :)
I just noticed it during Core Dev that my perf logs are full of script validations when trying out different scenarios and cancelling a reindex.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in my backlog, but I would be very happy if someone who already understands this fixes it instead :)

Ok thanks, I'll try out my suggestion from above and will open a PR if it's as straightforward as I think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Sjors Sjors force-pushed the 2024/11/init_m_tip_block branch from 3317123 to a6ee3e1 Compare November 25, 2024 08:44
@Sjors
Copy link
Member Author

Sjors commented Nov 25, 2024

I added a check to ensure we're not the background chainstate when sending the notification.

@Sjors Sjors force-pushed the 2024/11/init_m_tip_block branch from a6ee3e1 to be78eaa Compare November 25, 2024 10:55
@Sjors
Copy link
Member Author

Sjors commented Nov 25, 2024

I added a comment based on #31346 (comment) to clarify under which circumstances m_tip_block is unset at this init stage.

I find it quite confusing that ActivateBestChain is called by ImportBlocks even if no blocks are imported, but haven't dared to refactor it.

@maflcko
Copy link
Member

maflcko commented Nov 25, 2024

I find it quite confusing that ActivateBestChain is called by ImportBlocks even if no blocks are imported, but haven't dared to refactor it.

It is explained in the code comment above the line that does it. It was added as a fix for #2239

Copy link
Contributor

@ryanofsky ryanofsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code review ACK be78eaa9831be06bd94438d1f10ec8879b5c756d. Having LoadChainTip call blockTip makes a lot of sense and removes a confusing special case here and in followup PRs (#31297, #31283). The new comments are also really nice IMO.

Copy link
Contributor

@mzumsande mzumsande left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, just a request to change one comment, will A CK after.

Copy link
Contributor

@mzumsande mzumsande Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment change is not correct and should just be dropped: m_blockfiles_indexed is true (its default value) in case of an empty datadir, false in case of a reindex.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, then I'm confused about what causes case (1): #31346 (comment)

Copy link
Contributor

@mzumsande mzumsande Dec 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With an empty datadir, we don't have the genesis connected at this point. But we don't enter the reindex if clause here (the comment only refers to that), we just skip this and the loadblock block and go to the ActivateBestChain call below which connects Genesis and allows the init thread to continue.

@Sjors