[WIP] rebase: Call ProcessNewBlock() asynchronously #18963

dongcarl · 2020-05-12T20:56:39Z

This is a (currently naive) rebase of #16323, which is a rework of #16175, which is a rework of #12934.
Built on top of #17479.

Currently validationinterface_tests/unregister_all_during_call fails.

Goals:

Add as much documentation as possible to aid with review
Split up commits as much as possible to aid with review

dongcarl · 2020-05-13T18:27:48Z

Since this PR touches critical parts of the codebase, I think it deserves a thorough review. I'm going to go through the original PR (#16323) commit by commit and write down my notes here: https://docs.google.com/document/d/1tduRmqcvhdl3FRkmdnj0f3_fknXuIZ57R8zdCGHNt6k/edit?usp=sharing

9fdf05d resolved some lock inversion warnings in denialofservice_tests, but left in a number of cs_main locks that are unnecessary (introducing lock inversion warnings in future changes).

Co-authored-by: Carl Dong <[email protected]>

This is a pure refactor commit. This commit enables the caller of ProcessNewBlock to access the final BlockValidationState passed around between CheckBlock(), AcceptBlock(), and BlockChecked() inside ProcessNewBlock(). This is useful because in a future commit, we will move the BlockChecked() call out of ProcessNewBlock(), and BlockChecked() still needs to be able to access the BlockValidationState. Co-authored-by: John Newbery <[email protected]> Co-authored-by: Carl Dong <[email protected]>

This is a pure refactor commit. Since BlockChecked() doesn't actually depend on all of PeerLogicValidation but just PeerLogicValidation's CConnman, we can make a standalone, static function that simply has an extra CConnman parameter and have the non-static version call the static one. This also means that, in a future commit, when we move the BlockChecked() call out of ProcessNewBlock(), the caller of ProcessNewBlock() can call BlockChecked() directly even if they only have a CConnman. Co-authored-by: John Newbery <[email protected]> Co-authored-by: Carl Dong <[email protected]>

…ProcessNewBlock Net processing now passes a BlockValidationState object into ProcessNewBlock(). If CheckBlock() or AcceptBlock() fails, then PNB returns to net processing without calling the (asynchronous) BlockChecked Validation Interface method. net processing can use the invalid BlockValidationState returned to punish peers. CheckBlock() and AcceptBlock() represent the DoS checks on a block (ie PoW and malleability). Net processing wants to know about those failed checks immediately and shouldn't have to wait on a callback. Other validation interface clients don't care about net processing submitting bogus malleated blocks to validation, so they don't need to be notified of BlockChecked. Furthermore, if PNB returns a valid BlockValidationState, we never need to try to process (non-malleated) copies of the block from other peers. That makes it much easier to move the best chain activation logic to a background thread in future work. Co-authored-by: John Newbery <[email protected]> Co-authored-by: Carl Dong <[email protected]>

This is a pure refactor commit. Co-authored-by: John Newbery <[email protected]> Co-authored-by: Carl Dong <[email protected]>

Co-authored-by: John Newbery <[email protected]> Co-authored-by: Carl Dong <[email protected]>

dongcarl · 2020-05-18T21:49:35Z

Rebased on newer #17479. git-bisecting the validationinterface_tests/unregister_all_during_call says:

194935b1a2968b594a42cb880e30701dd2e2bc7c is the first bad commit

ryanofsky · 2020-05-18T22:35:42Z

Rebased on newer #17479. git-bisecting the validationinterface_tests/unregister_all_during_call says:
194935b1a2968b594a42cb880e30701dd2e2bc7c is the first bad commit

Test was relying on the fact that BlockChecked as synchronous and that commit made it asynchronous. You can switch it to a different synchronous method:

--- a/src/test/validationinterface_tests.cpp
+++ b/src/test/validationinterface_tests.cpp
@@ -57,15 +57,13 @@ public:
     {
         if (m_on_destroy) m_on_destroy();
     }
-    void BlockChecked(const CBlock& block, const BlockValidationState& state) override
+    void NewPoWValidBlock(const CBlockIndex *pindex, const std::shared_ptr<const CBlock> &block) override
     {
         if (m_on_call) m_on_call();
     }
     static void Call()
     {
-        std::shared_ptr<const CBlock> block = std::make_shared<CBlock>();
-        BlockValidationState state;
-        GetMainSignals().BlockChecked(block, state);
+        GetMainSignals().NewPoWValidBlock(nullptr, std::make_shared<CBlock>());
     }
     std::function<void()> m_on_call;
     std::function<void()> m_on_destroy;

dongcarl · 2020-05-18T23:50:26Z

Now p2p_unrequested_blocks.py is failing. git-bisect says:

1d9f66ea37ac6f22f26013cd57ed75fc04a9481f is the first bad commit

As noted in the Google Doc:

Up to this point, it seems that splitting this commit into:

Changing the call semantics of ProcessNewBlock

Changing the return type to a std::future

Might make it easier to review.

So I plan to split it up and track it down.

This prepares for making best-chain-activation and disk writes happen in a separate thread from the caller, even though all callsites currently block on the return value immediately.

… an immediate bool"

CNodeState was added for validation-state-tracking, and thus, logically, was protected by cs_main. However, as it has grown to include non-validation state (taking state from CNode), and as we've reduced cs_main usage for other unrelated things, CNodeState is left with lots of cs_main locking in net_processing. In order to ease transition to something new, this adds only a dummy CPeerState which is held as a reference for the duration of message processing. Note that moving things is somewhat tricky pre validation-thread as a consistent lockorder must be kept - we can't take a lock on the new cs_peerstate in anything that's called directly from validation.

dongcarl · 2020-05-19T19:37:54Z

Split up the first commit into:

cb17890 which changes the call semantics of ProcessNewBlock and deals with the fallout of that
7532cf6 which changes the return from a bool to a future of a bool and deals with the fallout of that
2dcdb53 which contains changes that are non-obvious to me to be correct

I also botched the git-bisect before (didn't call make in a bash -c), new output:

$ git bisect run bash -c "make -j50 && python3 test/functional/p2p_unrequested_blocks.py"

...

7c77827558715180594f5881b6d02982504c4fad is the first bad commit
commit 7c77827558715180594f5881b6d02982504c4fad
Author: Matt Corallo <[email protected]>
Date:   Mon Jun 17 13:13:36 2019 -0400

    Move net_processing's ProcessNewBlock calls to resolve async.

    Essentially, our goal is to not process anything for the given peer
    until the block finishes processing (emulating the previous behavior)
    without actually blocking the ProcessMessages loops. Obviously, in
    most cases, we'll just go on to the next peer and immediately hit a
    cs_main lock, blocking us anyway, but this we can slowly improve
    that state over time by moving things from CNodeState to CPeerState.

 src/net_processing.cpp | 79 ++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 67 insertions(+), 12 deletions(-)
bisect run success

Essentially, our goal is to not process anything for the given peer until the block finishes processing (emulating the previous behavior) without actually blocking the ProcessMessages loops. Obviously, in most cases, we'll just go on to the next peer and immediately hit a cs_main lock, blocking us anyway, but this we can slowly improve that state over time by moving things from CNodeState to CPeerState.

Spawn a background thread at startup which validates each block as it comes in from ProcessNewBlock, taking advantage of the new std::future return value to keep tests simple (and the new net_processing handling of such values async already). This makes introducing subtle validationinterface deadlocks much harder as any locks held going into ProcessNewBlock do not interact with (in the form of lockorder restrictions) locks taken in validationinterface callbacks. Note that after this commit, feature_block and feature_assumevalid tests time out due to increased latency between block processing when those blocks do not represent a new best block. This will be resolved in the next commit.

This resolves the performance regression introduced in the previous commit by always waking the message processing thread after each block future resolves. Sadly, this is somewhat awkward - all other validationinterface callbacks represent an actual change to the global validation state, whereas this callback indicates only that a call which one validation "client" made has completed. After going back and forth for some time I didn't see a materially better way to resolve this issue, and luckily its a rather simple change, but its far from ideal. Note that because we absolutely do not want to ever block on a ProcessNewBlock-returned-future, the callback approach is critical.

To keep the API the same (and for simplicity of clients, ie net_processing), this splits AcceptBlock into the do-I-want-this stage, the checking stage, and the writing stage. ProcessNewBlock calls the do-I-want-this and checking (ie malleability checking) stuff, and then dumps blocks that pass into the background thread. In the background, we re-test the do-I-want-this logic but skip the checking stuff, before writing the block to disk and activating the best chain.

As reject messages are required to go out in-order (ie before any further messages are processed), this sadly requires that we further delay re-enabling a peer after a block has been processed by waiting for current validationinterface callbacks to drain. This commit enables further reduction of cs_main in net_processing by allowing us to lock cs_peerstate before cs_main in BlockChecked (ie allows us to move things which are accessed in BlockChecked, including DoS state and rejects into CPeerState and out of CNodeState).

This technically resolves a race where entries are added to mapBlockSource before we know that they're non-malleated and then removed only after PNB returns, though in practice this wasn't an issue since all access to mapBlockSource already held cs_peerstate.

dongcarl · 2020-05-21T22:06:45Z

Got the unit tests and functional tests to pass!

ariard

Great to see progress on this, for now just high-level comments calling for better comments . I will review John PR again and then your linked notes to help to make them easier to read. I think that's the right way to come with a clear display of what is the current validation model, what this changes propose to do and why. There is so much context, that we can't expect all reviewers to do the PR context history digging by themselves.

ariard · 2020-05-22T23:42:31Z

src/net_processing.cpp

+ * Maintain state about nodes, protected by our own lock. Historically we put all
+ * peer tracking state in CNodeState, however this results in significant cs_main
+ * contention. Thus, new state tracking should go here, and we should eventually
+ * move most (non-validation-specific) state here.


I think you should define more difference between validation-state and non-validation-state. Right now peer tracking state is spread among multiple class like CNode with TxRelay or CNodeState and now CPeerState. We should have a clear idea of what should go where, according to which thread uses it. You should also explain how CPeerState aims to reduce cs_main contention with regards to the new threading model.

ariard · 2020-05-22T23:46:53Z

src/net_processing.cpp

+    //! The hash of the block which is pending download.
+    uint256 pending_block_hash;
+    //! Once we've finished processing a block from this peer, we must still wait for
+    //! any related callbacks to fire (to ensure, specifically, that rejects go out


Is reject here making reference to reject messages ? I think it doesn't make sense anymore post-#15437 and post-#17004. Also you may lay out the expected callback sequence (as we do for TransactionRemovedFromMempool in src/validationinterface.h).

ariard · 2020-05-22T23:47:32Z

src/net_processing.cpp

 }

+/**
+ *  A block has been processed. Handle potential peer punishment and housekeeping.


Please define more "housekeeping".

ariard · 2020-05-22T23:48:14Z

src/net_processing.cpp

        } // Don't hold cs_main when we call into ProcessNewBlock
        if (fBlockRead) {
-            bool fNewBlock = false;
+            // BIP 152 permits peers to relay compact blocks after validating


I'm not sure that BIP152 low-bandwidth authorizes invalid header propagation, maybe precise.

ariard · 2020-05-22T23:49:32Z

src/validation.cpp


-/** Store block on disk. If dbp is non-nullptr, the file is known to already reside on disk */
-bool CChainState::AcceptBlock(const std::shared_ptr<const CBlock>& pblock, BlockValidationState& state, const CChainParams& chainparams, CBlockIndex** ppindex, bool fRequested, const FlatFilePos* dbp, bool* fNewBlock)
+bool CChainState::ShouldMaybeWrite(CBlockIndex* pindex, bool fRequested)


Please can you comment on what conditional write is laying on.

ariard · 2020-05-22T23:50:28Z

src/validation.cpp

+
+        NotifyHeaderTip();
+
+        BlockValidationState state; // Only used to report errors, not invalidity - ignore it


You should point where errors are distinguished from invalidity.

dongcarl · 2021-04-20T15:45:21Z

Not planning to work on this anytime soon unfortunately.

jnewbery · 2021-04-20T16:30:40Z

🙁

Let me know if you pick this up again!

hebasto · 2023-05-17T08:07:13Z

From my research it follows that the "Move BlockChecked to a background thread" commit fixes a lock-order-inversion in the MessageHandler thread.

See #19303 (comment).

DrahtBot added Mining P2P RPC/REST/ZMQ Tests Validation labels May 12, 2020

TheBlueMatt and others added 7 commits May 14, 2020 09:02

[tests] Remove unnecessary cs_mains in denialofservice_tests

45bedbd

9fdf05d resolved some lock inversion warnings in denialofservice_tests, but left in a number of cs_main locks that are unnecessary (introducing lock inversion warnings in future changes).

[net processing] Deduplicate post-block-processing code

3446fd2

Co-authored-by: Carl Dong <[email protected]>

[validation] trivial: Rename state to dummy_state for clarity

c94ca98

This is a pure refactor commit. Co-authored-by: John Newbery <[email protected]> Co-authored-by: Carl Dong <[email protected]>

[test/rpc] Additional checks for dos_state validity

50048f4

Co-authored-by: John Newbery <[email protected]> Co-authored-by: Carl Dong <[email protected]>

ariard mentioned this pull request May 16, 2020

RFC: Introducing AltNet, a pluggable framework for alternative transports #18988

Closed

dongcarl force-pushed the 2020-05-async-pnb branch from 7efe6f9 to 178a758 Compare May 18, 2020 21:47

dongcarl and others added 4 commits May 19, 2020 14:45

Make ProcessNewBlock return fNewBlock instead of state validity

cb17890

Make ProcessNewBlock return a future instead of an immediate bool

7532cf6

This prepares for making best-chain-activation and disk writes happen in a separate thread from the caller, even though all callsites currently block on the return value immediately.

Non-obvious parts of "Make ProcessNewBlock return a future instead of…

2dcdb53

… an immediate bool"

dongcarl force-pushed the 2020-05-async-pnb branch from a2e7fd6 to 893c635 Compare May 19, 2020 19:29

TheBlueMatt added 6 commits May 21, 2020 18:02

validationinterface: Use NewPoWValidBlock as sync test instead.

d6b1729

dongcarl force-pushed the 2020-05-async-pnb branch from 893c635 to d6b1729 Compare May 21, 2020 22:05

ariard reviewed May 22, 2020

View reviewed changes

ariard mentioned this pull request May 26, 2020

Return BlockValidationState from ProcessNewBlock if CheckBlock/AcceptBlock fails #17479

Closed

dongcarl closed this Apr 20, 2021

fanquake added the Up for grabs label Apr 21, 2021

bitcoin locked as resolved and limited conversation to collaborators Aug 18, 2022

hebasto mentioned this pull request May 17, 2023

Replace all of the RecursiveMutex instances with the Mutex ones #19303

Open

35 tasks

bitcoin unlocked this conversation May 17, 2023

hebasto mentioned this pull request May 17, 2023

Avoid lock order inversion in Chainstate::ConnectTip function #27684

Closed

maflcko removed the Tests label Aug 8, 2023

bitcoin locked and limited conversation to collaborators Aug 7, 2024


		NotifyHeaderTip();

		BlockValidationState state; // Only used to report errors, not invalidity - ignore it

[WIP] rebase: Call ProcessNewBlock() asynchronously #18963

[WIP] rebase: Call ProcessNewBlock() asynchronously #18963

Uh oh!

Conversation

dongcarl commented May 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongcarl commented May 13, 2020

Uh oh!

dongcarl commented May 18, 2020

Uh oh!

ryanofsky commented May 18, 2020

Uh oh!

dongcarl commented May 18, 2020

Uh oh!

dongcarl commented May 19, 2020

Uh oh!

dongcarl commented May 21, 2020

Uh oh!

ariard left a comment

Choose a reason for hiding this comment

Uh oh!

ariard May 22, 2020

Choose a reason for hiding this comment

Uh oh!

ariard May 22, 2020

Choose a reason for hiding this comment

Uh oh!

ariard May 22, 2020

Choose a reason for hiding this comment

Uh oh!

ariard May 22, 2020

Choose a reason for hiding this comment

Uh oh!

ariard May 22, 2020

Choose a reason for hiding this comment

Uh oh!

ariard May 22, 2020

Choose a reason for hiding this comment

Uh oh!

dongcarl commented Apr 20, 2021

Uh oh!

jnewbery commented Apr 20, 2021

Uh oh!

hebasto commented May 17, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

dongcarl commented May 12, 2020 •

edited

Loading