Skip to content

Conversation

@furszy
Copy link

@furszy furszy commented Oct 22, 2021

Long week.. a small summary of what have done here:

First, the last forking reason investigation
The issue was a not-synchronized votes count on proposals that were part of the last superblock budget payments that ended up on a rejection of the budget finalization and a never follow-up re-synchronization of it.

This happened, once more, for a chain of events over the current tier two sync model. This PR corrects and improves a good number of topics over the area, that will briefly mention below, preventing the forking scenario from happening again during the next superblock.

  1. The initial synchronization of budget finalizations, proposals and votes are only relayed once at sync time. They are all sent together as part of the budget sync message response. In this response, the remote peer invalidly blocks any subsequent budget sync request for the entire node life cycle (there is no cleanup function). So the remote peer will not respond to any budget sync anymore, don't care how many times the local node resets the process and re-tries it.
    And, as the current initial synchronization receives the tier two data progressively, item by item, and as the node can discard budget data due lack of MN entries when the data is processed, a subsequent sync request can be needed after certain amount of time.
    (Plus, the reason why we never saw this behavior on testnet/regtest is because it has a check to only block peers requests forever on mainnet..).

  2. None of the tier two inv requests are being removed from the waiting for response inv set and map when the item arrives.
    The set grows infinitely for the entire node life cycle and any subsequent inv request to specific item is locally rejected (which is not something uncommon in the current tier two sync model as an item can be discarded by a missing dependency, like for example a budget finalization that contains a proposal that hasn't arrived to the node yet. In this example, the node will reject the finalization, request the missing proposal and then retry to obtain the finalization after receiving the missing information to validate it properly).

  3. The node only send sync request messages to the same list of nodes ordered, up to a certain limit, when those nodes can already have the request fulfilled and blocked (for a certain time) for follow-up responses.

  4. There is a race condition that skips the MN list sync requests if single MN sync requests were sent and responded before reaching the get mn list message dispatch function. For example an arriving mnwinner that contains an unknown voter will trigger a single MN sync request that if it happens before reaching the initial GETMNLIST sync request (as both use the same msg response) can cause the sync process to skip the full MN list sync. Ending up, with the node, only syncing via single entry requests during msgs processing rejections, and hardily obtaining the full MN list.

  5. The node can mark peers as "request fulfilled" during the mn list sync process and increase the attempt field even when it did not send the sync request message. This happens because DsegUpdate function can return without doing its job.

  6. If the budget finalization fails for any sync issue post initial sync time, aside from the fact that we cannot request full sync for issue (1), the node isn't requesting budget votes sync neither, only proposals sync.

Then, aside from the corrections and improvements for the topics above, added:

  1. A budget data re-synchronization test case to exercise the proper sync behavior after the modifications. (added a hidden RPC command to manually clean the budget data).

  2. A maximum mnlist sync threshold and budget "good" timeout increment.

  3. Documentation for the, previously totally undocumented.., masternode-sync process.

And finally, as this work has to be released prior to the next superblock to prevent another bad forking scenario, have bumped the peer protocol version to help with the release deployment and all the network monitoring work during the superblock.

@furszy furszy self-assigned this Oct 22, 2021
@furszy furszy changed the title [TierTwo] Solve tier two network synchronization issues [WIP][TierTwo] Solve tier two network synchronization issues Oct 22, 2021
@furszy furszy force-pushed the 2021_mnsync_corrections branch from a122b79 to d3d45c6 Compare October 23, 2021 15:08
@furszy furszy changed the title [WIP][TierTwo] Solve tier two network synchronization issues [TierTwo] Solve tier two network synchronization issues Oct 23, 2021
furszy added 10 commits October 23, 2021 15:31
…sn't done.

plus decrease the possible mn list sync request time to 1 hour instead of 3.
…m different peers in case of needing it.

In the previous flow, if one of the requests failed for an already asked sync request (blocked period) or any other issue, the sync requests to the peers that come after it, in an ordered list, were never done.
Before, once the first budget sync message was received, the complete budget sync requests were blocked for the entire peer's life cycle.
This modifies the behaviour to block it only for an hour after the budget sync request reception.
…st sync" DoS validation.

And decrease DoS penalization to 20 for now (this will be totally erased on v6.0).
@furszy furszy force-pushed the 2021_mnsync_corrections branch from d3d45c6 to 1fe0812 Compare October 23, 2021 18:51
@furszy
Copy link
Author

furszy commented Oct 23, 2021

An update:
The GA job inconsistency on the new re-synchronization test case was great to discover another issue on the tier two net inventory workflow (God bless testing.. many chained issues over the area..).

Briefly explained:
None of the tier two inv requests are being removed from the waiting for response inv set and map when the item arrives. This ofc is just bad as the set grows infinitely for the entire node life cycle and any subsequent inv request to specific item is locally rejected (which is not something uncommon in the current tier two sync model as an item can be discarded by a missing dependency, like for example a budget finalization that contains a proposal that hasn't arrived to the node yet. In this example, the node will reject the finalization, request the missing proposal and then retry to obtain the finalization after receiving the missing information to validate it properly).

Now GA passes consistently, issue solved.
Added this new discovered problem to the PR description as well.

and.. will continue testing a bit more.. let's see what else can find..

@furszy furszy changed the title [TierTwo] Solve tier two network synchronization issues [WIP][TierTwo] Solve tier two network synchronization issues Oct 23, 2021
@furszy furszy changed the title [WIP][TierTwo] Solve tier two network synchronization issues [TierTwo] Solve tier two network synchronization issues Oct 24, 2021
…hen the item is received.

The inv item is stored forever in the set.. which will grow infinitely with all the asked for INVs and block any possible subsequent request.
@furszy furszy force-pushed the 2021_mnsync_corrections branch 3 times, most recently from b9252cc to 55107dd Compare October 25, 2021 02:01
@furszy
Copy link
Author

furszy commented Oct 25, 2021

Expanded the governance synchronization functional test coverage:

  1. Multi-proposals generation for higher network data movement.
  2. Validate node proper resync after be on a non-available budget data state.
  3. Validate node proper resync from scratch (chain + tier two data).

Ready to go now.

@furszy furszy added this to the 5.3.2.2 milestone Oct 26, 2021
Copy link

@random-zebra random-zebra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great set of changes here ☕ Tested ACK a7aef4e4d003c2347a9644da2fc5676495bafe45

furszy added 11 commits October 27, 2021 11:02
Performed after a chain + tier two data removal.
added 15 more proposals to have a higher network movement in the test.
…ode addr that causes a signature verification failure.
The mnb can easily be invalid, it's arriving from the network.
Added the following changes:
1) Do not accept txs nor tier two INV messages during IBD (with only one exception).
   None of the received data can be validated without be synced with the chain.
2) Only accept sporks during IBD, they are always welcome.
…phase.

It simply makes no sense to relay mnb, votes, proposals, etc. to the peers from where we are synchronizing our initial state while we are doing it.
As they are not totally connected to us, if we send any INV msg they will raise the DoS scoring.
We already have it, no need to continue requesting it to anyone else.
@furszy furszy force-pushed the 2021_mnsync_corrections branch from a7aef4e to 95aa45d Compare October 27, 2021 14:03
@furszy
Copy link
Author

furszy commented Oct 27, 2021

Feedback tackled.

@furszy furszy force-pushed the 2021_mnsync_corrections branch from 95aa45d to bf34585 Compare October 27, 2021 14:51
Copy link

@random-zebra random-zebra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re-ACK bf34585

Copy link
Collaborator

@Fuzzbawls Fuzzbawls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK bf34585

@furszy furszy merged commit 729d2f9 into PIVX-Project:master Oct 29, 2021
furszy referenced this pull request in furszy/bitcoin-core Oct 30, 2021
…chronization state.

if it happens before or while the node's tier two initial synchronization is being executed, the peer will move to single items requests instead of requesting the full list.

Plus, decrease the DoS score for a bad mnb signature during the initial synchronization process, so we don't end up banning pre v5.3.3 nodes too fast for an invalidly serialized mnb (this issue was solved in bitcoin#2611 removing the mnb BIP155 flag, this is only an extra guard to give a bit more time to non-updated peers to upgrade while the network is upgrading).
furszy referenced this pull request in furszy/bitcoin-core Oct 30, 2021
…chronization state.

if it happens before or while the node's tier two initial synchronization is being executed, the peer will move to single items requests instead of requesting the full list.

Plus, decrease the DoS score for a bad mnb signature during the initial synchronization process, so we don't end up banning pre v5.3.3 nodes too fast for an invalidly serialized mnb (this issue was solved in bitcoin#2611 removing the mnb BIP155 flag, this is only an extra guard to give a bit more time to non-updated peers to upgrade while the network is upgrading).
furszy added a commit that referenced this pull request Nov 1, 2021
…in initial sync state

860fa5a TierTwo: do not ask for single MN items if the node is in initial synchronization state. (furszy)

Pull request description:

  Solving a last edge case race condition for the tier two initial sync process:
   if a `fbvote` or `fbs` message is received before or while we are synchronizing the MN list, there can be a race where the node will move to single items requests instead of requesting the full list (which is not what we want in the initial sync process as single items requests sync is a reactive approach while full list requests is a proactive one).

  Plus, decreased the DoS score for a bad mnb signature during the initial synchronization process, so we don't end up banning pre v5.3.3 nodes too fast for an invalidly serialized mnb (this issue was solved in #2611 removing the mnb BIP155 flag, this is only an extra guard to give a bit more time to non-updated peers to upgrade while the network is upgrading).

ACKs for top commit:
  Fuzzbawls:
    ACK 860fa5a
  random-zebra:
    utACK 860fa5a

Tree-SHA512: b490adddd4cf229f0a20a59b4d8402dd87b9dcbba8cfa32994d2e9dfdecb3c120f6decca22d03d91ab6892c733b954c35e0a114d27784e15863b6c740353339b
furszy added a commit that referenced this pull request Nov 2, 2021
e43a239 TierTwo: do not ask for single MN items if the node is in initial synchronization state. (furszy)
1a3f7e0 [GUI] Force argument insertion when switching languages (Fuzzbawls)
9ea042c refactor: rename DsegUpdate to RequestMnList (furszy)
60767c0 TierTwo: when the INV is received, remove it from every peer AskFor set. (furszy)
6ece735 net: do not relay INV messages to not successfully connected peers. (furszy)
ff181f1 TierTwo: Only relay data if we aren't on the initial synchronization phase. (furszy)
bf5f78e net_processing: improve INV processing for tier two elements. (furszy)
a5ff7f8 Masternode: clean possible invalidly stored seen MN broadcasts at startup. (furszy)
e024421 Masternode: remove own mnb validation skip. (furszy)
3dec0cd Masternode-sync: Count local MN, in case of have it, for fallback sync issue. (furszy)
a729fc9 Masternode: pre-v6.0 guard against an invalidly stored BIP155 masternode addr that causes a signature verification failure. (furszy)
66568c6 Masternode: remove error-prone 'isBIP155Addr' flag, use directly the addr isCompat function. (furszy)
df1b3cc test, governance sync: increase tier two network gossip movement (furszy)
15f5a70 test: add tier two resync from scratch test case. (furszy)
ee7cc08 TierTwo sync: raise mn list sync time to 40 seconds. (furszy)
2240602 net bugfix: clear waiting for response INVs from the "askedFor" set when the item is received. (furszy)
5dd052c tier two: do not request single missing mn entries until the initial sync process is completed. (furszy)
156b6a9 MasternodeMan: remove mainnet only check for the "already asked mn list sync" DoS validation. (furszy)
be5bf14 Raise max mnlist sync threshold, increase budget sync finish timeout and document masternode-sync process. (furszy)
c1c1044 Bump protocol version to 70924. (furszy)
09c72c5 budget sync: if budget finalization fails, not only request missing proposals, request proposals' votes as well. (furszy)
0726c16 budget sync: do not block budget full sync requests forever. (furszy)
95bbbd5 mnsync: randomize the sync requests order so the node try to sync from different peers in case of needing it. (furszy)
bbb28fe mnsync: do not mark get_mn_list peer sync fulfilled if the request wasn't done. (furszy)
6d285b2 Test: add test case for budget data re-synchronization. (furszy)
2bde6fc [RPC] Add cleanbudget command. (furszy)

Pull request description:

  Backports #2611 for v5.3.3

  So we do a release focused on solving tier two network synchronization and re-synchronization issues that are mainly affecting the superblock payments window and causing network forks.

  Update:
  Added #2617 and #2622 as well.

ACKs for top commit:
  random-zebra:
    utACK e43a239 then
  Fuzzbawls:
    ACK e43a239

Tree-SHA512: ea7e1730caaefc48a49edd7610bd9874fdcbabd6365a818313f3b418eb77fd95b0bdeb32bacdfdeefee8dddb3edc422ebaa2e85a1c93df51ee66fc64f40a68bb
furszy added a commit that referenced this pull request Nov 19, 2021
…allet_nullifiers

0f79bdc [QA] Fix intermittent mempool sync failures in sapling_wallet_nullifiers (random-zebra)

Pull request description:

  Still (somewhat frequent) mempool sync failures in this test, even after #2633.

  This is because `node1` might still be in IBD when he receives the first tx INV message, and, with #2611, nodes no longer send (non-block) getdata requests during initial sync.

ACKs for top commit:
  furszy:
    great catch, ACK 0f79bdc.
  Fuzzbawls:
    ACK 0f79bdc

Tree-SHA512: a3f3b81848893803ba04e93d605d7d6accb2b2a28c62a8d7dbd09ae5513111c917c46f1e30eccacdf9b6ac9f1caf06ff4383fd0071d55133f49e8798e68933d4
Fuzzbawls added a commit that referenced this pull request Nov 23, 2021
…set.

9b5fd1c test: remove duplicated and non-existent CInv types. (furszy)
96c64ec test: add case for full askfor set. (furszy)
d495d27 Clean seen MN pings map right after the MN is removed. (furszy)

Pull request description:

  Follow-up work to #2611.
  Adding a regression test case for one of the issues solved in #2611:

  As none of the mnp inv requests were being removed from the waiting for response inv set and map when the item arrives.
  The set was growing non-stop for the entire node life cycle up to the point where no subsequent inv msg were accepted anymore.

  After v6.0, as the mnp message will be removed, the test can be moved to be based on quorum inv msgs.

ACKs for top commit:
  random-zebra:
    ACK 9b5fd1c
  Fuzzbawls:
    ACK 9b5fd1c

Tree-SHA512: 8b8a54edc30bba9b81cd375b478356c929e8a6868d09ce7af9dccfe7a1289436786a970dd55a05ea75ff7d022cbfee4a70706adf0cb10a9d72421a66202998ef
@furszy furszy deleted the 2021_mnsync_corrections branch November 29, 2022 15:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants