Etcm 290 invalid branches built by fetcher by kapke · Pull Request #787 · input-output-hk/mantis

kapke · 2020-11-09T12:20:08Z

Description

One of reasons for forks on cluster deployments is the fact that BlockFetcher builds block sequences which are invalid chanins, making them effectively impossible to import and thus - only local miner is able to provide blocks which are valid chain extension.

Proposed Solution

Validate header chains as they are appended to queue, to ensure they are possible to be imported.

Important Changes Introduced

Thing that was out of scope, but seemed to be affecting whole syncing story a lot - when receiving new blocks remove check for fetching.

Testing

Already tested regular sync of whole chain on Mordor - no issues at all.
Spun up a local cluster with 5 miners couple of times:

got one miner fork around block 400, later OS killed 2 containers due to lack of resources (including the forked one), rest 3 continued to work without any problems or forks for over 4000 blocks
another time - local cluster with 5 miners - no issues at all for ~200 blocks

Update

After resolving issue with fetcher being stuck on headers from rejected fork:

local cluster of 5 miners - no issues at all for ~2500 blocks
sync to Mordor - caught top and stayed on it over a night without visible issues

kapke · 2020-11-12T14:26:00Z

@ntallar As you can see in the code, there is blacklisting disabled when receiving headers response. I think I would:

create a JIRA ticket to revise whole response validation story and its relation to blacklisting
make blacklisting tests ignored and marked with that ticket
remove blacklisting line

WDYT?

ntallar · 2020-11-12T15:20:48Z

@ntallar As you can see in the code, there is blacklisting disabled when receiving headers response. I think I would:

I agree with that plan! Maybe something like #771 but for the headers

kapke · 2020-11-12T16:12:40Z

It seems that current state of this branch introduces yet another reason for forks - a scenario where:

fetcher receives a branch
some of these blocks get rejected by rest of the network
fetcher gets stuck because it's unable to fetch a branch which follows chain in queue

I see 2 solutions to that problem, the question is which seems to be better:

just make sure that the chain is contiguous in terms of block numbers - ledger will import whatever possible and then force to fetcher to clear state up in case of no longer continued branch (at cost of blacklisting honest peer)
count number of rejected responses and if it passes some threshold - reset fetcher state (or even restart the actor) so that invalid state is being invalidated and the accepted fork should be automatically picked

I'm leaning to the latter as it possibly could be used also in other scenarios where fetcher gets stuck for some reason

ntallar · 2020-11-12T18:09:32Z

Wouldn't that scenario evolve into:

BlockImporter imports the blocks from fetcher so he removes it from there
Fetcher would then be able to fetch the blocks from the fork (though they would be unconcatenable)
When the blocks fail to be imported the InvalidateBlocksFrom message would kick in and result in the fetcher going back... eventually being able to solve the fork

The only problem I see is that blocks being mined in the middle are quite disruptive, as they invalidate the invalidation of the InvalidateBlocksFrom message. Maybe the fetcher should keep both

a last block: with the last block number imported
a block number to fetch

wdyt?

I wouldn't go for your alternatives, this code is already too complicated to add more workarounds, they seem too hacky imho:

I'm not sure which of the components from the ledger rely on the blocks passed being a branch, something could break in the middle if you break that assumption.
This is the safest of the two, though it only hides the problem as we are only restarting periodically

kapke · 2020-11-13T08:48:28Z

@ntallar Not really. In my case fetcher got stuck because of having a header from non-existing fork and being unable to make any progress.

As for approach #2 (cleaning up state/restarting) - indeed it doesn't solve the problem, but we still should be able to monitor such cases (log level error + monitoring) without leading to forking nodes. Please note that such approach to handling irrecoverable errors is at core of actors (which we use here) as well as is quite effective at keeping system operational.

ntallar · 2020-11-13T12:45:43Z

Please note that such approach to handling irrecoverable errors is at core of actors (which we use here) as well as is quite effective at keeping system operational.

I'm sorry but I'm still not convinced. It's definitely a good last resource in case we don't manage to fix an issue, but this hasn't yet become a blocker

@ntallar Not really. In my case fetcher got stuck because of having a header from non-existing fork and being unable to make any progress.

I see now the issue you mentioned! Shouldn't we do a better handling of our BlockBodies response to address it? A simple alternative is dropping headers from the waitingHeaders when a body of a header is requested but not received. Maybe it can be a next step to #771

mirkoAlic · 2020-11-17T18:11:49Z

+        PeersClient.Response(fakePeer, BlockBodies(alternativeSecondBlocksBatch.drop(6).map(_.body)))
+      )
+
+      importer.send(blockFetcher, PickBlocks(100))


PickBlocks(100) sounds like magic number

mirkoAlic · 2020-11-17T18:21:08Z

+        case PeersClient.Request(msg, _, _) if msg == firstGetBlockBodiesRequest => peersClient.lastSender
+      }
+
+      // Block 5 is mined (we could have reached this stage due to invalidation messages sent to the fetcher)


i think minedBlock.number == 16

…ding requests check when checking for top on fetch side

…ess response, improve peer message logging

…ected fork

… and current state of fetcher

mirkoAlic · 2020-11-18T17:44:34Z

-        waitingHeaders = waiting
-      )
+  /**
+    * Matches bodies with headers in queue and adding matched bodies to the blocks.


mirkoAlic

LGTM!

ntallar

LGTM! Love the comments ❤️

 .\\            //.
. \ \          / /.
.\  ,\     /` /,.-
 -.   \  /'/ /  .
 ` -   `-'  \  -
   '.       /.\`
      -    .-
      :`//.'
      .`.'
      .' BP

kapke requested review from mirkoAlic and ntallar November 9, 2020 12:40

kapke force-pushed the etcm-290-invalid-branches-built-by-fetcher branch from 30df020 to 6515c7f Compare November 9, 2020 12:41

ntallar suggested changes Nov 9, 2020

View reviewed changes

Comment thread src/main/scala/io/iohk/ethereum/blockchain/sync/regular/BlockFetcherState.scala Outdated

Comment thread src/main/scala/io/iohk/ethereum/blockchain/sync/regular/BlockFetcherState.scala

kapke force-pushed the etcm-290-invalid-branches-built-by-fetcher branch from 83da7e5 to 0b54ea8 Compare November 10, 2020 12:09

ntallar suggested changes Nov 10, 2020

View reviewed changes

Comment thread src/main/scala/io/iohk/ethereum/blockchain/sync/regular/BlockFetcher.scala Outdated

Comment thread src/main/scala/io/iohk/ethereum/blockchain/sync/regular/BlockFetcherState.scala

Comment thread src/main/scala/io/iohk/ethereum/blockchain/sync/regular/BlockFetcherState.scala

ntallar reviewed Nov 12, 2020

View reviewed changes

Comment thread src/main/scala/io/iohk/ethereum/blockchain/sync/regular/BlockFetcherState.scala

ntallar reviewed Nov 12, 2020

View reviewed changes

Comment thread src/main/scala/io/iohk/ethereum/network/PeerActor.scala Outdated

mirkoAlic reviewed Nov 12, 2020

View reviewed changes

Comment thread src/main/scala/io/iohk/ethereum/blockchain/sync/regular/BlockFetcher.scala

mirkoAlic reviewed Nov 12, 2020

View reviewed changes

Comment thread src/main/scala/io/iohk/ethereum/blockchain/sync/regular/BlockFetcher.scala Outdated

mirkoAlic reviewed Nov 12, 2020

View reviewed changes

Comment thread src/main/scala/io/iohk/ethereum/blockchain/sync/regular/BlockFetcher.scala Outdated

kapke force-pushed the etcm-290-invalid-branches-built-by-fetcher branch 3 times, most recently from 1cd357b to a75e7cd Compare November 16, 2020 10:39

kapke marked this pull request as ready for review November 16, 2020 10:40

kapke requested review from mirkoAlic and ntallar November 16, 2020 10:45

jmendiola222 added bug Something isn't working high priority This PRs should be reviewed and merged ASAP labels Nov 17, 2020

kapke force-pushed the etcm-290-invalid-branches-built-by-fetcher branch from a75e7cd to a2839bf Compare November 17, 2020 11:55

mirkoAlic reviewed Nov 17, 2020

View reviewed changes

Comment thread src/main/scala/io/iohk/ethereum/blockchain/sync/regular/BlockFetcherState.scala

ntallar suggested changes Nov 17, 2020

View reviewed changes

mirkoAlic reviewed Nov 17, 2020

View reviewed changes

Comment thread src/test/scala/io/iohk/ethereum/blockchain/sync/regular/BlockFetcherSpec.scala

kapke force-pushed the etcm-290-invalid-branches-built-by-fetcher branch 4 times, most recently from 2014d4c to 1d585e6 Compare November 18, 2020 13:53

Nicolas Tallar and others added 8 commits November 18, 2020 15:16

[ETCM-290] First failing test

08f3cae

[ETCM-290][WIP] Fix for invalid chains passed to importer

f0a4d41

[ETCM-290] Add chain validation on headers added to queue, remove pen…

ee624a0

…ding requests check when checking for top on fetch side

[ETCM-290] Fix test case exposing the issue

497ae12

[ETCM-290] Logs cleanup

f58f099

[ETCM-290] Temporarily remove peers blacklisting after receiving usel…

aac3181

…ess response, improve peer message logging

[ETCM-290] Fix problem with fetcher being stuck with headers from rej…

9efb85f

…ected fork

[ETCM-290] CR fixes: comments and explanations regarding made changes…

02a4b1d

… and current state of fetcher

kapke force-pushed the etcm-290-invalid-branches-built-by-fetcher branch from 1d585e6 to 02a4b1d Compare November 18, 2020 14:16

mirkoAlic mentioned this pull request Nov 18, 2020

[ETCM-283] Add block bodies validation in block fetcher #771

Merged

mirkoAlic reviewed Nov 18, 2020

View reviewed changes

mirkoAlic approved these changes Nov 18, 2020

View reviewed changes

Merge branch 'develop' into etcm-290-invalid-branches-built-by-fetcher

7ca3fce

ntallar approved these changes Nov 18, 2020

View reviewed changes

mirkoAlic merged commit 7a1323e into develop Nov 18, 2020

mirkoAlic deleted the etcm-290-invalid-branches-built-by-fetcher branch November 18, 2020 19:03

Conversation

kapke commented Nov 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Proposed Solution

Important Changes Introduced

Testing

Update

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kapke commented Nov 12, 2020

Uh oh!

Uh oh!

Uh oh!

ntallar commented Nov 12, 2020

Uh oh!

kapke commented Nov 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ntallar commented Nov 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kapke commented Nov 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ntallar commented Nov 13, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mirkoAlic Nov 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mirkoAlic Nov 17, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mirkoAlic Nov 18, 2020

Choose a reason for hiding this comment

Uh oh!

mirkoAlic left a comment

Choose a reason for hiding this comment

Uh oh!

ntallar left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kapke commented Nov 9, 2020 •

edited

Loading

kapke commented Nov 12, 2020 •

edited

Loading

ntallar commented Nov 12, 2020 •

edited

Loading

kapke commented Nov 13, 2020 •

edited

Loading

mirkoAlic Nov 17, 2020 •

edited

Loading