[ETCM-103] Restartable state sync by KonradStaniec · Pull Request #730 · input-output-hk/mantis

KonradStaniec · 2020-10-09T14:23:42Z

Description

Makes it possible to restart state sync if the pivot block goes stale during it.

Proposed Solution

The way it works is:

After switching to state sync, we leave the main fast sync loop running, but only thing we are doing there is checking how many peers have possible pivot block larger than our current by some margin.
When the number of peers with this better possible block is larger than min peers to pick pivot block we: stop state sync, pick new pivot, sync up blockchain date up to new pivot, restart state sync from new pivot

Possible improvements

The best possible improvement would be to have concurrent blockchain and state download, then this whole dance with updating pivot would be unnecessary, only thing the state sync would need to do would be to track best synced block and restart when best synced block is larger by some margin than current state sync pivot. This would require a little rehaul of the way we are handling available peers in upper layers of Mantis to avoid simultaneous concurrent request for state and blockchain data.

Bonus

SyncControllerSpec test have beed refactored to use autopilot.

Testing

I was able to sync to mainnet 4 times now with this setup. (for now without node restarts during state sync, as proper restarting requires one more ticket https://jira.iohk.io/browse/ETCM-213 i.e refiling bloom filter after node restart. Without it it should theroticly be possible to do but can be painfully slow due to large number of false positives from bloom filter which not coresspond to database content)

…le-state-sync

mmrozek · 2020-10-15T10:09:11Z

+    # Current Size of ETC state trie is aroud 150M Nodes, so 200M is set to have some reserve
+    # If the number of elements inserted into bloom filter would be significally higher that expected, then number
+    # of false positives would rise which would degrade performance of state sync
+    state-sync-bloomFilter-size = 200000000


state-sync-bloom-filter-size? to be consistent

mmrozek · 2020-10-15T10:09:28Z

+
+    # Max number of mpt nodes held in memory in state sync, before saving them into database
+    # 100k is around 60mb (each key-value pair has around 600bytes)
+    state-sync-persistBatch-size = 100000


state-sync-persist-batch-size

mmrozek · 2020-10-15T10:10:08Z

+    # If new pivot block received from network will be less than fast sync current pivot block, the re-try to chose new
+    # pivot will be scheduler after this time. Avarage block time in etc/eth is around 15s so after this time, most of
+    # network peers should have new best block
+    pivot-block-reSchedule-interval =  15.seconds


pivot-block-reschedule-interval

mmrozek · 2020-10-15T10:16:31Z

+    def waitingForPivotBlockUpdate(updateReason: PivotBlockUpdateReason): Receive = handleCommonMessages orElse {
      case PivotBlockSelector.Result(pivotBlockHeader) =>
        log.info(s"New pivot block with number ${pivotBlockHeader.number} received")
        if (pivotBlockHeader.number >= syncState.pivotBlock.number) {


It will be more readable if you use pattern matching instead of nested ifs

mmrozek · 2020-10-15T10:17:36Z

+            reScheduleAskForNewPivot(updateReason)
+          } else {
+            updatePivotSyncState(updateReason, pivotBlockHeader)
+            syncState = syncState.copy(updatingPivotBlock = false)


I think syncState = syncState.copy(updatingPivotBlock = false) should be done in updatePivotSyncState method

mmrozek · 2020-10-15T10:21:38Z

      reqType match {
        case _: CodeRequest =>
          blockchain.storeEvmCode(hash, data).commit()
+          bloomFilter.put(hash)


Very minor: You could call bloomFilter.put(hash) before the pattern matching

mmrozek · 2020-10-15T10:23:36Z

-    // restart. This can be done by exposing RockDb iterator to traverse whole mptnode storage.
-    // Another possibility is that there is some light way alternative in rocksdb to check key existence
-    state.memBatch.contains(req.nodeHash) || isInDatabase(req)
+    if (state.memBatch.contains(req.nodeHash)) {


Simpler: state.memBatch.contains(req.nodeHash) || (bloomFilter.mightContain(req.nodeHash) && isInDatabase(req))

kapke · 2020-10-15T11:12:30Z

    }

-    private def updatePivotBlock(state: FinalBlockProcessingResult): Unit = {
+    private def updatePivotBlock(state: PivotBlockUpdateReason): Unit = {


minor: state or reason then?

reason - forgot to change

kapke · 2020-10-15T12:41:06Z

          syncState =
            syncState.updatePivotBlock(pivotBlockHeader, syncConfig.fastSyncBlockValidationX, updateFailures = true)
+
+        case NodeRestart =>


Shouldn't it be named SyncRestart? If fast-sync actor gets restarted due to some failure catched by supervisor, it's going to be restarted with clean state the same way is if it was restarted whole node.

That also makes me think - shouldn't SyncController watch for Fast-Sync restarts and start it once such happens?

So i agree that SyncRestart is more compelling name.

Question about supervision is more nuanced, in general I am not sure we handle it well across all codebase, but for this particual case we are fine as: FastySync is child of SyncController and default strategy for uncatched Exception in child is just restart it. It will probably mean some of the request in flight will be later ignored and some of the peers gets unnecessry blacklis. Those missed requests may triger some weird error condition. In my view this whole class was not designed with restarts in mind, but rather with handing all excpetion by itself.

kapke · 2020-10-15T12:43:07Z

+      (info.maxBlockNumber - syncConfig.pivotBlockOffset) - state.pivotBlock.number >= syncConfig.maxPivotBlockAge
+    }
+
+    private def getPeerWithTooFreshNewBlock(


is it really too fresh or more fresh enough to update to?

kapke · 2020-10-15T12:47:28Z


-  sealed abstract class FinalBlockProcessingResult
+  sealed abstract class PivotBlockUpdateReason {
+    def nodeRestart: Boolean = this match {


Minor: isNodeRestart?

kapke

The code looks good!
I'd like to see how it works though. How much time should I expect to wait on mainnet before state sync starts?

KonradStaniec · 2020-10-16T10:14:54Z

So currently sync time looks like on average:

8-10h for blockchain
6-10h for state sync

And state sync starts only after blockchain is downloaded. There is also still one issue with blockchain which makes blockchain state to be stuck, and restart with some config tweak is needed to resume it, if it happen to you let me know. (we already have ticket to track it, and I suspect what is the issue here)

State Sync has higher variability in sync times as it is more parallel and it depends on number of peers , which for now is totally random due to our random walk nature of current disovery. On my machine i had finished state sync in 6h when i got 9-10 peers, and 10h when i got 3-4peers.

mmrozek

LGTM!

…le-state-sync

KonradStaniec added 3 commits October 9, 2020 08:48

[ETCM-103] Make sync state process restartable

99f67a8

[ETCM-103] Add bloom filter

6fc5a46

Merge remote-tracking branch 'origin/develop' into etcm-103/restartab…

a898ebb

…le-state-sync

KonradStaniec requested a review from mmrozek October 9, 2020 14:23

KonradStaniec added 4 commits October 12, 2020 09:27

[ETCM-103] Move hardcoded values to configuration

0c80251

[ETCM-103] Refactor handling of checking for stale block

599311b

[ETCM-103] Refactor SyncControllerSpec

b30f5e6

[ETCM-103] Add more tests

6234ff2

KonradStaniec requested a review from kapke October 14, 2020 08:55

KonradStaniec marked this pull request as ready for review October 14, 2020 08:55

Merge remote-tracking branch 'origin/develop' into etcm-103/restartab…

3f9f261

…le-state-sync

mmrozek reviewed Oct 15, 2020

View reviewed changes

kapke reviewed Oct 15, 2020

View reviewed changes

[ETCM-103] Minor renaming and cleanup

b60038a

kapke approved these changes Oct 16, 2020

View reviewed changes

mmrozek approved these changes Oct 16, 2020

View reviewed changes

Merge remote-tracking branch 'origin/develop' into etcm-103/restartab…

6e3c185

…le-state-sync

KonradStaniec merged commit 6e3c185 into develop Oct 16, 2020

KonradStaniec deleted the etcm-103/restartable-state-sync branch October 16, 2020 12:25

Conversation

KonradStaniec commented Oct 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Proposed Solution

Possible improvements

Bonus

Testing

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kapke left a comment

Choose a reason for hiding this comment

Uh oh!

KonradStaniec commented Oct 16, 2020

Uh oh!

mmrozek left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

KonradStaniec commented Oct 9, 2020 •

edited

Loading