[ETCM-105] State sync improvements by KonradStaniec · Pull Request #715 · input-output-hk/mantis

KonradStaniec · 2020-10-01T09:52:48Z

Description

PR which makes all grunt work necessary for implementing fast sync restarting at arbitrary new target block. Notable changes:

moves state sync from FastSyc actor and split it into two separate components StateSyncScheduler which traverses mpt trie in dfs fashion and create requests for currently missing nodes, StateSyncDownloader which retrieves those nodes from remote peers and provides them to Scheduler
Add a lot of testing to state sync process. Until now state sync process was almost untested, and making any change to it was a lottery.
Fix a bug with target block update. Until now state sync process started just after safe target header was downloaded (without receipts and bodies), this pr fixs it that state sync is started only when whole blockchain data is retrieved.
Change how state sync behaves after node restart. Until now state of the download queues was persist periodically by syncStateStorageActor. This process was highly indeterministic and probably could lose data (i did not have any test case for it, but it easy to imagine situation when node is killed during node processing with not correctly updated queues).
With this pr, after restart state sync start from scratch i.e from already known target block, but it does not request nodes which are already saved in database.
The order of traversal here is important, as the fact that we have node at level n implies that we have all subtries at deeper levels, so ultimatly we need to traverse only unknown paths from root.

Whole solution is havily influenced by the way how Geth is handling state sync in FastSync

Future Tasks

Restarting state sync to new target - with all current improvements from this PR, it will be just the case of sending signal
StateSyncScheduler to request restart, waiting for downloader to finish or its download tasks.
Currently we sync first blockchain then state, with all with all current improvements from this PR, it would be posible to start both concurrently, and update target to some final at some point. Changes necessry to do that requires changing how we handle list of handshaked peers to avoind concurrent state and blockchain requests to peers.
Protocol between scheduler and downloader could be improved, as now when there is large batch of brach nodes, scheduler can take long time to process them and downloader can sit idly. The solution, would be to have an explicit queue of MissingNodes messages and do the processing in background. Then when the first messge with downloaderCapacity > 0 arrives, scheduler would send missing nodes in between processing of MissingNodes message.

TODO

Fix existing tests
Add docs
Run new state sync on real net - mordor - completed whole mordor state sync in 4 minutes
Run new state sync on real net - main - partially completed as remote peers run out of nodes, to make this complete, we need the stale target detection, which is planned as next feature.

…ast-sync

mmrozek · 2020-10-02T13:31:04Z

-    val etcPeerManager: ActorRef,
-    val syncConfig: SyncConfig,
-    implicit val scheduler: Scheduler)
+                val fastSyncStateStorage: FastSyncStateStorage,


Strange formatting

mmrozek · 2020-10-02T13:31:49Z

      context become syncingHandler.receive
+      if (syncState.isBlockchainWorkFinished && !syncState.stateSyncFinished) {
+        // chain has already been downloaded we can start state sync
+        syncingHandler.startStateSync(syncState.targetBlock)


Maybe some log here will be helpful?

mmrozek · 2020-10-02T13:34:57Z

+        if (blockchainDataToDownload)
+          processDownloads()
+        else if (syncState.isBlockchainWorkFinished && !syncState.stateSyncFinished) {
+          // TODO we are waiting for state sync to finish


When this TODO will be addressed?

i will add ticket number there. My plan was to address this in ETCM-103, as there i will add monitoring for stale target block, and will probably know more how exactly this synicng loop should look like in FastSync

mmrozek · 2020-10-02T13:35:55Z

-    nextBlockToFullyValidate: BigInt = 1,
-    targetBlockUpdateFailures: Int = 0,
-    updatingTargetBlock: Boolean = false) {
+                        targetBlock: BlockHeader,


Strange formatting

mmrozek · 2020-10-02T14:21:06Z

+    * If it would valuable, it possible to implement processor which would gather statistics about duplicated or not requested data.
+    */
+  def processResponses(state: SchedulerState, responses: List[SyncResponse]): Either[CriticalError, SchedulerState] = {
+    def go(currentState: SchedulerState, remaining: Seq[SyncResponse]): Either[CriticalError, SchedulerState] = {


Missing @tailrec

mmrozek · 2020-10-02T14:43:50Z

+      requestType match {
+        case SyncStateScheduler.StateNode =>
+          import io.iohk.ethereum.network.p2p.messages.PV63.AccountImplicits._
+          scala.util.Try(n.value.toArray[Byte].toAccount).toEither.left.map(_ => NotAccountLeafNode).map { account =>


We could add some decoder from RLPEncodable to Account and use n.parsedRlp instead of using Try

I will add custom apply method for account to do that, to not expose all this details here.

I had in mind that we could use already parsed LeafNote to avoid unnecessary decoding from bytes

mmrozek · 2020-10-02T14:46:52Z

+  }
+
+  private def isRequestAlreadyKnown(state: SchedulerState, req: StateNodeRequest): Boolean = {
+    if (state.memBatch.contains(req.nodeHash)) {


Simpler: private def isRequestAlreadyKnown(state: SchedulerState, req: StateNodeRequest): Boolean = state.memBatch.contains(req.nodeHash) || isInDatabase(req)

mmrozek · 2020-10-02T14:53:01Z

+
+  private val stateNodeRequestComparator = new Comparator[StateNodeRequest] {
+    override def compare(o1: StateNodeRequest, o2: StateNodeRequest): Int = {
+      if (o1.nodeDepth > o2.nodeDepth) {


Simpler o2.nodeDepth compare o1.nodeDepth

Add tailrec annotation Simplify known node check Fix formatting in all files

mmrozek · 2020-10-06T12:28:32Z

+            val (newRequests, newState) =
+              currentState.assignTasksToPeers(
+                NonEmptyList.fromListUnsafe(freePeers.toList),
+                Some(newNodesToGet),
+                syncConfig.nodesPerRequest
+              )
+            log.info(
+              "Creating {} new state node requests. Current request queue size is {}",
+              newRequests.size,
+              newState.nodesToGet.size
+            )
+            newRequests.foreach { request =>
+              requestNodes(request)
+            }
+            context.become(downloading(scheduler, newState))


It could be extracted to separate a method. It is the same in both cases

aahh you are right we probably do not need check for peers message at all

mmrozek · 2020-10-06T12:38:33Z

+          if (nextRequested == receivedHash) {
+            go(requestedRemaining.tail, receivedRemaining.tail, SyncResponse(receivedHash, nextReceived) :: processed)
+          } else {
+            // hash of next element does not match return what what we have processed, and remaing hashes to get


Very minor: typo remaing

mmrozek · 2020-10-06T12:40:32Z

+          } else {
+            val (notReceived, received) = process(requestedHashes.toList, receivedMessage.values.toList)
+            if (received.isEmpty) {
+              val rescheduleRequestedHashes = notReceived.foldLeft(nodesToGet) { case (map, hash) =>


Simpler (?): nodesToGet ++ notReceived.map(_ -> None)

but at least two traversals of notReceived collection, one for map one for addition to map. I would leave it as it is.

mmrozek · 2020-10-06T12:54:26Z

+      // so we can ignore those errors.
+      sync.processResponses(currentState, nodes) match {
+        case Left(value) =>
+          log.info(s"Critical error while state syncing ${value}, stopping state sync")


log.error

mmrozek · 2020-10-06T13:09:56Z

+        if (parentsToCheck.isEmpty) {
+          (currentRequests, currentBatch)
+        } else {
+          val parent = parentsToCheck.head


WDYT about adding some meaningful exception here? eg. val parent = parentsToCheck.headOption.getOrElse(throw new IllegalStateException("Critical exception. Cannot find parent"))

it sound like good idea, it will be instantly known that some invariants have been broken

ntallar

Very minor comments, will continue reviewing tomorrow

ntallar · 2020-10-06T18:46:07Z

    # During fast-sync when most up to date block is determined from peers, the actual target block number
    # will be decreased by this value
-    target-block-offset = 128
+    target-block-offset = 32


I assume this value was taken from geth, right?

it is part of my experiments with fast sync. It seems geth tries to have offset equal to 64 blocks. But their sync is much faster and they process a lot more nodes before updating to new target.
128 is definitly to much, as large part of the peers keeps only 128 blocks history, so it can happen that they won;t have target block root and our sync will not even start.

ntallar · 2020-10-06T18:57:20Z

+    }
+
+    def getMissingHashes(max: Int): (List[ByteString], SchedulerState) = {
+      def go(


Missing @tailrec

ntallar · 2020-10-06T18:59:38Z

+  /**
+    * Default responses processor which ignores duplicated or not requested hashes, but informs the caller about critical
+    * errors.
+    * If it would valuable, it possible to implement processor which would gather statistics about duplicated or not requested data.


Maybe we could log them for now?

Add statistics logging Remove unnecessary CheckPeers Messages from downloader

…ast-sync

kapke · 2020-10-07T11:52:28Z

+        override def run(sender: ActorRef, msg: Any): AutoPilot = {
+          msg match {
+            case SendMessage(msg, peer) if msg.underlyingMsg.isInstanceOf[GetNodeData] =>
+              val msgToGet = msg.underlyingMsg.asInstanceOf[GetNodeData]


pattern matching instead?

kapke · 2020-10-07T11:59:45Z

+      val (scheduler, schedulerBlockchain, schedulerDb) = buildScheduler()
+      val header = Fixtures.Blocks.ValidBlock.header.copy(stateRoot = worldHash, number = 1)
+      schedulerBlockchain.storeBlockHeader(header).commit()
+      var state = scheduler.initState(worldHash).get


minor - foldLeft?

maybe i am missing something, but this is not necessary fold, as we do not have any collection to summarise into one value, but need to process some stuff until some condition hold

Ouch, I missed the fact that condition is being checked on different state with each loop, which makes it much elaborate to express purely than while loop.

ntallar · 2020-10-07T12:03:13Z

+    }
+  }
+
+  private def isRequestAlreadyKnown(state: SchedulerState, req: StateNodeRequest): Boolean = {


Isn't this more of isResponseAlreadyKnown? To difference with isRequestAlreadyKnownOrResolved

ntallar · 2020-10-07T12:15:34Z

+    case n: BranchNode =>
+      Right(n.children.collect { case HashNode(childHash) =>
+        StateNodeRequest(
+          ByteString.fromArrayUnsafe(childHash),


I assume the ByteString.fromArrayUnsafe is here for performance reasons, right?

Yup, it can be used when we known that provided array won't be mutated. (as in this case)

Refactor FastSyncIt tests Properly close actor system in StateSyncSpec

…ast-sync

mmrozek

Minor comments only. If it syncs with the mainnet it is ready to merge

mmrozek · 2020-10-09T09:26:13Z

+  }
+
+  private def isRequestedHashAlreadyCommitted(state: SchedulerState, req: StateNodeRequest): Boolean = {
+    // TODO add bloom filter step before data base to speed things up. Bloomfilter will need to be reloaded after node


Minor: please add JIRA ticket to the comment

mmrozek · 2020-10-09T09:32:50Z

+
+  case object AlreadyProcessedItem extends NotCriticalError
+
+  final case class ProcessingStatistics(duplicatedHashes: Long, notRequestedHashes: Long, saved: Long) {


Do we want to expose ProcessingStatistics as a metric?

it is possible, we can even save it to do to save stats between shutdowns. I will probably do this when sync will be ready for its prime time(i.e it will work with mainnet)

mmrozek · 2020-10-09T09:35:02Z

+    with BeforeAndAfterAll
+    with ScalaCheckPropertyChecks {
+
+  override def afterAll(): Unit = {


Minor: You could use WithActorSystemShutDown trait

mmrozek · 2020-10-09T09:36:22Z

+    with Matchers
+    with BeforeAndAfterAll {
+
+  override def afterAll(): Unit = {


mmrozek · 2020-10-09T09:38:48Z

-          Task.raiseError(new TimeoutException("Task time out after all retries"))
-        }
-      }
+  it should "should update target block and sync this new target block state" in customTestCaseResourceM(


Minor: pivot instead of target

There are already 2 other reviewers

kapke

Minor stuff only. LGTM if it syncs with mainnet (I can test Mordor over the weekend if you want to)

kapke · 2020-10-09T10:55:34Z

+        newNodes: Option[Seq[ByteString]],
+        nodesPerPeerCapacity: Int
+    ): (Seq[PeerRequest], DownloaderState) = {
+      def go(


minor: missing @tailrec?

kapke · 2020-10-09T12:08:13Z

+      val (scheduler, schedulerBlockchain, schedulerDb) = buildScheduler()
+      val header = Fixtures.Blocks.ValidBlock.header.copy(stateRoot = worldHash, number = 1)
+      schedulerBlockchain.storeBlockHeader(header).commit()
+      var state = scheduler.initState(worldHash).get


Ouch, I missed the fact that condition is being checked on different state with each loop, which makes it much elaborate to express purely than while loop.

kapke · 2020-10-09T12:10:43Z

+import org.scalatest.matchers.must.Matchers
+import org.scalatestplus.scalacheck.ScalaCheckPropertyChecks
+
+class SyncSchedulerSpec extends AnyFlatSpec with Matchers with EitherValues with ScalaCheckPropertyChecks {


Minor - names are out of sync (SyncSchedulerSpec, SyncSchedulerState, SyncStateScheduler)

kapke · 2020-10-09T12:20:12Z

+    val goodResponse = peerRequest.nodes.toList.take(perPeerCapacity / 2).map(h => hashNodeMap(h))
+    val badResponse = (200 until 210).map(ByteString(_)).toList
+    val (result, newState2) = newState1.handleRequestSuccess(requests(0).peer, NodeData(goodResponse ++ badResponse))
+    assert(result.isInstanceOf[UsefulData])


pattern match instead?

KonradStaniec · 2020-10-09T12:56:39Z

@kapke It would be great if you try mordor sync. As for mainnet, syncing with it will be possible after next ticker in line ETCM-103 wchich will add two things:

Pivot block updates when state syncing
bloom filter in sync scheduler (without it syncing would take 2-3 days)

[ETCM-105] Fast sync work

542a06a

KonradStaniec force-pushed the etcm-105/improve-fast-sync branch from 6fc329e to 542a06a Compare October 1, 2020 13:31

Merge remote-tracking branch 'origin/develop' into etcm-105/improve-f…

2aced07

…ast-sync

KonradStaniec force-pushed the etcm-105/improve-fast-sync branch 2 times, most recently from 1c56ac7 to 8090650 Compare October 2, 2020 10:25

[ETCM-105] Regenerate repo.nix

e487563

KonradStaniec force-pushed the etcm-105/improve-fast-sync branch from 8090650 to e487563 Compare October 2, 2020 11:00

KonradStaniec requested review from mmrozek and ntallar October 2, 2020 11:01

KonradStaniec marked this pull request as ready for review October 2, 2020 11:01

Merge remote-tracking branch 'origin/develop' into etcm-105/improve-f…

c47db27

…ast-sync

mmrozek reviewed Oct 2, 2020

View reviewed changes

[ETCM-105] Pr comments

f606b82

Add tailrec annotation Simplify known node check Fix formatting in all files

mmrozek reviewed Oct 6, 2020

View reviewed changes

ntallar previously requested changes Oct 6, 2020

View reviewed changes

KonradStaniec added 2 commits October 7, 2020 12:52

[ETCM-105] Improve state sync

4189a55

Add statistics logging Remove unnecessary CheckPeers Messages from downloader

Merge remote-tracking branch 'origin/develop' into etcm-105/improve-f…

391cbe2

…ast-sync

kapke reviewed Oct 7, 2020

View reviewed changes

ntallar reviewed Oct 7, 2020

View reviewed changes

KonradStaniec added 5 commits October 7, 2020 15:14

[ETCM-105] Relax requirements when processing response from peer

9cbbf56

[ETCM-105] Pr comments

e0fd3ac

Refactor FastSyncIt tests Properly close actor system in StateSyncSpec

[ETCM-105] Remove unused mpt methods from fast sync

7bcae77

Merge remote-tracking branch 'origin/develop' into etcm-105/improve-f…

4d1ad0f

…ast-sync

[ETCM-105] Fix merge conflicts

2627ea5

mmrozek approved these changes Oct 9, 2020

View reviewed changes

kapke approved these changes Oct 9, 2020

View reviewed changes

[ETCM-105] Remove targetblock naming. Use withActorshutdown trait

11755e2

KonradStaniec merged commit 11755e2 into develop Oct 9, 2020

KonradStaniec deleted the etcm-105/improve-fast-sync branch October 9, 2020 14:13


		case object AlreadyProcessedItem extends NotCriticalError

		final case class ProcessingStatistics(duplicatedHashes: Long, notRequestedHashes: Long, saved: Long) {

Conversation

KonradStaniec commented Oct 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Future Tasks

TODO

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KonradStaniec Oct 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mmrozek Oct 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ntallar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KonradStaniec commented Oct 1, 2020 •

edited

Loading

KonradStaniec Oct 5, 2020 •

edited

Loading

mmrozek Oct 6, 2020 •

edited

Loading