execution: store changesets for last MaxReorgDepth blocks after initial cycle #17165

taratorio · 2025-09-19T14:06:55Z

On fusaka-devnet-5 we had roughly 10 out of our 20 nodes fail permanently with:

[INFO] [09-17|16:17:02.379] [4/6 Execution] Unwind Execution         from=45875 to=45874
[DBUG] [09-17|16:17:02.381] [NewPayload] New payload verification ended status=Success err="[4/6 Execution] domains.GetDiffset(45875, 0xc9689e36d4f143b0602c76c7e2692c085cdfb323586c67a2db7028dc9457d6c8): not found"

In short this happened in the following scenario (reproduced and captured in the accompanying test in this PR):

we were keeping up with chain tip 1 block at a time (check fcu0 logs below)
then the CL went into a period of inactivity (no requests were sent from it for few nearly 2 hours)
then the CL sent us a FCU - we exec-ed 357 blocks without re-orgs connecting to fcu0 (check fcu1 logs below)
then we got a subsequent FCU - we exec-ed 64 blocks without re-orgs connecting to fcu1 (check fcu2 logs below)
then we got a NewPayload on a side chain which required us to do 1 block unwind from fcu2 (check newPayload logs below)
we failed with a domains.GetDiffset(45875, 0xc9689e36d4f143b0602c76c7e2692c085cdfb323586c67a2db7028dc9457d6c8): not found error and got stuck forever

This should not happen given the MaxReorgDepth we support is 512 blocks. The reason for this bug is because we had a changesetSafeRange=32 which meant that we would not be persisting change sets for fcu1 and fcu2 at all when we should have. This PR fixes this by changing the logic to always persist change sets for the last 512 blocks only after we are out of our initial cycle mode. Note, currently we determine if we are in initialCycle sync mode if we are exec-ing batches of 5000 blocks (as per --sync.loop.block.limit).

Also note, that the most robust solution is to always persist change sets for the last 512 blocks of a batch of blocks, however that will make the initial sync to tip slower. So for now, we're only doing this for the batches after initial sync and rely on the fact that the first "chain tip" batch after the last "initial sync" batch will have a considerable buffer of blocks (in practice this holds true) and we will persist change sets for all the last 512 blocks of it and then going forward. If this proves out to be not stable enough we can change this in the future to always persist change sets for last 512 block of all batches and take the penalty during initial sync. Or we can think of another heuristic. For now I think this is good enough and improves the stability.

This should also close #17123 and #16427. Same issue seen in the series of issues related to #17025

Logs from fusaka-devnet-5

fcu0
blockNum=45454
blockHash=0x0c1b5e6036faef0278de666cccd5223c2d9008ee4c5b1b91151aae0260b05e26
executedBlocks=1
unwindedBlocks=0

[DBUG] [09-17|13:42:59.828] UnwindTo                                 block=45453 err=nil stack="[sync.go:174 forkchoice.go:341 asm_amd64.s:1700]"
...
[DBUG] [09-17|13:43:00.252] [4/6 Execution] Starting Stage run 
...
[DBUG] [09-17|13:43:01.677] RPC Daemon notified of new headers       from=45453 to=45455 amount=1
[INFO] [09-17|13:43:01.678] head updated                             hash=0x0c1b5e6036faef0278de666cccd5223c2d9008ee4c5b1b91151aae0260b05e26 number=45454 txnum=34718859 age=5m1s execution=337.801989ms mga
s/s=163.64 average mgas/s=117.23 commit=222.615055ms alloc=419.6MB sys=857.0MB

period of inactivity from the CL:

[WARN] [09-17|15:23:27.011] flag --externalcl was provided, but no CL requests to engine-api in 19m0s

fcu1
blockNum=45811
blockHash=0x58f9d6530d296f65915d9f9a950c72a4935edfe4bb6bafe4b89af79d05f04484
executedBlocks=357
unwindedBlocks=0

[DBUG] [09-17|15:25:05.619] UnwindTo                                 block=45454 err=nil stack="[sync.go:174 forkchoice.go:341 asm_amd64.s:1700]"
...
[INFO] [09-17|15:25:11.314] [4/6 Execution] starting                 from=45454 to=45811 fromTxNum=34718859 offsetFromBlockBeginning=0 initialCycle=false useExternalTx=true inMem=false
...
[DBUG] [09-17|15:26:51.307] RPC Daemon notified of new headers       from=45454 to=45812 amount=357
[INFO] [09-17|15:26:51.309] head updated                             hash=0x58f9d6530d296f65915d9f9a950c72a4935edfe4bb6bafe4b89af79d05f04484 number=45811 txnum=35100175 age=16m51s commit=9.210621ms alloc=
689.2MB sys=1.8GB

fcu 2
blockNum=45875 blockHash=0xc9689e36d4f143b0602c76c7e2692c085cdfb323586c67a2db7028dc9457d6c8
executedBlocks=64
unwindedBlocks=0

[DBUG] [09-17|15:27:19.636] UnwindTo                                 block=45811 err=nil stack="[sync.go:174 forkchoice.go:341 asm_amd64.s:1700]"
...
[INFO] [09-17|15:27:20.667] [4/6 Execution] starting                 from=45811 to=45875 fromTxNum=35100175 offsetFromBlockBeginning=0 initialCycle=false useExternalTx=true inMem=false
...
[DBUG] [09-17|15:27:35.338] RPC Daemon notified of new headers       from=45811 to=45876 amount=64
[INFO] [09-17|15:27:35.340] head updated                             hash=0xc9689e36d4f143b0602c76c7e2692c085cdfb323586c67a2db7028dc9457d6c8 number=45875 txnum=35172190 age=23s commit=435.64468ms alloc=97
5.5MB sys=1.8GB

newPayload
blockNum=45877 blockHash=0x23b1ad9020d0b669fad939f40623b8516a8d24ff75e3c9fa29d730cbdb6e0d0b

[DBUG] [09-17|16:17:02.256] [txpool.fetch] Handling incoming message reqID=POOLED_TRANSACTIONS_66 err="rlp parse transaction: expected envelope in the payload, got 01"
[DBUG] [09-17|16:17:02.333] [NewPayload] processing new request      blockNum=45877 blockHash=0x23b1ad9020d0b669fad939f40623b8516a8d24ff75e3c9fa29d730cbdb6e0d0b parentHash=0xa3665f8f3917321f91e0022bf4c2fb5e5639e91cedc44e7896717b3722a9be66
[DBUG] [09-17|16:17:02.337] [NewPayload] sending block               height=45877 hash=0x23b1ad9020d0b669fad939f40623b8516a8d24ff75e3c9fa29d730cbdb6e0d0b
[INFO] [09-17|16:17:02.337] [NewPayload] Handling new payload        height=45877 hash=0x23b1ad9020d0b669fad939f40623b8516a8d24ff75e3c9fa29d730cbdb6e0d0b
[DBUG] [09-17|16:17:02.338] [NewPayload] New payload begin verification 
[DBUG] [09-17|16:17:02.341] UnwindTo                                 block=45874 err=nil stack="[sync.go:174 ethereum_execution.go:222 ethereum_execution.go:286 execution_client.go:64 chain_reader.go:343 engine_server.go:870 engine_server.go:363 engine_api_methods.go:147 value.go:584 value.go:368 service.go:227 handler.go:529 handler.go:479 handler.go:420 handler.go:240 handler.go:333 asm_amd64.s:1700]"
[INFO] [09-17|16:17:02.379] [4/6 Execution] Unwind Execution         from=45875 to=45874
[DBUG] [09-17|16:17:02.381] [NewPayload] New payload verification ended status=Success err="[4/6 Execution] domains.GetDiffset(45875, 0xc9689e36d4f143b0602c76c7e2692c085cdfb323586c67a2db7028dc9457d6c8): not found"
[WARN] [09-17|16:17:02.382] [rpc] served                             conn=[2400:6180:100:d0::b683:1008]:43710 method=engine_newPayloadV4 reqid=1286 err="[4/6 Execution] domains.GetDiffset(45875, 0xc9689e36d4f143b0602c76c7e2692c085cdfb323586c67a2db7028dc9457d6c8): not found"

…al cycle

…in-get-diffset-issues

taratorio · 2025-09-19T14:11:04Z

execution/stagedsync/exec3.go

 	for ; blockNum <= maxBlockNum; blockNum++ {
-		// set shouldGenerateChangesets=true if we are at last n blocks from maxBlockNum. this is as a safety net in chains
-		// where during initial sync we can expect bogus blocks to be imported.
-		if !shouldGenerateChangesets && shouldGenerateChangesetsForLastBlocks && blockNum > cfg.blockReader.FrozenBlocks() && blockNum+changesetSafeRange >= maxBlockNum {


This Bor specific logic is no longer needed since Polygon now has single slot finality (in practice roughly 4 blocks finality) so it is safe for it to share the same flow for change sets as Ethereum. Also note this workaround was added pre-Astrid when we were still using the Bor heimdall stage. Astrid does "sync to tip" using finalised checkpoints and milestone so it guarantees to exec to the latest finalised block ~4 block behind the tip without any reorgs.

taratorio · 2025-09-19T14:28:17Z

All tests with -race: https://github.com/erigontech/erigon/actions/runs/17860720962
Hive EEST: https://github.com/erigontech/erigon/actions/runs/17860734470
Hive: https://github.com/erigontech/erigon/actions/runs/17862280667

…ain-get-diffset-issues

taratorio · 2025-09-19T17:00:56Z

got 3 failures in hive tests - need to check those first before merging

…able

…ain-get-diffset-issues

SyedMuhamadYasir · 2025-09-22T01:20:10Z

hoping this version will fix the issue!

taratorio · 2025-09-22T07:41:29Z

got 3 failures in hive tests - need to check those first before merging

It looks like this fix uncovered another off by 1 bug in CanUnwindToBlockNum which was caught by the 3 failing tests. Luckily it was an easy fix (fixed in commit 0e8b589).

@yperbasis tests are all passing now, so ready for review:
All tests with -race: https://github.com/erigontech/erigon/actions/runs/17895328104
Hive EEST: https://github.com/erigontech/erigon/actions/runs/17895332364
Hive: https: https://github.com/erigontech/erigon/actions/runs/17895336017

…al cycle (#17165) On fusaka-devnet-5 we had roughly 10 out of our 20 nodes fail permanently with: ``` [INFO] [09-17|16:17:02.379] [4/6 Execution] Unwind Execution from=45875 to=45874 [DBUG] [09-17|16:17:02.381] [NewPayload] New payload verification ended status=Success err="[4/6 Execution] domains.GetDiffset(45875, 0xc9689e36d4f143b0602c76c7e2692c085cdfb323586c67a2db7028dc9457d6c8): not found" ``` In short this happened in the following scenario (reproduced and captured in the accompanying test in this PR): - we were keeping up with chain tip 1 block at a time (check `fcu0` logs below) - then the CL went into a period of inactivity (no requests were sent from it for few nearly 2 hours) - then the CL sent us a FCU - we exec-ed 357 blocks without re-orgs connecting to `fcu0` (check `fcu1` logs below) - then we got a subsequent FCU - we exec-ed 64 blocks without re-orgs connecting to `fcu1` (check `fcu2` logs below) - then we got a NewPayload on a side chain which required us to do 1 block unwind from `fcu2` (check `newPayload` logs below) - we failed with a `domains.GetDiffset(45875, 0xc9689e36d4f143b0602c76c7e2692c085cdfb323586c67a2db7028dc9457d6c8): not found` error and got stuck forever This should not happen given the `MaxReorgDepth` we support is `512` blocks. The reason for this bug is because we had a `changesetSafeRange=32` which meant that we would not be persisting change sets for `fcu1` and `fcu2` at all when we should have. This PR fixes this by changing the logic to always persist change sets for the last `512` blocks *only after we are out of our initial cycle mode*. Note, currently we determine if we are in `initialCycle` sync mode if we are exec-ing batches of 5000 blocks (as per `--sync.loop.block.limit`). Also note, that the most robust solution is to always persist change sets for the last 512 blocks of a batch of blocks, however that will make the initial sync to tip slower. So for now, we're only doing this for the batches after initial sync and rely on the fact that the first "chain tip" batch after the last "initial sync" batch will have a considerable buffer of blocks (in practice this holds true) and we will persist change sets for all the last 512 blocks of it and then going forward. If this proves out to be not stable enough we can change this in the future to always persist change sets for last 512 block of all batches and take the penalty during initial sync. Or we can think of another heuristic. For now I think this is good enough and improves the stability. This should also close #17123 and #16427. Same issue seen in the series of issues related to #17025 --- `Logs from fusaka-devnet-5` `fcu0` blockNum=45454 blockHash=0x0c1b5e6036faef0278de666cccd5223c2d9008ee4c5b1b91151aae0260b05e26 executedBlocks=1 unwindedBlocks=0 ``` [DBUG] [09-17|13:42:59.828] UnwindTo block=45453 err=nil stack="[sync.go:174 forkchoice.go:341 asm_amd64.s:1700]" ... [DBUG] [09-17|13:43:00.252] [4/6 Execution] Starting Stage run ... [DBUG] [09-17|13:43:01.677] RPC Daemon notified of new headers from=45453 to=45455 amount=1 [INFO] [09-17|13:43:01.678] head updated hash=0x0c1b5e6036faef0278de666cccd5223c2d9008ee4c5b1b91151aae0260b05e26 number=45454 txnum=34718859 age=5m1s execution=337.801989ms mga s/s=163.64 average mgas/s=117.23 commit=222.615055ms alloc=419.6MB sys=857.0MB ``` period of inactivity from the CL: ``` [WARN] [09-17|15:23:27.011] flag --externalcl was provided, but no CL requests to engine-api in 19m0s ``` `fcu1` blockNum=45811 blockHash=0x58f9d6530d296f65915d9f9a950c72a4935edfe4bb6bafe4b89af79d05f04484 executedBlocks=357 unwindedBlocks=0 ``` [DBUG] [09-17|15:25:05.619] UnwindTo block=45454 err=nil stack="[sync.go:174 forkchoice.go:341 asm_amd64.s:1700]" ... [INFO] [09-17|15:25:11.314] [4/6 Execution] starting from=45454 to=45811 fromTxNum=34718859 offsetFromBlockBeginning=0 initialCycle=false useExternalTx=true inMem=false ... [DBUG] [09-17|15:26:51.307] RPC Daemon notified of new headers from=45454 to=45812 amount=357 [INFO] [09-17|15:26:51.309] head updated hash=0x58f9d6530d296f65915d9f9a950c72a4935edfe4bb6bafe4b89af79d05f04484 number=45811 txnum=35100175 age=16m51s commit=9.210621ms alloc= 689.2MB sys=1.8GB ``` `fcu 2` blockNum=45875 blockHash=0xc9689e36d4f143b0602c76c7e2692c085cdfb323586c67a2db7028dc9457d6c8 executedBlocks=64 unwindedBlocks=0 ``` [DBUG] [09-17|15:27:19.636] UnwindTo block=45811 err=nil stack="[sync.go:174 forkchoice.go:341 asm_amd64.s:1700]" ... [INFO] [09-17|15:27:20.667] [4/6 Execution] starting from=45811 to=45875 fromTxNum=35100175 offsetFromBlockBeginning=0 initialCycle=false useExternalTx=true inMem=false ... [DBUG] [09-17|15:27:35.338] RPC Daemon notified of new headers from=45811 to=45876 amount=64 [INFO] [09-17|15:27:35.340] head updated hash=0xc9689e36d4f143b0602c76c7e2692c085cdfb323586c67a2db7028dc9457d6c8 number=45875 txnum=35172190 age=23s commit=435.64468ms alloc=97 5.5MB sys=1.8GB ``` `newPayload` blockNum=45877 blockHash=0x23b1ad9020d0b669fad939f40623b8516a8d24ff75e3c9fa29d730cbdb6e0d0b ``` [DBUG] [09-17|16:17:02.256] [txpool.fetch] Handling incoming message reqID=POOLED_TRANSACTIONS_66 err="rlp parse transaction: expected envelope in the payload, got 01" [DBUG] [09-17|16:17:02.333] [NewPayload] processing new request blockNum=45877 blockHash=0x23b1ad9020d0b669fad939f40623b8516a8d24ff75e3c9fa29d730cbdb6e0d0b parentHash=0xa3665f8f3917321f91e0022bf4c2fb5e5639e91cedc44e7896717b3722a9be66 [DBUG] [09-17|16:17:02.337] [NewPayload] sending block height=45877 hash=0x23b1ad9020d0b669fad939f40623b8516a8d24ff75e3c9fa29d730cbdb6e0d0b [INFO] [09-17|16:17:02.337] [NewPayload] Handling new payload height=45877 hash=0x23b1ad9020d0b669fad939f40623b8516a8d24ff75e3c9fa29d730cbdb6e0d0b [DBUG] [09-17|16:17:02.338] [NewPayload] New payload begin verification [DBUG] [09-17|16:17:02.341] UnwindTo block=45874 err=nil stack="[sync.go:174 ethereum_execution.go:222 ethereum_execution.go:286 execution_client.go:64 chain_reader.go:343 engine_server.go:870 engine_server.go:363 engine_api_methods.go:147 value.go:584 value.go:368 service.go:227 handler.go:529 handler.go:479 handler.go:420 handler.go:240 handler.go:333 asm_amd64.s:1700]" [INFO] [09-17|16:17:02.379] [4/6 Execution] Unwind Execution from=45875 to=45874 [DBUG] [09-17|16:17:02.381] [NewPayload] New payload verification ended status=Success err="[4/6 Execution] domains.GetDiffset(45875, 0xc9689e36d4f143b0602c76c7e2692c085cdfb323586c67a2db7028dc9457d6c8): not found" [WARN] [09-17|16:17:02.382] [rpc] served conn=[2400:6180:100:d0::b683:1008]:43710 method=engine_newPayloadV4 reqid=1286 err="[4/6 Execution] domains.GetDiffset(45875, 0xc9689e36d4f143b0602c76c7e2692c085cdfb323586c67a2db7028dc9457d6c8): not found" ```

taratorio added 2 commits September 19, 2025 14:39

execution: store changesets for last MaxReorgDepth blocks after initi…

918a531

…al cycle

Merge branch 'main' of github.com:erigontech/erigon into improve-doma…

74c0356

…in-get-diffset-issues

taratorio commented Sep 19, 2025

View reviewed changes

taratorio requested review from mh0lt, shohamc1, somnergy and yperbasis September 19, 2025 14:29

Merge branch 'main' of github.com:ledgerwatch/erigon into improve-dom…

1486841

…ain-get-diffset-issues

taratorio added 2 commits September 21, 2025 17:52

execution: fix 3 failing hive tests due to off by 1 bug in min unwind…

0e8b589

…able

Merge branch 'main' of github.com:ledgerwatch/erigon into improve-dom…

3422861

…ain-get-diffset-issues

yperbasis requested review from AskAlexSharov and Giulio2002 September 22, 2025 07:54

AskAlexSharov approved these changes Sep 22, 2025

View reviewed changes

yperbasis added this to the 3.2.0 milestone Sep 22, 2025

yperbasis approved these changes Sep 22, 2025

View reviewed changes

AskAlexSharov merged commit 585edfe into main Sep 22, 2025
22 checks passed

AskAlexSharov deleted the improve-domain-get-diffset-issues branch September 22, 2025 08:16

yperbasis mentioned this pull request Sep 22, 2025

Erigon loops on “diffset not found” and flood logs #16427

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

execution: store changesets for last MaxReorgDepth blocks after initial cycle #17165

execution: store changesets for last MaxReorgDepth blocks after initial cycle #17165

Uh oh!

taratorio commented Sep 19, 2025 •

edited

Loading

Uh oh!

taratorio Sep 19, 2025 •

edited

Loading

Uh oh!

taratorio commented Sep 19, 2025 •

edited

Loading

Uh oh!

taratorio commented Sep 19, 2025

Uh oh!

SyedMuhamadYasir commented Sep 22, 2025

Uh oh!

taratorio commented Sep 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

execution: store changesets for last MaxReorgDepth blocks after initial cycle #17165

execution: store changesets for last MaxReorgDepth blocks after initial cycle #17165

Uh oh!

Conversation

taratorio commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

taratorio Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

taratorio commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

taratorio commented Sep 19, 2025

Uh oh!

SyedMuhamadYasir commented Sep 22, 2025

Uh oh!

taratorio commented Sep 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

taratorio commented Sep 19, 2025 •

edited

Loading

taratorio Sep 19, 2025 •

edited

Loading

taratorio commented Sep 19, 2025 •

edited

Loading