Skip to content

Conversation

@taratorio
Copy link
Member

@taratorio taratorio commented Sep 19, 2025

On fusaka-devnet-5 we had roughly 10 out of our 20 nodes fail permanently with:

[INFO] [09-17|16:17:02.379] [4/6 Execution] Unwind Execution         from=45875 to=45874
[DBUG] [09-17|16:17:02.381] [NewPayload] New payload verification ended status=Success err="[4/6 Execution] domains.GetDiffset(45875, 0xc9689e36d4f143b0602c76c7e2692c085cdfb323586c67a2db7028dc9457d6c8): not found"

In short this happened in the following scenario (reproduced and captured in the accompanying test in this PR):

  • we were keeping up with chain tip 1 block at a time (check fcu0 logs below)
  • then the CL went into a period of inactivity (no requests were sent from it for few nearly 2 hours)
  • then the CL sent us a FCU - we exec-ed 357 blocks without re-orgs connecting to fcu0 (check fcu1 logs below)
  • then we got a subsequent FCU - we exec-ed 64 blocks without re-orgs connecting to fcu1 (check fcu2 logs below)
  • then we got a NewPayload on a side chain which required us to do 1 block unwind from fcu2 (check newPayload logs below)
  • we failed with a domains.GetDiffset(45875, 0xc9689e36d4f143b0602c76c7e2692c085cdfb323586c67a2db7028dc9457d6c8): not found error and got stuck forever

This should not happen given the MaxReorgDepth we support is 512 blocks. The reason for this bug is because we had a changesetSafeRange=32 which meant that we would not be persisting change sets for fcu1 and fcu2 at all when we should have. This PR fixes this by changing the logic to always persist change sets for the last 512 blocks only after we are out of our initial cycle mode. Note, currently we determine if we are in initialCycle sync mode if we are exec-ing batches of 5000 blocks (as per --sync.loop.block.limit).

Also note, that the most robust solution is to always persist change sets for the last 512 blocks of a batch of blocks, however that will make the initial sync to tip slower. So for now, we're only doing this for the batches after initial sync and rely on the fact that the first "chain tip" batch after the last "initial sync" batch will have a considerable buffer of blocks (in practice this holds true) and we will persist change sets for all the last 512 blocks of it and then going forward. If this proves out to be not stable enough we can change this in the future to always persist change sets for last 512 block of all batches and take the penalty during initial sync. Or we can think of another heuristic. For now I think this is good enough and improves the stability.

This should also close #17123 and #16427. Same issue seen in the series of issues related to #17025


Logs from fusaka-devnet-5

fcu0
blockNum=45454
blockHash=0x0c1b5e6036faef0278de666cccd5223c2d9008ee4c5b1b91151aae0260b05e26
executedBlocks=1
unwindedBlocks=0

[DBUG] [09-17|13:42:59.828] UnwindTo                                 block=45453 err=nil stack="[sync.go:174 forkchoice.go:341 asm_amd64.s:1700]"
...
[DBUG] [09-17|13:43:00.252] [4/6 Execution] Starting Stage run 
...
[DBUG] [09-17|13:43:01.677] RPC Daemon notified of new headers       from=45453 to=45455 amount=1
[INFO] [09-17|13:43:01.678] head updated                             hash=0x0c1b5e6036faef0278de666cccd5223c2d9008ee4c5b1b91151aae0260b05e26 number=45454 txnum=34718859 age=5m1s execution=337.801989ms mga
s/s=163.64 average mgas/s=117.23 commit=222.615055ms alloc=419.6MB sys=857.0MB

period of inactivity from the CL:

[WARN] [09-17|15:23:27.011] flag --externalcl was provided, but no CL requests to engine-api in 19m0s 

fcu1
blockNum=45811
blockHash=0x58f9d6530d296f65915d9f9a950c72a4935edfe4bb6bafe4b89af79d05f04484
executedBlocks=357
unwindedBlocks=0

[DBUG] [09-17|15:25:05.619] UnwindTo                                 block=45454 err=nil stack="[sync.go:174 forkchoice.go:341 asm_amd64.s:1700]"
...
[INFO] [09-17|15:25:11.314] [4/6 Execution] starting                 from=45454 to=45811 fromTxNum=34718859 offsetFromBlockBeginning=0 initialCycle=false useExternalTx=true inMem=false
...
[DBUG] [09-17|15:26:51.307] RPC Daemon notified of new headers       from=45454 to=45812 amount=357
[INFO] [09-17|15:26:51.309] head updated                             hash=0x58f9d6530d296f65915d9f9a950c72a4935edfe4bb6bafe4b89af79d05f04484 number=45811 txnum=35100175 age=16m51s commit=9.210621ms alloc=
689.2MB sys=1.8GB

fcu 2
blockNum=45875 blockHash=0xc9689e36d4f143b0602c76c7e2692c085cdfb323586c67a2db7028dc9457d6c8
executedBlocks=64
unwindedBlocks=0

[DBUG] [09-17|15:27:19.636] UnwindTo                                 block=45811 err=nil stack="[sync.go:174 forkchoice.go:341 asm_amd64.s:1700]"
...
[INFO] [09-17|15:27:20.667] [4/6 Execution] starting                 from=45811 to=45875 fromTxNum=35100175 offsetFromBlockBeginning=0 initialCycle=false useExternalTx=true inMem=false
...
[DBUG] [09-17|15:27:35.338] RPC Daemon notified of new headers       from=45811 to=45876 amount=64
[INFO] [09-17|15:27:35.340] head updated                             hash=0xc9689e36d4f143b0602c76c7e2692c085cdfb323586c67a2db7028dc9457d6c8 number=45875 txnum=35172190 age=23s commit=435.64468ms alloc=97
5.5MB sys=1.8GB

newPayload
blockNum=45877 blockHash=0x23b1ad9020d0b669fad939f40623b8516a8d24ff75e3c9fa29d730cbdb6e0d0b

[DBUG] [09-17|16:17:02.256] [txpool.fetch] Handling incoming message reqID=POOLED_TRANSACTIONS_66 err="rlp parse transaction: expected envelope in the payload, got 01"
[DBUG] [09-17|16:17:02.333] [NewPayload] processing new request      blockNum=45877 blockHash=0x23b1ad9020d0b669fad939f40623b8516a8d24ff75e3c9fa29d730cbdb6e0d0b parentHash=0xa3665f8f3917321f91e0022bf4c2fb5e5639e91cedc44e7896717b3722a9be66
[DBUG] [09-17|16:17:02.337] [NewPayload] sending block               height=45877 hash=0x23b1ad9020d0b669fad939f40623b8516a8d24ff75e3c9fa29d730cbdb6e0d0b
[INFO] [09-17|16:17:02.337] [NewPayload] Handling new payload        height=45877 hash=0x23b1ad9020d0b669fad939f40623b8516a8d24ff75e3c9fa29d730cbdb6e0d0b
[DBUG] [09-17|16:17:02.338] [NewPayload] New payload begin verification 
[DBUG] [09-17|16:17:02.341] UnwindTo                                 block=45874 err=nil stack="[sync.go:174 ethereum_execution.go:222 ethereum_execution.go:286 execution_client.go:64 chain_reader.go:343 engine_server.go:870 engine_server.go:363 engine_api_methods.go:147 value.go:584 value.go:368 service.go:227 handler.go:529 handler.go:479 handler.go:420 handler.go:240 handler.go:333 asm_amd64.s:1700]"
[INFO] [09-17|16:17:02.379] [4/6 Execution] Unwind Execution         from=45875 to=45874
[DBUG] [09-17|16:17:02.381] [NewPayload] New payload verification ended status=Success err="[4/6 Execution] domains.GetDiffset(45875, 0xc9689e36d4f143b0602c76c7e2692c085cdfb323586c67a2db7028dc9457d6c8): not found"
[WARN] [09-17|16:17:02.382] [rpc] served                             conn=[2400:6180:100:d0::b683:1008]:43710 method=engine_newPayloadV4 reqid=1286 err="[4/6 Execution] domains.GetDiffset(45875, 0xc9689e36d4f143b0602c76c7e2692c085cdfb323586c67a2db7028dc9457d6c8): not found"

for ; blockNum <= maxBlockNum; blockNum++ {
// set shouldGenerateChangesets=true if we are at last n blocks from maxBlockNum. this is as a safety net in chains
// where during initial sync we can expect bogus blocks to be imported.
if !shouldGenerateChangesets && shouldGenerateChangesetsForLastBlocks && blockNum > cfg.blockReader.FrozenBlocks() && blockNum+changesetSafeRange >= maxBlockNum {
Copy link
Member Author

@taratorio taratorio Sep 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This Bor specific logic is no longer needed since Polygon now has single slot finality (in practice roughly 4 blocks finality) so it is safe for it to share the same flow for change sets as Ethereum. Also note this workaround was added pre-Astrid when we were still using the Bor heimdall stage. Astrid does "sync to tip" using finalised checkpoints and milestone so it guarantees to exec to the latest finalised block ~4 block behind the tip without any reorgs.

@taratorio
Copy link
Member Author

got 3 failures in hive tests - need to check those first before merging

@SyedMuhamadYasir
Copy link

hoping this version will fix the issue!

@taratorio
Copy link
Member Author

got 3 failures in hive tests - need to check those first before merging

It looks like this fix uncovered another off by 1 bug in CanUnwindToBlockNum which was caught by the 3 failing tests. Luckily it was an easy fix (fixed in commit 0e8b589).

@yperbasis tests are all passing now, so ready for review:
All tests with -race: https://github.com/erigontech/erigon/actions/runs/17895328104
Hive EEST: https://github.com/erigontech/erigon/actions/runs/17895332364
Hive: https: https://github.com/erigontech/erigon/actions/runs/17895336017

@yperbasis yperbasis added this to the 3.2.0 milestone Sep 22, 2025
@AskAlexSharov AskAlexSharov merged commit 585edfe into main Sep 22, 2025
22 checks passed
@AskAlexSharov AskAlexSharov deleted the improve-domain-get-diffset-issues branch September 22, 2025 08:16
NazariiDenha pushed a commit that referenced this pull request Oct 24, 2025
…al cycle (#17165)

On fusaka-devnet-5 we had roughly 10 out of our 20 nodes fail
permanently with:
```
[INFO] [09-17|16:17:02.379] [4/6 Execution] Unwind Execution         from=45875 to=45874
[DBUG] [09-17|16:17:02.381] [NewPayload] New payload verification ended status=Success err="[4/6 Execution] domains.GetDiffset(45875, 0xc9689e36d4f143b0602c76c7e2692c085cdfb323586c67a2db7028dc9457d6c8): not found"
```

In short this happened in the following scenario (reproduced and
captured in the accompanying test in this PR):
- we were keeping up with chain tip 1 block at a time (check `fcu0` logs
below)
- then the CL went into a period of inactivity (no requests were sent
from it for few nearly 2 hours)
- then the CL sent us a FCU - we exec-ed 357 blocks without re-orgs
connecting to `fcu0` (check `fcu1` logs below)
- then we got a subsequent FCU - we exec-ed 64 blocks without re-orgs
connecting to `fcu1` (check `fcu2` logs below)
- then we got a NewPayload on a side chain which required us to do 1
block unwind from `fcu2` (check `newPayload` logs below)
- we failed with a `domains.GetDiffset(45875,
0xc9689e36d4f143b0602c76c7e2692c085cdfb323586c67a2db7028dc9457d6c8): not
found` error and got stuck forever

This should not happen given the `MaxReorgDepth` we support is `512`
blocks. The reason for this bug is because we had a
`changesetSafeRange=32` which meant that we would not be persisting
change sets for `fcu1` and `fcu2` at all when we should have. This PR
fixes this by changing the logic to always persist change sets for the
last `512` blocks *only after we are out of our initial cycle mode*.
Note, currently we determine if we are in `initialCycle` sync mode if we
are exec-ing batches of 5000 blocks (as per `--sync.loop.block.limit`).

Also note, that the most robust solution is to always persist change
sets for the last 512 blocks of a batch of blocks, however that will
make the initial sync to tip slower. So for now, we're only doing this
for the batches after initial sync and rely on the fact that the first
"chain tip" batch after the last "initial sync" batch will have a
considerable buffer of blocks (in practice this holds true) and we will
persist change sets for all the last 512 blocks of it and then going
forward. If this proves out to be not stable enough we can change this
in the future to always persist change sets for last 512 block of all
batches and take the penalty during initial sync. Or we can think of
another heuristic. For now I think this is good enough and improves the
stability.

This should also close #17123
and #16427. Same issue seen
in the series of issues related to
#17025

---

`Logs from fusaka-devnet-5`

`fcu0`
blockNum=45454

blockHash=0x0c1b5e6036faef0278de666cccd5223c2d9008ee4c5b1b91151aae0260b05e26
executedBlocks=1
unwindedBlocks=0
```
[DBUG] [09-17|13:42:59.828] UnwindTo                                 block=45453 err=nil stack="[sync.go:174 forkchoice.go:341 asm_amd64.s:1700]"
...
[DBUG] [09-17|13:43:00.252] [4/6 Execution] Starting Stage run 
...
[DBUG] [09-17|13:43:01.677] RPC Daemon notified of new headers       from=45453 to=45455 amount=1
[INFO] [09-17|13:43:01.678] head updated                             hash=0x0c1b5e6036faef0278de666cccd5223c2d9008ee4c5b1b91151aae0260b05e26 number=45454 txnum=34718859 age=5m1s execution=337.801989ms mga
s/s=163.64 average mgas/s=117.23 commit=222.615055ms alloc=419.6MB sys=857.0MB
```

period of inactivity from the CL:
```
[WARN] [09-17|15:23:27.011] flag --externalcl was provided, but no CL requests to engine-api in 19m0s 
```

`fcu1`
blockNum=45811

blockHash=0x58f9d6530d296f65915d9f9a950c72a4935edfe4bb6bafe4b89af79d05f04484
executedBlocks=357
unwindedBlocks=0
```
[DBUG] [09-17|15:25:05.619] UnwindTo                                 block=45454 err=nil stack="[sync.go:174 forkchoice.go:341 asm_amd64.s:1700]"
...
[INFO] [09-17|15:25:11.314] [4/6 Execution] starting                 from=45454 to=45811 fromTxNum=34718859 offsetFromBlockBeginning=0 initialCycle=false useExternalTx=true inMem=false
...
[DBUG] [09-17|15:26:51.307] RPC Daemon notified of new headers       from=45454 to=45812 amount=357
[INFO] [09-17|15:26:51.309] head updated                             hash=0x58f9d6530d296f65915d9f9a950c72a4935edfe4bb6bafe4b89af79d05f04484 number=45811 txnum=35100175 age=16m51s commit=9.210621ms alloc=
689.2MB sys=1.8GB
```

`fcu 2`
blockNum=45875
blockHash=0xc9689e36d4f143b0602c76c7e2692c085cdfb323586c67a2db7028dc9457d6c8
executedBlocks=64
unwindedBlocks=0
```
[DBUG] [09-17|15:27:19.636] UnwindTo                                 block=45811 err=nil stack="[sync.go:174 forkchoice.go:341 asm_amd64.s:1700]"
...
[INFO] [09-17|15:27:20.667] [4/6 Execution] starting                 from=45811 to=45875 fromTxNum=35100175 offsetFromBlockBeginning=0 initialCycle=false useExternalTx=true inMem=false
...
[DBUG] [09-17|15:27:35.338] RPC Daemon notified of new headers       from=45811 to=45876 amount=64
[INFO] [09-17|15:27:35.340] head updated                             hash=0xc9689e36d4f143b0602c76c7e2692c085cdfb323586c67a2db7028dc9457d6c8 number=45875 txnum=35172190 age=23s commit=435.64468ms alloc=97
5.5MB sys=1.8GB
```

`newPayload`
blockNum=45877
blockHash=0x23b1ad9020d0b669fad939f40623b8516a8d24ff75e3c9fa29d730cbdb6e0d0b
```
[DBUG] [09-17|16:17:02.256] [txpool.fetch] Handling incoming message reqID=POOLED_TRANSACTIONS_66 err="rlp parse transaction: expected envelope in the payload, got 01"
[DBUG] [09-17|16:17:02.333] [NewPayload] processing new request      blockNum=45877 blockHash=0x23b1ad9020d0b669fad939f40623b8516a8d24ff75e3c9fa29d730cbdb6e0d0b parentHash=0xa3665f8f3917321f91e0022bf4c2fb5e5639e91cedc44e7896717b3722a9be66
[DBUG] [09-17|16:17:02.337] [NewPayload] sending block               height=45877 hash=0x23b1ad9020d0b669fad939f40623b8516a8d24ff75e3c9fa29d730cbdb6e0d0b
[INFO] [09-17|16:17:02.337] [NewPayload] Handling new payload        height=45877 hash=0x23b1ad9020d0b669fad939f40623b8516a8d24ff75e3c9fa29d730cbdb6e0d0b
[DBUG] [09-17|16:17:02.338] [NewPayload] New payload begin verification 
[DBUG] [09-17|16:17:02.341] UnwindTo                                 block=45874 err=nil stack="[sync.go:174 ethereum_execution.go:222 ethereum_execution.go:286 execution_client.go:64 chain_reader.go:343 engine_server.go:870 engine_server.go:363 engine_api_methods.go:147 value.go:584 value.go:368 service.go:227 handler.go:529 handler.go:479 handler.go:420 handler.go:240 handler.go:333 asm_amd64.s:1700]"
[INFO] [09-17|16:17:02.379] [4/6 Execution] Unwind Execution         from=45875 to=45874
[DBUG] [09-17|16:17:02.381] [NewPayload] New payload verification ended status=Success err="[4/6 Execution] domains.GetDiffset(45875, 0xc9689e36d4f143b0602c76c7e2692c085cdfb323586c67a2db7028dc9457d6c8): not found"
[WARN] [09-17|16:17:02.382] [rpc] served                             conn=[2400:6180:100:d0::b683:1008]:43710 method=engine_newPayloadV4 reqid=1286 err="[4/6 Execution] domains.GetDiffset(45875, 0xc9689e36d4f143b0602c76c7e2692c085cdfb323586c67a2db7028dc9457d6c8): not found"
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Erigon stops block production after 65/66 blocks

5 participants