Skip to content

Conversation

@taratorio
Copy link
Member

Test "Blob Transaction Ordering, Multiple Clients (Cancun) (erigon)" is flaky (example run - https://github.com/erigontech/erigon/actions/runs/16647657985/job/47112094838)

the test setup is:

  • 2 EL clients
  • some blob txns are sent to client A others to client B
  • after they are all sent, payloads are built by client A
  • expectation is that each block will have 6 blob txns included

we flake because:

  • we stop block building when the txpool dries out
  • in the test the time between FCU and GetPayload is 2 seconds
  • we do not wait the full 2 seconds
  • sometimes we get the blob txns in time, other times we dont (depends on txpool propagation delay which can vary depending on hardware/resource consumption on the machine, note we run with --sim.parallelism 8 which can also affect this)
  • if we keep on poking the txpool within the 2 seconds (driven by the CL) we will give more time for the txpool propagations to reach the builder

coincidentally, this may be one reason why we sometimes see Erigon produce emptier blocks on devnets as compared to other clients

@somnergy
Copy link
Member

somnergy commented Aug 1, 2025

But then it's gonna keep polling till the end of the slot if txpool is honestly empty. The execution should build a block ideally within the first 2 seconds or so.

Edit: After reading your desc. again, i think you don't mean to pass config.secondsPerSlot as the wait time?

@taratorio
Copy link
Member Author

taratorio commented Aug 4, 2025

But then it's gonna keep polling till the end of the slot if txpool is honestly empty. The execution should build a block ideally within the first 2 seconds or so.

Edit: After reading your desc. again, i think you don't mean to pass config.secondsPerSlot as the wait time?

The maxBuildTimeSecs is a max upper bound to guarantee goroutine termination in case GetPayload is never called by the CL.

For the 2 sec example - the CL will call GetPayload at 2secs and that will interrupt block building.

@somnergy
Copy link
Member

somnergy commented Aug 4, 2025

But then it's gonna keep polling till the end of the slot if txpool is honestly empty. The execution should build a block ideally within the first 2 seconds or so.
Edit: After reading your desc. again, i think you don't mean to pass config.secondsPerSlot as the wait time?

The maxBuildTimeSecs is a max upper bound to guarantee goroutine termination in case GetPayload is never called by the CL.

For the 2 sec example - the CL will call GetPayload at 2secs and that will interrupt block building.

  • It may be worth sparing 1 extra second (or half of it) towards the end, as it's unlikely that a block building request came in before the current slot, and block built in the last second is useless.
  • The timeout could coincide with the FCU timeout.
  • The timeout could also be counted for the seconds remaining in the slot: i.e. strictly not to exceed param.Timestamp + secondsPerSlot in NewBlockBuilder

@taratorio
Copy link
Member Author

But then it's gonna keep polling till the end of the slot if txpool is honestly empty. The execution should build a block ideally within the first 2 seconds or so.
Edit: After reading your desc. again, i think you don't mean to pass config.secondsPerSlot as the wait time?

The maxBuildTimeSecs is a max upper bound to guarantee goroutine termination in case GetPayload is never called by the CL.
For the 2 sec example - the CL will call GetPayload at 2secs and that will interrupt block building.

  • It may be worth sparing 1 extra second (or half of it) towards the end, as it's unlikely that a block building request came in before the current slot, and block built in the last second is useless.
  • The timeout could coincide with the FCU timeout.
  • The timeout could also be counted for the seconds remaining in the slot: i.e. strictly not to exceed param.Timestamp + secondsPerSlot in NewBlockBuilder

The only reason I've added an upper bound is to preserve compute resources and guarantee that the builder goroutines terminate in the case the txpool never dries out and GetPayload is never called. Note that GetPayload not getting called is a rare scenario, so treat this as protection for an extreme case.

Regarding your points:

  • that's unnecessary complication which brings very little value - 1 sec more or less doesn't matter much - GetPayload will be called at the right time in 99.999% of the time
  • block building is an async process that terminates when you call GetPayload - don't see how that can be connected with FCU timeout?
  • this is similar to the first point - yes, we can make the timeout a bit stricter/narrower but I see very little value in doing this as this is just a mechanism to put an upper bound for an extreme and highly unlikely case (since GetPayload will be called in 99.999% of times)

@somnergy
Copy link
Member

somnergy commented Aug 4, 2025

But then it's gonna keep polling till the end of the slot if txpool is honestly empty. The execution should build a block ideally within the first 2 seconds or so.
Edit: After reading your desc. again, i think you don't mean to pass config.secondsPerSlot as the wait time?

The maxBuildTimeSecs is a max upper bound to guarantee goroutine termination in case GetPayload is never called by the CL.
For the 2 sec example - the CL will call GetPayload at 2secs and that will interrupt block building.

  • It may be worth sparing 1 extra second (or half of it) towards the end, as it's unlikely that a block building request came in before the current slot, and block built in the last second is useless.
  • The timeout could coincide with the FCU timeout.
  • The timeout could also be counted for the seconds remaining in the slot: i.e. strictly not to exceed param.Timestamp + secondsPerSlot in NewBlockBuilder

The only reason I've added an upper bound is to preserve compute resources and guarantee that the builder goroutines terminate in the case the txpool never dries out and GetPayload is never called. Note that GetPayload not getting called is a rare scenario, so treat this as protection for an extreme case.

Regarding your points:

* that's unnecessary complication which brings very little value - 1 sec more or less doesn't matter much - GetPayload will be called at the right time in 99.999% of the time

* block building is an async process that terminates when you call GetPayload - don't see how that can be connected with FCU timeout?

* this is similar to the first point - yes, we can make the timeout a bit stricter/narrower but I see very little value in doing this as this is just a mechanism to put an upper bound for an extreme and highly unlikely case (since GetPayload will be called in 99.999% of times)

Well, first of all, 99.999% - that's not a number based on some data i can cross-check. So, I presume that's a crude measure of conviction for the argument. But I agree that in most cases getPayload should be called.
No, I think it's not going to bring a complication here to add a more precise timeout, since you are doing 50ms round trips, or 20 rounds per sec, and potentially 240 more rounds that previously were not there!
But, okay.

@taratorio taratorio merged commit 4e4f700 into main Aug 4, 2025
15 checks passed
@taratorio taratorio deleted the block-building-use-full-interrupt-time branch August 4, 2025 11:33
@taratorio
Copy link
Member Author

and potentially 240 more rounds that previously were not there!

they still won't be there - because GetPayload will be called

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants