Skip to content

Conversation

@MauroToscano
Copy link
Contributor

@MauroToscano MauroToscano commented Jul 11, 2025

Rework batcher concurrency

This PR overhauls the concurrency of the batcher, to optimize proofs processing and avoid race conditions

Description

Two locks are added, one for the batcher and one for the user states. The key things that are done are:

  • If both locks are taken at the same time, we (almost*) always take the user locks first and the batcher locks afterward as to avoid deadlocks

The biggest change is in the batch creation:

  • Batch creation for ethereum is split in 3 phases:
    • Phase 1: Extract the proofs from the queue that make a potential batch (Holds the batch lock)
    • Phase 2: Release the lock, and post in ethereum this extracted batch
    • If phase 2 succeds:
      • Update the user states (only proofs in batch and fees used are relevant here)
    • If phase 2 fails:
      • Re add the proofs to the main queue, following standard eviction rules

Notice we tolerate that the user state to be temporarily inconsistent with the queue until we confirm the block. This means that we are a bit stricter than we could with proofs the user send in parallel until we confirm the posting. (we assume the proofs is still in the queue and is not paid, so the user may need a bit more of spare balance and cannot use arbitrary fees, since we ask for the proofs in the batch to have the same or lower fee if have a bigger nonce)

For proof submission, the general idea is:

  • Take the user lock
  • Analyze if the message is valid
  • Take the batch lock and add it to the queue
  • Update the user state and release the locks

The added complexity is handling the case of a full queue. Since we cannot be sure that we can take a user lock after the batch lock, what we do is:

  • Take the batch lock after the user lock and check if the queue is full
  • Going from the proofs with less fees to more fees, take the first proof whose user state is not locked (it's a for with a try_lock)
  • Compare fees against it

This mechanism avoids deadlocks, since it may happen that the "candidate" for eviction has it's state locked.

As a downside, it may be imprecise, since the user may need to bid more than N proofs, but it's not critical.

Another approach would be to briefly take the batch lock, peek to see if the queue is full, drop it and try to get the lock of the user with the least amount of fees. But this may lead to some edge cases that are harder to handle.

*The only exception to this rule is when we got a failure on sending a batch, in this scenario, to recover, we need to lock all the users states since the queue is finite and we may need to evict them and update their nonces. We don't actually take all the locks, and we may only need a couple. But to avoid a deadlock, we added a flag to avoid processing more users when a recovery is in progress which works in a similar manner. In the rare event that the lock we need is taken, the user task will timeout in 15s and release it for the restoration task

Type of change

Please delete options that are not relevant.

  • Bug fix
  • Refactor

@MauroToscano MauroToscano marked this pull request as ready for review July 16, 2025 20:37
@MauroToscano MauroToscano changed the base branch from testnet to staging July 22, 2025 17:14
ciborium = "=0.2.2"
priority-queue = "2.1.0"
reqwest = { version = "0.12", features = ["json"] }
dashmap = "6.0.1"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove dependency

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Member

@MarcosNicolau MarcosNicolau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, the general code is much easier to follow through.

@MauroToscano MauroToscano enabled auto-merge August 22, 2025 20:27
@MauroToscano MauroToscano added this pull request to the merge queue Aug 22, 2025
Merged via the queue into staging with commit 15c8d13 Aug 22, 2025
3 checks passed
@MauroToscano MauroToscano deleted the rework_batcher_concurrency branch August 22, 2025 20:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants