Skip to content

Conversation

@jimmygchen
Copy link
Member

@jimmygchen jimmygchen commented Aug 25, 2025

Issue Addressed

Closes #6439, but also covers additional race conditions described below

Scenario:

  1. Block becomes available via gossip, beacon chain starts block import
  2. RPC columns batch arrive and hits process_rpc_custody_columns
    • Fork choice check passes as the block is not yet in fork choice
    • RPC columns get verified
  3. Block import completes - Block imported to fork choice AND removed from DA checker here
  4. RPC data columns completes verification and gets imported into DA checker

This is not great because we may end up having some data columns in DA checker (as RPC columns may be from multiple requests & peers) and may not be able to fully serve peers' requests.

Proposed Changes

We had also considered Lion's proposal in this comment, it is clean but doesn't handle one specific case - when the PendingComponent gets removed after block import, if any RPC columns have already made it pass the fork choice check, this race condition may still be possible in a tiny window.

This PR proposes to use a LRU cache of a small size to track recently imported block roots, and performs check before processing block / blobs / columns.

NOTE: this hasn't been tested at all and we should add some tests for the AlreadyAvailable code path.

Add available blocks cache to prevent blocks/columns from persisting after blocks become available.
@jimmygchen jimmygchen requested a review from jxs as a code owner August 25, 2025 09:20
@jimmygchen jimmygchen requested a review from pawanjay176 August 25, 2025 09:21
@jimmygchen jimmygchen added work-in-progress PR is a work-in-progress v8.0.0-rc.0 Q3 2025 release for Fusaka on Holesky and removed syncing labels Aug 25, 2025
result = "custody columns not imported",
block_hash = %hash,
"Block components already available"
);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried about the case where block is available but is awaiting import / recompute head. This could be a while and we may not want to send BlockComponentProcessed event until then

@jimmygchen
Copy link
Member Author

I'm going to close this as this isn't a great solution for the reason mentioned in the above comment. Will create a new PR.

@jimmygchen
Copy link
Member Author

jimmygchen commented Aug 27, 2025

@pawanjay176

I've done another implementation keeping the blocks in the DA checker until it gets pruned, but i don't think it solves the problem mentioned above - we still need to handle the AlreadyAvailable case on caller site. It would also require rework of the PendingComponents span , as the pending components now stays in DA checker for ~64 blocks.

AvailabilityProcessingStatus::AlreadyAvailable is not the same as AvailabilityProcessingStatus::Imported or BlockError::DuplicateFullyImported, because there's no guarantee that the block has completed import and the caller may not be ready to perform the next operation e.g.

  • Complete the current lookup and process its child
  • Recompute head

Without considering a redesign or significant refactor at this point, I see two potential options:

  1. Allow block to be re-imported if the race occurs. Block re-import is idempotent and it's already happening anyway - but i think this PR can still fix two things:
    1. Prevent already imported components getting re-inserted into DA checker, potentially causing LH to serve partial data (the main bug we're addressing here)
    2. Prevent making a block available twice in the DA checker, this will always result in state cache miss and state reconstruction here. This actually caused the availability cache write lock to beheld for longer, and make it even more likely to trigger this race.
  2. Alternatively, we could store a promise-like object in the available_blocks LRU cache. This is potentially a cleaner fix but with more complexity and more effort required to make it robust.

Given we're late in the testing phase, I'd prefer avoiding a redesign / major refactor now if we can, and deferring it after the fork would allow us to address this race condition cleanly without adding more complexity now.

@jimmygchen
Copy link
Member Author

I made another attempt, i think it's now way simpler. (0 line of code added!) so I'm closing this one.

#7961

@jimmygchen jimmygchen closed this Aug 28, 2025
mergify bot pushed a commit that referenced this pull request Sep 2, 2025
…umns to be served over RPC (#7961)

Partially resolves #6439, an simpler alternative to #7931.

Race condition occurs when RPC data columns arrives after a block has been imported and removed from the DA checker:
1. Block becomes available via gossip
2. RPC columns arrive and pass fork choice check (block hasn't been imported)
3. Block import completes (removing block from DA checker)
4. RPC data columns finish verification and get imported into DA checker

This causes two issues:
1. **Partial data serving**: Already imported components get re-inserted, potentially causing LH to serve incomplete data
2. **State cache misses**: Leads to state reconstruction, holding the availability cache write lock longer and increasing race likelihood

### Proposed Changes

1. Never manually remove pending components from DA checker. Components are only removed via LRU eviction as finality advances. This makes sure we don't run into the issue described above.
2. Use `get` instead of `pop` when recovering the executed block, this prevents cache misses in race condition. This should reduce the likelihood of the race condition
3. Refactor DA checker to drop write lock as soon as components are added. This should also reduce the likelihood of the race condition

**Trade-offs:**

This solution eliminates a few nasty race conditions while allowing simplicity, with the cost of allowing block re-import (already existing).

The increase in memory in DA checker can be partially offset by a reduction in block cache size if this really comes an issue (as we now serve recent blocks from DA checker).
jtraglia pushed a commit to jtraglia/lighthouse that referenced this pull request Sep 16, 2025
…umns to be served over RPC (sigp#7961)

Partially resolves sigp#6439, an simpler alternative to sigp#7931.

Race condition occurs when RPC data columns arrives after a block has been imported and removed from the DA checker:
1. Block becomes available via gossip
2. RPC columns arrive and pass fork choice check (block hasn't been imported)
3. Block import completes (removing block from DA checker)
4. RPC data columns finish verification and get imported into DA checker

This causes two issues:
1. **Partial data serving**: Already imported components get re-inserted, potentially causing LH to serve incomplete data
2. **State cache misses**: Leads to state reconstruction, holding the availability cache write lock longer and increasing race likelihood

### Proposed Changes

1. Never manually remove pending components from DA checker. Components are only removed via LRU eviction as finality advances. This makes sure we don't run into the issue described above.
2. Use `get` instead of `pop` when recovering the executed block, this prevents cache misses in race condition. This should reduce the likelihood of the race condition
3. Refactor DA checker to drop write lock as soon as components are added. This should also reduce the likelihood of the race condition

**Trade-offs:**

This solution eliminates a few nasty race conditions while allowing simplicity, with the cost of allowing block re-import (already existing).

The increase in memory in DA checker can be partially offset by a reduction in block cache size if this really comes an issue (as we now serve recent blocks from DA checker).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

v8.0.0-rc.0 Q3 2025 release for Fusaka on Holesky work-in-progress PR is a work-in-progress

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant