Skip to content

Conversation

@jimmygchen
Copy link
Member

@jimmygchen jimmygchen commented Aug 28, 2025

Issue Addressed

Partially resolves #6439, an simpler alternative to #7931.

Race condition occurs when RPC data columns arrives after a block has been imported and removed from the DA checker:

  1. Block becomes available via gossip
  2. RPC columns arrive and pass fork choice check (block hasn't been imported)
  3. Block import completes (removing block from DA checker)
  4. RPC data columns finish verification and get imported into DA checker

This causes two issues:

  1. Partial data serving: Already imported components get re-inserted, potentially causing LH to serve incomplete data
  2. State cache misses: Leads to state reconstruction, holding the availability cache write lock longer and increasing race likelihood

Proposed Changes

  1. Never manually remove pending components from DA checker. Components are only removed via LRU eviction as finality advances. This makes sure we don't run into the issue described above.
  2. Use get instead of pop when recovering the executed block, this prevents cache misses in race condition. This should reduce the likelihood of the race condition
  3. Refactor DA checker to drop write lock as soon as components are added. This should also reduce the likelihood of the race condition

Trade-offs:

This solution eliminates a few nasty race conditions while allowing simplicity, with the cost of allowing block re-import (already existing).

The increase in memory in DA checker can be partially offset by a reduction in block cache size if this really comes an issue (as we now serve recent blocks from DA checker).

@jimmygchen jimmygchen added ready-for-review The code is ready for review v8.0.0-rc.0 Q3 2025 release for Fusaka on Holesky labels Aug 28, 2025
@jimmygchen jimmygchen force-pushed the fix-da-checker-race-take-3 branch from 83de4dc to b8a32de Compare August 28, 2025 07:18
@jimmygchen jimmygchen force-pushed the fix-da-checker-race-take-3 branch from b8a32de to b801675 Compare August 28, 2025 07:44
@mergify
Copy link

mergify bot commented Aug 28, 2025

Some required checks have failed. Could you please take a look @jimmygchen? 🙏

@mergify mergify bot added waiting-on-author The reviewer has suggested changes and awaits thier implementation. ready-for-review The code is ready for review and removed ready-for-review The code is ready for review waiting-on-author The reviewer has suggested changes and awaits thier implementation. labels Aug 28, 2025
@jimmygchen jimmygchen changed the title Another attempt to fix race condition with a simplified approach Another attempt to fix race condition with a simpler approach Aug 29, 2025
@shane-moore
Copy link

cool that this will simplify concurrency management for the DA checker

as a follow-up, it might be good to monitor memory usage in grafana after deploying to a devnet as to keep an eye on it

@jimmygchen jimmygchen changed the title Another attempt to fix race condition with a simpler approach Fix data availability checker race condition causing partial data columns to be served over RPC Sep 1, 2025
@jimmygchen jimmygchen added the bug Something isn't working label Sep 1, 2025
self.availability_cache
.remove_pending_components(block_root)
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reconstruction error makes the assumption that if the block root doesnt exist, then block is already imported, which is no longer correct. Update error message.

// Check indices from cache again to make sure we don't publish components we've already received.
let Some(existing_column_indices) = self.cached_data_column_indexes(block_root) else {
return Ok(DataColumnReconstructionResult::RecoveredColumnsNotImported(
"block already imported",
));
};

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 4d8525a. Now returns an error as it's an unexpected scenario.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also update this comment in this file:

/// Under normal conditions, the cache should only store the current pending block, but could
///  occasionally spike to 2-4 for various reasons e.g. components arriving late, but would very
/// rarely go above this, unless there are many concurrent forks.

Now we expect this cache to basically be full all the time

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in 4d8525a, and also reduced the cache size from 64 to 32.

Copy link
Member

@michaelsproul michaelsproul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks safe to me, and I like the more "idempotent" design style of keeping components rather than removing them.

I still feel like our caches could do with an overhaul and simplification, but that can wait.

@jimmygchen jimmygchen added ready-for-merge This PR is ready to merge. and removed ready-for-review The code is ready for review labels Sep 2, 2025
@mergify mergify bot merged commit eef02af into sigp:unstable Sep 2, 2025
37 checks passed
@jimmygchen jimmygchen deleted the fix-da-checker-race-take-3 branch September 2, 2025 21:28
jtraglia pushed a commit to jtraglia/lighthouse that referenced this pull request Sep 16, 2025
…umns to be served over RPC (sigp#7961)

Partially resolves sigp#6439, an simpler alternative to sigp#7931.

Race condition occurs when RPC data columns arrives after a block has been imported and removed from the DA checker:
1. Block becomes available via gossip
2. RPC columns arrive and pass fork choice check (block hasn't been imported)
3. Block import completes (removing block from DA checker)
4. RPC data columns finish verification and get imported into DA checker

This causes two issues:
1. **Partial data serving**: Already imported components get re-inserted, potentially causing LH to serve incomplete data
2. **State cache misses**: Leads to state reconstruction, holding the availability cache write lock longer and increasing race likelihood

### Proposed Changes

1. Never manually remove pending components from DA checker. Components are only removed via LRU eviction as finality advances. This makes sure we don't run into the issue described above.
2. Use `get` instead of `pop` when recovering the executed block, this prevents cache misses in race condition. This should reduce the likelihood of the race condition
3. Refactor DA checker to drop write lock as soon as components are added. This should also reduce the likelihood of the race condition

**Trade-offs:**

This solution eliminates a few nasty race conditions while allowing simplicity, with the cost of allowing block re-import (already existing).

The increase in memory in DA checker can be partially offset by a reduction in block cache size if this really comes an issue (as we now serve recent blocks from DA checker).
mergify bot pushed a commit that referenced this pull request Sep 19, 2025
…r` (#8045)

This PR consolidates the `reqresp_pre_import_cache` into the `data_availability_checker` for the following reasons:
- the `reqresp_pre_import_cache` suffers from the same TOCTOU bug we had with `data_availability_checker` earlier, and leads to unbounded memory leak, which we have observed over the last 6 months on some nodes.
- the `reqresp_pre_import_cache` is no longer necessary, because we now hold blocks in the `data_availability_checker` for longer since (#7961), and recent blocks can be served from the DA checker.

This PR also maintains the following functionalities
- Serving pre-executed blocks over RPC, and they're now served from the `data_availability_checker` instead.
- Using the cache for de-duplicating lookup requests.


  


Co-Authored-By: Jimmy Chen <[email protected]>

Co-Authored-By: Jimmy Chen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working ready-for-merge This PR is ready to merge. v8.0.0-rc.0 Q3 2025 release for Fusaka on Holesky

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants