-
Notifications
You must be signed in to change notification settings - Fork 958
Fix race condition causing data columns getting inserted into DA checker after block is imported #7931
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Add available blocks cache to prevent blocks/columns from persisting after blocks become available.
| result = "custody columns not imported", | ||
| block_hash = %hash, | ||
| "Block components already available" | ||
| ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm worried about the case where block is available but is awaiting import / recompute head. This could be a while and we may not want to send BlockComponentProcessed event until then
|
I'm going to close this as this isn't a great solution for the reason mentioned in the above comment. Will create a new PR. |
|
I've done another implementation keeping the blocks in the DA checker until it gets pruned, but i don't think it solves the problem mentioned above - we still need to handle the
Without considering a redesign or significant refactor at this point, I see two potential options:
Given we're late in the testing phase, I'd prefer avoiding a redesign / major refactor now if we can, and deferring it after the fork would allow us to address this race condition cleanly without adding more complexity now. |
|
I made another attempt, i think it's now way simpler. (0 line of code added!) so I'm closing this one. |
…umns to be served over RPC (#7961) Partially resolves #6439, an simpler alternative to #7931. Race condition occurs when RPC data columns arrives after a block has been imported and removed from the DA checker: 1. Block becomes available via gossip 2. RPC columns arrive and pass fork choice check (block hasn't been imported) 3. Block import completes (removing block from DA checker) 4. RPC data columns finish verification and get imported into DA checker This causes two issues: 1. **Partial data serving**: Already imported components get re-inserted, potentially causing LH to serve incomplete data 2. **State cache misses**: Leads to state reconstruction, holding the availability cache write lock longer and increasing race likelihood ### Proposed Changes 1. Never manually remove pending components from DA checker. Components are only removed via LRU eviction as finality advances. This makes sure we don't run into the issue described above. 2. Use `get` instead of `pop` when recovering the executed block, this prevents cache misses in race condition. This should reduce the likelihood of the race condition 3. Refactor DA checker to drop write lock as soon as components are added. This should also reduce the likelihood of the race condition **Trade-offs:** This solution eliminates a few nasty race conditions while allowing simplicity, with the cost of allowing block re-import (already existing). The increase in memory in DA checker can be partially offset by a reduction in block cache size if this really comes an issue (as we now serve recent blocks from DA checker).
…umns to be served over RPC (sigp#7961) Partially resolves sigp#6439, an simpler alternative to sigp#7931. Race condition occurs when RPC data columns arrives after a block has been imported and removed from the DA checker: 1. Block becomes available via gossip 2. RPC columns arrive and pass fork choice check (block hasn't been imported) 3. Block import completes (removing block from DA checker) 4. RPC data columns finish verification and get imported into DA checker This causes two issues: 1. **Partial data serving**: Already imported components get re-inserted, potentially causing LH to serve incomplete data 2. **State cache misses**: Leads to state reconstruction, holding the availability cache write lock longer and increasing race likelihood ### Proposed Changes 1. Never manually remove pending components from DA checker. Components are only removed via LRU eviction as finality advances. This makes sure we don't run into the issue described above. 2. Use `get` instead of `pop` when recovering the executed block, this prevents cache misses in race condition. This should reduce the likelihood of the race condition 3. Refactor DA checker to drop write lock as soon as components are added. This should also reduce the likelihood of the race condition **Trade-offs:** This solution eliminates a few nasty race conditions while allowing simplicity, with the cost of allowing block re-import (already existing). The increase in memory in DA checker can be partially offset by a reduction in block cache size if this really comes an issue (as we now serve recent blocks from DA checker).
Issue Addressed
Closes #6439, but also covers additional race conditions described below
Scenario:
process_rpc_custody_columnsThis is not great because we may end up having some data columns in DA checker (as RPC columns may be from multiple requests & peers) and may not be able to fully serve peers' requests.
Proposed Changes
We had also considered Lion's proposal in this comment, it is clean but doesn't handle one specific case - when the
PendingComponentgets removed after block import, if any RPC columns have already made it pass the fork choice check, this race condition may still be possible in a tiny window.This PR proposes to use a LRU cache of a small size to track recently imported block roots, and performs check before processing block / blobs / columns.
NOTE: this hasn't been tested at all and we should add some tests for the
AlreadyAvailablecode path.