-
Notifications
You must be signed in to change notification settings - Fork 957
Fix data availability checker race condition causing partial data columns to be served over RPC #7961
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
83de4dc to
b8a32de
Compare
b8a32de to
b801675
Compare
|
Some required checks have failed. Could you please take a look @jimmygchen? 🙏 |
|
cool that this will simplify concurrency management for the DA checker as a follow-up, it might be good to monitor memory usage in grafana after deploying to a devnet as to keep an eye on it |
beacon_node/beacon_chain/src/data_availability_checker/overflow_lru_cache.rs
Outdated
Show resolved
Hide resolved
…ghthouse into fix-da-checker-race-take-3
| self.availability_cache | ||
| .remove_pending_components(block_root) | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This reconstruction error makes the assumption that if the block root doesnt exist, then block is already imported, which is no longer correct. Update error message.
lighthouse/beacon_node/beacon_chain/src/data_availability_checker.rs
Lines 590 to 595 in a134d43
| // Check indices from cache again to make sure we don't publish components we've already received. | |
| let Some(existing_column_indices) = self.cached_data_column_indexes(block_root) else { | |
| return Ok(DataColumnReconstructionResult::RecoveredColumnsNotImported( | |
| "block already imported", | |
| )); | |
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 4d8525a. Now returns an error as it's an unexpected scenario.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should also update this comment in this file:
/// Under normal conditions, the cache should only store the current pending block, but could
/// occasionally spike to 2-4 for various reasons e.g. components arriving late, but would very
/// rarely go above this, unless there are many concurrent forks.
Now we expect this cache to basically be full all the time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated in 4d8525a, and also reduced the cache size from 64 to 32.
michaelsproul
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks safe to me, and I like the more "idempotent" design style of keeping components rather than removing them.
I still feel like our caches could do with an overhaul and simplification, but that can wait.
…umns to be served over RPC (sigp#7961) Partially resolves sigp#6439, an simpler alternative to sigp#7931. Race condition occurs when RPC data columns arrives after a block has been imported and removed from the DA checker: 1. Block becomes available via gossip 2. RPC columns arrive and pass fork choice check (block hasn't been imported) 3. Block import completes (removing block from DA checker) 4. RPC data columns finish verification and get imported into DA checker This causes two issues: 1. **Partial data serving**: Already imported components get re-inserted, potentially causing LH to serve incomplete data 2. **State cache misses**: Leads to state reconstruction, holding the availability cache write lock longer and increasing race likelihood ### Proposed Changes 1. Never manually remove pending components from DA checker. Components are only removed via LRU eviction as finality advances. This makes sure we don't run into the issue described above. 2. Use `get` instead of `pop` when recovering the executed block, this prevents cache misses in race condition. This should reduce the likelihood of the race condition 3. Refactor DA checker to drop write lock as soon as components are added. This should also reduce the likelihood of the race condition **Trade-offs:** This solution eliminates a few nasty race conditions while allowing simplicity, with the cost of allowing block re-import (already existing). The increase in memory in DA checker can be partially offset by a reduction in block cache size if this really comes an issue (as we now serve recent blocks from DA checker).
…r` (#8045) This PR consolidates the `reqresp_pre_import_cache` into the `data_availability_checker` for the following reasons: - the `reqresp_pre_import_cache` suffers from the same TOCTOU bug we had with `data_availability_checker` earlier, and leads to unbounded memory leak, which we have observed over the last 6 months on some nodes. - the `reqresp_pre_import_cache` is no longer necessary, because we now hold blocks in the `data_availability_checker` for longer since (#7961), and recent blocks can be served from the DA checker. This PR also maintains the following functionalities - Serving pre-executed blocks over RPC, and they're now served from the `data_availability_checker` instead. - Using the cache for de-duplicating lookup requests. Co-Authored-By: Jimmy Chen <[email protected]> Co-Authored-By: Jimmy Chen <[email protected]>
Issue Addressed
Partially resolves #6439, an simpler alternative to #7931.
Race condition occurs when RPC data columns arrives after a block has been imported and removed from the DA checker:
This causes two issues:
Proposed Changes
getinstead ofpopwhen recovering the executed block, this prevents cache misses in race condition. This should reduce the likelihood of the race conditionTrade-offs:
This solution eliminates a few nasty race conditions while allowing simplicity, with the cost of allowing block re-import (already existing).
The increase in memory in DA checker can be partially offset by a reduction in block cache size if this really comes an issue (as we now serve recent blocks from DA checker).