-
Notifications
You must be signed in to change notification settings - Fork 957
Consolidate reqresp_pre_import_cache into data_availability_checker
#8045
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…e block around for longer and serving non-executed blocks isn't really necessary.
|
|
||
| /// Remove any block components from the *processing cache* if we no longer require them. If the | ||
| /// block was imported full or erred, we no longer require them. | ||
| fn remove_notified( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still need to handle the case where block import was errored and remove them from cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The damage to leave the invalid block in the cache is that we might serve the block over RPC after knowing it's invalid. Consider the case where we remove the block from the da_checker after noticing it's invalid and then someone triggers lookup sync and we fetch it again. We will serve the block over RPC while verification is happening for the second time.
I don't see that big of an issue with leaving the invalid block in the da_checker until LRU evicts it. Otherwise, we can leave it there but mark it as "INVALID"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking out loud -
If the block is invalid we will downscore the peer.
If we leave the invalid block in the cache
- we might serve the invalid block over RPC (we already are doing it before execution) - however, we are unlikely to get downscored because we will never attest or build on top of thisblock.
- we avoid downloading the invalid block again, even if a peer sends us a child block / attestation
If we remove the invalid block from the cache
- we still serve the invalid block over RPC before execution, but NOT after it failed validation
- a peer might get us to fetch again and and serve us the invalid block, we will downscore them after execution
- it doesn't' really give us much because we could still serve the invalid block over RPC before execution, and even if we remove it, it could come back again
I don't see that big of an issue with leaving the invalid block in the da_checker until LRU evicts it.
Yes agree. I actually think leaving it in the cache to prevent the lookup is better.
Otherwise, we can leave it there but mark it as "INVALID"
A block that does not transition into executed block will never become available so we don't really need to mark them I believe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for raising these good points, I'm now inclined to leave it as it is, as it doesn't seem to help us that much by removing it, as we don't really benefit from doing it for extra code and complexity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding more thoughts after seeing the failed block_in_processing_cache_becomes_invalid test.
If the block fails execution initially for whatever reason (e.g. engine offline), and we keep it in the cache, then there's a chance that the chain might get stuck, because we will never perform lookup until the block is evicted from the cache. I think this is not great, I'll add code to remove it so we re-trigger a new lookup in this case.
if it is indeed an invalid block, peer will get penalised for getting us to do the lookup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing failed execution block in 5f2e85c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we will never perform lookup until the block is evicted from the cache
Just to be clear, is this because we cannot re-trigger the execution of a block sitting in the da checker? not fully clear on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes exactly - because lookup will not re-download the block if it's in the DA checker:
lighthouse/beacon_node/network/src/sync/network_context.rs
Lines 835 to 853 in ee734d1
| match self.chain.get_block_process_status(&block_root) { | |
| // Unknown block, continue request to download | |
| BlockProcessStatus::Unknown => {} | |
| // Block is known are currently processing, expect a future event with the result of | |
| // processing. | |
| BlockProcessStatus::NotValidated { .. } => { | |
| // Lookup sync event safety: If the block is currently in the processing cache, we | |
| // are guaranteed to receive a `SyncMessage::GossipBlockProcessResult` that will | |
| // make progress on this lookup | |
| return Ok(LookupRequestResult::Pending("block in processing cache")); | |
| } | |
| // Block is fully validated. If it's not yet imported it's waiting for missing block | |
| // components. Consider this request completed and do nothing. | |
| BlockProcessStatus::ExecutionValidated { .. } => { | |
| return Ok(LookupRequestResult::NoRequestNeeded( | |
| "block execution validated", | |
| )); | |
| } | |
| } |
so if it fails the first time because of unexpected error, e.g. EL timeout, then it doesn't gets re-downloaded, until we fallback to range sync
|
Some required checks have failed. Could you please take a look @jimmygchen? 🙏 |
0f1d801 to
f2d5e92
Compare
dapplion
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice simplification! Can't find any issues with it after mutliple passes
| } | ||
|
|
||
| self.data_availability_checker | ||
| .put_pre_execution_block(block_root, unverified_block.block_cloned())?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noting that if the block is unverified, so it can be 10 MB. The current max size of the cache is 32 blocks so there's no risk of OOM.
| } | ||
|
|
||
| /// Inserts a pre-execution block into the cache. | ||
| /// This does NOT override an existing executed block. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noting that this will override a PreExecution block, but it is fine since the block root is check so the data is exactly the same. Do we care about controlling more the state transitions of this enum?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this - it only insert it if the block is None, so if it's already set it will not override. Am I missing something?
I considered more controlled state transition but i think it's enough to ensure one direction in following order:
None -> Pre execution -> Executed
We can never override an executed block
…ow re-processing.
pawanjay176
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking good to me. Just a question.
Also, would be good to test this on the devnet with stopping the EL node and checking if it recovers as expected.
| // chain to get stuck temporarily if the block is canonical. Therefore we remove | ||
| // it from the cache if execution fails. | ||
| self.data_availability_checker | ||
| .remove_block_on_execution_error(&block_root); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree its safer to just chuck it out of the cache and let lookup fetch the block regardless of the type of the execution error.
Yep I was able to see a block lookup immediately sent after engine is restarted |
|
This pull request has been removed from the queue for the following reason: The merge conditions cannot be satisfied due to failing checks: You can check the last failing draft PR here: #8089. You may have to fix your CI before adding the pull request to the queue again. |
…8112) - PR #8045 introduced a regression of how lookup sync interacts with the da_checker. Now in unstable block import from the HTTP API also insert the block in the da_checker while the block is being execution verified. If lookup sync finds the block in the da_checker in `NotValidated` state it expects a `GossipBlockProcessResult` message sometime later. That message is only sent after block import in gossip. I confirmed in our node's logs for 4/4 cases of stuck lookups are caused by this sequence of events: - Receive block through API, insert into da_checker in fn process_block in put_pre_execution_block - Create lookup and leave in AwaitingDownload(block in processing cache) state - Block from HTTP API finishes importing - Lookup is left stuck Closes #8104 - #8110 was my initial solution attempt but we can't send the `GossipBlockProcessResult` event from the `http_api` crate without adding new channels, which seems messy. For a given node it's rare that a lookup is created at the same time that a block is being published. This PR solves #8104 by allowing lookup sync to import the block twice in that case. Co-Authored-By: dapplion <[email protected]>
…igp#8112) - PR sigp#8045 introduced a regression of how lookup sync interacts with the da_checker. Now in unstable block import from the HTTP API also insert the block in the da_checker while the block is being execution verified. If lookup sync finds the block in the da_checker in `NotValidated` state it expects a `GossipBlockProcessResult` message sometime later. That message is only sent after block import in gossip. I confirmed in our node's logs for 4/4 cases of stuck lookups are caused by this sequence of events: - Receive block through API, insert into da_checker in fn process_block in put_pre_execution_block - Create lookup and leave in AwaitingDownload(block in processing cache) state - Block from HTTP API finishes importing - Lookup is left stuck Closes sigp#8104 - sigp#8110 was my initial solution attempt but we can't send the `GossipBlockProcessResult` event from the `http_api` crate without adding new channels, which seems messy. For a given node it's rare that a lookup is created at the same time that a block is being published. This PR solves sigp#8104 by allowing lookup sync to import the block twice in that case. Co-Authored-By: dapplion <[email protected]>
commit 9b2de09 Author: Pawan Dhananjay <[email protected]> Date: Mon Oct 6 17:22:44 2025 -0700 Rethink peer scoring commit 5c562c6 Author: Pawan Dhananjay <[email protected]> Date: Mon Oct 6 11:17:57 2025 -0700 Fix some more issues commit 15df3d2 Merge: c491856 26575c5 Author: Pawan Dhananjay <[email protected]> Date: Thu Oct 2 13:15:59 2025 -0700 Merge branch 'unstable' into blocks-then-columns commit c491856 Merge: 826a06e ffa7b2b Author: Pawan Dhananjay <[email protected]> Date: Thu Sep 25 16:23:27 2025 -0700 Merge branch 'unstable' into blocks-then-columns commit 826a06e Author: Pawan Dhananjay <[email protected]> Date: Thu Sep 25 16:21:55 2025 -0700 Fix variant name commit 421e954 Author: Pawan Dhananjay <[email protected]> Date: Thu Sep 25 16:05:45 2025 -0700 Revert "Revert type change in UnexpectedRequestId" This reverts commit 6ea14016f3d164456bc4c3cae0355ab532fe1a86. commit 3f8998f Author: Lion - dapplion <[email protected]> Date: Thu Sep 25 05:52:27 2025 +0200 Only mark block lookups as pending if block is importing from gossip (sigp#8112) - PR sigp#8045 introduced a regression of how lookup sync interacts with the da_checker. Now in unstable block import from the HTTP API also insert the block in the da_checker while the block is being execution verified. If lookup sync finds the block in the da_checker in `NotValidated` state it expects a `GossipBlockProcessResult` message sometime later. That message is only sent after block import in gossip. I confirmed in our node's logs for 4/4 cases of stuck lookups are caused by this sequence of events: - Receive block through API, insert into da_checker in fn process_block in put_pre_execution_block - Create lookup and leave in AwaitingDownload(block in processing cache) state - Block from HTTP API finishes importing - Lookup is left stuck Closes sigp#8104 - sigp#8110 was my initial solution attempt but we can't send the `GossipBlockProcessResult` event from the `http_api` crate without adding new channels, which seems messy. For a given node it's rare that a lookup is created at the same time that a block is being published. This PR solves sigp#8104 by allowing lookup sync to import the block twice in that case. Co-Authored-By: dapplion <[email protected]> commit d99df0a Author: Jimmy Chen <[email protected]> Date: Thu Sep 25 12:52:07 2025 +1000 Only send data coumn subnet discovery requests after peerdas is scheduled (sigp#8109) sigp#8105 (to be confirmed) I noticed a large number of failed discovery requests after deploying latest `unstable` to some of our testnet and mainnet nodes. This is because of a recent PeerDAS change to attempt to maintain sufficient peers across data column subnets - this shouldn't be enabled on network without peerdas scheduled, otherwise it will keep retrying discovery on these subnets and never succeed. Also removed some unused files. Co-Authored-By: Jimmy Chen <[email protected]> Co-Authored-By: Jimmy Chen <[email protected]> commit cf46d10 Author: Pawan Dhananjay <[email protected]> Date: Thu Sep 25 15:54:32 2025 -0700 Fix issues from review commit c2aa4ae Author: dapplion <[email protected]> Date: Thu Sep 25 23:58:31 2025 +0200 Revert type change in UnexpectedRequestId commit 7488755 Author: dapplion <[email protected]> Date: Thu Sep 25 22:09:06 2025 +0200 De-duplicate data columns by root request type commit 7650032 Author: dapplion <[email protected]> Date: Thu Sep 25 23:52:47 2025 +0200 Rename DataColumnsFromRange commit 4b0b655 Author: Pawan Dhananjay <[email protected]> Date: Wed Sep 24 12:31:06 2025 -0700 Fix nits commit aa6a1bc Author: Pawan Dhananjay <[email protected]> Date: Wed Sep 24 12:16:05 2025 -0700 Create a custom penalize_sync_peer method for clarity commit 27195ca Merge: 4a59d35 af27402 Author: Pawan Dhananjay <[email protected]> Date: Wed Sep 24 11:16:27 2025 -0700 Merge branch 'unstable' into blocks-then-columns commit 4a59d35 Merge: 2f35c36 5928407 Author: Pawan Dhananjay <[email protected]> Date: Wed Sep 17 15:15:44 2025 -0700 Merge branch 'unstable' into blocks-then-columns commit 2f35c36 Author: Pawan Dhananjay <[email protected]> Date: Wed Sep 17 15:13:40 2025 -0700 Add some metrics commit b3b3756 Author: Pawan Dhananjay <[email protected]> Date: Wed Sep 17 15:01:46 2025 -0700 Fix tests commit e3aed89 Author: Pawan Dhananjay <[email protected]> Date: Tue Sep 16 17:46:21 2025 -0700 Remove retry test that we do not use anymore commit baee27a Merge: 9db4c30 191570e Author: Pawan Dhananjay <[email protected]> Date: Tue Sep 16 16:39:30 2025 -0700 Merge branch 'unstable' into blocks-then-columns commit 9db4c30 Author: Pawan Dhananjay <[email protected]> Date: Tue Sep 16 16:38:19 2025 -0700 Fix small bug commit 08bba3f Author: Pawan Dhananjay <[email protected]> Date: Thu Sep 11 18:24:25 2025 -0700 fmt commit cffbd34 Author: Pawan Dhananjay <[email protected]> Date: Thu Sep 11 18:19:16 2025 -0700 Reduce code duplication commit bf09d57 Author: Pawan Dhananjay <[email protected]> Date: Thu Sep 11 15:06:07 2025 -0700 Fix some issues from lion's review commit 04398ad Author: Pawan Dhananjay <[email protected]> Date: Mon Sep 8 15:37:26 2025 -0700 Fix some more issues from review commit 4f62a9c Merge: e259ecd 8ec2640 Author: Pawan Dhananjay <[email protected]> Date: Fri Sep 5 13:00:47 2025 -0700 Merge branch 'unstable' into blocks-then-columns commit e259ecd Author: Pawan Dhananjay <[email protected]> Date: Fri Sep 5 12:58:49 2025 -0700 More renamings commit 6a2a33d Author: Pawan Dhananjay <[email protected]> Date: Fri Sep 5 12:48:05 2025 -0700 Fix some issues from review commit e0d8f04 Author: Pawan Dhananjay <[email protected]> Date: Thu Sep 4 18:09:07 2025 -0700 Tests compile commit 7e91eeb Merge: 29c2f83 9d2f55a Author: Pawan Dhananjay <[email protected]> Date: Thu Sep 4 18:08:52 2025 -0700 Merge branch 'unstable' into blocks-then-columns commit 29c2f83 Author: Pawan Dhananjay <[email protected]> Date: Tue Sep 2 13:29:19 2025 -0700 Cleanup SyncingChain commit 8458df6 Author: Pawan Dhananjay <[email protected]> Date: Mon Sep 1 12:53:01 2025 -0700 Attempt sending awaitingDownload batches when restarting sync commit 7a6d0d9 Author: Pawan Dhananjay <[email protected]> Date: Sun Aug 31 19:07:39 2025 -0700 Force processing_target request commit 4f60e86 Author: Pawan Dhananjay <[email protected]> Date: Sun Aug 31 14:54:44 2025 -0700 Add logs to debug stuck range sync commit b07bc6d Author: Pawan Dhananjay <[email protected]> Date: Fri Aug 29 16:18:53 2025 -0700 Force requests if batch buffer is full under certain conditions commit 19b0a5c Merge: 8e1337d 438fb65 Author: Pawan Dhananjay <[email protected]> Date: Thu Aug 28 23:25:14 2025 -0700 Merge branch 'unstable' into blocks-then-columns commit 8e1337d Merge: da1aaba a134d43 Author: Pawan Dhananjay <[email protected]> Date: Thu Aug 28 18:03:44 2025 -0700 Merge branch 'unstable' into blocks-then-columns commit da1aaba Author: Pawan Dhananjay <[email protected]> Date: Wed Aug 27 17:34:27 2025 -0700 Revise error tolerance commit 7331323 Author: Pawan Dhananjay <[email protected]> Date: Wed Aug 27 17:30:46 2025 -0700 AwaitingValidation state only needs block peer commit 8386bd9 Merge: b4bc7fe d235f2c Author: Pawan Dhananjay <[email protected]> Date: Wed Aug 27 17:08:42 2025 -0700 Merge branch 'unstable' into blocks-then-columns commit b4bc7fe Author: Pawan Dhananjay <[email protected]> Date: Wed Aug 27 14:30:46 2025 -0700 Remove more debug logs commit 05adb71 Author: Pawan Dhananjay <[email protected]> Date: Wed Aug 27 14:26:28 2025 -0700 Fix bug with partial column responses before all column requests sent commit a97cf88 Author: Pawan Dhananjay <[email protected]> Date: Tue Aug 26 18:53:51 2025 -0700 Add docs commit 27d0b36 Author: Pawan Dhananjay <[email protected]> Date: Tue Aug 26 16:06:27 2025 -0700 Remove debug statements commit 90d319f Merge: 7c214f5 3e78034 Author: Pawan Dhananjay <[email protected]> Date: Tue Aug 26 15:38:47 2025 -0700 Merge branch 'unstable' into blocks-then-columns commit 7c214f5 Merge: 52762b9 daf1c7c Author: Pawan Dhananjay <[email protected]> Date: Mon Aug 25 11:01:40 2025 -0700 Merge branch 'unstable' into blocks-then-columns commit 52762b9 Author: Pawan Dhananjay <[email protected]> Date: Fri Aug 22 11:52:56 2025 -0700 Handle 0 blobs per epoch case commit da27441 Merge: 521778b cee30d8 Author: Pawan Dhananjay <[email protected]> Date: Thu Aug 21 11:02:22 2025 -0700 Merge branch 'unstable' into blocks-then-columns commit 521778b Author: Pawan Dhananjay <[email protected]> Date: Thu Aug 21 07:03:58 2025 -0700 Random logs commit 4540195 Author: Pawan Dhananjay <[email protected]> Date: Wed Aug 13 17:33:47 2025 -0700 Request columns from global peer pool commit fdce537 Author: Pawan Dhananjay <[email protected]> Date: Wed Aug 20 18:37:44 2025 -0700 Implement responsible peer tracking commit 17c4e34 Author: Pawan Dhananjay <[email protected]> Date: Wed Aug 20 14:18:15 2025 -0700 Avoid root requests for backfill sync commit 1a0df30 Author: Pawan Dhananjay <[email protected]> Date: Wed Aug 20 14:14:48 2025 -0700 Also penalize all batch peers for availability check errors commit 6da924b Author: Pawan Dhananjay <[email protected]> Date: Wed Aug 20 14:14:26 2025 -0700 Fix bug in initialization code commit 68cce37 Author: Pawan Dhananjay <[email protected]> Date: Wed Aug 20 14:14:11 2025 -0700 Try to avoid chains failing for rpc errors commit ca9cfd5 Author: Pawan Dhananjay <[email protected]> Date: Tue Aug 19 12:17:59 2025 -0700 Small fixes commit 5337e46 Author: Pawan Dhananjay <[email protected]> Date: Mon Aug 18 22:59:38 2025 -0700 Add a function to retry column requests that could not be made commit 9455153 Author: Pawan Dhananjay <[email protected]> Date: Mon Aug 18 14:48:44 2025 -0700 Without retries commit 6bd8944 Author: Pawan Dhananjay <[email protected]> Date: Fri Aug 15 14:23:48 2025 -0700 Reduce backfill buffer size commit 156449c Author: Pawan Dhananjay <[email protected]> Date: Thu Aug 14 09:10:28 2025 -0700 Increase columns_by_root quota commit 836f9c6 Author: Pawan Dhananjay <[email protected]> Date: Wed Aug 13 16:37:12 2025 -0700 Priorotize status v2 commit 490b627 Author: Pawan Dhananjay <[email protected]> Date: Wed Aug 13 16:36:56 2025 -0700 Penalize if invalid EL block
…igp#8112) - PR sigp#8045 introduced a regression of how lookup sync interacts with the da_checker. Now in unstable block import from the HTTP API also insert the block in the da_checker while the block is being execution verified. If lookup sync finds the block in the da_checker in `NotValidated` state it expects a `GossipBlockProcessResult` message sometime later. That message is only sent after block import in gossip. I confirmed in our node's logs for 4/4 cases of stuck lookups are caused by this sequence of events: - Receive block through API, insert into da_checker in fn process_block in put_pre_execution_block - Create lookup and leave in AwaitingDownload(block in processing cache) state - Block from HTTP API finishes importing - Lookup is left stuck Closes sigp#8104 - sigp#8110 was my initial solution attempt but we can't send the `GossipBlockProcessResult` event from the `http_api` crate without adding new channels, which seems messy. For a given node it's rare that a lookup is created at the same time that a block is being published. This PR solves sigp#8104 by allowing lookup sync to import the block twice in that case. Co-Authored-By: dapplion <[email protected]>
Issue Addressed
This PR consolidates the
reqresp_pre_import_cacheinto thedata_availability_checkerfor the following reasons:reqresp_pre_import_cachesuffers from the same TOCTOU bug we had withdata_availability_checkerearlier, and leads to unbounded memory leak, which we have observed over the last 6 months on some nodes.reqresp_pre_import_cacheis no longer necessary, because we now hold blocks in thedata_availability_checkerfor longer since (Fix data availability checker race condition causing partial data columns to be served over RPC #7961), and recent blocks can be served from the DA checker.This PR also maintains the following functionalities
data_availability_checkerinstead.