Skip to content

Commit d457cee

Browse files
authored
Don't create child lookup if parent is faulty (#7118)
Issue discovered on PeerDAS devnet (node `lighthouse-geth-2.peerdas-devnet-5.ethpandaops.io`). Summary: - A lookup is created for block root `0x28299de15843970c8ea4f95f11f07f75e76a690f9a8af31d354c38505eebbe12` - That block or a parent is faulty and `0x28299de15843970c8ea4f95f11f07f75e76a690f9a8af31d354c38505eebbe12` is added to the failed chains cache - We later receive a block that is a child of a child of `0x28299de15843970c8ea4f95f11f07f75e76a690f9a8af31d354c38505eebbe12` - We create a lookup, which attempts to process the child of `0x28299de15843970c8ea4f95f11f07f75e76a690f9a8af31d354c38505eebbe12` and hit a processor error `UnknownParent`, hitting this line https://github.com/sigp/lighthouse/blob/bf955c7543dac8911a6f6c334b5b3ca4ef728d9c/beacon_node/network/src/sync/block_lookups/mod.rs#L686-L688 `search_parent_of_child` does not create a parent lookup because the parent root is in the failed chain cache. However, we have **already** marked the child as awaiting the parent. This results in an inconsistent state of lookup sync, as there's a lookup awaiting a parent that doesn't exist. Now we have a lookup (the child of `0x28299de15843970c8ea4f95f11f07f75e76a690f9a8af31d354c38505eebbe12`) that is awaiting a parent lookup that doesn't exist: hence stuck. ### Impact This bug can affect Mainnet as well as PeerDAS devnets. This bug may stall lookup sync for a few minutes (up to `LOOKUP_MAX_DURATION_STUCK_SECS = 15 min`) until the stuck prune routine deletes it. By that time the root will be cleared from the failed chain cache and sync should succeed. During that time the user will see a lot of `WARN` logs when attempting to add each peer to the inconsistent lookup. We may also sync the block through range sync if we fall behind by more than 2 epochs. We may also create the parent lookup successfully after the failed cache clears and complete the child lookup. This bug is triggered if: - We have a lookup that fails and its root is added to the failed chain cache (much more likely to happen in PeerDAS networks) - We receive a block that builds on a child of the block added to the failed chain cache Ensure that we never create (or leave existing) a lookup that references a non-existing parent. I added `must_use` lints to the functions that create lookups. To fix the specific bug we must recursively drop the child lookup if the parent is not created. So if `search_parent_of_child` returns `false` now return `LookupRequestError::Failed` instead of `LookupResult::Pending`. As a bonus I have a added more logging and reason strings to the errors
1 parent 9a49720 commit d457cee

File tree

4 files changed

+119
-25
lines changed

4 files changed

+119
-25
lines changed

beacon_node/network/src/sync/block_lookups/mod.rs

Lines changed: 42 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -106,7 +106,7 @@ pub type SingleLookupId = u32;
106106
enum Action {
107107
Retry,
108108
ParentUnknown { parent_root: Hash256 },
109-
Drop,
109+
Drop(/* reason: */ String),
110110
Continue,
111111
}
112112

@@ -194,19 +194,22 @@ impl<T: BeaconChainTypes> BlockLookups<T> {
194194

195195
/// Creates a parent lookup for the block with the given `block_root` and immediately triggers it.
196196
/// If a parent lookup exists or is triggered, a current lookup will be created.
197+
///
198+
/// Returns true if the lookup is created or already exists
197199
#[instrument(parent = None,
198200
level = "info",
199201
fields(service = "lookup_sync"),
200202
name = "lookup_sync",
201203
skip_all
202204
)]
205+
#[must_use = "only reference the new lookup if returns true"]
203206
pub fn search_child_and_parent(
204207
&mut self,
205208
block_root: Hash256,
206209
block_component: BlockComponent<T::EthSpec>,
207210
peer_id: PeerId,
208211
cx: &mut SyncNetworkContext<T>,
209-
) {
212+
) -> bool {
210213
let parent_root = block_component.parent_root();
211214

212215
let parent_lookup_exists =
@@ -223,25 +226,29 @@ impl<T: BeaconChainTypes> BlockLookups<T> {
223226
// the lookup with zero peers to house the block components.
224227
&[],
225228
cx,
226-
);
229+
)
230+
} else {
231+
false
227232
}
228233
}
229234

230235
/// Seach a block whose parent root is unknown.
236+
///
231237
/// Returns true if the lookup is created or already exists
232238
#[instrument(parent = None,
233239
level = "info",
234240
fields(service = "lookup_sync"),
235241
name = "lookup_sync",
236242
skip_all
237243
)]
244+
#[must_use = "only reference the new lookup if returns true"]
238245
pub fn search_unknown_block(
239246
&mut self,
240247
block_root: Hash256,
241248
peer_source: &[PeerId],
242249
cx: &mut SyncNetworkContext<T>,
243-
) {
244-
self.new_current_lookup(block_root, None, None, peer_source, cx);
250+
) -> bool {
251+
self.new_current_lookup(block_root, None, None, peer_source, cx)
245252
}
246253

247254
/// A block or blob triggers the search of a parent.
@@ -256,6 +263,7 @@ impl<T: BeaconChainTypes> BlockLookups<T> {
256263
name = "lookup_sync",
257264
skip_all
258265
)]
266+
#[must_use = "only reference the new lookup if returns true"]
259267
pub fn search_parent_of_child(
260268
&mut self,
261269
block_root_to_search: Hash256,
@@ -363,6 +371,7 @@ impl<T: BeaconChainTypes> BlockLookups<T> {
363371
name = "lookup_sync",
364372
skip_all
365373
)]
374+
#[must_use = "only reference the new lookup if returns true"]
366375
fn new_current_lookup(
367376
&mut self,
368377
block_root: Hash256,
@@ -656,7 +665,7 @@ impl<T: BeaconChainTypes> BlockLookups<T> {
656665
// This is unreachable because RPC blocks do not undergo gossip verification, and
657666
// this error can *only* come from gossip verification.
658667
error!(?block_root, "Single block lookup hit unreachable condition");
659-
Action::Drop
668+
Action::Drop("DuplicateImportStatusUnknown".to_owned())
660669
}
661670
BlockProcessingResult::Ignored => {
662671
// Beacon processor signalled to ignore the block processing result.
@@ -665,14 +674,14 @@ impl<T: BeaconChainTypes> BlockLookups<T> {
665674
component = ?R::response_type(),
666675
"Lookup component processing ignored, cpu might be overloaded"
667676
);
668-
Action::Drop
677+
Action::Drop("Block processing ignored".to_owned())
669678
}
670679
BlockProcessingResult::Err(e) => {
671680
match e {
672681
BlockError::BeaconChainError(e) => {
673682
// Internal error
674683
error!(%block_root, error = ?e, "Beacon chain error processing lookup component");
675-
Action::Drop
684+
Action::Drop(format!("{e:?}"))
676685
}
677686
BlockError::ParentUnknown { parent_root, .. } => {
678687
// Reverts the status of this request to `AwaitingProcessing` holding the
@@ -691,7 +700,7 @@ impl<T: BeaconChainTypes> BlockLookups<T> {
691700
error = ?e,
692701
"Single block lookup failed. Execution layer is offline / unsynced / misconfigured"
693702
);
694-
Action::Drop
703+
Action::Drop(format!("{e:?}"))
695704
}
696705
BlockError::AvailabilityCheck(e)
697706
if e.category() == AvailabilityCheckErrorCategory::Internal =>
@@ -703,7 +712,7 @@ impl<T: BeaconChainTypes> BlockLookups<T> {
703712
// lookup state transition. This error invalidates both blob and block requests, and we don't know the
704713
// state of both requests. Blobs may have already successfullly processed for example.
705714
// We opt to drop the lookup instead.
706-
Action::Drop
715+
Action::Drop(format!("{e:?}"))
707716
}
708717
other => {
709718
debug!(
@@ -757,19 +766,32 @@ impl<T: BeaconChainTypes> BlockLookups<T> {
757766
}
758767
Action::ParentUnknown { parent_root } => {
759768
let peers = lookup.all_peers();
769+
// Mark lookup as awaiting **before** creating the parent lookup. At this point the
770+
// lookup maybe inconsistent.
760771
lookup.set_awaiting_parent(parent_root);
761-
debug!(
762-
id = lookup.id,
763-
?block_root,
764-
?parent_root,
765-
"Marking lookup as awaiting parent"
766-
);
767-
self.search_parent_of_child(parent_root, block_root, &peers, cx);
768-
Ok(LookupResult::Pending)
772+
let parent_lookup_exists =
773+
self.search_parent_of_child(parent_root, block_root, &peers, cx);
774+
if parent_lookup_exists {
775+
// The parent lookup exist or has been created. It's safe for `lookup` to
776+
// reference the parent as awaiting.
777+
debug!(
778+
id = lookup_id,
779+
?block_root,
780+
?parent_root,
781+
"Marking lookup as awaiting parent"
782+
);
783+
Ok(LookupResult::Pending)
784+
} else {
785+
// The parent lookup is faulty and was not created, we must drop the `lookup` as
786+
// it's in an inconsistent state. We must drop all of its children too.
787+
Err(LookupRequestError::Failed(format!(
788+
"Parent lookup is faulty {parent_root:?}"
789+
)))
790+
}
769791
}
770-
Action::Drop => {
792+
Action::Drop(reason) => {
771793
// Drop with noop
772-
Err(LookupRequestError::Failed)
794+
Err(LookupRequestError::Failed(reason))
773795
}
774796
Action::Continue => {
775797
// Drop this completed lookup only

beacon_node/network/src/sync/block_lookups/single_block_lookup.rs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,7 @@ pub enum LookupRequestError {
4040
/// Inconsistent lookup request state
4141
BadState(String),
4242
/// Lookup failed for some other reason and should be dropped
43-
Failed,
43+
Failed(/* reason: */ String),
4444
/// Received MissingComponents when all components have been processed. This should never
4545
/// happen, and indicates some internal bug
4646
MissingComponentsAfterAllProcessed,

beacon_node/network/src/sync/manager.rs

Lines changed: 19 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -929,12 +929,20 @@ impl<T: BeaconChainTypes> SyncManager<T> {
929929
) {
930930
match self.should_search_for_block(Some(slot), &peer_id) {
931931
Ok(_) => {
932-
self.block_lookups.search_child_and_parent(
932+
if self.block_lookups.search_child_and_parent(
933933
block_root,
934934
block_component,
935935
peer_id,
936936
&mut self.network,
937-
);
937+
) {
938+
// Lookup created. No need to log here it's logged in `new_current_lookup`
939+
} else {
940+
debug!(
941+
?block_root,
942+
?parent_root,
943+
"No lookup created for child and parent"
944+
);
945+
}
938946
}
939947
Err(reason) => {
940948
debug!(%block_root, %parent_root, reason, "Ignoring unknown parent request");
@@ -945,8 +953,15 @@ impl<T: BeaconChainTypes> SyncManager<T> {
945953
fn handle_unknown_block_root(&mut self, peer_id: PeerId, block_root: Hash256) {
946954
match self.should_search_for_block(None, &peer_id) {
947955
Ok(_) => {
948-
self.block_lookups
949-
.search_unknown_block(block_root, &[peer_id], &mut self.network);
956+
if self.block_lookups.search_unknown_block(
957+
block_root,
958+
&[peer_id],
959+
&mut self.network,
960+
) {
961+
// Lookup created. No need to log here it's logged in `new_current_lookup`
962+
} else {
963+
debug!(?block_root, "No lookup created for unknown block");
964+
}
950965
}
951966
Err(reason) => {
952967
debug!(%block_root, reason, "Ignoring unknown block request");

beacon_node/network/src/sync/tests/lookups.rs

Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1725,6 +1725,63 @@ fn test_parent_lookup_too_deep_grow_ancestor() {
17251725
rig.assert_failed_chain(chain_hash);
17261726
}
17271727

1728+
// Regression test for https://github.com/sigp/lighthouse/pull/7118
1729+
#[test]
1730+
fn test_child_lookup_not_created_for_failed_chain_parent_after_processing() {
1731+
// GIVEN: A parent chain longer than PARENT_DEPTH_TOLERANCE.
1732+
let mut rig = TestRig::test_setup();
1733+
let mut blocks = rig.rand_blockchain(PARENT_DEPTH_TOLERANCE + 1);
1734+
let peer_id = rig.new_connected_peer();
1735+
1736+
// The child of the trigger block to be used to extend the chain.
1737+
let trigger_block_child = blocks.pop().unwrap();
1738+
// The trigger block that starts the lookup.
1739+
let trigger_block = blocks.pop().unwrap();
1740+
let tip_root = trigger_block.canonical_root();
1741+
1742+
// Trigger the initial unknown parent block for the tip.
1743+
rig.trigger_unknown_parent_block(peer_id, trigger_block.clone());
1744+
1745+
// Simulate the lookup chain building up via `ParentUnknown` errors.
1746+
for block in blocks.into_iter().rev() {
1747+
let id = rig.expect_block_parent_request(block.canonical_root());
1748+
rig.parent_lookup_block_response(id, peer_id, Some(block.clone()));
1749+
rig.parent_lookup_block_response(id, peer_id, None);
1750+
rig.expect_block_process(ResponseType::Block);
1751+
rig.parent_block_processed(
1752+
tip_root,
1753+
BlockProcessingResult::Err(BlockError::ParentUnknown {
1754+
parent_root: block.parent_root(),
1755+
}),
1756+
);
1757+
}
1758+
1759+
// At this point, the chain should have been deemed too deep and pruned.
1760+
// The tip root should have been inserted into failed chains.
1761+
rig.assert_failed_chain(tip_root);
1762+
rig.expect_no_penalty_for(peer_id);
1763+
1764+
// WHEN: Trigger the extending block that points to the tip.
1765+
let trigger_block_child_root = trigger_block_child.canonical_root();
1766+
rig.trigger_unknown_block_from_attestation(trigger_block_child_root, peer_id);
1767+
let id = rig.expect_block_lookup_request(trigger_block_child_root);
1768+
rig.single_lookup_block_response(id, peer_id, Some(trigger_block_child.clone()));
1769+
rig.single_lookup_block_response(id, peer_id, None);
1770+
rig.expect_block_process(ResponseType::Block);
1771+
rig.single_block_component_processed(
1772+
id.lookup_id,
1773+
BlockProcessingResult::Err(BlockError::ParentUnknown {
1774+
parent_root: tip_root,
1775+
}),
1776+
);
1777+
1778+
// THEN: The extending block should not create a lookup because the tip was inserted into failed chains.
1779+
rig.expect_no_active_lookups();
1780+
// AND: The peer should be penalized for extending a failed chain.
1781+
rig.expect_single_penalty(peer_id, "failed_chain");
1782+
rig.expect_empty_network();
1783+
}
1784+
17281785
#[test]
17291786
fn test_parent_lookup_too_deep_grow_tip() {
17301787
let mut rig = TestRig::test_setup();

0 commit comments

Comments
 (0)