Skip to content

node/manager: synthesize node deletion events#33278

Merged
aanm merged 2 commits intocilium:mainfrom
bimmlerd:pr/bimmlerd/startup-node-pruning
Jul 4, 2024
Merged

node/manager: synthesize node deletion events#33278
aanm merged 2 commits intocilium:mainfrom
bimmlerd:pr/bimmlerd/startup-node-pruning

Conversation

@bimmlerd
Copy link
Copy Markdown
Member

@bimmlerd bimmlerd commented Jun 20, 2024

When the cilium agent is down (due to a crash or an upgrade), it can miss node events. Upon startup, live nodes are upserted, but when deletions are missed, the agent fails to clean up node-related system state. Examples of such state includes bpf map entries, xfrm states or routes. In particular, the agent fails to clean up node IP to nodeID mappings in the nodeid bpf map. Since K8s will happily recycle such IPs, this can lead to breakage, as the agent associate the wrong nodeID with IPs.

To avoid leaking this state, the node manager now dumps its view of the current set of nodes to a file in the runtime state directory, which can be read on restart of an agent. This is similar to how we restore other state upon restart.

When reading this file, it's important to avoid resurrecting long-gone nodes (as we don't know for how long the agent was down) - instead, we merely take note of which nodes we knew of in the past, compare that to the nodes we consider live (once synced to k8s), and delete the ones which seem to have disappeared.

The motivation to build this reconciliation based on full state dumps to disk is that downstream code generally assumes to have access to a full node object in the deletion callbacks. This makes is infeasible to base the pruning on just the information available in bpf maps. In an alternative design, downstream subsystems are responsible for cleaning up their own state based on just a node identifier, but current code doesn't allow for this.

Fixes: #29822

The cilium agent now cleans up stale nodeID mappings and other node-related state on startup

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jun 20, 2024
@bimmlerd bimmlerd added kind/bug This is a bug in the Cilium logic. release-note/bug This PR fixes an issue in a previous release of Cilium. area/agent Cilium agent related. labels Jun 20, 2024
@maintainer-s-little-helper maintainer-s-little-helper bot removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Jun 20, 2024
@bimmlerd bimmlerd force-pushed the pr/bimmlerd/startup-node-pruning branch 3 times, most recently from 3fb8e90 to 5d8c666 Compare June 20, 2024 13:35
@bimmlerd
Copy link
Copy Markdown
Member Author

/test

@bimmlerd bimmlerd marked this pull request as ready for review June 20, 2024 14:48
@bimmlerd bimmlerd requested review from a team as code owners June 20, 2024 14:48
@bimmlerd bimmlerd requested review from ldelossa and thorn3r June 20, 2024 14:48
@bimmlerd bimmlerd force-pushed the pr/bimmlerd/startup-node-pruning branch from 5d8c666 to d19338c Compare June 21, 2024 11:11
@bimmlerd
Copy link
Copy Markdown
Member Author

/scale-100

@bimmlerd
Copy link
Copy Markdown
Member Author

bimmlerd commented Jun 21, 2024

/test (CI was green before the force push, but I wanted to have access to /scale-100)

Copy link
Copy Markdown
Member

@gandro gandro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work! Looks excellent to me. Two very minor things

@bimmlerd bimmlerd force-pushed the pr/bimmlerd/startup-node-pruning branch from d19338c to bdd860b Compare June 25, 2024 12:42
@bimmlerd
Copy link
Copy Markdown
Member Author

/test

Copy link
Copy Markdown
Member

@gandro gandro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me now!

Copy link
Copy Markdown
Contributor

@thorn3r thorn3r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work 👍

@bimmlerd bimmlerd force-pushed the pr/bimmlerd/startup-node-pruning branch from bdd860b to d0f2d92 Compare June 27, 2024 09:07
@bimmlerd bimmlerd requested a review from a team as a code owner June 27, 2024 09:07
@bimmlerd bimmlerd requested a review from nebril June 27, 2024 09:07
Copy link
Copy Markdown
Member

@nebril nebril left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, one minor comment left inline.

bimmlerd added 2 commits June 28, 2024 08:50
Clearing the environment in the middle of the test can cause failures
related to state being deleted, as the "environment" being cleared is
simply the StateDir of the agent.

Fixes: 940b186 ("test/controlplane: Fix tests after removal of global hives")

Signed-off-by: David Bimmler <[email protected]>
When the cilium agent is down (due to a crash or an upgrade), it can
miss node events. Upon startup, live nodes are upserted, but when
deletions are missed, the agent fails to clean up node-related system
state. Examples of such state includes bpf map entries, xfrm states or
routes. In particular, the agent fails to clean up node IP to nodeID
mappings in the nodeid bpf map. Since K8s will happily recycle such IPs,
this can lead to breakage, as the agent associate the wrong nodeID with
IPs.

To avoid leaking this state, the node manager now dumps its view of the
current set of nodes to a file in the runtime state directory, which can
be read on restart of an agent. This is similar to how we restore other
state upon restart.

When reading this file, it's important to avoid resurrecting long-gone
nodes (as we don't know for how long the agent was down) - instead, we
merely take note of which nodes we knew of in the past, compare that to
the nodes we consider live (once synced to k8s), and delete the ones
which seem to have disappeared.

The motivation to build this reconciliation based on full state dumps to
disk is that downstream code generally assumes to have access to a full
node object in the deletion callbacks. This makes is infeasible to base
the pruning on just the information available in bpf maps. In an
alternative design, downstream subsystems are responsible for cleaning
up their own state based on just a node identifier, but current code
doesn't allow for this.

Signed-off-by: David Bimmler <[email protected]>
@bimmlerd bimmlerd force-pushed the pr/bimmlerd/startup-node-pruning branch from d0f2d92 to e1871a7 Compare June 28, 2024 06:50
@bimmlerd
Copy link
Copy Markdown
Member Author

bimmlerd commented Jul 1, 2024

/test

@bimmlerd
Copy link
Copy Markdown
Member Author

bimmlerd commented Jul 1, 2024

@ldelossa friendly ping for review

@aanm aanm enabled auto-merge July 4, 2024 14:41
@aanm aanm added this pull request to the merge queue Jul 4, 2024
Merged via the queue into cilium:main with commit b855b25 Jul 4, 2024
@bimmlerd
Copy link
Copy Markdown
Member Author

bimmlerd commented Jul 5, 2024

marking for backport to 1.16 as it's a bugfix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/agent Cilium agent related. backport-done/1.16 The backport for Cilium 1.16.x for this PR is done. kind/bug This is a bug in the Cilium logic. release-note/bug This PR fixes an issue in a previous release of Cilium.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cilium should reconcile missed node delete events

8 participants