node/manager: synthesize node deletion events#33278
Merged
aanm merged 2 commits intocilium:mainfrom Jul 4, 2024
Merged
Conversation
3fb8e90 to
5d8c666
Compare
Member
Author
|
/test |
5d8c666 to
d19338c
Compare
Member
Author
|
/scale-100 |
Member
Author
|
/test (CI was green before the force push, but I wanted to have access to |
gandro
reviewed
Jun 24, 2024
Member
gandro
left a comment
There was a problem hiding this comment.
Awesome work! Looks excellent to me. Two very minor things
d19338c to
bdd860b
Compare
Member
Author
|
/test |
bdd860b to
d0f2d92
Compare
nebril
approved these changes
Jun 27, 2024
Member
nebril
left a comment
There was a problem hiding this comment.
LGTM, one minor comment left inline.
Clearing the environment in the middle of the test can cause failures related to state being deleted, as the "environment" being cleared is simply the StateDir of the agent. Fixes: 940b186 ("test/controlplane: Fix tests after removal of global hives") Signed-off-by: David Bimmler <[email protected]>
When the cilium agent is down (due to a crash or an upgrade), it can miss node events. Upon startup, live nodes are upserted, but when deletions are missed, the agent fails to clean up node-related system state. Examples of such state includes bpf map entries, xfrm states or routes. In particular, the agent fails to clean up node IP to nodeID mappings in the nodeid bpf map. Since K8s will happily recycle such IPs, this can lead to breakage, as the agent associate the wrong nodeID with IPs. To avoid leaking this state, the node manager now dumps its view of the current set of nodes to a file in the runtime state directory, which can be read on restart of an agent. This is similar to how we restore other state upon restart. When reading this file, it's important to avoid resurrecting long-gone nodes (as we don't know for how long the agent was down) - instead, we merely take note of which nodes we knew of in the past, compare that to the nodes we consider live (once synced to k8s), and delete the ones which seem to have disappeared. The motivation to build this reconciliation based on full state dumps to disk is that downstream code generally assumes to have access to a full node object in the deletion callbacks. This makes is infeasible to base the pruning on just the information available in bpf maps. In an alternative design, downstream subsystems are responsible for cleaning up their own state based on just a node identifier, but current code doesn't allow for this. Signed-off-by: David Bimmler <[email protected]>
d0f2d92 to
e1871a7
Compare
Member
Author
|
/test |
Member
Author
|
@ldelossa friendly ping for review |
borkmann
approved these changes
Jul 4, 2024
Member
Author
|
marking for backport to 1.16 as it's a bugfix |
1 task
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
When the cilium agent is down (due to a crash or an upgrade), it can miss node events. Upon startup, live nodes are upserted, but when deletions are missed, the agent fails to clean up node-related system state. Examples of such state includes bpf map entries, xfrm states or routes. In particular, the agent fails to clean up node IP to nodeID mappings in the nodeid bpf map. Since K8s will happily recycle such IPs, this can lead to breakage, as the agent associate the wrong nodeID with IPs.
To avoid leaking this state, the node manager now dumps its view of the current set of nodes to a file in the runtime state directory, which can be read on restart of an agent. This is similar to how we restore other state upon restart.
When reading this file, it's important to avoid resurrecting long-gone nodes (as we don't know for how long the agent was down) - instead, we merely take note of which nodes we knew of in the past, compare that to the nodes we consider live (once synced to k8s), and delete the ones which seem to have disappeared.
The motivation to build this reconciliation based on full state dumps to disk is that downstream code generally assumes to have access to a full node object in the deletion callbacks. This makes is infeasible to base the pruning on just the information available in bpf maps. In an alternative design, downstream subsystems are responsible for cleaning up their own state based on just a node identifier, but current code doesn't allow for this.
Fixes: #29822