Skip to content

endpoint: fix identity manager refCount issues causing leaks#42662

Merged
squeed merged 2 commits intocilium:mainfrom
odinuge:odinuge/fix-cip-leak-2
Nov 21, 2025
Merged

endpoint: fix identity manager refCount issues causing leaks#42662
squeed merged 2 commits intocilium:mainfrom
odinuge:odinuge/fix-cip-leak-2

Conversation

@odinuge
Copy link
Copy Markdown
Member

@odinuge odinuge commented Nov 7, 2025

This fixes a refCount issue in the identity manager in the case where an agent is restoring endpoints that are in the process of being stopped and cleaned up. In that case, the identity might never be added to the identity manger, but it will unconditionally be removed from it when its being cleaned up.

See the commit message for more information.

Keeping as a draft now to run tests. I'll add some better test cases here, and spend some time creating a repro so that its easier to reason about.

Fix agent local identity leak 

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Nov 7, 2025
@odinuge odinuge mentioned this pull request Nov 7, 2025
8 tasks
@odinuge
Copy link
Copy Markdown
Member Author

odinuge commented Nov 7, 2025

/test

@squeed
Copy link
Copy Markdown
Contributor

squeed commented Nov 10, 2025

The IdentityManager is superfluous now that we have the EndpointManager. It already knows the security identity of every endpoint without needing to resort to fragile refcounting.

Additionally, the only subscriber for the identity manager is, in fact, the policy cache. I understand we have a proposal to change its lifecycle as well.

So, perhaps we can completely get rid of this codepath?

@odinuge
Copy link
Copy Markdown
Member Author

odinuge commented Nov 12, 2025

I don't fully understand what you mean by superfluous here. Its still in use, and it will be non-trivial to use the endpointmanager as-is to do the same job, especially for backports. We also plan to still use the identity manager in #42580.

I somewhat agree it would be interesting to rewrite some parts of this, and use the endpoint manager for some of these things - but I don't think its the correct call now. We have also seen some locking issues around endpoint manager vs. endpoints, so I'm a bit worried (although this has been older versions, so might not be true any more) that could cause some more problems down the line if not done very carefully.

This fixes a refCount issue in the identity manager in the case where an
agent is restoring endpoints that are in the process of being stopped
and cleaned up. In that case, the identity might never be added to the
identity manager, but it will unconditionally be removed from it when its
being cleaned up.

This is seemingly fine in the base case with a single
endpoint, but if there are two or more sharing the same identity, we are
approaching muddy water. In that case the identity will be
upserted to the identity manager by the endpoint that is running, and
then removed by the one that is not running. That results in the
selector policy being removed _and_ detached when the endpoint is
running. This has the potential to cause a policy deadlock on the
running endpoint, since it is running with a detached selector policy -
not getting new updates. Also, prior to
36c9bba ("policy/cache: simplify cache entry creation flow"), introduced
in v1.17, this also caused a permanently broken endpoint that is unable
to regenerate. In v1.17 and newer, the selector policy is instead leaked
when the endpoint then is deleted. This effectively caused that
mentioned patch to hide this bug, making it harder to notice.

We have seen this cause endpoints to be stuck for hours until they are
deleted, without being able to get new policy updates.

Signed-off-by: Odin Ugedal <[email protected]>
This ensures that every time the SetIdentity func is called, it will
properly release the old reference in case one is set. This is also the
case when an endpoint is torn down, since we then want to ensure that we
only release the identity if it was actually set.

Signed-off-by: Odin Ugedal <[email protected]>
Signed-off-by: Odin Ugedal <[email protected]>
@odinuge odinuge force-pushed the odinuge/fix-cip-leak-2 branch from ed9e5c8 to 432fa8b Compare November 18, 2025 12:47
@odinuge
Copy link
Copy Markdown
Member Author

odinuge commented Nov 18, 2025

/test

@squeed
Copy link
Copy Markdown
Contributor

squeed commented Nov 18, 2025

I don't fully understand what you mean by superfluous here.

Now that some more PRs have surfaced, it is no longer superfluous. At the time of that comment, all it was used for was GCing SelectorPolicies.

@odinuge
Copy link
Copy Markdown
Member Author

odinuge commented Nov 19, 2025

/test

@odinuge odinuge marked this pull request as ready for review November 19, 2025 12:35
@odinuge odinuge requested review from a team as code owners November 19, 2025 12:35
@odinuge odinuge requested a review from jrajahalme November 19, 2025 12:35
@odinuge
Copy link
Copy Markdown
Member Author

odinuge commented Nov 21, 2025

/ci-kpr

@squeed squeed added the release-note/bug This PR fixes an issue in a previous release of Cilium. label Nov 21, 2025
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Nov 21, 2025
@squeed squeed enabled auto-merge November 21, 2025 10:10
@squeed squeed added this pull request to the merge queue Nov 21, 2025
Merged via the queue into cilium:main with commit f476c33 Nov 21, 2025
77 of 79 checks passed
@christarazi christarazi added the needs-backport/1.18 This PR / issue needs backporting to the v1.18 branch label Dec 2, 2025
@tklauser tklauser mentioned this pull request Dec 3, 2025
7 tasks
@tklauser tklauser removed the needs-backport/1.18 This PR / issue needs backporting to the v1.18 branch label Dec 3, 2025
@github-actions github-actions bot added backport-done/1.18 The backport for Cilium 1.18.x for this PR is done. and removed backport-pending/1.18 The backport for Cilium 1.18.x for this PR is in progress. labels Dec 4, 2025
@christarazi christarazi added needs-backport/1.16 needs-backport/1.17 This PR / issue needs backporting to the v1.17 branch labels Jan 27, 2026
@mhofstetter mhofstetter mentioned this pull request Jan 29, 2026
2 tasks
@mhofstetter mhofstetter added the backport/author The backport will be carried out by the author of the PR. label Jan 29, 2026
@mhofstetter
Copy link
Copy Markdown
Member

mhofstetter commented Jan 29, 2026

@odinuge, @squeed, @christarazi: Labelled with backport/author due to conflicts that need some extra attention.

@odinuge
Copy link
Copy Markdown
Member Author

odinuge commented Feb 2, 2026

Ack. I'll try to get a set of manual backports going, but can't promise an ETA since I have a lot of other work in flight.

brb added a commit that referenced this pull request Feb 3, 2026
The new code path which can lead to the error was added by
#42662.

Signed-off-by: Martynas Pumputis <[email protected]>
github-merge-queue bot pushed a commit that referenced this pull request Feb 3, 2026
The new code path which can lead to the error was added by
#42662.

Signed-off-by: Martynas Pumputis <[email protected]>
@cilium-release-bot cilium-release-bot bot moved this to Released in cilium v1.19.0 Feb 3, 2026
@julianwiedmann julianwiedmann added backport/1.14 This PR represents a backport for Cilium 1.14.x of a PR that was merged to main. affects/v1.16 This issue affects v1.16 branch and removed needs-backport/1.16 backport/1.14 This PR represents a backport for Cilium 1.14.x of a PR that was merged to main. labels Feb 5, 2026
christarazi added a commit to christarazi/cilium that referenced this pull request Feb 13, 2026
Due to cilium#42661 and
cilium#42662 not being backported yet to
v1.17, CI fails in the upgrade/downgrade test with this error.
Therefore, we must add it to the ignore list until the PRs are at least
backported to v1.17.

The error was removed from the ignore list in
cilium#42982.

Suggested-by: Marco Iorio <[email protected]>
Suggested-by: Casey Callendrello <[email protected]>
Signed-off-by: Chris Tarazi <[email protected]>
christarazi added a commit to christarazi/cilium that referenced this pull request Feb 13, 2026
Due to cilium#42661 and
cilium#42662 not being backported yet to
v1.17, CI fails in the upgrade/downgrade test with this error.
Therefore, we must add it to the ignore list until the PRs are at least
backported to v1.17.

The error was removed from the ignore list in
cilium#42982.

Suggested-by: Marco Iorio <[email protected]>
Suggested-by: Casey Callendrello <[email protected]>
Signed-off-by: Chris Tarazi <[email protected]>
github-merge-queue bot pushed a commit that referenced this pull request Feb 13, 2026
Due to #42661 and
#42662 not being backported yet to
v1.17, CI fails in the upgrade/downgrade test with this error.
Therefore, we must add it to the ignore list until the PRs are at least
backported to v1.17.

The error was removed from the ignore list in
#42982.

Suggested-by: Marco Iorio <[email protected]>
Suggested-by: Casey Callendrello <[email protected]>
Signed-off-by: Chris Tarazi <[email protected]>