endpoint: fix identity manager refCount issues causing leaks#42662
endpoint: fix identity manager refCount issues causing leaks#42662squeed merged 2 commits intocilium:mainfrom
Conversation
|
/test |
|
The IdentityManager is superfluous now that we have the EndpointManager. It already knows the security identity of every endpoint without needing to resort to fragile refcounting. Additionally, the only subscriber for the identity manager is, in fact, the policy cache. I understand we have a proposal to change its lifecycle as well. So, perhaps we can completely get rid of this codepath? |
|
I don't fully understand what you mean by superfluous here. Its still in use, and it will be non-trivial to use the endpointmanager as-is to do the same job, especially for backports. We also plan to still use the identity manager in #42580. I somewhat agree it would be interesting to rewrite some parts of this, and use the endpoint manager for some of these things - but I don't think its the correct call now. We have also seen some locking issues around endpoint manager vs. endpoints, so I'm a bit worried (although this has been older versions, so might not be true any more) that could cause some more problems down the line if not done very carefully. |
This fixes a refCount issue in the identity manager in the case where an agent is restoring endpoints that are in the process of being stopped and cleaned up. In that case, the identity might never be added to the identity manager, but it will unconditionally be removed from it when its being cleaned up. This is seemingly fine in the base case with a single endpoint, but if there are two or more sharing the same identity, we are approaching muddy water. In that case the identity will be upserted to the identity manager by the endpoint that is running, and then removed by the one that is not running. That results in the selector policy being removed _and_ detached when the endpoint is running. This has the potential to cause a policy deadlock on the running endpoint, since it is running with a detached selector policy - not getting new updates. Also, prior to 36c9bba ("policy/cache: simplify cache entry creation flow"), introduced in v1.17, this also caused a permanently broken endpoint that is unable to regenerate. In v1.17 and newer, the selector policy is instead leaked when the endpoint then is deleted. This effectively caused that mentioned patch to hide this bug, making it harder to notice. We have seen this cause endpoints to be stuck for hours until they are deleted, without being able to get new policy updates. Signed-off-by: Odin Ugedal <[email protected]>
This ensures that every time the SetIdentity func is called, it will properly release the old reference in case one is set. This is also the case when an endpoint is torn down, since we then want to ensure that we only release the identity if it was actually set. Signed-off-by: Odin Ugedal <[email protected]> Signed-off-by: Odin Ugedal <[email protected]>
ed9e5c8 to
432fa8b
Compare
|
/test |
Now that some more PRs have surfaced, it is no longer superfluous. At the time of that comment, all it was used for was GCing SelectorPolicies. |
|
/test |
|
/ci-kpr |
|
@odinuge, @squeed, @christarazi: Labelled with |
|
Ack. I'll try to get a set of manual backports going, but can't promise an ETA since I have a lot of other work in flight. |
The new code path which can lead to the error was added by #42662. Signed-off-by: Martynas Pumputis <[email protected]>
The new code path which can lead to the error was added by #42662. Signed-off-by: Martynas Pumputis <[email protected]>
Due to cilium#42661 and cilium#42662 not being backported yet to v1.17, CI fails in the upgrade/downgrade test with this error. Therefore, we must add it to the ignore list until the PRs are at least backported to v1.17. The error was removed from the ignore list in cilium#42982. Suggested-by: Marco Iorio <[email protected]> Suggested-by: Casey Callendrello <[email protected]> Signed-off-by: Chris Tarazi <[email protected]>
Due to cilium#42661 and cilium#42662 not being backported yet to v1.17, CI fails in the upgrade/downgrade test with this error. Therefore, we must add it to the ignore list until the PRs are at least backported to v1.17. The error was removed from the ignore list in cilium#42982. Suggested-by: Marco Iorio <[email protected]> Suggested-by: Casey Callendrello <[email protected]> Signed-off-by: Chris Tarazi <[email protected]>
Due to #42661 and #42662 not being backported yet to v1.17, CI fails in the upgrade/downgrade test with this error. Therefore, we must add it to the ignore list until the PRs are at least backported to v1.17. The error was removed from the ignore list in #42982. Suggested-by: Marco Iorio <[email protected]> Suggested-by: Casey Callendrello <[email protected]> Signed-off-by: Chris Tarazi <[email protected]>
This fixes a refCount issue in the identity manager in the case where an agent is restoring endpoints that are in the process of being stopped and cleaned up. In that case, the identity might never be added to the identity manger, but it will unconditionally be removed from it when its being cleaned up.
See the commit message for more information.
Keeping as a draft now to run tests. I'll add some better test cases here, and spend some time creating a repro so that its easier to reason about.