Skip to content

endpoint: stop regenerating all endpoints on every identity allocation; switch to periodic regens instead.#35815

Merged
joestringer merged 6 commits intocilium:mainfrom
squeed:no-more-regen-on-alloc
Dec 6, 2024
Merged

endpoint: stop regenerating all endpoints on every identity allocation; switch to periodic regens instead.#35815
joestringer merged 6 commits intocilium:mainfrom
squeed:no-more-regen-on-alloc

Conversation

@squeed
Copy link
Copy Markdown
Contributor

@squeed squeed commented Nov 6, 2024

Please review by commit.

Prior to this change, there were two possible ways a newly-allocated
identity would be added to the SelectorCache. The first was via the
ipcache, which allocates with notifyOwner: false, taking the
responsibility of propagating updates directly.

The IPCache first updates the SelectorCache, then triggers incremental
policy map updates on all endpoints. This works well.

Alternatively, an identity could be allocated by the global allocator.
In this case, notifyOwner: true was passed, and the allocator itself
took responsibility for updating the SelectorCache. This would
regenerate all endpoints, rather than merely applying incremental
updates. This also works, but is wasteful.

Finally, if a remote node allocates an identity, this will be
synchronized via the Allocator backend, sent on a watch channel,
inserted in to the SelectorCache, and all endpoints will be regenerated.

Regenerating all endpoints is quite wasteful; we have a functional
incremental policy update mechanism that exactly handles this case.
There were previously issues wherein Envoy would lose updates, but those
have been fixed. So, change it so all identity updates only trigger
incremental updates.

For resilience purposes, we now periodically regenerate all endpoints (default: every 2 minutes). We add a controller to the EndpointManager that handles this. Finally, the redundant PolicyUpdater has lost its trigger, and just delegates to the EndpointManager.

@squeed squeed added the release-note/misc This PR makes changes that have no direct user impact. label Nov 6, 2024
@squeed squeed requested a review from a team as a code owner November 6, 2024 17:13
@squeed squeed requested a review from doniacld November 6, 2024 17:13
@github-actions github-actions bot added the sig/policy Impacts whether traffic is allowed or denied based on user-defined policies. label Nov 6, 2024
@squeed squeed changed the title No more regen on alloc policy: stop regenerating all endpoints on every identity allocation Nov 6, 2024
@squeed squeed requested a review from joestringer November 7, 2024 09:15
@squeed
Copy link
Copy Markdown
Contributor Author

squeed commented Nov 7, 2024

FYI @jrajahalme, I feel safe making this change now that you fixed the leak that caused Envoy to miss some incremental updates.

@squeed squeed force-pushed the no-more-regen-on-alloc branch from fa1ba51 to d08cca2 Compare November 7, 2024 14:20
@joestringer
Copy link
Copy Markdown
Member

Could we end up in future reintroducing a bug where Cilium never gets around to reconciling the proper policy state? The primary reason for doing this regeneration sometimes was to make sure that there's a full recalculation fallback in case some incremental logic was incorrect. Although I think that we shouldn't have to do a bunch of wasteful additional work every time there is a change in the policy engine overall, I think it's interesting to consider what our resiliency mechanisms should look like in order to mitigate such risks.

@squeed
Copy link
Copy Markdown
Contributor Author

squeed commented Nov 8, 2024

Could we end up in future reintroducing a bug where Cilium never gets around to reconciling the proper policy state?

Of course, that's absolutely possible. I'm less concerned about policy state, but rather everything else that may be depending on periodic endpoint regeneration. I've given the code a look-through, and it seems OK, which is worth approximately nothing in terms of reassurance :).

I agree that, in the interest of resilience, we should probably periodically regenerate endpoints. Ideally we could detect changes and warn on them, but given that the overlying state may have coincidentally changed, this would be a very noisy signal.

Using identity allocation as a trigger for occasional endpoint regeneration is not a good timing signal. Noisy clusters will regenerate constantly, where clusters with a static set of identities will not -- even when regeneration-relevant configuration may have changed.

So, I can add a periodic regeneration controller to the EndpointManager. That would be more useful for resiliency, and be a potentially significant reduction in CPU usage for certain workloads.

@squeed squeed force-pushed the no-more-regen-on-alloc branch from d08cca2 to 468f229 Compare November 11, 2024 12:29
@squeed squeed requested review from a team as code owners November 11, 2024 12:29
@maintainer-s-little-helper

This comment was marked as resolved.

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Nov 11, 2024
@squeed squeed force-pushed the no-more-regen-on-alloc branch from 468f229 to 1337cb2 Compare November 11, 2024 12:29
@maintainer-s-little-helper

This comment was marked as resolved.

@squeed squeed changed the title policy: stop regenerating all endpoints on every identity allocation endpoint: stop regenerating all endpoints on every identity allocation; switch to periodic regens instead. Nov 11, 2024
@squeed squeed force-pushed the no-more-regen-on-alloc branch from 1337cb2 to 375beff Compare November 11, 2024 13:10
@squeed squeed requested a review from a team as a code owner November 11, 2024 13:10
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Nov 11, 2024
@squeed squeed force-pushed the no-more-regen-on-alloc branch from 375beff to 39288d0 Compare November 11, 2024 13:13
@squeed squeed requested a review from a team as a code owner November 11, 2024 13:13
@squeed squeed force-pushed the no-more-regen-on-alloc branch from 39288d0 to 495980a Compare November 11, 2024 15:36
@squeed squeed force-pushed the no-more-regen-on-alloc branch from e651382 to dbb51c7 Compare December 3, 2024 16:54
@squeed
Copy link
Copy Markdown
Contributor Author

squeed commented Dec 3, 2024

/test

Copy link
Copy Markdown
Member

@gandro gandro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My codeowners look good to me now! Some nits on the docs, but nothing blocking

When an identity is allocated, it needs to be plumbed in to the policy
engine. The mechanism for this is the IdentityAllocatorOwner, which
pushes down identity adds / deletes. All node-local allocations are
synchronously inserted in to the SelectorCache as part of the control
flow.

Separately, there is a channel of events from the global allocator's
backend. As identities are allocated, they are pushed in to the
SelectorCache via this mechanism. This is asynchronous, as part of the
remote changestream.

However, there was an awkward duplication where the local-only
allocation path (i.e. an identity for a local CIDR) did *both* the
synchronous SelectorCache update *as well* as emitting an event on the
event channel.

The event channel is purely internal; it is consumed from within the
allocator cache and the only effect is to update the SelectorCache; same
as the synchronous case.

This commit removes this redundant update. This is safe because
node-local allocation *always* synchronously updates.

Signed-off-by: Casey Callendrello <[email protected]>
Prior to this change, there were two possible ways a newly-allocated
identity would be added to the SelectorCache. The first was via the
ipcache, which allocates with `notifyOwner: false`, taking the
responsibility of propagating updates directly.

The IPCache first updates the SelectorCache, then triggers incremental
policy map updates on all endpoints. This works well.

Alternatively, an identity could be allocated by the global allocator.
In this case, `notifyOwner: true` was passed, and the allocator itself
took responsibility for updating the SelectorCache. This would
regenerate all endpoints, rather than merely applying incremental
updates. This also works, but is wasteful.

Finally, if a remote node allocates an identity, this will be
synchronized via the Allocator backend, sent on a watch channel,
inserted in to the SelectorCache, and all endpoints will be regenerated.

Regenerating all endpoints is quite wasteful; we have a functional
incremental policy update mechanism that exactly handles this case.
There were previously issues wherein Envoy would lose updates, but those
have been fixed. So, change it so *all* identity updates only trigger
incremental updates. Additionally, batch changes so we don't flood
endpoint locking.

Finally, when an identity is allocated by this node, we will process
this new identity twice: once on allocation, and once when the remote
backend gets the watch event. Add some de-duplication so we don't
needlessly process events.

Signed-off-by: Casey Callendrello <[email protected]>
A previous commit stopped regenerating all endpoints on identity
allocation, which was wasteful. However, it is still useful to have
periodic state reconciliation, in case a configuration edge is somehow
missed.

So, this adds an explicit controller in the EndpointManager that
periodically regenerates all endpoints. By default, the interval is 2
minutes, but it is configurable.

Signed-off-by: Casey Callendrello <[email protected]>
These metrics are not interesting:
1. It incremented a metric labeled with "policy update", which is
   confusing.
2. It ignored the asynchronous nature of endpoint regeneration, so the
   reported duration is incorrect.

Signed-off-by: Casey Callendrello <[email protected]>
The PolicyUpdater is a small trigger that batches requests to regenerate
all endpoints. Now that there is a controller for this directly in the
EndpointManager, we can remove this additional trigger.

Signed-off-by: Casey Callendrello <[email protected]>
The metrics `cilium_triggers_policy_update_total` and
`cilium_triggers_policy_update_folds` have been removed. They were never
particularly accurate, as this trigger was only used when auxiliary
resources such as CiliumEnvoyConfigs were updated, rather than actual
policy updates. Furthermore, the measured duration was incorrect as
the actual updates were asynchronous.

Signed-off-by: Casey Callendrello <[email protected]>
@squeed squeed force-pushed the no-more-regen-on-alloc branch from dbb51c7 to e75284e Compare December 4, 2024 13:48
@squeed
Copy link
Copy Markdown
Contributor Author

squeed commented Dec 4, 2024

/test

Copy link
Copy Markdown
Contributor

@tommyp1ckles tommyp1ckles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good for @cilium/cli

@squeed squeed requested a review from a user December 5, 2024 09:43
Copy link
Copy Markdown
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I opened some non-blocking threads below but they're more just thinking out loud about some minor tuning that can follow or enquiring for me to learn more 😅

@maintainer-s-little-helper maintainer-s-little-helper bot added ready-to-merge This PR has passed all tests and received consensus from code owners to merge. labels Dec 6, 2024
@joestringer joestringer enabled auto-merge December 6, 2024 02:58
@joestringer joestringer added this pull request to the merge queue Dec 6, 2024
Merged via the queue into cilium:main with commit 9e00698 Dec 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/misc This PR makes changes that have no direct user impact. sig/policy Impacts whether traffic is allowed or denied based on user-defined policies.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants