-
Notifications
You must be signed in to change notification settings - Fork 42k
Fix endpoints status out-of-sync when the pod state changes rapidly #125675
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix endpoints status out-of-sync when the pod state changes rapidly #125675
Conversation
|
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
| e.podsSynced = podInformer.Informer().HasSynced | ||
|
|
||
| endpointsInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{ | ||
| UpdateFunc: func(old, cur interface{}) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to find the issue where this option was discussed and discarded several releases ago, @thockin @robscott do you remember?
This fixes the issue because it process its own update, we are already doing it on other controllers, but it means we are going to do double the work now for each endpoint update
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found this change would break a conformance test: "should test the lifecycle of an Endpoint".
However, the conformance test itself seems contradictory with some logic of endpoints controller:
- The conformance test expects an Endpoint created without a Service to exist
- The
checkLeftoverEndpointsfunction of endpoint controller triggers syncing all Endpoints except for those having leader election annotation and will cause Endpoints without Services to be deleted:kubernetes/pkg/controller/endpoint/endpoints_controller.go
Lines 561 to 567 in 921fb0b
// checkLeftoverEndpoints lists all currently existing endpoints and adds their // service to the queue. This will detect endpoints that exist with no // corresponding service; these endpoints need to be deleted. We only need to // do this once on startup, because in steady-state these are detected (but // some stragglers could have been left behind if the endpoint controller // reboots). func (e *Controller) checkLeftoverEndpoints() {
The conformance test gives an impression that an user created Endpoints (without a Service) can exist but it will be deleted once kube-controller-manager restarts or fails over. Subscribing endpoints update events extends the cleanup logic of checkLeftoverEndpoints in a sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems the conformance test would fail as long as the controller watches endpoints update events.
I'm thinking another fix which doesn't affect the conformance test and could avoid doubling the work: we track the stale resource versions of endpoints after each successful endpoints update, use it to determine whether the endpoints in cache is stale when syncing it next time, and return error if it's stale and retry. (The reason why we don't track generation like endpointslice controller is endpoints doesn't have a generation.)
The 2nd commit implements the above fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2a51e4f to
ef15f35
Compare
|
/retest |
1 similar comment
|
/retest |
|
@tnqn sorry for being very picky but everytime we change something here there are always unexpected edge cases, I think that adding an integration test that reproduces the problem, we can do multiple transitions in parallel if necessary to increase the chances of getting an stale cache entry, will give us more confidence
|
|
/test pull-kubernetes-integration |
@aojea no problem, I will add one. |
ef15f35 to
e90da46
Compare
ca449a4 to
baffa09
Compare
|
LGTM label has been added. DetailsGit tree hash: 1d1d5e40825aad84a968121a5ff62190938b854d |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: aojea, tnqn The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Thanks @tnqn. The proposed solution makes sense to me. Appreciate your provided explanation in #125675 (comment) /lgtm |
|
/lgtm |
|
Thanks @tnqn! /hold cancel |
|
Actually I guess @aojea said Friday, but I think we likely have enough reviewers that have signed off on this now, will defer to him. /hold |
/hold cancel There are 3 additional reviews, more than enough 😄 |
…-upstream-release-1.29 Automated cherry pick of #125675: Fix endpoints status out-of-sync when the pod state changes
…-upstream-release-1.28 Automated cherry pick of #125675: Fix endpoints status out-of-sync when the pod state changes
…-upstream-release-1.30 Automated cherry pick of #125675: Fix endpoints status out-of-sync when the pod state changes
|
For visibility, backports of this regressed 1.30.3+, 1.29.7+, and 1.28.12+ (see #127370). Make sure we stick to the criteria for backports. Backporting changes which are not regression fixes or critical bugfixes as described in https://github.com/kubernetes/community/blob/master/contributors/devel/sig-release/cherry-picks.md#what-kind-of-prs-are-good-for-cherry-picks should be avoided. Every backport carries risk, and in this case, we made patch releases worse with a backport. |
What type of PR is this?
/kind bug
What this PR does / why we need it:
When Pod state changes rapidly, endpoints controller may use outdated informer cache to sync Service. If the outdated endpoints appear to be expected by the controller, it skips updating it.
The commit fixes it by checking if endpoints informer cache is outdated when processing a service. If the endpoints is stale, it returns an error and retries later.
Which issue(s) this PR fixes:
Fixes #125638
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: