Skip to content

Conversation

@tnqn
Copy link
Member

@tnqn tnqn commented Jun 24, 2024

What type of PR is this?

/kind bug

What this PR does / why we need it:

When Pod state changes rapidly, endpoints controller may use outdated informer cache to sync Service. If the outdated endpoints appear to be expected by the controller, it skips updating it.

The commit fixes it by checking if endpoints informer cache is outdated when processing a service. If the endpoints is stale, it returns an error and retries later.

Which issue(s) this PR fixes:

Fixes #125638

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Fix endpoints status out-of-sync when the pod state changes rapidly

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 24, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Jun 24, 2024
@k8s-ci-robot k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/network Categorizes an issue or PR as relevant to SIG Network. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 24, 2024
e.podsSynced = podInformer.Informer().HasSynced

endpointsInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
UpdateFunc: func(old, cur interface{}) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to find the issue where this option was discussed and discarded several releases ago, @thockin @robscott do you remember?

This fixes the issue because it process its own update, we are already doing it on other controllers, but it means we are going to do double the work now for each endpoint update

Copy link
Member Author

@tnqn tnqn Jun 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this change would break a conformance test: "should test the lifecycle of an Endpoint".
However, the conformance test itself seems contradictory with some logic of endpoints controller:

  • The conformance test expects an Endpoint created without a Service to exist
  • The checkLeftoverEndpoints function of endpoint controller triggers syncing all Endpoints except for those having leader election annotation and will cause Endpoints without Services to be deleted:
    // checkLeftoverEndpoints lists all currently existing endpoints and adds their
    // service to the queue. This will detect endpoints that exist with no
    // corresponding service; these endpoints need to be deleted. We only need to
    // do this once on startup, because in steady-state these are detected (but
    // some stragglers could have been left behind if the endpoint controller
    // reboots).
    func (e *Controller) checkLeftoverEndpoints() {

The conformance test gives an impression that an user created Endpoints (without a Service) can exist but it will be deleted once kube-controller-manager restarts or fails over. Subscribing endpoints update events extends the cleanup logic of checkLeftoverEndpoints in a sense.

Copy link
Member Author

@tnqn tnqn Jun 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the conformance test would fail as long as the controller watches endpoints update events.

I'm thinking another fix which doesn't affect the conformance test and could avoid doubling the work: we track the stale resource versions of endpoints after each successful endpoints update, use it to determine whether the endpoints in cache is stale when syncing it next time, and return error if it's stale and retry. (The reason why we don't track generation like endpointslice controller is endpoints doesn't have a generation.)

The 2nd commit implements the above fix.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to find the issue where this option was discussed and discarded several releases ago, @thockin @robscott do you remember?

I definitely remember what you're talking about, but not the details...

@tnqn tnqn force-pushed the fix-rapid-endpoints-update branch 3 times, most recently from 2a51e4f to ef15f35 Compare June 25, 2024 10:24
@tnqn
Copy link
Member Author

tnqn commented Jun 25, 2024

/retest

1 similar comment
@tnqn
Copy link
Member Author

tnqn commented Jun 26, 2024

/retest

@aojea
Copy link
Member

aojea commented Jun 27, 2024

@tnqn sorry for being very picky but everytime we change something here there are always unexpected edge cases, I think that adding an integration test that reproduces the problem, we can do multiple transitions in parallel if necessary to increase the chances of getting an stale cache entry, will give us more confidence

func TestEndpointUpdates(t *testing.T) {

@chymy
Copy link
Contributor

chymy commented Jun 28, 2024

/test pull-kubernetes-integration

@tnqn
Copy link
Member Author

tnqn commented Jun 28, 2024

@tnqn sorry for being very picky but everytime we change something here there are always unexpected edge cases, I think that adding an integration test that reproduces the problem, we can do multiple transitions in parallel if necessary to increase the chances of getting an stale cache entry, will give us more confidence

@aojea no problem, I will add one.

@tnqn tnqn force-pushed the fix-rapid-endpoints-update branch from ef15f35 to e90da46 Compare June 29, 2024 10:41
@k8s-ci-robot k8s-ci-robot added area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Jun 29, 2024
@tnqn tnqn force-pushed the fix-rapid-endpoints-update branch 2 times, most recently from ca449a4 to baffa09 Compare June 29, 2024 11:27
@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Jul 2, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

DetailsGit tree hash: 1d1d5e40825aad84a968121a5ff62190938b854d

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aojea, tnqn

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 2, 2024
@gauravkghildiyal
Copy link
Member

Thanks @tnqn. The proposed solution makes sense to me. Appreciate your provided explanation in #125675 (comment)

/lgtm

@chymy
Copy link
Contributor

chymy commented Jul 3, 2024

/lgtm
Our environment has happened several times.

@robscott
Copy link
Member

robscott commented Jul 3, 2024

Thanks @tnqn!

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 3, 2024
@robscott
Copy link
Member

robscott commented Jul 3, 2024

Actually I guess @aojea said Friday, but I think we likely have enough reviewers that have signed off on this now, will defer to him.

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 3, 2024
@aojea
Copy link
Member

aojea commented Jul 3, 2024

Actually I guess @aojea said Friday, but I think we likely have enough reviewers that have signed off on this now, will defer to him.

/hold

/hold cancel

There are 3 additional reviews, more than enough 😄

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 3, 2024
@k8s-ci-robot k8s-ci-robot merged commit 11c689b into kubernetes:master Jul 3, 2024
@k8s-ci-robot k8s-ci-robot added this to the v1.31 milestone Jul 3, 2024
@tnqn tnqn deleted the fix-rapid-endpoints-update branch July 3, 2024 06:59
k8s-ci-robot added a commit that referenced this pull request Jul 11, 2024
…-upstream-release-1.29

Automated cherry pick of #125675: Fix endpoints status out-of-sync when the pod state changes
k8s-ci-robot added a commit that referenced this pull request Jul 11, 2024
…-upstream-release-1.28

Automated cherry pick of #125675: Fix endpoints status out-of-sync when the pod state changes
k8s-ci-robot added a commit that referenced this pull request Jul 11, 2024
…-upstream-release-1.30

Automated cherry pick of #125675: Fix endpoints status out-of-sync when the pod state changes
@liggitt
Copy link
Member

liggitt commented Sep 18, 2024

For visibility, backports of this regressed 1.30.3+, 1.29.7+, and 1.28.12+ (see #127370).

Make sure we stick to the criteria for backports. Backporting changes which are not regression fixes or critical bugfixes as described in https://github.com/kubernetes/community/blob/master/contributors/devel/sig-release/cherry-picks.md#what-kind-of-prs-are-good-for-cherry-picks should be avoided. Every backport carries risk, and in this case, we made patch releases worse with a backport.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

The endpoint status does not update when the pod state changes rapidly.

8 participants