Fix endpoints status out-of-sync when the pod state changes rapidly #125675

tnqn · 2024-06-24T18:07:13Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

When Pod state changes rapidly, endpoints controller may use outdated informer cache to sync Service. If the outdated endpoints appear to be expected by the controller, it skips updating it.

The commit fixes it by checking if endpoints informer cache is outdated when processing a service. If the endpoints is stale, it returns an error and retries later.

Which issue(s) this PR fixes:

Fixes #125638

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Fix endpoints status out-of-sync when the pod state changes rapidly

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot · 2024-06-24T18:07:21Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

aojea · 2024-06-24T18:25:49Z

pkg/controller/endpoint/endpoints_controller.go

 	e.podsSynced = podInformer.Informer().HasSynced

 	endpointsInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
+		UpdateFunc: func(old, cur interface{}) {


I'm trying to find the issue where this option was discussed and discarded several releases ago, @thockin @robscott do you remember?

This fixes the issue because it process its own update, we are already doing it on other controllers, but it means we are going to do double the work now for each endpoint update

I found this change would break a conformance test: "should test the lifecycle of an Endpoint".
However, the conformance test itself seems contradictory with some logic of endpoints controller:

The conformance test expects an Endpoint created without a Service to exist

The checkLeftoverEndpoints function of endpoint controller triggers syncing all Endpoints except for those having leader election annotation and will cause Endpoints without Services to be deleted:

kubernetes/pkg/controller/endpoint/endpoints_controller.go

Lines 561 to 567 in 921fb0b

// checkLeftoverEndpoints lists all currently existing endpoints and adds their

// service to the queue. This will detect endpoints that exist with no

// corresponding service; these endpoints need to be deleted. We only need to

// do this once on startup, because in steady-state these are detected (but

// some stragglers could have been left behind if the endpoint controller

// reboots).

func (e *Controller) checkLeftoverEndpoints() {

The conformance test gives an impression that an user created Endpoints (without a Service) can exist but it will be deleted once kube-controller-manager restarts or fails over. Subscribing endpoints update events extends the cleanup logic of checkLeftoverEndpoints in a sense.

It seems the conformance test would fail as long as the controller watches endpoints update events.

I'm thinking another fix which doesn't affect the conformance test and could avoid doubling the work: we track the stale resource versions of endpoints after each successful endpoints update, use it to determine whether the endpoints in cache is stale when syncing it next time, and return error if it's stale and retry. (The reason why we don't track generation like endpointslice controller is endpoints doesn't have a generation.)

The 2nd commit implements the above fix.

I'm trying to find the issue where this option was discussed and discarded several releases ago, @thockin @robscott do you remember?

I definitely remember what you're talking about, but not the details...

tnqn · 2024-06-25T12:34:35Z

/retest

tnqn · 2024-06-26T11:06:16Z

/retest

aojea · 2024-06-27T20:51:56Z

@tnqn sorry for being very picky but everytime we change something here there are always unexpected edge cases, I think that adding an integration test that reproduces the problem, we can do multiple transitions in parallel if necessary to increase the chances of getting an stale cache entry, will give us more confidence

kubernetes/test/integration/endpoints/endpoints_test.go

Line 38 in 742b2f7

func TestEndpointUpdates(t *testing.T) {

chymy · 2024-06-28T06:48:47Z

/test pull-kubernetes-integration

tnqn · 2024-06-28T08:49:01Z

@tnqn sorry for being very picky but everytime we change something here there are always unexpected edge cases, I think that adding an integration test that reproduces the problem, we can do multiple transitions in parallel if necessary to increase the chances of getting an stale cache entry, will give us more confidence

@aojea no problem, I will add one.

test/integration/endpoints/endpoints_test.go

k8s-ci-robot · 2024-07-02T12:13:18Z

LGTM label has been added.

Details

Git tree hash: 1d1d5e40825aad84a968121a5ff62190938b854d

k8s-ci-robot · 2024-07-02T12:13:35Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aojea, tnqn

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~pkg/controller/endpoint/OWNERS~~ [aojea]
~~test/OWNERS~~ [aojea]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

gauravkghildiyal · 2024-07-02T23:37:46Z

Thanks @tnqn. The proposed solution makes sense to me. Appreciate your provided explanation in #125675 (comment)

/lgtm

chymy · 2024-07-03T00:48:52Z

/lgtm
Our environment has happened several times.

robscott · 2024-07-03T05:43:35Z

Thanks @tnqn!

/hold cancel

robscott · 2024-07-03T05:44:18Z

Actually I guess @aojea said Friday, but I think we likely have enough reviewers that have signed off on this now, will defer to him.

/hold

aojea · 2024-07-03T06:02:38Z

Actually I guess @aojea said Friday, but I think we likely have enough reviewers that have signed off on this now, will defer to him.

/hold

/hold cancel

There are 3 additional reviews, more than enough 😄

…-upstream-release-1.29 Automated cherry pick of #125675: Fix endpoints status out-of-sync when the pod state changes

…-upstream-release-1.28 Automated cherry pick of #125675: Fix endpoints status out-of-sync when the pod state changes

…-upstream-release-1.30 Automated cherry pick of #125675: Fix endpoints status out-of-sync when the pod state changes

liggitt · 2024-09-18T16:39:01Z

For visibility, backports of this regressed 1.30.3+, 1.29.7+, and 1.28.12+ (see #127370).

Make sure we stick to the criteria for backports. Backporting changes which are not regression fixes or critical bugfixes as described in https://github.com/kubernetes/community/blob/master/contributors/devel/sig-release/cherry-picks.md#what-kind-of-prs-are-good-for-cherry-picks should be avoided. Every backport carries risk, and in this case, we made patch releases worse with a backport.

k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Jun 24, 2024

tnqn mentioned this pull request Jun 24, 2024

The endpoint status does not update when the pod state changes rapidly. #125638

Closed

k8s-ci-robot requested review from andrewsykim and danwinship June 24, 2024 18:12

k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/network Categorizes an issue or PR as relevant to SIG Network. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 24, 2024

aojea reviewed Jun 24, 2024

View reviewed changes

tnqn force-pushed the fix-rapid-endpoints-update branch 3 times, most recently from 2a51e4f to ef15f35 Compare June 25, 2024 10:24

tnqn force-pushed the fix-rapid-endpoints-update branch from ef15f35 to e90da46 Compare June 29, 2024 10:41

k8s-ci-robot added area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Jun 29, 2024

tnqn force-pushed the fix-rapid-endpoints-update branch 2 times, most recently from ca449a4 to baffa09 Compare June 29, 2024 11:27

aojea reviewed Jun 29, 2024

View reviewed changes

test/integration/endpoints/endpoints_test.go Show resolved Hide resolved

aojea reviewed Jun 29, 2024

View reviewed changes

test/integration/endpoints/endpoints_test.go Show resolved Hide resolved

k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Jul 2, 2024

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 2, 2024

k8s-ci-robot assigned gauravkghildiyal Jul 2, 2024

k8s-ci-robot assigned chymy Jul 3, 2024

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 3, 2024

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 3, 2024

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 3, 2024

k8s-ci-robot merged commit 11c689b into kubernetes:master Jul 3, 2024

k8s-ci-robot added this to the v1.31 milestone Jul 3, 2024

tnqn deleted the fix-rapid-endpoints-update branch July 3, 2024 06:59

aojea mentioned this pull request Jul 3, 2024

Endpoint / Endpointslice objects become stuck in state where IP condition differs (not ready vs ready) #117193

Closed

k8s-ci-robot added a commit that referenced this pull request Jul 11, 2024

Merge pull request #125858 from tnqn/automated-cherry-pick-of-#125675…

ff0b3bd

…-upstream-release-1.29 Automated cherry pick of #125675: Fix endpoints status out-of-sync when the pod state changes

k8s-ci-robot added a commit that referenced this pull request Jul 11, 2024

Merge pull request #125859 from tnqn/automated-cherry-pick-of-#125675…

f529a46

…-upstream-release-1.28 Automated cherry pick of #125675: Fix endpoints status out-of-sync when the pod state changes

k8s-ci-robot added a commit that referenced this pull request Jul 11, 2024

Merge pull request #125857 from tnqn/automated-cherry-pick-of-#125675…

20597ff

…-upstream-release-1.30 Automated cherry pick of #125675: Fix endpoints status out-of-sync when the pod state changes

This was referenced Aug 28, 2024

Still seeing the issue for endpoints staying out of sync #126578

Closed

[WIP]fix an edge case where endpoints might be incorrectly marked as stale #127103

Closed

mbrancato mentioned this pull request Sep 15, 2024

Endpoints do not reconcile with EndpointSlices for Services with selector #127370

Closed

aojea mentioned this pull request Sep 17, 2024

bugfix: endpoints controller track resource version conrrectly #127417

Merged

	// checkLeftoverEndpoints lists all currently existing endpoints and adds their
	// service to the queue. This will detect endpoints that exist with no
	// corresponding service; these endpoints need to be deleted. We only need to
	// do this once on startup, because in steady-state these are detected (but
	// some stragglers could have been left behind if the endpoint controller
	// reboots).
	func (e *Controller) checkLeftoverEndpoints() {

Fix endpoints status out-of-sync when the pod state changes rapidly #125675

Fix endpoints status out-of-sync when the pod state changes rapidly #125675

Conversation

tnqn commented Jun 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Uh oh!

k8s-ci-robot commented Jun 24, 2024

Uh oh!

aojea Jun 24, 2024

Choose a reason for hiding this comment

Uh oh!

tnqn Jun 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tnqn Jun 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danwinship Jun 26, 2024

Choose a reason for hiding this comment

Uh oh!

tnqn commented Jun 25, 2024

Uh oh!

tnqn commented Jun 26, 2024

Uh oh!

aojea commented Jun 27, 2024

Uh oh!

chymy commented Jun 28, 2024

Uh oh!

tnqn commented Jun 28, 2024

Uh oh!

Uh oh!

Uh oh!

k8s-ci-robot commented Jul 2, 2024

Uh oh!

k8s-ci-robot commented Jul 2, 2024

Uh oh!

gauravkghildiyal commented Jul 2, 2024

Uh oh!

chymy commented Jul 3, 2024

Uh oh!

robscott commented Jul 3, 2024

Uh oh!

robscott commented Jul 3, 2024

Uh oh!

aojea commented Jul 3, 2024

Uh oh!

liggitt commented Sep 18, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

tnqn commented Jun 24, 2024 •

edited

Loading

tnqn Jun 25, 2024 •

edited

Loading

tnqn Jun 25, 2024 •

edited

Loading