Skip to content

Endpoints do not reconcile with EndpointSlices for Services with selector #127370

@mbrancato

Description

@mbrancato

What happened?

#125675 was merged to 1.31, and backported and regressed the following releases:

  • 1.31.0+
  • 1.30.3+
  • 1.29.7+
  • 1.28.12+

After an upgrade to 1.28 (1.28.13), we have had significant problems with Services that use a selector to target Pods in a Deployment. This is likely happening only with very large services with 1000-2000 backing Pods. The Endpoints eventually appear to be "stuck" with LOTS of IPs as targets (these are large services) and they are not removing old Pod IPs from the Endpoints object, even long after the Pod is gone.

These services are Knative Services, and are operating under frequent and wide scaling. It does seem like Knative is still reading from the Endpoints API.

Deleting the Endpoints object causes it to be recreated and instantly resolves problems from things downstream which are relying on the Endpoints API.

I can show the behavior, but it is difficult to reproduce this without mimicking the scale and churn (scale out and scale in) of a. real service.

Here, I have a service my-app - it is in the failed state where Endpoints are not being updated. Note, the source Service we are concerned about here is my-app-00112-private.

% kubectl -n app get endpointslices | grep my-app-00               
my-app-00112-4vk84                   IPv4          8012,8112,8012               10.32.2.22,10.32.5.21,10.32.31.20 + 997 more...         40m
my-app-00112-private-5662t           IPv4          8012,8022,8112 + 3 more...   10.32.86.67,10.32.86.68                                 34m
my-app-00112-private-9mdgr           IPv4          8012,8022,8112 + 3 more...   10.32.210.18,10.32.210.12,10.32.210.32 + 3 more...      36m
my-app-00112-private-kzkxb           IPv4          8012,8022,8112 + 3 more...   10.32.222.22,10.32.222.21,10.32.222.29                  36m
my-app-00112-private-mrpt4           IPv4          8012,8022,8112 + 3 more...   10.32.217.26,10.32.86.63,10.32.86.64 + 12 more...       36m
my-app-00112-private-qnd6m           IPv4          8012,8022,8112 + 3 more...   10.32.139.22,10.32.139.16,10.32.139.20                  37m
my-app-00112-private-swrhv           IPv4          8012,8022,8112 + 3 more...   10.32.85.54,10.32.85.57,10.32.85.55                     34m
my-app-00112-private-xm2w7           IPv4          8012,8022,8112 + 3 more...   10.32.10.180,10.32.96.167,10.32.200.143 + 4 more...     40m
my-app-00112-private-zlp44           IPv4          8012,8022,8112 + 3 more...   10.32.217.21,10.32.217.29,10.32.217.16 + 6 more...      36m

And if we look at the scale of that Deployment:

% kubectl -n app get deploy my-app-00112-deployment                
NAME                         READY   UP-TO-DATE   AVAILABLE   AGE
my-app-00112-deployment   28/28   28           28          40m

This seems correct. But when we look at the Endpoints, it is a much different story:

% kubectl -n app get endpoints my-app-00112-private
NAME                      ENDPOINTS                                                            AGE
my-app-00112-private   10.32.0.70:9091,10.32.10.180:9091,10.32.101.10:9091 + 5907 more...   40m

Nearly 6000 endpoints - at about 6 ports per pod (5 are from Knative, basically), that is 1000 pods, the over capacity limit for Endpoints API. In fact, we DO see this annotation:

Annotations:  endpoints.kubernetes.io/over-capacity: truncated

But that is supposed to come down when pods fall below 1k. Since this service is using a selector, and I think is managed by the Endpoints / EndpointSlices controller, the Endpoints object comes right back if it is deleted, effectively forcing reconciliation from EndpointSlices to Endpoints. And that is exactly what happens.

% kubectl -n app delete endpoints my-app-00112-private
endpoints "my-app-00112-private" deleted
% kubectl -n app get endpoints my-app-00112-private     
NAME                      ENDPOINTS                                                           AGE
my-app-00112-private   10.32.0.70:9091,10.32.10.180:9091,10.32.160.111:9091 + 81 more...   4s

So it does seem like EndpointSlices -> Endpoints reconciliation is broken in some fashion, under these conditions.

What did you expect to happen?

Endpoints should be updated when EndpointSlices are changed, even on scale-in operations where Pods are removed.

How can we reproduce it (as minimally and precisely as possible)?

It is very difficult for me to provide clear details. But it happens with large Knative services, which are just Deployments that are autoscaled and feed a service.

Anything else we need to know?

I see audit logs where endpoint-controller/kube-system is making frequent use of the permission io.k8s.core.v1.endpoints.update on that endpoints resource. But then the audit logs abruptly stop. Which seems to indicate that the controller is no longer even attempting to update the Endpoints resource. This appears to happend after. particularly rapid set of calls to update the endpoints - anywhere from 500ms to 5s apart and about a total of 15-20 times in a minute.

Kubernetes version

Details
$ kubectl version
Client Version: v1.30.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.13-gke.1119000

Cloud provider

Details GKE

OS version

Details

Google COS

Install tools

Details

Container runtime (CRI) and version (if applicable)

Details

Related plugins (CNI, CSI, ...) and versions (if applicable)

Details

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.kind/regressionCategorizes issue or PR as related to a regression from a prior release.needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.priority/important-soonMust be staffed and worked on either currently, or very soon, ideally in time for the next release.sig/networkCategorizes an issue or PR as relevant to SIG Network.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions