Endpoints do not reconcile with EndpointSlices for Services with selector

### What happened?

https://github.com/kubernetes/kubernetes/pull/125675 was merged to 1.31, and backported and regressed the following releases:
* 1.31.0+
* 1.30.3+
* 1.29.7+
* 1.28.12+

After an upgrade to 1.28 (1.28.13), we have had significant problems with Services that use a `selector` to target Pods in a Deployment. This is likely happening only with very large services with 1000-2000 backing Pods. The Endpoints eventually appear to be "stuck" with LOTS of IPs as targets (these are large services) and they are not removing old Pod IPs from the Endpoints object, even long after the Pod is gone.

These services are Knative Services, and are operating under frequent and wide scaling. It does seem like Knative is still reading from the Endpoints API.

Deleting the Endpoints object causes it to be recreated and instantly resolves problems from things downstream which are relying on the Endpoints API.

I can show the behavior, but it is difficult to reproduce this without mimicking the scale and churn (scale out and scale in) of a. real service.

Here, I have a service `my-app` - it is in the failed state where Endpoints are not being updated. Note, the source `Service` we are concerned about here is `my-app-00112-private`.
```
% kubectl -n app get endpointslices | grep my-app-00               
my-app-00112-4vk84                   IPv4          8012,8112,8012               10.32.2.22,10.32.5.21,10.32.31.20 + 997 more...         40m
my-app-00112-private-5662t           IPv4          8012,8022,8112 + 3 more...   10.32.86.67,10.32.86.68                                 34m
my-app-00112-private-9mdgr           IPv4          8012,8022,8112 + 3 more...   10.32.210.18,10.32.210.12,10.32.210.32 + 3 more...      36m
my-app-00112-private-kzkxb           IPv4          8012,8022,8112 + 3 more...   10.32.222.22,10.32.222.21,10.32.222.29                  36m
my-app-00112-private-mrpt4           IPv4          8012,8022,8112 + 3 more...   10.32.217.26,10.32.86.63,10.32.86.64 + 12 more...       36m
my-app-00112-private-qnd6m           IPv4          8012,8022,8112 + 3 more...   10.32.139.22,10.32.139.16,10.32.139.20                  37m
my-app-00112-private-swrhv           IPv4          8012,8022,8112 + 3 more...   10.32.85.54,10.32.85.57,10.32.85.55                     34m
my-app-00112-private-xm2w7           IPv4          8012,8022,8112 + 3 more...   10.32.10.180,10.32.96.167,10.32.200.143 + 4 more...     40m
my-app-00112-private-zlp44           IPv4          8012,8022,8112 + 3 more...   10.32.217.21,10.32.217.29,10.32.217.16 + 6 more...      36m
```

And if we look at the scale of that Deployment:
```
% kubectl -n app get deploy my-app-00112-deployment                
NAME                         READY   UP-TO-DATE   AVAILABLE   AGE
my-app-00112-deployment   28/28   28           28          40m
```

This seems correct. But when we look at the Endpoints, it is a much different story:
```
% kubectl -n app get endpoints my-app-00112-private
NAME                      ENDPOINTS                                                            AGE
my-app-00112-private   10.32.0.70:9091,10.32.10.180:9091,10.32.101.10:9091 + 5907 more...   40m
```

Nearly 6000 endpoints - at about 6 ports per pod (5 are from Knative, basically), that is 1000 pods, the over capacity limit for Endpoints API. In fact, we DO see this annotation:
```
Annotations:  endpoints.kubernetes.io/over-capacity: truncated
```

But that is supposed to come down when pods fall below 1k. Since this service is using a selector, and I think is managed by the Endpoints / EndpointSlices controller, the Endpoints object comes right back if it is deleted, effectively forcing reconciliation from EndpointSlices to Endpoints. And that is exactly what happens.
```
% kubectl -n app delete endpoints my-app-00112-private
endpoints "my-app-00112-private" deleted
% kubectl -n app get endpoints my-app-00112-private     
NAME                      ENDPOINTS                                                           AGE
my-app-00112-private   10.32.0.70:9091,10.32.10.180:9091,10.32.160.111:9091 + 81 more...   4s
```

So it does seem like EndpointSlices -> Endpoints reconciliation is broken in some fashion, under these conditions.


### What did you expect to happen?

Endpoints should be updated when EndpointSlices are changed, even on scale-in operations where Pods are removed.

### How can we reproduce it (as minimally and precisely as possible)?

It is very difficult for me to provide clear details. But it happens with large Knative services, which are just Deployments that are autoscaled and feed a service.

### Anything else we need to know?

I see audit logs where `endpoint-controller/kube-system` is making frequent use of the permission `io.k8s.core.v1.endpoints.update` on that endpoints resource. But then the audit logs abruptly stop. Which seems to indicate that the controller is no longer even attempting to update the Endpoints resource. This appears to happend after. particularly rapid set of calls to update the endpoints - anywhere from 500ms to 5s apart and about a total of 15-20 times in a minute.

### Kubernetes version

<details>

```console
$ kubectl version
Client Version: v1.30.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.13-gke.1119000
```

</details>


### Cloud provider

<details>
GKE
</details>


### OS version

<details>

Google COS

</details>


### Install tools

<details>

</details>


### Container runtime (CRI) and version (if applicable)

<details>

</details>


### Related plugins (CNI, CSI, ...) and versions (if applicable)

<details>

</details>


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Endpoints do not reconcile with EndpointSlices for Services with selector #127370

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Endpoints do not reconcile with EndpointSlices for Services with selector #127370

Description

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions