-
Notifications
You must be signed in to change notification settings - Fork 42k
Description
What happened?
#125675 was merged to 1.31, and backported and regressed the following releases:
- 1.31.0+
- 1.30.3+
- 1.29.7+
- 1.28.12+
After an upgrade to 1.28 (1.28.13), we have had significant problems with Services that use a selector to target Pods in a Deployment. This is likely happening only with very large services with 1000-2000 backing Pods. The Endpoints eventually appear to be "stuck" with LOTS of IPs as targets (these are large services) and they are not removing old Pod IPs from the Endpoints object, even long after the Pod is gone.
These services are Knative Services, and are operating under frequent and wide scaling. It does seem like Knative is still reading from the Endpoints API.
Deleting the Endpoints object causes it to be recreated and instantly resolves problems from things downstream which are relying on the Endpoints API.
I can show the behavior, but it is difficult to reproduce this without mimicking the scale and churn (scale out and scale in) of a. real service.
Here, I have a service my-app - it is in the failed state where Endpoints are not being updated. Note, the source Service we are concerned about here is my-app-00112-private.
% kubectl -n app get endpointslices | grep my-app-00
my-app-00112-4vk84 IPv4 8012,8112,8012 10.32.2.22,10.32.5.21,10.32.31.20 + 997 more... 40m
my-app-00112-private-5662t IPv4 8012,8022,8112 + 3 more... 10.32.86.67,10.32.86.68 34m
my-app-00112-private-9mdgr IPv4 8012,8022,8112 + 3 more... 10.32.210.18,10.32.210.12,10.32.210.32 + 3 more... 36m
my-app-00112-private-kzkxb IPv4 8012,8022,8112 + 3 more... 10.32.222.22,10.32.222.21,10.32.222.29 36m
my-app-00112-private-mrpt4 IPv4 8012,8022,8112 + 3 more... 10.32.217.26,10.32.86.63,10.32.86.64 + 12 more... 36m
my-app-00112-private-qnd6m IPv4 8012,8022,8112 + 3 more... 10.32.139.22,10.32.139.16,10.32.139.20 37m
my-app-00112-private-swrhv IPv4 8012,8022,8112 + 3 more... 10.32.85.54,10.32.85.57,10.32.85.55 34m
my-app-00112-private-xm2w7 IPv4 8012,8022,8112 + 3 more... 10.32.10.180,10.32.96.167,10.32.200.143 + 4 more... 40m
my-app-00112-private-zlp44 IPv4 8012,8022,8112 + 3 more... 10.32.217.21,10.32.217.29,10.32.217.16 + 6 more... 36m
And if we look at the scale of that Deployment:
% kubectl -n app get deploy my-app-00112-deployment
NAME READY UP-TO-DATE AVAILABLE AGE
my-app-00112-deployment 28/28 28 28 40m
This seems correct. But when we look at the Endpoints, it is a much different story:
% kubectl -n app get endpoints my-app-00112-private
NAME ENDPOINTS AGE
my-app-00112-private 10.32.0.70:9091,10.32.10.180:9091,10.32.101.10:9091 + 5907 more... 40m
Nearly 6000 endpoints - at about 6 ports per pod (5 are from Knative, basically), that is 1000 pods, the over capacity limit for Endpoints API. In fact, we DO see this annotation:
Annotations: endpoints.kubernetes.io/over-capacity: truncated
But that is supposed to come down when pods fall below 1k. Since this service is using a selector, and I think is managed by the Endpoints / EndpointSlices controller, the Endpoints object comes right back if it is deleted, effectively forcing reconciliation from EndpointSlices to Endpoints. And that is exactly what happens.
% kubectl -n app delete endpoints my-app-00112-private
endpoints "my-app-00112-private" deleted
% kubectl -n app get endpoints my-app-00112-private
NAME ENDPOINTS AGE
my-app-00112-private 10.32.0.70:9091,10.32.10.180:9091,10.32.160.111:9091 + 81 more... 4s
So it does seem like EndpointSlices -> Endpoints reconciliation is broken in some fashion, under these conditions.
What did you expect to happen?
Endpoints should be updated when EndpointSlices are changed, even on scale-in operations where Pods are removed.
How can we reproduce it (as minimally and precisely as possible)?
It is very difficult for me to provide clear details. But it happens with large Knative services, which are just Deployments that are autoscaled and feed a service.
Anything else we need to know?
I see audit logs where endpoint-controller/kube-system is making frequent use of the permission io.k8s.core.v1.endpoints.update on that endpoints resource. But then the audit logs abruptly stop. Which seems to indicate that the controller is no longer even attempting to update the Endpoints resource. This appears to happend after. particularly rapid set of calls to update the endpoints - anywhere from 500ms to 5s apart and about a total of 15-20 times in a minute.
Kubernetes version
Details
$ kubectl version
Client Version: v1.30.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.13-gke.1119000Cloud provider
Details
GKEOS version
Details
Google COS