-
Notifications
You must be signed in to change notification settings - Fork 42k
Closed
Labels
kind/bugCategorizes issue or PR as related to a bug.Categorizes issue or PR as related to a bug.needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.Indicates an issue or PR lacks a `triage/foo` label and requires one.sig/networkCategorizes an issue or PR as relevant to SIG Network.Categorizes an issue or PR as relevant to SIG Network.
Description
What happened?
This issue #125638 was supposed to have fixed the issue where endpoint stay out of sync
I0807 14:01:51.613700 2 endpoints_controller.go:348] "Error syncing endpoints, retrying" service="test1/test-qa" err="endpoints informer cache is out of date, resource version 10168236546 already processed for endpoints test1/test-qa"
I0807 14:01:51.624576 2 endpoints_controller.go:348] "Error syncing endpoints, retrying" service="test1/test-qa" err="endpoints informer cache is out of date, resource version 10168236546 already processed for endpoints test1/test-qa"
I0807 14:01:51.645704 2 endpoints_controller.go:348] "Error syncing endpoints, retrying" service="test1/test-qa" err="endpoints informer cache is out of date, resource version 10168236546 already processed for endpoints test1/test-qa"
I0807 14:01:51.686942 2 endpoints_controller.go:348] "Error syncing endpoints, retrying" service="test1/test-qa" err="endpoints informer cache is out of date, resource version 10168236546 already processed for endpoints test1/test-qa"
I0807 14:01:51.768648 2 endpoints_controller.go:348] "Error syncing endpoints, retrying" service="test1/test-qa" err="endpoints informer cache is out of date, resource version 10168236546 already processed for endpoints test1/test-qa"
I0807 14:01:51.808043 2 endpoints_controller.go:348] "Error syncing endpoints, retrying" service="test1/test2-qa" err="endpoints informer cache is out of date, resource version 10168250766 already processed for endpoints test1/test2-qa"
I0807 14:01:51.930345 2 endpoints_controller.go:348] "Error syncing endpoints, retrying" service="test1/test-qa" err="endpoints informer cache is out of date, resource version 10168236546 already processed for endpoints test1/test-qa"
I also wrote a small script which would get me the out of sync endpoints compared to the endpointslices
from kubernetes.client import CoreV1Api, DiscoveryV1Api
from hubspot_kube_utils.client import build_kube_client
import json
import os
from datetime import datetime
def extract_ips_from_endpoint(endpoint):
ips = set()
if endpoint.subsets:
for subset in endpoint.subsets:
if subset.addresses:
ips.update(addr.ip for addr in subset.addresses)
if subset.not_ready_addresses:
ips.update(addr.ip for addr in subset.not_ready_addresses)
return ips
def extract_ips_from_endpoint_slice(slice):
if not slice.endpoints:
return set()
return set(address for endpoint in slice.endpoints
for address in (endpoint.addresses or []))
def compare_endpoints_and_slices(core_client, discovery_client):
all_mismatches = []
try:
namespaces = core_client.list_namespace()
except Exception as e:
print(f"Error listing namespaces: {e}")
return all_mismatches
for ns in namespaces.items:
namespace = ns.metadata.name
print(f"Processing namespace: {namespace}")
try:
endpoints = core_client.list_namespaced_endpoints(namespace)
except Exception as e:
print(f"Error listing endpoints in namespace {namespace}: {e}")
continue
for endpoint in endpoints.items:
name = endpoint.metadata.name
try:
slices = discovery_client.list_namespaced_endpoint_slice(namespace, label_selector=f"kubernetes.io/service-name={name}")
except Exception as e:
print(f"Error listing endpoint slices for service {name} in namespace {namespace}: {e}")
continue
endpoint_ips = extract_ips_from_endpoint(endpoint)
slice_ips = set()
for slice in slices.items:
slice_ips.update(extract_ips_from_endpoint_slice(slice))
if endpoint_ips != slice_ips:
mismatch = {
"namespace": namespace,
"service_name": name,
"endpoint_ips": list(endpoint_ips),
"slice_ips": list(slice_ips),
"missing_in_endpoint": list(slice_ips - endpoint_ips),
"missing_in_slice": list(endpoint_ips - slice_ips)
}
all_mismatches.append(mismatch)
print(f"Completed processing namespace: {namespace}")
print("---")
return all_mismatches
def save_to_json(data, cluster_name):
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
filename = f"{cluster_name}_mismatches_{timestamp}.json"
with open(filename, 'w') as f:
json.dump(data, f, indent=2)
print(f"Mismatch data for cluster {cluster_name} saved to {filename}")
def main():
clusters = ["test"]
all_cluster_mismatches = {}
for cluster_name in clusters:
print(f"Processing cluster: {cluster_name}")
try:
kube_client = build_kube_client(host="TEST",
token="TOKEN")
core_client = CoreV1Api(kube_client)
discovery_client = DiscoveryV1Api(kube_client)
mismatches = compare_endpoints_and_slices(core_client, discovery_client)
all_cluster_mismatches[cluster_name] = mismatches
save_to_json(mismatches, cluster_name)
print(f"Completed processing cluster: {cluster_name}")
print(f"Total mismatches found in this cluster: {len(mismatches)}")
except Exception as e:
print(f"Error processing cluster {cluster_name}: {e}")
if __name__ == "__main__":
main()
What did you expect to happen?
I expect the endpoints to eventually sync and reflect the most upto date information.
How can we reproduce it (as minimally and precisely as possible)?
I have just deployed the newer patch to our cluster and that has resulted in endpoints never ending up being updated if the status goes out of sync.
Anything else we need to know?
No response
Kubernetes version
Client Version: v1.29.7
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.7
Cloud provider
Details
OS version
almalinux-9
Install tools
Details
Container runtime (CRI) and version (if applicable)
cri-o
Related plugins (CNI, CSI, ...) and versions (if applicable)
No response
mengqiy
Metadata
Metadata
Labels
kind/bugCategorizes issue or PR as related to a bug.Categorizes issue or PR as related to a bug.needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.Indicates an issue or PR lacks a `triage/foo` label and requires one.sig/networkCategorizes an issue or PR as relevant to SIG Network.Categorizes an issue or PR as relevant to SIG Network.