Skip to content

Still seeing the issue for endpoints staying out of sync #126578

@kedar700

Description

@kedar700

What happened?

This issue #125638 was supposed to have fixed the issue where endpoint stay out of sync

I0807 14:01:51.613700       2 endpoints_controller.go:348] "Error syncing endpoints, retrying" service="test1/test-qa" err="endpoints informer cache is out of date, resource version 10168236546 already processed for endpoints test1/test-qa"
I0807 14:01:51.624576       2 endpoints_controller.go:348] "Error syncing endpoints, retrying" service="test1/test-qa" err="endpoints informer cache is out of date, resource version 10168236546 already processed for endpoints test1/test-qa"
I0807 14:01:51.645704       2 endpoints_controller.go:348] "Error syncing endpoints, retrying" service="test1/test-qa" err="endpoints informer cache is out of date, resource version 10168236546 already processed for endpoints test1/test-qa"
I0807 14:01:51.686942       2 endpoints_controller.go:348] "Error syncing endpoints, retrying" service="test1/test-qa" err="endpoints informer cache is out of date, resource version 10168236546 already processed for endpoints test1/test-qa"
I0807 14:01:51.768648       2 endpoints_controller.go:348] "Error syncing endpoints, retrying" service="test1/test-qa" err="endpoints informer cache is out of date, resource version 10168236546 already processed for endpoints test1/test-qa"
I0807 14:01:51.808043       2 endpoints_controller.go:348] "Error syncing endpoints, retrying" service="test1/test2-qa" err="endpoints informer cache is out of date, resource version 10168250766 already processed for endpoints test1/test2-qa"
I0807 14:01:51.930345       2 endpoints_controller.go:348] "Error syncing endpoints, retrying" service="test1/test-qa" err="endpoints informer cache is out of date, resource version 10168236546 already processed for endpoints test1/test-qa"

I also wrote a small script which would get me the out of sync endpoints compared to the endpointslices

from kubernetes.client import CoreV1Api, DiscoveryV1Api
from hubspot_kube_utils.client import build_kube_client
import json
import os
from datetime import datetime

def extract_ips_from_endpoint(endpoint):
    ips = set()
    if endpoint.subsets:
        for subset in endpoint.subsets:
            if subset.addresses:
                ips.update(addr.ip for addr in subset.addresses)
            if subset.not_ready_addresses:
                ips.update(addr.ip for addr in subset.not_ready_addresses)
    return ips

def extract_ips_from_endpoint_slice(slice):
    if not slice.endpoints:
        return set()
    return set(address for endpoint in slice.endpoints
               for address in (endpoint.addresses or []))

def compare_endpoints_and_slices(core_client, discovery_client):
    all_mismatches = []

    try:
        namespaces = core_client.list_namespace()
    except Exception as e:
        print(f"Error listing namespaces: {e}")
        return all_mismatches

    for ns in namespaces.items:
        namespace = ns.metadata.name
        print(f"Processing namespace: {namespace}")

        try:
            endpoints = core_client.list_namespaced_endpoints(namespace)
        except Exception as e:
            print(f"Error listing endpoints in namespace {namespace}: {e}")
            continue

        for endpoint in endpoints.items:
            name = endpoint.metadata.name

            try:
                slices = discovery_client.list_namespaced_endpoint_slice(namespace, label_selector=f"kubernetes.io/service-name={name}")
            except Exception as e:
                print(f"Error listing endpoint slices for service {name} in namespace {namespace}: {e}")
                continue

            endpoint_ips = extract_ips_from_endpoint(endpoint)
            slice_ips = set()

            for slice in slices.items:
                slice_ips.update(extract_ips_from_endpoint_slice(slice))

            if endpoint_ips != slice_ips:
                mismatch = {
                    "namespace": namespace,
                    "service_name": name,
                    "endpoint_ips": list(endpoint_ips),
                    "slice_ips": list(slice_ips),
                    "missing_in_endpoint": list(slice_ips - endpoint_ips),
                    "missing_in_slice": list(endpoint_ips - slice_ips)
                }
                all_mismatches.append(mismatch)

        print(f"Completed processing namespace: {namespace}")
        print("---")

    return all_mismatches

def save_to_json(data, cluster_name):
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    filename = f"{cluster_name}_mismatches_{timestamp}.json"

    with open(filename, 'w') as f:
        json.dump(data, f, indent=2)

    print(f"Mismatch data for cluster {cluster_name} saved to {filename}")

def main():
    clusters = ["test"]
    all_cluster_mismatches = {}

    for cluster_name in clusters:
        print(f"Processing cluster: {cluster_name}")

        try:
            kube_client = build_kube_client(host="TEST",
                              token="TOKEN")

            core_client = CoreV1Api(kube_client)
            discovery_client = DiscoveryV1Api(kube_client)

            mismatches = compare_endpoints_and_slices(core_client, discovery_client)

            all_cluster_mismatches[cluster_name] = mismatches

            save_to_json(mismatches, cluster_name)

            print(f"Completed processing cluster: {cluster_name}")
            print(f"Total mismatches found in this cluster: {len(mismatches)}")
        except Exception as e:
            print(f"Error processing cluster {cluster_name}: {e}")


if __name__ == "__main__":
    main()

What did you expect to happen?

I expect the endpoints to eventually sync and reflect the most upto date information.

How can we reproduce it (as minimally and precisely as possible)?

I have just deployed the newer patch to our cluster and that has resulted in endpoints never ending up being updated if the status goes out of sync.

Anything else we need to know?

No response

Kubernetes version

Client Version: v1.29.7
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.7

Cloud provider

Details

OS version

almalinux-9

Install tools

Details

Container runtime (CRI) and version (if applicable)

cri-o

Related plugins (CNI, CSI, ...) and versions (if applicable)

No response

Metadata

Metadata

Assignees

Labels

kind/bugCategorizes issue or PR as related to a bug.needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.sig/networkCategorizes an issue or PR as relevant to SIG Network.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions