Skip to content

Proxy trying to connect to stale endpoint IP address #6842

@mkrutik

Description

@mkrutik

Bug Report

From time to time we run into connection problems, when we enabled the debug logs, we noticed that the proxy was trying to connect to outdated IP addresses.

What is the issue?

The proxy is trying to connect to the stale IP address of the endpoint.

How can it be reproduced?

We are still trying to reproduce it, but so far with no success. One thing we've observed that could be a potential trigger for this problem is fast/huge scaling down the target PODs from ~10 instances to 2.

Logs, error output, etc

10.208.6.184 destination SVC IP

Logs related to this svc IP - unfortunately they are in JSON format since I exported them from GCP.
I also have all the logs with TRACE and DEBUG for ~20 minutes after the problem occurred (~60K entries). I can share them too if you need them!

linkerd check output

kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API

kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version

linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ controller pod is running

linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ control plane PodSecurityPolicies exist

linkerd-identity
----------------
√ certificate config is valid
√ trust anchors are using supported crypto algorithm
√ trust anchors are within their validity period
√ trust anchors are valid for at least 60 days
√ issuer cert is using supported crypto algorithm
√ issuer cert is within its validity period
√ issuer cert is valid for at least 60 days
√ issuer cert is issued by the trust anchor

linkerd-webhooks-and-apisvc-tls
-------------------------------
√ proxy-injector webhook has valid cert
√ proxy-injector cert is valid for at least 60 days
√ sp-validator webhook has valid cert
√ sp-validator cert is valid for at least 60 days

linkerd-api
-----------
√ control plane pods are ready
√ can initialize the client
√ can query the control plane API

linkerd-version
---------------
√ can determine the latest version
√ cli is up-to-date

control-plane-version
---------------------
√ control plane is up-to-date
√ control plane and cli versions match

Status check results are √

Linkerd extensions checks
=========================

linkerd-viz
-----------
√ linkerd-viz Namespace exists
√ linkerd-viz ClusterRoles exist
√ linkerd-viz ClusterRoleBindings exist
√ tap API server has valid cert
√ tap API server cert is valid for at least 60 days
√ tap API service is running
√ linkerd-viz pods are injected
√ viz extension pods are running
√ can initialize the client
√ viz extension self-check

Status check results are √

Environment

  • Kubernetes Version: v1.20.8-gke.2100
  • Cluster Environment: GKE with kube-proxy
  • Linkerd version: stable-2.10.2

Possible solution

Additional context

⚠️ This happened several times a month for different PODs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions