Skip to content

linkerd-proxy using stale endpoints #11480

@bruecktech

Description

@bruecktech

What is the issue?

Since the update to 2.13.5 we experience sporadic issues where the linkerd-proxy seems to connect to endpoints that don't exist anymore since days.
We think it's triggered by an issue to connect to linkerd-destination (which is a different problem).
Restarting linkerd-destination solves the issue.
We also compared endpoints from linkerd diagnostics endpoints myservice... with kubectl get endpoints myservice and they seem to match. So we don't think that linkerd-destination contains stale data but rather the proxies.

The affected proxies did not have any pending endpoints after the issue, but we currently don't have the data to understand what it looked like during the issue

❯ curl -s localhost:8000/metrics | grep endpoints | grep myservice
outbound_http_balancer_endpoints{parent_group="core",parent_kind="Service",parent_namespace="prod",parent_name="myservice",parent_port="80",parent_section_name="",backend_group="",backend_kind="default",backend_namespace="",backend_name="service",backend_port="",backend_section_name="",endpoint_state="pending"} 0
outbound_http_balancer_endpoints{parent_group="core",parent_kind="Service",parent_namespace="prod",parent_name="myservice",parent_port="80",parent_section_name="",backend_group="",backend_kind="default",backend_namespace="",backend_name="service",backend_port="",backend_section_name="",endpoint_state="ready"} 419

How can it be reproduced?

Not clear. We updated to 2.13.5 and since then had 3 issues over the course of a few days.

Logs, error output, etc

{
		"message": "HTTP/1.1 request failed",
		"attributes": {
			"threadId": "ThreadId(1)",
			"spans": [
				{
					"name": "outbound"
				},
				{
					"name": "proxy",
					"addr": "172.20.207.194:80"
				},
				{
					"name": "rescue",
					"client": {
						"addr": "10.250.154.125:56774"
					}
				}
			],
			"level": "INFO",
			"fields": {
				"error": "logical service myservice.prod.svc.cluster.local:80: Service.myservice:80: endpoint 10.250.162.250:80: operation was canceled: connection was not ready"
			},
			"timestamp": "[104343.356432s]",
			"target": "linkerd_app_core::errors::respond"
		}
	}
}
{
		"message": "Unexpected error",
		"attributes": {
			"threadId": "ThreadId(1)",
			"spans": [
				{
					"name": "outbound"
				},
				{
					"name": "proxy",
					"addr": "172.20.207.194:80"
				},
				{
					"name": "rescue",
					"client": {
						"addr": "10.250.154.125:56774"
					}
				}
			],
			"level": "WARN",
			"fields": {
				"error": "logical service myservice.prod.svc.cluster.local:80: Service.prod.myservice:80: endpoint 10.250.162.250:80: operation was canceled: connection was not ready"
			},
			"timestamp": "[104343.356447s]",
			"target": "linkerd_app_outbound::http::server"
		}
	}
}
{
		"message": "Service failed",
		"attributes": {
			"threadId": "ThreadId(1)",
			"spans": [
				{
					"name": "outbound"
				},
				{
					"name": "proxy",
					"addr": "172.20.207.194:80"
				},
				{
					"ns": "prod",
					"port": "80",
					"name": "service"
				},
				{
					"name": "endpoint",
					"addr": "10.250.162.250:80"
				}
			],
			"level": "WARN",
			"fields": {
				"error": "channel closed"
			},
			"timestamp": "[104343.925284s]",
			"target": "linkerd_reconnect"
		}
	}
}
{
		"message": "Failed to connect",
		"attributes": {
			"threadId": "ThreadId(1)",
			"spans": [
				{
					"name": "outbound"
				},
				{
					"name": "proxy",
					"addr": "172.20.207.194:80"
				},
				{
					"ns": "prod",
					"port": "80",
					"name": "service"
				},
				{
					"name": "endpoint",
					"addr": "10.250.162.250:80"
				}
			],
			"level": "WARN",
			"fields": {
				"error": "Connection refused (os error 111)"
			},
			"timestamp": "[104344.409821s]",
			"target": "linkerd_reconnect"
		}
	}
}

output of linkerd check -o short

linkerd-identity
----------------
‼ issuer cert is valid for at least 60 days
    issuer certificate will expire on 2023-11-08T07:32:15Z
    see https://linkerd.io/2.14/checks/#l5d-identity-issuer-cert-not-expiring-soon for hints

linkerd-webhooks-and-apisvc-tls
-------------------------------
‼ proxy-injector cert is valid for at least 60 days
    certificate will expire on 2023-10-18T13:43:33Z
    see https://linkerd.io/2.14/checks/#l5d-proxy-injector-webhook-cert-not-expiring-soon for hints
‼ sp-validator cert is valid for at least 60 days
    certificate will expire on 2023-10-21T08:56:13Z
    see https://linkerd.io/2.14/checks/#l5d-sp-validator-webhook-cert-not-expiring-soon for hints
‼ policy-validator cert is valid for at least 60 days
    certificate will expire on 2023-11-08T09:31:39Z
    see https://linkerd.io/2.14/checks/#l5d-policy-validator-webhook-cert-not-expiring-soon for hints

control-plane-version
---------------------
‼ control plane is up-to-date
    is running version 2.13.5 but the latest stable version is 2.14.1
    see https://linkerd.io/2.14/checks/#l5d-version-control for hints
‼ control plane and cli versions match
    control plane running stable-2.13.5 but cli running stable-2.14.1
    see https://linkerd.io/2.14/checks/#l5d-version-control for hints

linkerd-control-plane-proxy
---------------------------
‼ control plane proxies are up-to-date
    some proxies are not running the current version:
	* linkerd-destination-5b5ddcf5d4-45glv (v2.207.0)
	* linkerd-destination-5b5ddcf5d4-94j2x (v2.207.0)
	* linkerd-destination-5b5ddcf5d4-g95bm (v2.207.0)
	* linkerd-destination-5b5ddcf5d4-gxvtz (v2.207.0)
	* linkerd-destination-5b5ddcf5d4-jmfn8 (v2.207.0)
	* linkerd-identity-9559b4d7f-96kv7 (v2.207.0)
	* linkerd-identity-9559b4d7f-gqddq (v2.207.0)
	* linkerd-identity-9559b4d7f-gx4bz (v2.207.0)
	* linkerd-identity-9559b4d7f-sfkb7 (v2.207.0)
	* linkerd-identity-9559b4d7f-sl7ck (v2.207.0)
	* linkerd-proxy-injector-6688d4487f-b6w99 (v2.207.0)
	* linkerd-proxy-injector-6688d4487f-bmgqm (v2.207.0)
	* linkerd-proxy-injector-6688d4487f-cf2ss (v2.207.0)
	* linkerd-proxy-injector-6688d4487f-ffzlj (v2.207.0)
	* linkerd-proxy-injector-6688d4487f-n8hxk (v2.207.0)
	* linkerd-sp-validator-dbcc64849-7846s (v2.207.0)
	* linkerd-sp-validator-dbcc64849-7d6kn (v2.207.0)
	* linkerd-sp-validator-dbcc64849-7pqw7 (v2.207.0)
	* linkerd-sp-validator-dbcc64849-92jzx (v2.207.0)
	* linkerd-sp-validator-dbcc64849-jkws8 (v2.207.0)
    see https://linkerd.io/2.14/checks/#l5d-cp-proxy-version for hints
‼ control plane proxies and cli versions match
    linkerd-destination-5b5ddcf5d4-45glv running v2.207.0 but cli running stable-2.14.1
    see https://linkerd.io/2.14/checks/#l5d-cp-proxy-cli-version for hints

linkerd-viz
-----------
‼ linkerd-viz pods are injected
    could not find proxy container for prometheus-scrape-1-5585795fbd-l4sn5 pod
    see https://linkerd.io/2.14/checks/#l5d-viz-pods-injection for hints
‼ viz extension pods are running
    container "linkerd-proxy" in pod "prometheus-scrape-1-5585795fbd-l4sn5" is not ready
    see https://linkerd.io/2.14/checks/#l5d-viz-pods-running for hints
‼ viz extension proxies are up-to-date
    some proxies are not running the current version:
	* grafana-864b6b8ddb-jxlpk (v2.207.0)
	* metrics-api-5484cdf977-llg6t (v2.207.0)
	* tap-58654c968b-7q5hm (v2.207.0)
	* tap-injector-55597d88c7-xd7wp (v2.207.0)
	* web-cbdb85945-b5s27 (v2.207.0)
    see https://linkerd.io/2.14/checks/#l5d-viz-proxy-cp-version for hints
‼ viz extension proxies and cli versions match
    grafana-864b6b8ddb-jxlpk running v2.207.0 but cli running stable-2.14.1
    see https://linkerd.io/2.14/checks/#l5d-viz-proxy-cli-version for hints

linkerd-smi
-----------
‼ Linkerd extension command linkerd-smi exists
    exec: "linkerd-smi": executable file not found in $PATH
    see https://linkerd.io/2.14/checks/#extensions for hints

Status check results are √

Environment

  • EKS
  • Kubernetes 1.24
  • Linkerd 2.13.5

Possible solution

No response

Additional context

No response

Would you like to work on fixing this bug?

maybe

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions