Skip to content

linkerd-destination seemingly serving incorrect Pod IPs causing connection errors #8956

@dwilliams782

Description

@dwilliams782

What is the issue?

Slack thread here.

One of our application’s proxy’s is full of failed to connect errors. Examples:

[   584.018861s]  WARN ThreadId(01) outbound:accept{client.addr=10.225.11.11:35658}:proxy{addr=10.224.28.33:4317}:http{v=h2}:logical{dst=otel-agent.open-telemetry.svc.cluster.local:4317}:concrete{addr=otel-agent.open-telemetry.svc.cluster.local:4317}:endpoint{server.addr=10.225.17.10:4317}: linkerd_reconnect: Failed to connect error=connect timed out after 1s

 [  2104.773832s]  WARN ThreadId(01) linkerd_reconnect: Failed to connect error=connect timed out after 1s

 [  1785.088002s]  WARN ThreadId(01) outbound:accept{client.addr=10.225.11.11:35658}:proxy{addr=10.224.28.33:4317}:http{v=h2}:logical{dst=otel-agent.open-telemetry.svc.cluster.local:4317}:concrete{addr=otel-agent.open-telemetry.svc.cluster.local:4317}:endpoint{server.addr=10.225.19.9:4317}: linkerd_reconnect: Failed to connect error=Connection refused (os error 111)

Debug logs aren’t much use:

DEBUG ThreadId(01) outbound:accept{client.addr=10.225.11.11:35658}:proxy{addr=10.224.28.33:4317}:http{v=h2}:logical{dst=otel-agent.open-telemetry.svc.cluster.local:4317}:concrete{addr=otel-agent.open-telemetry.svc.cluster.local:4317}:endpoint{server.addr=10.225.13.18:4317}: linkerd_reconnect: Recovering

offer-price-index-service-ie-66bbbfcff-f8kbg [  1851.149239s] DEBUG ThreadId(01) outbound:accept{client.addr=10.225.11.11:35658}:proxy{addr=10.224.28.33:4317}:http{v=h2}:logical{dst=otel-agent.open-telemetry.svc.cluster.local:4317}:concrete{addr=otel-agent.open-telemetry.svc.cluster.local:4317}:endpoint{server.addr=10.225.13.18:4317}: linkerd_reconnect: Disconnected backoff=true

The service otel-agent.open-telemetry.svc.cluster.local:4317 points to a DaemonSet, so pods are constantly switching with our preemptible nodes.

I caught a failed request in Tap, where all other requests are ~1ms:

image

I then found this log entry:

2022-07-22 12:48:54
WARN ThreadId(01) outbound:accept{client.addr=10.225.11.11:35658}:proxy{addr=10.224.28.33:4317}:http{v=h2}:logical{dst=otel-agent.open-telemetry.svc.cluster.local:4317}:concrete{addr=otel-agent.open-telemetry.svc.cluster.local:4317}:endpoint{server.addr=10.225.13.4:4317}: linkerd_reconnect: Service failed error=channel closed

10.225.13.4 was for a pod, otel-agent-6kg56 until yesterday (21/07) at 13:52. That IP then got reassigned to pod tap-injector-7c4b6dc5f9-7wsxn. For some reason, the linkerd-proxy is sending requests to IPs that no longer belong to that Daemonset.

I run three replicas of linkerd-destination, which at the time of writing had:

linkerd-destination-8f9d689fc-lxl5f       4/4     Running   0          3d22h
linkerd-destination-8f9d689fc-mddv5       4/4     Running   0          21h
linkerd-destination-8f9d689fc-rn8dn       4/4     Running   0          21h

I searched the last two days of logs from linkerd-destination pods for any logs relating to the otel-agent pod that was no longer relevant and found two log entries:

2022-07-21 14:51:44	
 linkerd-destination-8f9d689fc-lxl5f time=“2022-07-21T13:51:44Z” level=error msg=“Unable to create new address:unable to fetch pod open-telemetry/otel-agent-6kg56: pod \“otel-agent-6kg56\” not found” addr=“:8086” component=service-publisher ns=open-telemetry port=4317 svc=otel-agent
2022-07-21 14:51:44	
 linkerd-destination-8f9d689fc-zqxhr time=“2022-07-21T13:51:44Z” level=error msg=“Unable to create new address:unable to fetch pod open-telemetry/otel-agent-6kg56: pod \“otel-agent-6kg56\” not found” addr=“:8086” component=service-publisher ns=open-telemetry port=4317 svc=otel-agent

zqxhr had since been deleted but lxl5f is still alive. My theory is that this linkerd-destination pod has entered into an unhealthy state, but is still ready, and sending "bad information" to the linkerd-proxy containers requesting destination information.

I then killed the linkerd-destination pod with these errors at 13:10 and have not seen this error come back again.

How can it be reproduced?

Not confirmed, but removing pods from a Daemonset and checking if the destination updates?

Logs, error output, etc

See section above.

output of linkerd check -o short

I have a mismatched CLI version so ignore the errors:

Linkerd core checks
===================

linkerd-version
---------------
‼ cli is up-to-date
    is running version 22.6.2 but the latest edge version is 22.7.2
    see https://linkerd.io/2/checks/#l5d-version-cli for hints

control-plane-version
---------------------
‼ control plane is up-to-date
    is running version 22.7.1 but the latest edge version is 22.7.2
    see https://linkerd.io/2/checks/#l5d-version-control for hints
‼ control plane and cli versions match
    control plane running edge-22.7.1 but cli running edge-22.6.2
    see https://linkerd.io/2/checks/#l5d-version-control for hints

linkerd-control-plane-proxy
---------------------------
‼ control plane proxies are up-to-date
    some proxies are not running the current version:
	* alpine-curl-6cd54fb65d-j6z7p (edge-22.7.1)
	* linkerd-destination-8f9d689fc-jrkr2 (edge-22.7.1)
	* linkerd-destination-8f9d689fc-mddv5 (edge-22.7.1)
	* linkerd-destination-8f9d689fc-rn8dn (edge-22.7.1)
	* linkerd-identity-55c8d467bf-hr479 (edge-22.7.1)
	* linkerd-identity-55c8d467bf-r4bd4 (edge-22.7.1)
	* linkerd-identity-55c8d467bf-zvnvl (edge-22.7.1)
	* linkerd-proxy-injector-84c59f7bf9-5rnj4 (edge-22.7.1)
	* linkerd-proxy-injector-84c59f7bf9-r692g (edge-22.7.1)
	* linkerd-proxy-injector-84c59f7bf9-wjh44 (edge-22.7.1)
	* nginx-67888f898b-x9mz4 (edge-22.7.1)
    see https://linkerd.io/2/checks/#l5d-cp-proxy-version for hints
‼ control plane proxies and cli versions match
    alpine-curl-6cd54fb65d-j6z7p running edge-22.7.1 but cli running edge-22.6.2
    see https://linkerd.io/2/checks/#l5d-cp-proxy-cli-version for hints

Linkerd extensions checks
=========================

linkerd-viz
-----------
‼ viz extension proxies are up-to-date
    some proxies are not running the current version:
	* metrics-api-794fbbf94b-z7b5r (edge-22.7.1)
	* prometheus-6cd858c97b-rh9xr (edge-22.7.1)
	* tap-68b97d8d8b-szrc9 (edge-22.7.1)
	* tap-68b97d8d8b-tvj55 (edge-22.7.1)
	* tap-68b97d8d8b-xhs69 (edge-22.7.1)
	* tap-injector-7c4b6dc5f9-7wsxn (edge-22.7.1)
	* web-6576dcfdb-g4mhd (edge-22.7.1)
    see https://linkerd.io/2/checks/#l5d-viz-proxy-cp-version for hints
‼ viz extension proxies and cli versions match
    metrics-api-794fbbf94b-z7b5r running edge-22.7.1 but cli running edge-22.6.2
    see https://linkerd.io/2/checks/#l5d-viz-proxy-cli-version for hints

Status check results are √

Environment

Kubernetes version: 1.22
Host: GKE
Linkerd version: edge-22.7.1

Possible solution

No response

Additional context

No response

Would you like to work on fixing this bug?

No response

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions