-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
What is the issue?
Slack thread here.
One of our application’s proxy’s is full of failed to connect errors. Examples:
[ 584.018861s] WARN ThreadId(01) outbound:accept{client.addr=10.225.11.11:35658}:proxy{addr=10.224.28.33:4317}:http{v=h2}:logical{dst=otel-agent.open-telemetry.svc.cluster.local:4317}:concrete{addr=otel-agent.open-telemetry.svc.cluster.local:4317}:endpoint{server.addr=10.225.17.10:4317}: linkerd_reconnect: Failed to connect error=connect timed out after 1s
[ 2104.773832s] WARN ThreadId(01) linkerd_reconnect: Failed to connect error=connect timed out after 1s
[ 1785.088002s] WARN ThreadId(01) outbound:accept{client.addr=10.225.11.11:35658}:proxy{addr=10.224.28.33:4317}:http{v=h2}:logical{dst=otel-agent.open-telemetry.svc.cluster.local:4317}:concrete{addr=otel-agent.open-telemetry.svc.cluster.local:4317}:endpoint{server.addr=10.225.19.9:4317}: linkerd_reconnect: Failed to connect error=Connection refused (os error 111)
Debug logs aren’t much use:
DEBUG ThreadId(01) outbound:accept{client.addr=10.225.11.11:35658}:proxy{addr=10.224.28.33:4317}:http{v=h2}:logical{dst=otel-agent.open-telemetry.svc.cluster.local:4317}:concrete{addr=otel-agent.open-telemetry.svc.cluster.local:4317}:endpoint{server.addr=10.225.13.18:4317}: linkerd_reconnect: Recovering
offer-price-index-service-ie-66bbbfcff-f8kbg [ 1851.149239s] DEBUG ThreadId(01) outbound:accept{client.addr=10.225.11.11:35658}:proxy{addr=10.224.28.33:4317}:http{v=h2}:logical{dst=otel-agent.open-telemetry.svc.cluster.local:4317}:concrete{addr=otel-agent.open-telemetry.svc.cluster.local:4317}:endpoint{server.addr=10.225.13.18:4317}: linkerd_reconnect: Disconnected backoff=true
The service otel-agent.open-telemetry.svc.cluster.local:4317 points to a DaemonSet, so pods are constantly switching with our preemptible nodes.
I caught a failed request in Tap, where all other requests are ~1ms:
I then found this log entry:
2022-07-22 12:48:54
WARN ThreadId(01) outbound:accept{client.addr=10.225.11.11:35658}:proxy{addr=10.224.28.33:4317}:http{v=h2}:logical{dst=otel-agent.open-telemetry.svc.cluster.local:4317}:concrete{addr=otel-agent.open-telemetry.svc.cluster.local:4317}:endpoint{server.addr=10.225.13.4:4317}: linkerd_reconnect: Service failed error=channel closed
10.225.13.4 was for a pod, otel-agent-6kg56 until yesterday (21/07) at 13:52. That IP then got reassigned to pod tap-injector-7c4b6dc5f9-7wsxn. For some reason, the linkerd-proxy is sending requests to IPs that no longer belong to that Daemonset.
I run three replicas of linkerd-destination, which at the time of writing had:
linkerd-destination-8f9d689fc-lxl5f 4/4 Running 0 3d22h
linkerd-destination-8f9d689fc-mddv5 4/4 Running 0 21h
linkerd-destination-8f9d689fc-rn8dn 4/4 Running 0 21h
I searched the last two days of logs from linkerd-destination pods for any logs relating to the otel-agent pod that was no longer relevant and found two log entries:
2022-07-21 14:51:44
linkerd-destination-8f9d689fc-lxl5f time=“2022-07-21T13:51:44Z” level=error msg=“Unable to create new address:unable to fetch pod open-telemetry/otel-agent-6kg56: pod \“otel-agent-6kg56\” not found” addr=“:8086” component=service-publisher ns=open-telemetry port=4317 svc=otel-agent
2022-07-21 14:51:44
linkerd-destination-8f9d689fc-zqxhr time=“2022-07-21T13:51:44Z” level=error msg=“Unable to create new address:unable to fetch pod open-telemetry/otel-agent-6kg56: pod \“otel-agent-6kg56\” not found” addr=“:8086” component=service-publisher ns=open-telemetry port=4317 svc=otel-agent
zqxhr had since been deleted but lxl5f is still alive. My theory is that this linkerd-destination pod has entered into an unhealthy state, but is still ready, and sending "bad information" to the linkerd-proxy containers requesting destination information.
I then killed the linkerd-destination pod with these errors at 13:10 and have not seen this error come back again.
How can it be reproduced?
Not confirmed, but removing pods from a Daemonset and checking if the destination updates?
Logs, error output, etc
See section above.
output of linkerd check -o short
I have a mismatched CLI version so ignore the errors:
Linkerd core checks
===================
linkerd-version
---------------
‼ cli is up-to-date
is running version 22.6.2 but the latest edge version is 22.7.2
see https://linkerd.io/2/checks/#l5d-version-cli for hints
control-plane-version
---------------------
‼ control plane is up-to-date
is running version 22.7.1 but the latest edge version is 22.7.2
see https://linkerd.io/2/checks/#l5d-version-control for hints
‼ control plane and cli versions match
control plane running edge-22.7.1 but cli running edge-22.6.2
see https://linkerd.io/2/checks/#l5d-version-control for hints
linkerd-control-plane-proxy
---------------------------
‼ control plane proxies are up-to-date
some proxies are not running the current version:
* alpine-curl-6cd54fb65d-j6z7p (edge-22.7.1)
* linkerd-destination-8f9d689fc-jrkr2 (edge-22.7.1)
* linkerd-destination-8f9d689fc-mddv5 (edge-22.7.1)
* linkerd-destination-8f9d689fc-rn8dn (edge-22.7.1)
* linkerd-identity-55c8d467bf-hr479 (edge-22.7.1)
* linkerd-identity-55c8d467bf-r4bd4 (edge-22.7.1)
* linkerd-identity-55c8d467bf-zvnvl (edge-22.7.1)
* linkerd-proxy-injector-84c59f7bf9-5rnj4 (edge-22.7.1)
* linkerd-proxy-injector-84c59f7bf9-r692g (edge-22.7.1)
* linkerd-proxy-injector-84c59f7bf9-wjh44 (edge-22.7.1)
* nginx-67888f898b-x9mz4 (edge-22.7.1)
see https://linkerd.io/2/checks/#l5d-cp-proxy-version for hints
‼ control plane proxies and cli versions match
alpine-curl-6cd54fb65d-j6z7p running edge-22.7.1 but cli running edge-22.6.2
see https://linkerd.io/2/checks/#l5d-cp-proxy-cli-version for hints
Linkerd extensions checks
=========================
linkerd-viz
-----------
‼ viz extension proxies are up-to-date
some proxies are not running the current version:
* metrics-api-794fbbf94b-z7b5r (edge-22.7.1)
* prometheus-6cd858c97b-rh9xr (edge-22.7.1)
* tap-68b97d8d8b-szrc9 (edge-22.7.1)
* tap-68b97d8d8b-tvj55 (edge-22.7.1)
* tap-68b97d8d8b-xhs69 (edge-22.7.1)
* tap-injector-7c4b6dc5f9-7wsxn (edge-22.7.1)
* web-6576dcfdb-g4mhd (edge-22.7.1)
see https://linkerd.io/2/checks/#l5d-viz-proxy-cp-version for hints
‼ viz extension proxies and cli versions match
metrics-api-794fbbf94b-z7b5r running edge-22.7.1 but cli running edge-22.6.2
see https://linkerd.io/2/checks/#l5d-viz-proxy-cli-version for hints
Status check results are √
Environment
Kubernetes version: 1.22
Host: GKE
Linkerd version: edge-22.7.1
Possible solution
No response
Additional context
No response
Would you like to work on fixing this bug?
No response
