linkerd-destination seemingly serving incorrect Pod IPs causing connection errors

### What is the issue?

Slack thread [here](https://linkerd.slack.com/archives/C89RTCWJF/p1658489469661049).

One of our application’s proxy’s is full of failed to connect  errors. Examples:

```
[   584.018861s]  WARN ThreadId(01) outbound:accept{client.addr=10.225.11.11:35658}:proxy{addr=10.224.28.33:4317}:http{v=h2}:logical{dst=otel-agent.open-telemetry.svc.cluster.local:4317}:concrete{addr=otel-agent.open-telemetry.svc.cluster.local:4317}:endpoint{server.addr=10.225.17.10:4317}: linkerd_reconnect: Failed to connect error=connect timed out after 1s

 [  2104.773832s]  WARN ThreadId(01) linkerd_reconnect: Failed to connect error=connect timed out after 1s

 [  1785.088002s]  WARN ThreadId(01) outbound:accept{client.addr=10.225.11.11:35658}:proxy{addr=10.224.28.33:4317}:http{v=h2}:logical{dst=otel-agent.open-telemetry.svc.cluster.local:4317}:concrete{addr=otel-agent.open-telemetry.svc.cluster.local:4317}:endpoint{server.addr=10.225.19.9:4317}: linkerd_reconnect: Failed to connect error=Connection refused (os error 111)
```

Debug logs aren’t much use:

```
DEBUG ThreadId(01) outbound:accept{client.addr=10.225.11.11:35658}:proxy{addr=10.224.28.33:4317}:http{v=h2}:logical{dst=otel-agent.open-telemetry.svc.cluster.local:4317}:concrete{addr=otel-agent.open-telemetry.svc.cluster.local:4317}:endpoint{server.addr=10.225.13.18:4317}: linkerd_reconnect: Recovering

offer-price-index-service-ie-66bbbfcff-f8kbg [  1851.149239s] DEBUG ThreadId(01) outbound:accept{client.addr=10.225.11.11:35658}:proxy{addr=10.224.28.33:4317}:http{v=h2}:logical{dst=otel-agent.open-telemetry.svc.cluster.local:4317}:concrete{addr=otel-agent.open-telemetry.svc.cluster.local:4317}:endpoint{server.addr=10.225.13.18:4317}: linkerd_reconnect: Disconnected backoff=true
```

The service `otel-agent.open-telemetry.svc.cluster.local:4317` points to a DaemonSet, so pods are constantly switching with our preemptible nodes.

I caught a failed request in Tap, where all other requests are ~1ms:

![image](https://user-images.githubusercontent.com/42545485/180437709-94fc6759-a64a-4f24-aa84-1b7c291e6cff.png)

I then found this log entry:

```
2022-07-22 12:48:54
WARN ThreadId(01) outbound:accept{client.addr=10.225.11.11:35658}:proxy{addr=10.224.28.33:4317}:http{v=h2}:logical{dst=otel-agent.open-telemetry.svc.cluster.local:4317}:concrete{addr=otel-agent.open-telemetry.svc.cluster.local:4317}:endpoint{server.addr=10.225.13.4:4317}: linkerd_reconnect: Service failed error=channel closed
```

`10.225.13.4` was for a pod, `otel-agent-6kg56` until yesterday (21/07) at 13:52. That IP then got reassigned to pod `tap-injector-7c4b6dc5f9-7wsxn`. For some reason, the linkerd-proxy is sending requests to IPs that no longer belong to that Daemonset.

I run three replicas of linkerd-destination, which at the time of writing had:

```
linkerd-destination-8f9d689fc-lxl5f       4/4     Running   0          3d22h
linkerd-destination-8f9d689fc-mddv5       4/4     Running   0          21h
linkerd-destination-8f9d689fc-rn8dn       4/4     Running   0          21h
```

I searched the last two days of logs from linkerd-destination pods for any logs relating to the otel-agent pod that was no longer relevant and found two log entries:

```
2022-07-21 14:51:44	
 linkerd-destination-8f9d689fc-lxl5f time=“2022-07-21T13:51:44Z” level=error msg=“Unable to create new address:unable to fetch pod open-telemetry/otel-agent-6kg56: pod \“otel-agent-6kg56\” not found” addr=“:8086” component=service-publisher ns=open-telemetry port=4317 svc=otel-agent
2022-07-21 14:51:44	
 linkerd-destination-8f9d689fc-zqxhr time=“2022-07-21T13:51:44Z” level=error msg=“Unable to create new address:unable to fetch pod open-telemetry/otel-agent-6kg56: pod \“otel-agent-6kg56\” not found” addr=“:8086” component=service-publisher ns=open-telemetry port=4317 svc=otel-agent
```

`zqxhr` had since been deleted but `lxl5f` is still alive. My theory is that this linkerd-destination pod has entered into an unhealthy state, but is still ready, and sending "bad information" to the linkerd-proxy containers requesting destination information.

I then killed the linkerd-destination pod with these errors at 13:10 and have not seen this error come back again.



### How can it be reproduced?

Not confirmed, but removing pods from a Daemonset and checking if the destination updates?

### Logs, error output, etc

See section above.

### output of `linkerd check -o short`

I have a mismatched CLI version so ignore the errors:

```
Linkerd core checks
===================

linkerd-version
---------------
‼ cli is up-to-date
    is running version 22.6.2 but the latest edge version is 22.7.2
    see https://linkerd.io/2/checks/#l5d-version-cli for hints

control-plane-version
---------------------
‼ control plane is up-to-date
    is running version 22.7.1 but the latest edge version is 22.7.2
    see https://linkerd.io/2/checks/#l5d-version-control for hints
‼ control plane and cli versions match
    control plane running edge-22.7.1 but cli running edge-22.6.2
    see https://linkerd.io/2/checks/#l5d-version-control for hints

linkerd-control-plane-proxy
---------------------------
‼ control plane proxies are up-to-date
    some proxies are not running the current version:
	* alpine-curl-6cd54fb65d-j6z7p (edge-22.7.1)
	* linkerd-destination-8f9d689fc-jrkr2 (edge-22.7.1)
	* linkerd-destination-8f9d689fc-mddv5 (edge-22.7.1)
	* linkerd-destination-8f9d689fc-rn8dn (edge-22.7.1)
	* linkerd-identity-55c8d467bf-hr479 (edge-22.7.1)
	* linkerd-identity-55c8d467bf-r4bd4 (edge-22.7.1)
	* linkerd-identity-55c8d467bf-zvnvl (edge-22.7.1)
	* linkerd-proxy-injector-84c59f7bf9-5rnj4 (edge-22.7.1)
	* linkerd-proxy-injector-84c59f7bf9-r692g (edge-22.7.1)
	* linkerd-proxy-injector-84c59f7bf9-wjh44 (edge-22.7.1)
	* nginx-67888f898b-x9mz4 (edge-22.7.1)
    see https://linkerd.io/2/checks/#l5d-cp-proxy-version for hints
‼ control plane proxies and cli versions match
    alpine-curl-6cd54fb65d-j6z7p running edge-22.7.1 but cli running edge-22.6.2
    see https://linkerd.io/2/checks/#l5d-cp-proxy-cli-version for hints

Linkerd extensions checks
=========================

linkerd-viz
-----------
‼ viz extension proxies are up-to-date
    some proxies are not running the current version:
	* metrics-api-794fbbf94b-z7b5r (edge-22.7.1)
	* prometheus-6cd858c97b-rh9xr (edge-22.7.1)
	* tap-68b97d8d8b-szrc9 (edge-22.7.1)
	* tap-68b97d8d8b-tvj55 (edge-22.7.1)
	* tap-68b97d8d8b-xhs69 (edge-22.7.1)
	* tap-injector-7c4b6dc5f9-7wsxn (edge-22.7.1)
	* web-6576dcfdb-g4mhd (edge-22.7.1)
    see https://linkerd.io/2/checks/#l5d-viz-proxy-cp-version for hints
‼ viz extension proxies and cli versions match
    metrics-api-794fbbf94b-z7b5r running edge-22.7.1 but cli running edge-22.6.2
    see https://linkerd.io/2/checks/#l5d-viz-proxy-cli-version for hints

Status check results are √
```

### Environment

Kubernetes version: 1.22
Host: GKE
Linkerd version: edge-22.7.1

### Possible solution

_No response_

### Additional context

_No response_

### Would you like to work on fixing this bug?

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

linkerd-destination seemingly serving incorrect Pod IPs causing connection errors #8956

What is the issue?

How can it be reproduced?

Logs, error output, etc

output of `linkerd check -o short`

Environment

Possible solution

Additional context

Would you like to work on fixing this bug?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

linkerd-destination seemingly serving incorrect Pod IPs causing connection errors #8956

Description

What is the issue?

How can it be reproduced?

Logs, error output, etc

output of linkerd check -o short

Environment

Possible solution

Additional context

Would you like to work on fixing this bug?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

output of `linkerd check -o short`