Linkerd-proxy forwards traffic to the wrong Pod

## Bug Report

We observe that Pods running in [proxy ingress mode](https://linkerd.io/2.10/tasks/using-ingress/) send traffic to the wrong (Service) target Pods.

It took us 8 hours of debugging today, to figure out that a production incident is actually caused by the buggy behaviour of linkerd2 described below. 

### What is the issue?

### How can it be reproduced?

We can reproduce with the following setup:

- Pod A (`10.40.39.46`) is acting as ingress gateway and runs in proxy ingress mode with linkerd.io/inject: ingress annotation.
- Deployment with 2 Pods B1 (`10.40.0.78`) and B2 (`10.40.33.132`) is serving HTTP traffic exposed as Service B  (`gh-mgmt-service-api.nx-devices`) with ClusterIP `172.20.53.184`
- Deployment with 2 Pods C1 (`10.40.0.33`) and C2  (`10.40.39.15`) serving HTTP traffic exposed as Service C (`gh-commander-api.nx-devices`) with ClusterIP `172.20.225.244`

Now, we exec into Pod A and from within that Pod we execute the following statement (a), which works as expected:
```shell
curl http://172.20.53.184:8000/ -H "l5d-dst-override: gh-mgmt-service-api.nx-devices.svc.cluster.local:8000"
```
Then we execute the following statement (b) still from within Pod A:
```shell
curl http://172.20.225.244:8000/ -H "l5d-dst-override: gh-commander-api.nx-devices.svc.cluster.local:8000"
````
Now, this request (b) is actually forwarded to Pod B1 or B2 instead of C1 or C2. Tapping into the traffic of Pod B1 and B2 revealed the following request headers for (b):

- `host: 172.20.225.244:8000` (the ClusterIP address of Service C)
- `l5d-dst-canonical: gh-commander-api.nx-devices:8000` (the DNS name and port of Service C)

While both headers are set correctly, the traffic arrives at the wrong Pods (B1/B2 instead of C1/C2). This is very confusing and we have to assume that this is caused by the outbound linkerd-proxy on Pod A, which somehow uses the wrong IPv4 address as target for the request. Note, we double checked all things several times and all Pods and Services are configured correctly and there is no overlapping of backend Pods (Service Endpoints resp. IPv4 addresses) between the Services B and C.

### Logs, error output, etc

Here is a small (redacted) excerpt from the ingress’ (Pod A) linkerd-proxy log. I have not much experience with linkerd-proxy logs, but the mere fact that IP addresses from Pods of both services (B & C) appear in the same log message looks suspicious to me:

```json
{"log":"{\"timestamp\":\"[   344.440286s]\",\"level\":\"DEBUG\",\"fields\":{\"message\":\"Upgrading request\",\"version\":\"HTTP/1.1\",\"absolute_form\":false},\"target\":\"linkerd_proxy_http::orig_proto\",\"spans\":[{\"name\":\"outbound\"},{\"client.addr\":\"10.40.39.46:45722\",\"target.addr\":\"10.40.39.15:8000\",\"name\":\"accept\"},{\"v\":\"1.x\",\"name\":\"http\"},{\"dst\":\"gh-commander-api.nx-devices.svc.cluster.local:8000\",\"name\":\"target\"},{\"dst\":\"gh-commander-api.nx-devices.svc.cluster.local:8000\",\"name\":\"logical\"},{\"addr\":\"gh-mgmt-service-api.nx-devices.svc.cluster.local:8000\",\"name\":\"concrete\"},{\"peer.addr\":\"10.40.0.78:8000\",\"name\":\"endpoint\"},{\"name\":\"orig-proto-upgrade\"}],\"threadId\":\"ThreadId(3)\"}\n","stream":"stdout","time":"2021-03-30T21:36:46.129499937Z"}
{"log":"{\"timestamp\":\"[   344.440245s]\",\"level\":\"DEBUG\",\"fields\":{\"headers\":\"{\\\"host\\\": \\\"gh-commander-api.nx-devices.svc.cluster.local:8000\\\", \\\"user-agent\\\": \\\"Wget\\\", \\\"authorization\\\": \\\"Basic redacted\\\", \\\"l5d-dst-canonical\\\": \\\"gh-commander-api.nx-devices.svc.cluster.local:8000\\\"}\"},\"target\":\"linkerd_proxy_http::client\",\"spans\":[{\"name\":\"outbound\"},{\"client.addr\":\"10.40.39.46:45722\",\"target.addr\":\"10.40.39.15:8000\",\"name\":\"accept\"},{\"v\":\"1.x\",\"name\":\"http\"},{\"dst\":\"gh-commander-api.nx-devices.svc.cluster.local:8000\",\"name\":\"target\"},{\"dst\":\"gh-commander-api.nx-devices.svc.cluster.local:8000\",\"name\":\"logical\"},{\"addr\":\"gh-mgmt-service-api.nx-devices.svc.cluster.local:8000\",\"name\":\"concrete\"},{\"peer.addr\":\"10.40.0.78:8000\",\"name\":\"endpoint\"},{\"name\":\"orig-proto-upgrade\"}],\"threadId\":\"ThreadId(3)\"}\n","stream":"stdout","time":"2021-03-30T21:36:46.129476157Z"}
{"log":"{\"timestamp\":\"[   344.440199s]\",\"level\":\"DEBUG\",\"fields\":{\"method\":\"GET\",\"uri\":\"http://gh-commander-api.nx-devices.svc.cluster.local:8000/settings/latest/m:RVIQO0\",\"version\":\"HTTP/1.1\"},\"target\":\"linkerd_proxy_http::client\",\"spans\":[{\"name\":\"outbound\"},{\"client.addr\":\"10.40.39.46:45722\",\"target.addr\":\"10.40.39.15:8000\",\"name\":\"accept\"},{\"v\":\"1.x\",\"name\":\"http\"},{\"dst\":\"gh-commander-api.nx-devices.svc.cluster.local:8000\",\"name\":\"target\"},{\"dst\":\"gh-commander-api.nx-devices.svc.cluster.local:8000\",\"name\":\"logical\"},{\"addr\":\"gh-mgmt-service-api.nx-devices.svc.cluster.local:8000\",\"name\":\"concrete\"},{\"peer.addr\":\"10.40.0.78:8000\",\"name\":\"endpoint\"},{\"name\":\"orig-proto-upgrade\"}],\"threadId\":\"ThreadId(3)\"}\n","stream":"stdout","time":"2021-03-30T21:36:46.129423316Z"}
```
#### `linkerd check` output

```text
kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API

kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version

linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ controller pod is running

linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ control plane PodSecurityPolicies exist

linkerd-identity
----------------
√ certificate config is valid
√ trust anchors are using supported crypto algorithm
√ trust anchors are within their validity period
√ trust anchors are valid for at least 60 days
√ issuer cert is using supported crypto algorithm
√ issuer cert is within its validity period
√ issuer cert is issued by the trust anchor

linkerd-webhooks-and-apisvc-tls
-------------------------------
√ proxy-injector webhook has valid cert
√ sp-validator webhook has valid cert

linkerd-api
-----------
√ control plane pods are ready
√ can initialize the client
√ can query the control plane API

linkerd-version
---------------
√ can determine the latest version
√ cli is up-to-date

control-plane-version
---------------------
√ control plane is up-to-date
√ control plane and cli versions match

Status check results are √
```

### Environment

- Kubernetes Version: 1.19
- Cluster Environment: EKS
- Host OS: Amazon Linux 2
- Linkerd version: 2.10.0

### Possible solution

As a work-around we no longer set the `l5d-dst-override` header on the ingress Pod A, which resolved our production incident. Hopefully, someone with more know-how about the inner workings of the “DNS name to IPv4 translation” in the linkerd-proxy can shed more light on what might be the root cause of this behaviour. (edited) 

### Additional context

This issue has already been triaged on [Slack](https://app.slack.com/client/T0JV2DX9R/C89RTCWJF/thread/C89RTCWJF-1617138891.328900).

Ingress Gateway: Gloo-Edge version 1.6.17.

Thera are no TrafficSplit resources in that cluster:

```shell
❯ kubectl get -A trafficsplit
No resources found
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Linkerd-proxy forwards traffic to the wrong Pod #5975

Bug Report

What is the issue?

How can it be reproduced?

Logs, error output, etc

`linkerd check` output

Environment

Possible solution

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Linkerd-proxy forwards traffic to the wrong Pod #5975

Description

Bug Report

What is the issue?

How can it be reproduced?

Logs, error output, etc

linkerd check output

Environment

Possible solution

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`linkerd check` output