Skip to content

Linkerd proxy fails to reconnect to restarted DaemonSet pod. #10590

@Steffen911

Description

@Steffen911

What is the issue?

We use the opentelemetry collector as a DaemonSet and send traces from our pods to the collector using the node ip address. When we restart the collector it receives a new Pod IP, but the Node IP remains the same. Some proxies still try to connect to the old Pod IP and, therefore, fail to reconnect.

Example:
Emojiservice is supposed to send traces to the Collector service. The collector runs as a DaemonSet and exposes port 4317 (gRPC). We inject the NodeIP via the downwards API into the emojiservice to make it send traces to 10.167.0.1:4317. This is resolved to the PodIP of the Collector 10.169.10.1:4317.

Now, we restart the Daemonset and the PodIP of the Collector changes to 10.169.11.1. I still see debug log entries from the Emojiservice sidecar that try to connect to the old IP, though (see log example).

I believe it is related to #8956, but I don't know how and if I can use the diagnostics command for IP based connections.

How can it be reproduced?

Call instances of a DaemonSet using the NodeIP address from the K8s downwards API. Restart the pods of the Daemonset to assign a new PodIP to them. See that connections using the NodeIP still use the old PodIP and fail to be re-established.

Logs, error output, etc

{
	"id": "AQAAAYcDrhY0uLmpXQAAAABBWWNEcmgya0FBQjMxcXIxZEpFSnJBQTQ",
	"content": {
		"timestamp": "2023-03-21T10:19:13.332Z",
		"tags": [
			"short_image:proxy",
			"kube_container_name:linkerd-proxy",
			"image_tag:stable-2.12.4",
			"pod_phase:running",
			"source:proxy",
			"kube_ownerref_kind:replicaset",
			"container_name:linkerd-proxy",
			"cloud_provider:gcp"
		],
		"attributes": {
			"threadId": "ThreadId(1)",
			"spans": [
				{
					"name": "outbound"
				},
				{
					"name": "proxy",
					"addr": "10.167.0.1:4317"
				}
			],
			"level": "DEBUG",
			"fields": {
				"server": {
					"addr": "10.169.10.1:4317"
				},
				"message": "Connecting"
			},
			"timestamp": "[ 75611.397246s]",
			"target": "linkerd_proxy_transport::connect"
		}
	}
}

output of linkerd check -o short

» linkerd check -o short
Status check results are √

Environment

  • Kubernetes Version: v1.24.10-gke.2300
  • Environment: GKE
  • LinkerD Version: stable-2.12.4

Possible solution

No response

Additional context

No response

Would you like to work on fixing this bug?

None

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions