-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
What is the issue?
Linkerd-proxy performs a series of DNS queries to linkerd-identity-headless and linkerd-dst-headless depending on how it is started. The queries are SRV ones but without the _port-name._port-protocol prefix as stated in the specification.
Several DNS services for kubernetes add SRV registers without the prefix (coreDNS, kubeDNS), but I came across with one that does not (CloudDNS in GKE). This is making linkerd2-proxy to hang upon starting.
This should not be an issue if it fall backs to query an A record as it is intended here, but this code is not being executed.
I worked around the issue by manually adding the SRV records pointing to the service name, so I understand that the dns library resolves it to an Ip eventually.
How can it be reproduced?
It can be reproduced by trying to install linkerd in GKE with CloudDNS.
Logs, error output, etc
I straced linkerd-proxy start, and could see the thing:
👇 THE QUERY
1202 sendto(16, "\tK\1\0\0\1\0\0\0\0\0\0\31linkerd-identity-headless\7linkerd\3svc\4core\nkubernetes\tonesignal\3lan\0\0!\0\1", 85, MSG_NOSIGNAL, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("169.254.169.254")}, 16) = 85
1202 write(1, "[ 0.022706s] TRACE ThreadId(02) identity:identity{server.addr=linkerd-identity-headless.linkerd.svc.core.kubernetes.onesignal.lan:8080}:controller{addr=linkerd-identity-headless.linkerd.svc.core.kubernetes.onesignal.lan:8080}: tower::buffer::worker: service.ready=false delay\n", 280) = 280
1202 write(1, "[ 0.022793s] TRACE ThreadId(02) tower::buffer::worker: worker polling for next message\n", 91) = 91
1202 write(1, "[ 0.022861s] TRACE ThreadId(02) tower::buffer::worker: resuming buffered request\n", 85) = 85
1202 write(1, "[ 0.022927s] TRACE ThreadId(02) identity:identity{server.addr=linkerd-identity-headless.linkerd.svc.core.kubernetes.onesignal.lan:8080}:controller{addr=linkerd-identity-headless.linkerd.svc.core.kubernetes.onesignal.lan:8080}: tower::buffer::worker: resumed=true worker received request; waiting for service readiness\n", 322) = 322
1202 write(1, "[ 0.022996s] TRACE ThreadId(02) identity:identity{server.addr=linkerd-identity-headless.linkerd.svc.core.kubernetes.onesignal.lan:8080}:controller{addr=linkerd-identity-headless.linkerd.svc.core.kubernetes.onesignal.lan:8080}: tower::buffer::worker: service.ready=false delay\n", 280) = 280
1202 epoll_wait(12, [{EPOLLOUT, {u32=16777217, u64=16777217}}], 1024, 4080) = 1
1202 epoll_wait(12, [{EPOLLIN|EPOLLOUT, {u32=16777217, u64=16777217}}], 1024, 4080) = 1
1202 write(1, "[ 0.024793s] TRACE ThreadId(02) tower::buffer::worker: worker polling for next message\n", 91) = 91
1202 write(1, "[ 0.024905s] TRACE ThreadId(02) tower::buffer::worker: resuming buffered request\n", 85) = 85
1202 write(1, "[ 0.024977s] TRACE ThreadId(02) identity:identity{server.addr=linkerd-identity-headless.linkerd.svc.core.kubernetes.onesignal.lan:8080}:controller{addr=linkerd-identity-headless.linkerd.svc.core.kubernetes.onesignal.lan:8080}: tower::buffer::worker: resumed=true worker received request; waiting for service readiness\n", 322) = 322
👇 THE RESPONSE
1202 recvfrom(16, "\tK\201\200\0\1\0\0\0\1\0\0\31linkerd-identity-headless\7linkerd\3svc\4core\nkubernetes\tonesignal\3lan\0\0!\0\1\3002\0\6\0\1\0\0\1,\0T\16ns-gcp-private\rgoogledomains\3com\0\24cloud-dns-hostmaster\6google\300~\0\0\0\1\0\0T`\0\0\16\20\0\3\364\200\0\0\1,", 2048, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("169.254.169.254")}, [128->16]) = 181
1202 write(1, "[ 0.025136s] TRACE ThreadId(02) identity:identity{server.addr=linkerd-identity-headless.linkerd.svc.core.kubernetes.onesignal.lan:8080}:controller{addr=linkerd-identity-headless.linkerd.svc.core.kubernetes.onesignal.lan:8080}: trust_dns_proto::rr::record_data: reading SOA \n", 281) = 281
1202 write(1, "[ 0.025224s] DEBUG ThreadId(02) identity:identity{server.addr=linkerd-identity-headless.linkerd.svc.core.kubernetes.onesignal.lan:8080}:controller{addr=linkerd-identity-headless.linkerd.svc.core.kubernetes.onesignal.lan:8080}: trust_dns_proto::udp::udp_client_stream: received message id: 2379 \n", 302) = 302
1202 write(1, "[ 0.025298s] TRACE ThreadId(02) identity:identity{server.addr=linkerd-identity-headless.linkerd.svc.core.kubernetes.onesignal.lan:8080}:controller{addr=linkerd-identity-headless.linkerd.svc.core.kubernetes.onesignal.lan:8080}: mio::poll: deregistering event source from poller \n", 285) = 285
1202 epoll_ctl(14, EPOLL_CTL_DEL, 16, NULL) = 0
1202 close(16) = 0
👇 THE LOGGING
1202 write(1, "[ 0.025484s] DEBUG ThreadId(02) identity:identity{server.addr=linkerd-identity-headless.linkerd.svc.core.kubernetes.onesignal.lan:8080}:controller{addr=linkerd-identity-headless.linkerd.svc.core.kubernetes.onesignal.lan:8080}: trust_dns_resolver::error: Nameserver responded with No Error and no records \n", 312) = 312
1202 write(1, "[ 0.025563s] DEBUG ThreadId(02) identity:identity{server.addr=linkerd-identity-headless.linkerd.svc.core.kubernetes.onesignal.lan:8080}:controller{addr=linkerd-identity-headless.linkerd.svc.core.kubernetes.onesignal.lan:8080}: linkerd_proxy_resolve::recover: recovering err=no record found for name: linkerd-identity-headless.linkerd.svc.core.kubernetes.onesignal.lan. type: SRV class: IN\n", 393) = 393
1202 write(1, "[ 0.025633s] WARN ThreadId(02) identity:identity{server.addr=linkerd-identity-headless.linkerd.svc.core.kubernetes.onesignal.lan:8080}:controller{addr=linkerd-identity-headless.linkerd.svc.core.kubernetes.onesignal.lan:8080}: linkerd_app_core::control: Failed to resolve control-plane component error=no record found for name: linkerd-identity-headless.linkerd.svc.core.kubernetes.onesignal.lan. type: SRV class: IN\n", 421) = 421
After this, it tries again, and again it says Failed to resolve control-plane component error=no record found for name: linkerd-identity-headless.linkerd.svc.core.kubernetes.onesignal.lan. type: SRV class: IN
output of linkerd check -o short
Linkerd core checks
===================
linkerd-identity
----------------
‼ issuer cert is valid for at least 60 days
issuer certificate will expire on 2022-04-21T21:36:29Z
see https://linkerd.io/2.11/checks/#l5d-identity-issuer-cert-not-expiring-soon for hints
Status check results are √
(I use cert-manager)
Environment
- Kubernetes Version: Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.6-gke.300", GitCommit:"df413ee6225aa3fc539e18ca3464a48d723bd3ea", GitTreeState:"clean", BuildDate:"2022-01-24T09:29:08Z", GoVersion:"go1.16.12b7", Compiler:"gc", Platform:"linux/amd64"}
- Custer Environment: GKE
- Linkerd Version: 2.11.1
Possible solution
I think that the fallback mechanism should be more robust.
- First SRV for service-headless
- Then SRV for _grpc._tcp.service-headless.....
- Then A for service-headless
Additional context
No response
Would you like to work on fixing this bug?
maybe