Skip to content

linkerd-proxy DNS queries do not fall back to A records when SRV is not found #8296

@magec

Description

@magec

What is the issue?

Linkerd-proxy performs a series of DNS queries to linkerd-identity-headless and linkerd-dst-headless depending on how it is started. The queries are SRV ones but without the _port-name._port-protocol prefix as stated in the specification.

Several DNS services for kubernetes add SRV registers without the prefix (coreDNS, kubeDNS), but I came across with one that does not (CloudDNS in GKE). This is making linkerd2-proxy to hang upon starting.

This should not be an issue if it fall backs to query an A record as it is intended here, but this code is not being executed.

I worked around the issue by manually adding the SRV records pointing to the service name, so I understand that the dns library resolves it to an Ip eventually.

How can it be reproduced?

It can be reproduced by trying to install linkerd in GKE with CloudDNS.

Logs, error output, etc

I straced linkerd-proxy start, and could see the thing:

👇 THE QUERY
1202  sendto(16, "\tK\1\0\0\1\0\0\0\0\0\0\31linkerd-identity-headless\7linkerd\3svc\4core\nkubernetes\tonesignal\3lan\0\0!\0\1", 85, MSG_NOSIGNAL, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("169.254.169.254")}, 16) = 85     

1202  write(1, "[     0.022706s] TRACE ThreadId(02) identity:identity{server.addr=linkerd-identity-headless.linkerd.svc.core.kubernetes.onesignal.lan:8080}:controller{addr=linkerd-identity-headless.linkerd.svc.core.kubernetes.onesignal.lan:8080}: tower::buffer::worker: service.ready=false delay\n", 280) = 280
1202  write(1, "[     0.022793s] TRACE ThreadId(02) tower::buffer::worker: worker polling for next message\n", 91) = 91
1202  write(1, "[     0.022861s] TRACE ThreadId(02) tower::buffer::worker: resuming buffered request\n", 85) = 85
1202  write(1, "[     0.022927s] TRACE ThreadId(02) identity:identity{server.addr=linkerd-identity-headless.linkerd.svc.core.kubernetes.onesignal.lan:8080}:controller{addr=linkerd-identity-headless.linkerd.svc.core.kubernetes.onesignal.lan:8080}: tower::buffer::worker: resumed=true worker received request; waiting for service readiness\n", 322) = 322
1202  write(1, "[     0.022996s] TRACE ThreadId(02) identity:identity{server.addr=linkerd-identity-headless.linkerd.svc.core.kubernetes.onesignal.lan:8080}:controller{addr=linkerd-identity-headless.linkerd.svc.core.kubernetes.onesignal.lan:8080}: tower::buffer::worker: service.ready=false delay\n", 280) = 280
1202  epoll_wait(12, [{EPOLLOUT, {u32=16777217, u64=16777217}}], 1024, 4080) = 1
1202  epoll_wait(12, [{EPOLLIN|EPOLLOUT, {u32=16777217, u64=16777217}}], 1024, 4080) = 1
1202  write(1, "[     0.024793s] TRACE ThreadId(02) tower::buffer::worker: worker polling for next message\n", 91) = 91
1202  write(1, "[     0.024905s] TRACE ThreadId(02) tower::buffer::worker: resuming buffered request\n", 85) = 85
1202  write(1, "[     0.024977s] TRACE ThreadId(02) identity:identity{server.addr=linkerd-identity-headless.linkerd.svc.core.kubernetes.onesignal.lan:8080}:controller{addr=linkerd-identity-headless.linkerd.svc.core.kubernetes.onesignal.lan:8080}: tower::buffer::worker: resumed=true worker received request; waiting for service readiness\n", 322) = 322

👇  THE RESPONSE
1202  recvfrom(16, "\tK\201\200\0\1\0\0\0\1\0\0\31linkerd-identity-headless\7linkerd\3svc\4core\nkubernetes\tonesignal\3lan\0\0!\0\1\3002\0\6\0\1\0\0\1,\0T\16ns-gcp-private\rgoogledomains\3com\0\24cloud-dns-hostmaster\6google\300~\0\0\0\1\0\0T`\0\0\16\20\0\3\364\200\0\0\1,", 2048, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("169.254.169.254")}, [128->16]) = 181 

1202  write(1, "[     0.025136s] TRACE ThreadId(02) identity:identity{server.addr=linkerd-identity-headless.linkerd.svc.core.kubernetes.onesignal.lan:8080}:controller{addr=linkerd-identity-headless.linkerd.svc.core.kubernetes.onesignal.lan:8080}: trust_dns_proto::rr::record_data: reading SOA    \n", 281) = 281
1202  write(1, "[     0.025224s] DEBUG ThreadId(02) identity:identity{server.addr=linkerd-identity-headless.linkerd.svc.core.kubernetes.onesignal.lan:8080}:controller{addr=linkerd-identity-headless.linkerd.svc.core.kubernetes.onesignal.lan:8080}: trust_dns_proto::udp::udp_client_stream: received message id: 2379    \n", 302) = 302
1202  write(1, "[     0.025298s] TRACE ThreadId(02) identity:identity{server.addr=linkerd-identity-headless.linkerd.svc.core.kubernetes.onesignal.lan:8080}:controller{addr=linkerd-identity-headless.linkerd.svc.core.kubernetes.onesignal.lan:8080}: mio::poll: deregistering event source from poller    \n", 285) = 285
1202  epoll_ctl(14, EPOLL_CTL_DEL, 16, NULL) = 0
1202  close(16)                         = 0

👇  THE LOGGING
1202  write(1, "[     0.025484s] DEBUG ThreadId(02) identity:identity{server.addr=linkerd-identity-headless.linkerd.svc.core.kubernetes.onesignal.lan:8080}:controller{addr=linkerd-identity-headless.linkerd.svc.core.kubernetes.onesignal.lan:8080}: trust_dns_resolver::error: Nameserver responded with No Error and no records    \n", 312) = 312 

1202  write(1, "[     0.025563s] DEBUG ThreadId(02) identity:identity{server.addr=linkerd-identity-headless.linkerd.svc.core.kubernetes.onesignal.lan:8080}:controller{addr=linkerd-identity-headless.linkerd.svc.core.kubernetes.onesignal.lan:8080}: linkerd_proxy_resolve::recover: recovering err=no record found for name: linkerd-identity-headless.linkerd.svc.core.kubernetes.onesignal.lan. type: SRV class: IN\n", 393) = 393
1202  write(1, "[     0.025633s]  WARN ThreadId(02) identity:identity{server.addr=linkerd-identity-headless.linkerd.svc.core.kubernetes.onesignal.lan:8080}:controller{addr=linkerd-identity-headless.linkerd.svc.core.kubernetes.onesignal.lan:8080}: linkerd_app_core::control: Failed to resolve control-plane component error=no record found for name: linkerd-identity-headless.linkerd.svc.core.kubernetes.onesignal.lan. type: SRV class: IN\n", 421) = 421

After this, it tries again, and again it says Failed to resolve control-plane component error=no record found for name: linkerd-identity-headless.linkerd.svc.core.kubernetes.onesignal.lan. type: SRV class: IN

output of linkerd check -o short

Linkerd core checks
===================

linkerd-identity
----------------
‼ issuer cert is valid for at least 60 days
    issuer certificate will expire on 2022-04-21T21:36:29Z
    see https://linkerd.io/2.11/checks/#l5d-identity-issuer-cert-not-expiring-soon for hints

Status check results are √

(I use cert-manager)

Environment

  • Kubernetes Version: Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.6-gke.300", GitCommit:"df413ee6225aa3fc539e18ca3464a48d723bd3ea", GitTreeState:"clean", BuildDate:"2022-01-24T09:29:08Z", GoVersion:"go1.16.12b7", Compiler:"gc", Platform:"linux/amd64"}
  • Custer Environment: GKE
  • Linkerd Version: 2.11.1

Possible solution

I think that the fallback mechanism should be more robust.

- First SRV for service-headless
- Then SRV for _grpc._tcp.service-headless.....
- Then A for service-headless

Additional context

No response

Would you like to work on fixing this bug?

maybe

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions