Skip to content

linkerd keeps talking with Terminating pods #3854

@falcoriss

Description

@falcoriss

Bug Report

What is the issue?

I encounter problems when one of my nodes crash and it looks like linkerd keeps sending traffic to pods on the failed node

How can it be reproduced?

  1. Install linkerd and emojivoto on a kind (procedure described here for kind) or kubeadm cluster (tested on 1.12, 1.15, 1.17) with at least 3 nodes. The number of master doesn't matter
  2. Install and configure metallb to provide an external IP to the emojivoto app
  3. mesh emojivoto
  4. Scale all emojivoto deployments to 2 replicas
  5. pause the container of a node containing a replica of an "emoji" pods
  6. Randomly, the web access will fail displaying emojis, being stuck on "Loading emojis"

I usually have to to the test a few times before it hits, deleting all pods and trying again but when it does occur, it doesn't matter how much time I wait, it will never get backup until i unpause the container (aka the node)

Logs, error output, etc

Here are screenshots of tests I did with relevent information

On the following screenshot, you can see that the endpoint list for the service has updated since i paused the first node, there is one valid endpoint for each service, however, the web container cannot reach the emoji service :
loading_emoji_stuck

I decided to go on the web container to see if something would keep going through the pod in Terminating, and it looks like that's the case. We can see that the Terminating pods has 10.244.1.27 and when I get a failed access (Loading emojis), a line pops on tcpdump that shows an access to the Terminated pod (that is on the paused, hence unreachable node) :
linkerd_tcpdump

The node status is right :

 **# k get nodes** 
NAME                 STATUS     ROLES    AGE    VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE       KERNEL-VERSION      CONTAINER-RUNTIME
kind-control-plane   Ready      master   165m   v1.17.0   172.17.0.2    <none>        Ubuntu 19.10   4.15.0-72-generic   containerd://1.3.2
kind-worker          Ready      <none>   165m   v1.17.0   172.17.0.3    <none>        Ubuntu 19.10   4.15.0-72-generic   containerd://1.3.2
kind-worker2         NotReady   <none>   165m   v1.17.0   172.17.0.6    <none>        Ubuntu 19.10   4.15.0-72-generic   containerd://1.3.2
kind-worker3         Ready      <none>   165m   v1.17.0   172.17.0.5    <none>        Ubuntu 19.10   4.15.0-72-generic   containerd://1.3.2
kind-worker4         Ready      <none>   165m   v1.17.0   172.17.0.4    <none>        Ubuntu 19.10   4.15.0-72-generic   containerd://1.3.2

I get these logs from the web service :

**# kubetail web**
Will tail 6 logs...
web-7f7b69d467-vlqv2 web-svc
web-7f7b69d467-vlqv2 linkerd-proxy
web-7f7b69d467-vlqv2 linkerd-init
web-7f7b69d467-w9sbf web-svc
web-7f7b69d467-w9sbf linkerd-proxy
web-7f7b69d467-w9sbf linkerd-init
Error from server: Get https://172.17.0.6:10250/containerLogs/emojivoto/web-7f7b69d467-w9sbf/web-svc?follow=true&sinceSeconds=10: net/http: TLS handshake timeout
Error from server: Get https://172.17.0.6:10250/containerLogs/emojivoto/web-7f7b69d467-w9sbf/linkerd-proxy?follow=true&sinceSeconds=10: net/http: TLS handshake timeout
Error from server: Get https://172.17.0.6:10250/containerLogs/emojivoto/web-7f7b69d467-w9sbf/linkerd-init?follow=true&sinceSeconds=10: net/http: TLS handshake timeout
[web-7f7b69d467-vlqv2 linkerd-proxy] WARN [  2516.257199s] linkerd2_app_core::errors request aborted because it reached the configured dispatch deadline 
[web-7f7b69d467-vlqv2 web-svc] 2019/12/19 16:11:38 Error serving request [&{GET /api/list HTTP/1.1 1 1 map[Accept:[*/*] Accept-Encoding:[gzip, deflate] Accept-Language:[fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3] Cache-Control:[no-cache] Pragma:[no-cache] Referer:[http://172.17.255.0/] User-Agent:[Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:71.0) Gecko/20100101 Firefox/71.0]] {} <nil> 0 [] false 172.17.255.0 map[] map[] <nil> map[] 127.0.0.1:33898 /api/list <nil> <nil> <nil> 0xc00047e0f0}]: rpc error: code = Unavailable desc = Service Unavailable: HTTP status code 503; transport: missing content-type field 
[web-7f7b69d467-vlqv2 linkerd-proxy] WARN [  2523.423009s] linkerd2_app_core::errors request aborted because it reached the configured dispatch deadline 
[web-7f7b69d467-vlqv2 web-svc] 2019/12/19 16:11:45 Error serving request [&{GET /api/list HTTP/1.1 1 1 map[Accept-Encoding:[gzip] L5d-Dst-Canonical:[web-svc.emojivoto.svc.cluster.local:80] User-Agent:[Go-http-client/1.1] X-B3-Sampled:[1] X-B3-Spanid:[2229d22041321333] X-B3-Traceid:[e21697090568f270b4deb8464d8c96fa]] {} <nil> 0 [] false web-svc.emojivoto:80 map[] map[] <nil> map[] 127.0.0.1:34270 /api/list <nil> <nil> <nil> 0xc00057f4d0}]: rpc error: code = Unavailable desc = Service Unavailable: HTTP status code 503; transport: missing content-type field 
[web-7f7b69d467-vlqv2 linkerd-proxy] WARN [  2527.429689s] linkerd2_app_core::errors request aborted because it reached the configured dispatch deadline 
[web-7f7b69d467-vlqv2 web-svc] 2019/12/19 16:11:49 Error serving request [&{GET /api/list HTTP/1.1 1 1 map[Accept-Encoding:[gzip] L5d-Dst-Canonical:[web-svc.emojivoto.svc.cluster.local:80] User-Agent:[Go-http-client/1.1] X-B3-Sampled:[1] X-B3-Spanid:[47a066af9f620523] X-B3-Traceid:[bc0252ad9edb2ac49bfdb09c01181548]] {} <nil> 0 [] false web-svc.emojivoto:80 map[] map[] <nil> map[] 127.0.0.1:34270 /api/list <nil> <nil> <nil> 0xc00047ebd0}]: rpc error: code = Unavailable desc = Service Unavailable: HTTP status code 503; transport: missing content-type field 
[web-7f7b69d467-vlqv2 web-svc] 2019/12/19 16:11:53 Error serving request [&{GET /api/list HTTP/1.1 1 1 map[Accept-Encoding:[gzip] L5d-Dst-Canonical:[web-svc.emojivoto.svc.cluster.local:80] User-Agent:[Go-http-client/1.1] X-B3-Sampled:[1] X-B3-Spanid:[918e8fcc5cc3e902] X-B3-Traceid:[5ab67ff9b2b31ef1f90d44da10360d5e]] {} <nil> 0 [] false web-svc.emojivoto:80 map[] map[] <nil> map[] 127.0.0.1:34270 /api/list <nil> <nil> <nil> 0xc00057ff20}]: rpc error: code = Unavailable desc = Service Unavailable: HTTP status code 503; transport: missing content-type field 
[web-7f7b69d467-vlqv2 linkerd-proxy] WARN [  2531.435677s] linkerd2_app_core::errors request aborted because it reached the configured dispatch deadline 
[web-7f7b69d467-vlqv2 linkerd-proxy] WARN [  2535.453758s] linkerd2_app_core::errors request aborted because it reached the configured dispatch deadline 
[web-7f7b69d467-vlqv2 web-svc] 2019/12/19 16:11:57 Error serving request [&{GET /api/list HTTP/1.1 1 1 map[Accept-Encoding:[gzip] L5d-Dst-Canonical:[web-svc.emojivoto.svc.cluster.local:80] User-Agent:[Go-http-client/1.1] X-B3-Sampled:[1] X-B3-Spanid:[b605245bbbf3dbf2] X-B3-Traceid:[ce0959706dc28605ef27acfdffe97274]] {} <nil> 0 [] false web-svc.emojivoto:80 map[] map[] <nil> map[] 127.0.0.1:34270 /api/list <nil> <nil> <nil> 0xc000435830}]: rpc error: code = Unavailable desc = Service Unavailable: HTTP status code 503; transport: missing content-type field 
[web-7f7b69d467-vlqv2 web-svc] 2019/12/19 16:12:21 Error serving request [&{GET /api/list HTTP/1.1 1 1 map[Accept:[*/*] Accept-Encoding:[gzip, deflate] Accept-Language:[fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3] Cache-Control:[no-cache] Pragma:[no-cache] Referer:[http://172.17.255.0/] User-Agent:[Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:71.0) Gecko/20100101 Firefox/71.0]] {} <nil> 0 [] false 172.17.255.0 map[] map[] <nil> map[] 127.0.0.1:33898 /api/list <nil> <nil> <nil> 0xc00057e6c0}]: rpc error: code = Unavailable desc = Service Unavailable: HTTP status code 503; transport: missing content-type field 
[web-7f7b69d467-vlqv2 linkerd-proxy] WARN [  2559.210677s] linkerd2_app_core::errors request aborted because it reached the configured dispatch deadline 
[web-7f7b69d467-vlqv2 linkerd-proxy] WARN [  2598.756060s] linkerd2_app_core::errors request aborted because it reached the configured dispatch deadline 

linkerd check output

kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API

kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version

linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ control plane PodSecurityPolicies exist

linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
× no unschedulable pods
    linkerd-controller-76cfcb4fb7-88kxm: 0/5 nodes are available: 2 node(s) didn't match pod affinity/anti-affinity, 3 node(s) had taints that the pod didn't tolerate.
    linkerd-destination-5679b85b-w6tmn: 0/5 nodes are available: 2 node(s) didn't match pod affinity/anti-affinity, 3 node(s) had taints that the pod didn't tolerate.
    linkerd-identity-d967f7bbf-z4dlf: 0/5 nodes are available: 2 node(s) didn't match pod affinity/anti-affinity, 3 node(s) had taints that the pod didn't tolerate.
    linkerd-proxy-injector-694db6cb6b-rk7bm: 0/5 nodes are available: 2 node(s) didn't match pod affinity/anti-affinity, 3 node(s) had taints that the pod didn't tolerate.
    linkerd-sp-validator-549796cf47-sw5px: 0/5 nodes are available: 2 node(s) didn't match pod affinity/anti-affinity, 3 node(s) had taints that the pod didn't tolerate.
    linkerd-tap-5d6c94b654-4677v: 0/5 nodes are available: 2 node(s) didn't match pod affinity/anti-affinity, 3 node(s) had taints that the pod didn't tolerate.
    see https://linkerd.io/checks/#l5d-existence-unschedulable-pods for hints

Status check results are ×

which is normal since I installed linkerd with HA replicas 3, it doesn't have enough nodes to schedule new pods to

# kr get pods                                                                                                                                              
NAME                                      READY   STATUS        RESTARTS   AGE
linkerd-controller-76cfcb4fb7-88kxm       0/3     Pending       0          63m
linkerd-controller-76cfcb4fb7-9pl5h       3/3     Running       0          106m
linkerd-controller-76cfcb4fb7-fl6bk       3/3     Terminating   0          106m
linkerd-controller-76cfcb4fb7-nzvb5       3/3     Running       0          106m
linkerd-destination-5679b85b-g2qs5        2/2     Running       0          106m
linkerd-destination-5679b85b-lql7z        2/2     Terminating   0          106m
linkerd-destination-5679b85b-w6tmn        0/2     Pending       0          63m
linkerd-destination-5679b85b-wllzw        2/2     Running       0          106m
linkerd-grafana-696c95f57d-gxxjx          2/2     Running       0          63m
linkerd-grafana-696c95f57d-tmbkh          2/2     Terminating   0          106m
linkerd-identity-d967f7bbf-hwb7l          2/2     Running       0          106m
linkerd-identity-d967f7bbf-jsk4v          2/2     Running       0          106m
linkerd-identity-d967f7bbf-q9ht5          2/2     Terminating   0          106m
linkerd-identity-d967f7bbf-z4dlf          0/2     Pending       0          63m
linkerd-prometheus-5596877b8-4k2l4        2/2     Running       0          63m
linkerd-prometheus-5596877b8-p7nt9        2/2     Terminating   0          106m
linkerd-proxy-injector-694db6cb6b-8fh57   2/2     Running       0          106m
linkerd-proxy-injector-694db6cb6b-gqcf7   2/2     Running       0          106m
linkerd-proxy-injector-694db6cb6b-nqp6p   2/2     Terminating   0          106m
linkerd-proxy-injector-694db6cb6b-rk7bm   0/2     Pending       0          63m
linkerd-sp-validator-549796cf47-bqxjk     2/2     Terminating   0          106m
linkerd-sp-validator-549796cf47-dc2dh     2/2     Running       0          106m
linkerd-sp-validator-549796cf47-fmplr     2/2     Running       0          106m
linkerd-sp-validator-549796cf47-sw5px     0/2     Pending       0          63m
linkerd-tap-5d6c94b654-2ssdb              2/2     Running       0          106m
linkerd-tap-5d6c94b654-4677v              0/2     Pending       0          63m
linkerd-tap-5d6c94b654-cjvj7              2/2     Running       0          106m
linkerd-tap-5d6c94b654-vqbfr              2/2     Terminating   0          106m
linkerd-web-7f57dc5547-lhrqs              2/2     Running       0          63m
linkerd-web-7f57dc5547-v9bmw              2/2     Terminating   0          106m

Environment

  • Kubernetes Version: 1.17
  • Cluster Environment: Kind
  • Host OS: Ubuntu 19.10
  • Linkerd version: 2.6.1

Possible solution

Additional context

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions