-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Closed
Description
Bug Report
What is the issue?
During node outage where some of linkerd components “linkerd-destination” and “linkerd-identity” were running, it looks like linkerd keeps sending traffic to pods on the failed node. In my cluster Linkerd is installed in HA mode.
How can it be reproduced?
- Installed linkerd in HA mode
- Mesh two apps, where app A talks to app B or vise versa
- Stop node where one of linkerd component "linkerd-destination" was running (The goal is to simulate a node outage which has happened to us and lead us to seeing this bug) . During this time the pod will wait for 5 minutes as per eviction timeout and get rescheduled in a new node after 5 minutes.
- Randomly one or two of app pods will fail to make call to another app
- This issue happens only when there is ungraceful shutdown of nodes
linker-proxy logs
2020-06-18T20:48:30.798501272Z [ 14263.555872710s] WARN outbound:accept{peer.addr=172.21.14.116:52074}:source{target.addr=172.17.119.128:80}: linkerd2_app_core::errors: Failed to proxy request: request timed out
2020-06-18T20:48:31.317647919Z [ 14264.75155158s] WARN outbound:accept{peer.addr=172.21.14.116:52074}:source{target.addr=172.17.119.128:80}: linkerd2_app_core::errors: Failed to proxy request: Service in fail-fast
2020-06-18T20:48:31.317687819Z [ 14264.75275658s] WARN outbound:accept{peer.addr=172.21.14.116:54590}:source{target.addr=172.17.119.128:80}: linkerd2_app_core::errors: Failed to proxy request: Service in fail-fast
2020-06-18T20:48:32.322513181Z [ 14265.80191420s] WARN outbound:accept{peer.addr=172.21.14.116:52074}:source{target.addr=172.17.119.128:80}: linkerd2_app_core::errors: Failed to proxy request: Service in fail-fast```
```2020-06-18T22:36:56.281636047Z [ 110.160896474s] WARN outbound:accept{peer.addr=172.21.14.119:50512}:source{target.addr=172.17.140.43:80}:logical{addr=commandproxy-svc.commandproxy:80}:profile:balance{addr=commandproxy-svc.commandproxy.svc.cluster.local:80}:endpoint{peer.addr=172.21.1.17:80}: rustls::session: Sending fatal alert BadCertificate
2020-06-18T22:36:56.785530039Z [ 110.664675566s] WARN outbound:accept{peer.addr=172.21.14.119:50512}:source{target.addr=172.17.140.43:80}:logical{addr=commandproxy-svc.commandproxy:80}:profile:balance{addr=commandproxy-svc.commandproxy.svc.cluster.local:80}:endpoint{peer.addr=172.21.1.17:80}: rustls::session: Sending fatal alert BadCertificate
2020-06-18T22:36:57.288315626Z [ 111.167593054s] WARN outbound:accept{peer.addr=172.21.14.119:50512}:source{target.addr=172.17.140.43:80}:logical{addr=commandproxy-svc.commandproxy:80}:profile:balance{addr=commandproxy-svc.commandproxy.svc.cluster.local:80}:endpoint{peer.addr=172.21.1.17:80}: rustls::session: Sending fatal alert BadCertificate
2020-06-18T22:36:57.79501223Z [ 111.674226057s] WARN outbound:accept{peer.addr=172.21.14.119:50512}:source{target.addr=172.17.140.43:80}:logical{addr=commandproxy-svc.commandproxy:80}:profile:balance{addr=commandproxy-svc.commandproxy.svc.cluster.local:80}:endpoint{peer.addr=172.21.1.17:80}: rustls::session: Sending fatal alert BadCertificate
linkerd check output
--------------
√ can initialize the client
√ can query the Kubernetes API
kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version
linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ controller pod is running
√ can initialize the client
√ can query the control plane API
linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ control plane PodSecurityPolicies exist
linkerd-identity
----------------
√ certificate config is valid
√ trust roots are using supported crypto algorithm
√ trust roots are within their validity period
√ trust roots are valid for at least 60 days
√ issuer cert is using supported crypto algorithm
√ issuer cert is within its validity period
√ issuer cert is valid for at least 60 days
√ issuer cert is issued by the trust root
linkerd-api
-----------
√ control plane pods are ready
√ control plane self-check
√ [kubernetes] control plane can talk to Kubernetes
√ [prometheus] control plane can talk to Prometheus
√ tap api service is running
linkerd-version
---------------
√ can determine the latest version
‼ cli is up-to-date
is running version 2.7.1 but the latest stable version is 2.8.1
see https://linkerd.io/checks/#l5d-version-cli for hints
control-plane-version
---------------------
‼ control plane is up-to-date
is running version 2.7.1 but the latest stable version is 2.8.1
see https://linkerd.io/checks/#l5d-version-control for hints
√ control plane and cli versions match
linkerd-ha-checks
-----------------
√ pod injection disabled on kube-system
Status check results are √
Environment
- Kubernetes Version: 1.16.9
- Cluster Environment: AKS
- Linkerd version: stable-2.7.1
Possible solution
Additional context
There is similar open github issue, I am not 100 percent sure they are the same
Reactions are currently unavailable