Skip to content

Linkerd fails during node outage #4674

@Abrishges

Description

@Abrishges

Bug Report

What is the issue?

During node outage where some of linkerd components “linkerd-destination” and “linkerd-identity” were running, it looks like linkerd keeps sending traffic to pods on the failed node. In my cluster Linkerd is installed in HA mode.

How can it be reproduced?

  1. Installed linkerd in HA mode
  2. Mesh two apps, where app A talks to app B or vise versa
  3. Stop node where one of linkerd component "linkerd-destination" was running (The goal is to simulate a node outage which has happened to us and lead us to seeing this bug) . During this time the pod will wait for 5 minutes as per eviction timeout and get rescheduled in a new node after 5 minutes.
  4. Randomly one or two of app pods will fail to make call to another app
  5. This issue happens only when there is ungraceful shutdown of nodes

linker-proxy logs

2020-06-18T20:48:30.798501272Z [ 14263.555872710s]  WARN outbound:accept{peer.addr=172.21.14.116:52074}:source{target.addr=172.17.119.128:80}: linkerd2_app_core::errors: Failed to proxy request: request timed out
2020-06-18T20:48:31.317647919Z [ 14264.75155158s]  WARN outbound:accept{peer.addr=172.21.14.116:52074}:source{target.addr=172.17.119.128:80}: linkerd2_app_core::errors: Failed to proxy request: Service in fail-fast
2020-06-18T20:48:31.317687819Z [ 14264.75275658s]  WARN outbound:accept{peer.addr=172.21.14.116:54590}:source{target.addr=172.17.119.128:80}: linkerd2_app_core::errors: Failed to proxy request: Service in fail-fast
2020-06-18T20:48:32.322513181Z [ 14265.80191420s]  WARN outbound:accept{peer.addr=172.21.14.116:52074}:source{target.addr=172.17.119.128:80}: linkerd2_app_core::errors: Failed to proxy request: Service in fail-fast```

```2020-06-18T22:36:56.281636047Z [   110.160896474s]  WARN outbound:accept{peer.addr=172.21.14.119:50512}:source{target.addr=172.17.140.43:80}:logical{addr=commandproxy-svc.commandproxy:80}:profile:balance{addr=commandproxy-svc.commandproxy.svc.cluster.local:80}:endpoint{peer.addr=172.21.1.17:80}: rustls::session: Sending fatal alert BadCertificate
2020-06-18T22:36:56.785530039Z [   110.664675566s]  WARN outbound:accept{peer.addr=172.21.14.119:50512}:source{target.addr=172.17.140.43:80}:logical{addr=commandproxy-svc.commandproxy:80}:profile:balance{addr=commandproxy-svc.commandproxy.svc.cluster.local:80}:endpoint{peer.addr=172.21.1.17:80}: rustls::session: Sending fatal alert BadCertificate
2020-06-18T22:36:57.288315626Z [   111.167593054s]  WARN outbound:accept{peer.addr=172.21.14.119:50512}:source{target.addr=172.17.140.43:80}:logical{addr=commandproxy-svc.commandproxy:80}:profile:balance{addr=commandproxy-svc.commandproxy.svc.cluster.local:80}:endpoint{peer.addr=172.21.1.17:80}: rustls::session: Sending fatal alert BadCertificate
2020-06-18T22:36:57.79501223Z [   111.674226057s]  WARN outbound:accept{peer.addr=172.21.14.119:50512}:source{target.addr=172.17.140.43:80}:logical{addr=commandproxy-svc.commandproxy:80}:profile:balance{addr=commandproxy-svc.commandproxy.svc.cluster.local:80}:endpoint{peer.addr=172.21.1.17:80}: rustls::session: Sending fatal alert BadCertificate

linkerd check output

--------------                                                                      
√ can initialize the client                                                         
√ can query the Kubernetes API                                                      
                                                                                    
kubernetes-version                                                                  
------------------                                                                  
√ is running the minimum Kubernetes API version                                     
√ is running the minimum kubectl version                                            
                                                                                    
linkerd-existence                                                                   
-----------------                                                                   
√ 'linkerd-config' config map exists                                                
√ heartbeat ServiceAccount exist                                                    
√ control plane replica sets are ready                                              
√ no unschedulable pods                                                             
√ controller pod is running                                                         
√ can initialize the client                                                         
√ can query the control plane API                                                   
                                                                                    
linkerd-config                                                                      
--------------                                                                      
√ control plane Namespace exists                                                    
√ control plane ClusterRoles exist                                                  
√ control plane ClusterRoleBindings exist                                           
√ control plane ServiceAccounts exist                                               
√ control plane CustomResourceDefinitions exist                                     
√ control plane MutatingWebhookConfigurations exist                                 
√ control plane ValidatingWebhookConfigurations exist                               
√ control plane PodSecurityPolicies exist                                           
                                                                                    
linkerd-identity                                                                    
----------------                                                                    
√ certificate config is valid                                                       
√ trust roots are using supported crypto algorithm                                  
√ trust roots are within their validity period                                      
√ trust roots are valid for at least 60 days                                        
√ issuer cert is using supported crypto algorithm                                   
√ issuer cert is within its validity period                                         
√ issuer cert is valid for at least 60 days                                         
√ issuer cert is issued by the trust root                                           
                                                                                    
linkerd-api                                                                         
-----------                                                                         
√ control plane pods are ready                                                      
√ control plane self-check                                                          
√ [kubernetes] control plane can talk to Kubernetes                                 
√ [prometheus] control plane can talk to Prometheus                                 
√ tap api service is running                                                        
                                                                                    
linkerd-version                                                                     
---------------                                                                     
√ can determine the latest version                                                  
‼ cli is up-to-date                                                                 
    is running version 2.7.1 but the latest stable version is 2.8.1                 
    see https://linkerd.io/checks/#l5d-version-cli for hints                        
                                                                                    
control-plane-version                                                               
---------------------                                                               
‼ control plane is up-to-date                                                       
    is running version 2.7.1 but the latest stable version is 2.8.1                 
    see https://linkerd.io/checks/#l5d-version-control for hints                    
√ control plane and cli versions match                                              
                                                                                    
linkerd-ha-checks                                                                   
-----------------                                                                   
√ pod injection disabled on kube-system                                             
                                                                                    
Status check results are √ 

Environment

  • Kubernetes Version: 1.16.9
  • Cluster Environment: AKS
  • Linkerd version: stable-2.7.1

Possible solution

Additional context

There is similar open github issue, I am not 100 percent sure they are the same

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions