-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Closed
Description
What is the issue?
We use canaries with ArgoRollout via SMI meaning there is a lot of endpointslice churn due to frequent release trains and their 10-15min canary rollouts. (~300-400 meshed pods)
Occasionally we will see the destination service go into failfast mode, which is problematic during a canary deployment because old IPs of pods that were already torn down by ArgoRollout are still receiving traffic even though they should not. To remedy this we need to restart the destination deployment.
How can it be reproduced?
This happens intermittently under load. We run 5 instances of each control plane component.
Logs, error output, etc
The following are all logs from destination pods:
[ 18093.017333s] WARN ThreadId(01) linkerd_reconnect: Service failed error=channel closed
[951475.788267s] WARN ThreadId(01) linkerd_reconnect: Failed to connect error=Connection refused (os error 111)
[950833.887787s] WARN ThreadId(01) inbound:server{port=8086}:controller{addr=localhost:8086}: linkerd_stack::failfast: Service entering failfast after 1s
[950842.896953s] WARN ThreadId(01) watch{port=8086}:controller{addr=localhost:8090}: linkerd_stack::failfast: Service entering failfast after 10s
[ 18093.017325s] WARN ThreadId(01) inbound:server{port=8090}:rescue{client.addr=10.30.243.33:34358}: linkerd_app_inbound::http::server: Unexpected error error=client 10.30.243.33:34358: server: 10.30.205.39:8090: server 10.30.205.39:8090: service linkerd-policy.linkerd.svc.cluster.local:8090: operation was canceled: connection closed error.sources=[server 10.30.205.39:8090: service linkerd-policy.linkerd.svc.cluster.local:8090: operation was canceled: connection closed, operation was canceled: connection closed, connection closed]
[ 3.751136s] WARN ThreadId(01) watch{port=8086}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[ 14118.108714s] WARN ThreadId(01) inbound: linkerd_app_core::serve: Server failed to accept connection error=failed to obtain peer address: Transport endpoint is not connected (os error 107) error.sources=[Transport endpoint is not connected (os error 107)]
[ 18093.017344s] INFO ThreadId(01) inbound:server{port=8090}:rescue{client.addr=10.30.156.57:57528}: linkerd_app_core::errors::respond: gRPC request failed error=client 10.30.156.57:57528: server: 10.30.205.39:8090: server 10.30.205.39:8090: service linkerd-policy.linkerd.svc.cluster.local:8090: operation was canceled: connection closed error.sources=[server 10.30.205.39:8090: service linkerd-policy.linkerd.svc.cluster.local:8090: operation was canceled: connection closed, operation was canceled: connection closed, connection closed]
output of linkerd check -o short
linkerd-identity
----------------
‼ issuer cert is valid for at least 60 days
issuer certificate will expire on --------------
see https://linkerd.io/2.13/checks/#l5d-identity-issuer-cert-not-expiring-soon for hints
linkerd-version
---------------
‼ cli is up-to-date
is running version 2.13.4 but the latest stable version is 2.13.6
see https://linkerd.io/2.13/checks/#l5d-version-cli for hints
control-plane-version
---------------------
‼ control plane and cli versions match
control plane running stable-2.13.6 but cli running stable-2.13.4
see https://linkerd.io/2.13/checks/#l5d-version-control for hints
linkerd-control-plane-proxy
---------------------------
‼ control plane proxies and cli versions match
linkerd-destination-789df68585-288sg running stable-2.13.6 but cli running stable-2.13.4
see https://linkerd.io/2.13/checks/#l5d-cp-proxy-cli-version for hints
linkerd-viz
-----------
‼ viz extension proxies and cli versions match
linkerd-destination-789df68585-288sg running stable-2.13.6 but cli running stable-2.13.4
see https://linkerd.io/2.13/checks/#l5d-viz-proxy-cli-version for hints
Status check results are √
Environment
- v1.24.16-eks
- linkerd: 2.13.6
Possible solution
No response
Additional context
No response
Would you like to work on fixing this bug?
None
Reactions are currently unavailable