-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
What is the issue?
We observed that when installing the 2.18 version of Linkerd, or the corresponding edge-25.4.4 version, we started seeing very high error rates in our GRPCRoutes. Most requests fail because our clients report the responses they receive as trailer only, even though the servers return proper responses.
This is not present on 25.4.3.
If we remove the retry annotations from the GRPCRoutes (Gateway API CR), the problem disappears 🤔
Original message from the #linkerd channel in CNCF slack (link to message):
Hey, I'm in the middle of debugging a very curious bug introduced by linkerd in edge-25.4.4 (2.18 release — also present in BEL). We're observing that a random amount (but very significant. everything from 10-90%) of requests are failing upon being received by the client with the following error message being logged by our gRPC client of choice — Connect RPC with the gRPC transport:
protocol error: missing output message for unary method
Which is thrown from here: https://github.com/connectrpc/connect-es/blob/9f8d28a82d9c8d58b28ccba334d7e7f138aa330b/packages/connect/src/protocol-grpc/transport.ts#L153. And is thrown when there is a trailer only response. We've checked the server and it's returning everything as normal. So the problem is intermittent and happening in linkerd. The problem is not present in edge-25.4.3. Even more curiously, the bug completely disappears when we remove the following annotation:
retry.linkerd.io/grpc: unavailable,resource-exhausted
on the GRPCRoute.
The only strange thing we observed was that the Response Length logged by linkerd-viz was 0B for both the successful and unsuccessful requests. (I'll post more in thread).
I'm still working on a minimal reproduction, but wanted to hear if anyone had any experiences similar to this or tips.
How can it be reproduced?
I've created a minimal reproduction here: https://github.com/FredrikAugust/linkerd-grpc-route-retry-bug-repro. You can start this by running task setup. The client pods will log the errors.
It should be noted that both of the reproductions there use the ConnectRPC gRPC implementations. I'm unsure if that is relevant, but given that it worked on the last minor version I believe there is something strange happening here.
Logs, error output, etc
output of linkerd check -o short
linkerd-version
---------------
‼ cli is up-to-date
is running version 25.4.4 but the latest edge version is 25.5.4
see https://linkerd.io/2/checks/#l5d-version-cli for hints
control-plane-version
---------------------
‼ control plane is up-to-date
is running version 25.4.4 but the latest edge version is 25.5.4
see https://linkerd.io/2/checks/#l5d-version-control for hints
linkerd-control-plane-proxy
---------------------------
‼ control plane proxies are up-to-date
some proxies are not running the current version:
* linkerd-destination-69c5f9b548-klgts (edge-25.4.4)
* linkerd-identity-78dd4d74f7-28b57 (edge-25.4.4)
* linkerd-proxy-injector-c5646799c-jjrr7 (edge-25.4.4)
see https://linkerd.io/2/checks/#l5d-cp-proxy-version for hints
Status check results are √
Environment
- Present in both local (k3d/k3s) cluster and in production (GKE)
Client Version: v1.29.0
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.31.5+k3s1
WARNING: version difference between client (1.29) and server (1.31) exceeds the supported minor version skew of +/-1
Possible solution
No response
Additional context
No response
Would you like to work on fixing this bug?
maybe

