Skip to content

Retries on GRPCRoutes cause spontaneous failures #14050

@FredrikAugust

Description

@FredrikAugust

What is the issue?

We observed that when installing the 2.18 version of Linkerd, or the corresponding edge-25.4.4 version, we started seeing very high error rates in our GRPCRoutes. Most requests fail because our clients report the responses they receive as trailer only, even though the servers return proper responses.

This is not present on 25.4.3.

If we remove the retry annotations from the GRPCRoutes (Gateway API CR), the problem disappears 🤔

Original message from the #linkerd channel in CNCF slack (link to message):

Hey, I'm in the middle of debugging a very curious bug introduced by linkerd in edge-25.4.4 (2.18 release — also present in BEL). We're observing that a random amount (but very significant. everything from 10-90%) of requests are failing upon being received by the client with the following error message being logged by our gRPC client of choice — Connect RPC with the gRPC transport:
protocol error: missing output message for unary method
Which is thrown from here: https://github.com/connectrpc/connect-es/blob/9f8d28a82d9c8d58b28ccba334d7e7f138aa330b/packages/connect/src/protocol-grpc/transport.ts#L153. And is thrown when there is a trailer only response. We've checked the server and it's returning everything as normal. So the problem is intermittent and happening in linkerd. The problem is not present in edge-25.4.3. Even more curiously, the bug completely disappears when we remove the following annotation:
retry.linkerd.io/grpc: unavailable,resource-exhausted
on the GRPCRoute.
The only strange thing we observed was that the Response Length logged by linkerd-viz was 0B for both the successful and unsuccessful requests. (I'll post more in thread).
I'm still working on a minimal reproduction, but wanted to hear if anyone had any experiences similar to this or tips.

How can it be reproduced?

I've created a minimal reproduction here: https://github.com/FredrikAugust/linkerd-grpc-route-retry-bug-repro. You can start this by running task setup. The client pods will log the errors.

It should be noted that both of the reproductions there use the ConnectRPC gRPC implementations. I'm unsure if that is relevant, but given that it worked on the last minor version I believe there is something strange happening here.

Logs, error output, etc

Image

Image

output of linkerd check -o short

linkerd-version
---------------
‼ cli is up-to-date
    is running version 25.4.4 but the latest edge version is 25.5.4
    see https://linkerd.io/2/checks/#l5d-version-cli for hints

control-plane-version
---------------------
‼ control plane is up-to-date
    is running version 25.4.4 but the latest edge version is 25.5.4
    see https://linkerd.io/2/checks/#l5d-version-control for hints

linkerd-control-plane-proxy
---------------------------
‼ control plane proxies are up-to-date
    some proxies are not running the current version:
	* linkerd-destination-69c5f9b548-klgts (edge-25.4.4)
	* linkerd-identity-78dd4d74f7-28b57 (edge-25.4.4)
	* linkerd-proxy-injector-c5646799c-jjrr7 (edge-25.4.4)
    see https://linkerd.io/2/checks/#l5d-cp-proxy-version for hints

Status check results are √

Environment

  • Present in both local (k3d/k3s) cluster and in production (GKE)
Client Version: v1.29.0
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.31.5+k3s1
WARNING: version difference between client (1.29) and server (1.31) exceeds the supported minor version skew of +/-1

Possible solution

No response

Additional context

No response

Would you like to work on fixing this bug?

maybe

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions