Skip to content

gRPC traffic is not retried because there is no content-length header #7141

@cpretzer

Description

@cpretzer

What is the issue?

While testing gRPC retries using the emojivoto application, I found that the gRPC requests between the web and voting services are not retryable because the gRPC protocol doesn't have a content-length header, which is part of the logic that is used to determine whether a POST request is retryable.

How can it be reproduced?

  1. Deploy linkerd 2.11 to a cluster (assumes the CLI is installed): linkerd install | kubectl apply -f -
  2. Deploy emojivoto: kubectl apply -f https://run.linkerd.io/emojivoto.yml
  3. Scale vote-bot to 0 replicas, because we want to target only the VoteDoughnut route
  4. Deploy the voting ServiceProfile with isRetryable enabled for all routes: kubectl apply -f https://raw.githubusercontent.com/BuoyantIO/emojivoto/main/training/service-profiles/voting-svc-profile.yml
  5. Set the proxy-log-level to linkerd=trace,info for the web pod: docs
  6. Port forward to the web service: kubectl port-forward svc/web -n emojivoto 8080:80
  7. Make a request to vote for the doughnut: curl http://localhost:8080/api/vote?choice=:doughnut:
  8. View the proxy logs for the web pod and you will see something like the output below

Logs, error output, etc

2021-10-21T17:33:24.352503923Z [   564.043062s] TRACE ThreadId(01) outbound:accept{client.addr=10.42.0.91:41620}:server{orig_dst=10.43.129.115:8080}:profile:http{v=h2}:logical{dst=voting-svc.emojivoto.svc.cluster.local:8080}: linkerd_stack_tracing: service ready=true ok=true
2021-10-21T17:33:24.352518037Z [   564.043072s] TRACE ThreadId(01) outbound:accept{client.addr=10.42.0.91:41620}:server{orig_dst=10.43.129.115:8080}:profile:http{v=h2}:logical{dst=voting-svc.emojivoto.svc.cluster.local:8080}: linkerd_stack_tracing: service request=Request { method: POST, uri: http://voting-svc.emojivoto:8080/emojivoto.v1.VotingService/VoteDoughnut, version: HTTP/2.0, headers: {"content-type": "application/grpc", "user-agent": "grpc-go/1.29.1", "te": "trailers", "grpc-trace-bin": "AACG+V5sxnQNK5cC/n1VpjOjAU3DqwK1qCUqAgA"}, body: BoxBody }
2021-10-21T17:33:24.352667528Z [   564.043231s] TRACE ThreadId(01) outbound:accept{client.addr=10.42.0.91:41620}:server{orig_dst=10.43.129.115:8080}:profile:http{v=h2}:logical{dst=voting-svc.emojivoto.svc.cluster.local:8080}: linkerd_service_profiles::http::route_request: Using configured route condition=All([Method(POST), Path(^/emojivoto\.v1\.VotingService/VoteDoughnut$)])
2021-10-21T17:33:24.352675121Z [   564.043242s] TRACE ThreadId(01) outbound:accept{client.addr=10.42.0.91:41620}:server{orig_dst=10.43.129.115:8080}:profile:http{v=h2}:logical{dst=voting-svc.emojivoto.svc.cluster.local:8080}: linkerd_retry: retryable=true
2021-10-21T17:33:24.352679109Z [   564.043246s] TRACE ThreadId(01) outbound:accept{client.addr=10.42.0.91:41620}:server{orig_dst=10.43.129.115:8080}:profile:http{v=h2}:logical{dst=voting-svc.emojivoto.svc.cluster.local:8080}: linkerd_app_core::retry: not retryable req.has_body=true req.content_length=None
2021-10-21T17:33:24.352696058Z [   564.043270s] TRACE ThreadId(01) outbound:accept{client.addr=10.42.0.91:41620}:server{orig_dst=10.43.129.115:8080}:profile:http{v=h2}:logical{dst=voting-svc.emojivoto.svc.cluster.local:8080}:concrete{addr=voting-svc.emojivoto.svc.cluster.local:8080}:endpoint{server.addr=10.42.0.75:8080}: linkerd_reconnect: Ready

linkerd check output

Linkerd core checks
===================

kubernetes-api
--------------
√ can initialize the client
√ can query the Kubernetes API

kubernetes-version
------------------
√ is running the minimum Kubernetes API version
√ is running the minimum kubectl version

linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ control plane pods are ready
√ cluster networks contains all node podCIDRs

linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist

linkerd-identity
----------------
√ certificate config is valid
√ trust anchors are using supported crypto algorithm
√ trust anchors are within their validity period
√ trust anchors are valid for at least 60 days
√ issuer cert is using supported crypto algorithm
√ issuer cert is within its validity period
√ issuer cert is valid for at least 60 days
√ issuer cert is issued by the trust anchor

linkerd-webhooks-and-apisvc-tls
-------------------------------
√ proxy-injector webhook has valid cert
√ proxy-injector cert is valid for at least 60 days
√ sp-validator webhook has valid cert
√ sp-validator cert is valid for at least 60 days
√ policy-validator webhook has valid cert
√ policy-validator cert is valid for at least 60 days

linkerd-version
---------------
√ can determine the latest version
√ cli is up-to-date

control-plane-version
---------------------
√ can retrieve the control plane version
√ control plane is up-to-date
√ control plane and cli versions match

linkerd-control-plane-proxy
---------------------------
√ control plane proxies are healthy
√ control plane proxies are up-to-date
√ control plane proxies and cli versions match

Status check results are √

Linkerd extensions checks
=========================

linkerd-viz
-----------
√ linkerd-viz Namespace exists
√ linkerd-viz ClusterRoles exist
√ linkerd-viz ClusterRoleBindings exist
√ tap API server has valid cert
√ tap API server cert is valid for at least 60 days
√ tap API service is running
√ linkerd-viz pods are injected
√ viz extension pods are running
√ viz extension proxies are healthy
√ viz extension proxies are up-to-date
√ viz extension proxies and cli versions match
√ prometheus is installed and configured correctly
√ can initialize the client
√ viz extension self-check

Status check results are √

Environment

  • Kubernetes Version: v1.21.3+k3s1
  • Cluster Environment: k3d
  • Host OS: PopOS
  • Linkerd version: 2.11

Possible solution

I'm still researching the gRPC protocol, so no proposed solution. After discussing with the Linkerd maintainers, there are a few things to take into consideration for a solution:

  • The current implementation is for POST requests, which do have a a content-length header, so the solution should preserve support for POST bodies
  • gRPC can be Unary or Streaming, which should be taken into consideration
  • The gRPC protocol includes a Message-Length field as part of a DATA frame, but there may be a more reliable way to cover both H1 POST requests and gRPC requests

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions