Skip to content

retry gRPC requests are immediately terminated by trailers#1706

Merged
hawkw merged 47 commits intomainfrom
eliza/grpc-retry-thing
Jun 22, 2022
Merged

retry gRPC requests are immediately terminated by trailers#1706
hawkw merged 47 commits intomainfrom
eliza/grpc-retry-thing

Conversation

@hawkw
Copy link
Contributor

@hawkw hawkw commented May 24, 2022

Currently, gRPC responses are classified for metrics based on the value
of the grpc-status header in either the initial HEADERS frame or
in the TRAILERS frame that terminates the response body. However, this
is not the case for retries. Because we don't want to buffer the
entire response body before forwarding it to the client (introducing
latency, and potentially breaking client behavior in the case of
long-running streams), we make retry decisions for gRPC based only on
grpc-status in the initial HEADERS frame.

Unfortunately, some gRPC implementations (such as Akka) will send the
grpc-status header in a TRAILERS frame even when the body is empty
(see linkerd/linkerd2#7701). In that case, Linkerd won't retry the
request, even though it wouldn't actually have to wait to buffer a long
body before making a retry decision.

This branch resolves this issue by changing the retry implementation so
that the proxy will wait for a single additional frame after the first
HEADERS frame before making a retry decision. If the next frame is a
TRAILERS frame that terminates the response stream, we can include the
value of the grpc-status in that TRAILERS frame in the retry
decision. Otherwise, if the next frame is a DATA frame, we buffer only
a single frame, give up on trying to retry, and forward that frame to
the client, followed by the rest of the body.

Unfortunately, this change is a bit complicated, as it was necessary to
introduce the ability for retries to modify the response body as well as
the request body. This then implies that we need to be able to erase
response body types as well, when they may differ based on whether or
not retries are enabled for a route. I tried a number of different ways
of structuring this change, such as passing an additional Layer into
the Retry service that wraps the service constructed for retries, but
this was kind of a mess. Ultimately, I felt like the cleanest way of
doing this was changing the linkerd_retry::PrepareRequest trait to a
PrepareRetry trait that can also modify response bodies.

I'm open to suggestions for a nicer way of structuing this change. I
also considered the alternative of giving up on tower::retry entirely
and writing our own retry service from scratch, but I wanted to try to
avoid that if it was possible.

Depends on #1705 and #1726.

Fixes linkerd/linkerd2#7701.

hawkw added a commit that referenced this pull request May 31, 2022
The `linkerd-http-box` crate currently contains `BoxRequest` and
`BoxResponse` middleware for boxing request/response bodies, and an
`EraseRequest` middleware for erasing the type of request bodies, for
use in a stack which must be able to serve *multiple* request body
types. However, there is no corresponding `EraseResponse` middleware.

This branch adds an `EraseResponse` middleware in `linkerd-http-box`.
This is necessary for PR #1706, but I factored this out into its own
branch as it's a fairly self-contained change.
@hawkw hawkw force-pushed the eliza/grpc-retry-thing branch from e75348e to 38b6036 Compare May 31, 2022 17:51
hawkw added a commit that referenced this pull request Jun 1, 2022
Currently, `ServiceExt::oneshot` is used to take a `Service` by value,
poll it until it's ready, and then call that service. This composes well
with `Service` middleware, but it does not compose with our `Proxy`
trait. If a `Service` must be wrapped in a `Proxy` call, it is
impossible to oneshot it.

This branch adds a new `Proxy::proxy_oneshot` method to `Proxy` that
takes a `Proxy` and a `Service` by value, and returns a `Future` that
drives the `Service` to readiness and then calls it through that
`Proxy`.

Currently, this is unused, but it will be used in PR #1706 (from which
this change was factored out).
hawkw added a commit that referenced this pull request Jun 1, 2022
This branch adds a `Proxy::into_service` method that composes a `Proxy`
with a `Service`, returning a new `Service` that calls the inner
`Service` through that `Proxy`.

This negates the need for `Proxy::proxy_oneshot`, so this also
closes #1725.

Currently, this is unused, but it will be used in PR #1706 (from which
this change was factored out).
@hawkw hawkw marked this pull request as ready for review June 2, 2022 19:44
@hawkw hawkw requested a review from a team as a code owner June 2, 2022 19:44
@hawkw hawkw changed the title [WIP] retry gRPC requests that fail in trailers retry: retry gRPC requests are immediately terminated by trailers Jun 2, 2022
@hawkw hawkw changed the title retry: retry gRPC requests are immediately terminated by trailers retry gRPC requests are immediately terminated by trailers Jun 2, 2022
hawkw added a commit that referenced this pull request Jun 3, 2022
The `linkerd-http-box` crate currently contains `BoxRequest` and
`BoxResponse` middleware for boxing request/response bodies, and an
`EraseRequest` middleware for erasing the type of request bodies, for
use in a stack which must be able to serve *multiple* request body
types. However, there is no corresponding `EraseResponse` middleware.

This branch adds an `EraseResponse` middleware in `linkerd-http-box`.
This is necessary for PR #1706, but I factored this out into its own
branch as it's a fairly self-contained change.
hawkw added a commit that referenced this pull request Jun 3, 2022
…1726)

This branch adds a `Proxy::into_service` method that composes a `Proxy`
with a `Service`, returning a new `Service` that calls the inner
`Service` through that `Proxy`.

This negates the need for `Proxy::proxy_oneshot`, so this also
closes #1725.

Currently, this is unused, but it will be used in PR #1706 (from which
this change was factored out).

Signed-off-by: Eliza Weisman <[email protected]>
hawkw added 14 commits June 3, 2022 12:14
Signed-off-by: Eliza Weisman <[email protected]>
Signed-off-by: Eliza Weisman <[email protected]>
Signed-off-by: Eliza Weisman <[email protected]>
Signed-off-by: Eliza Weisman <[email protected]>
Signed-off-by: Eliza Weisman <[email protected]>
Signed-off-by: Eliza Weisman <[email protected]>
Signed-off-by: Eliza Weisman <[email protected]>
Signed-off-by: Eliza Weisman <[email protected]>
Signed-off-by: Eliza Weisman <[email protected]>
@hawkw hawkw force-pushed the eliza/grpc-retry-thing branch from f972039 to 9e9d408 Compare June 3, 2022 19:23
olix0r added a commit that referenced this pull request Jun 22, 2022
In preparation for #1706, this change cleans up the type signatures used
by the integration test server.

Signed-off-by: Oliver Gould <[email protected]>
olix0r added a commit that referenced this pull request Jun 22, 2022
In preparation for #1706, this change cleans up the type signatures used
by the integration test server.

Signed-off-by: Oliver Gould <[email protected]>
olix0r added a commit that referenced this pull request Jun 22, 2022
In preparation for #1706, this change moves the `http_retry::ReplayBody`
type into a distinct module. This helps to make a cleaner diff in the
aforementioned PR.

No functional changes.

Signed-off-by: Oliver Gould <[email protected]>
hawkw pushed a commit that referenced this pull request Jun 22, 2022
In preparation for #1706, this change moves the `http_retry::ReplayBody`
type into a distinct module. This helps to make a cleaner diff in the
aforementioned PR.

No functional changes.

Signed-off-by: Oliver Gould <[email protected]>
Copy link
Member

@olix0r olix0r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good to me! Some questions about error handling

@hawkw hawkw requested a review from olix0r June 22, 2022 18:00
Signed-off-by: Eliza Weisman <[email protected]>
@hawkw
Copy link
Contributor Author

hawkw commented Jun 22, 2022

@olix0r I've made the error handling changes you suggested --- thanks for that, good catch! Also, I did a bit of cleaning up the WithTrailers implementation to make it clearer that it can't eat additional data frames.

Copy link
Member

@olix0r olix0r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Comment on lines +97 to +102
// XXX(eliza): the documentation for the `http::Body` trait says
// that `poll_trailers` should only be called after `poll_data`
// returns `None`...but, in practice, I'm fairly sure that this just
// means that it *will not return `Ready`* until there are no data
// frames left, which is fine for us here, because we `now_or_never`
// it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe worth a clarification on http_body...

@hawkw hawkw merged commit 7f817b5 into main Jun 22, 2022
@hawkw hawkw deleted the eliza/grpc-retry-thing branch June 22, 2022 22:08
olix0r added a commit to linkerd/linkerd2 that referenced this pull request Jun 30, 2022
This release updates the proxy's service discovery module to avoid
redundant load balancer updates that could cause unnecessary connection
churn.

This release also includes improvements to the proxy's retry handling of
gRPC requests. The proxy would not retry requests when a response's
status code was emitted in a TRAILERS frame. This has been fixed.

This release also includes a number of internal changes that set up for
per-route authorization. There should be no user-facing impact at this
point except for the introduction of additional metrics labels.

---

* build(deps): bump mio from 0.8.3 to 0.8.4 (linkerd/linkerd2-proxy#1760)
* build(deps): bump quote from 1.0.18 to 1.0.19 (linkerd/linkerd2-proxy#1761)
* build(deps): bump tower-service from 0.3.1 to 0.3.2 (linkerd/linkerd2-proxy#1762)
* build(deps): bump proc-macro2 from 1.0.39 to 1.0.40 (linkerd/linkerd2-proxy#1763)
* build(deps): bump syn from 1.0.96 to 1.0.98 (linkerd/linkerd2-proxy#1764)
* build(deps): bump prettyplease from 0.1.12 to 0.1.14 (linkerd/linkerd2-proxy#1766)
* build(deps): bump anyhow from 1.0.57 to 1.0.58 (linkerd/linkerd2-proxy#1767)
* dev: Update build settings (linkerd/linkerd2-proxy#1765)
* Dedupe discovery updates (linkerd/linkerd2-proxy#1759)
* build(deps): bump quote from 1.0.19 to 1.0.20 (linkerd/linkerd2-proxy#1768)
* deny: Remove tokio-util from exceptions (linkerd/linkerd2-proxy#1769)
* dev: Update memory contraints (linkerd/linkerd2-proxy#1770)
* Reorganize `server-policy` to set up for routes (linkerd/linkerd2-proxy#1771)
* inbound: Rename policy-enforcement layers (linkerd/linkerd2-proxy#1772)
* ci: Split fuzzer logic into a script (linkerd/linkerd2-proxy#1773)
* build(deps): bump prettyplease from 0.1.14 to 0.1.15 (linkerd/linkerd2-proxy#1775)
* build(deps): bump indexmap from 1.9.0 to 1.9.1 (linkerd/linkerd2-proxy#1776)
* integration: Cleanup test server (linkerd/linkerd2-proxy#1777)
* http-retry: Move the ReplayBody type into a module (linkerd/linkerd2-proxy#1778)
* inbound: Add route authorization labels (linkerd/linkerd2-proxy#1774)
* Rename HTTPRoutePermit to HttpRoutePermit (linkerd/linkerd2-proxy#1779)
* retry gRPC requests are immediately terminated by trailers (linkerd/linkerd2-proxy#1706)
* inbound: Record policy metrics for opaque-transport connections (linkerd/linkerd2-proxy#1780)
* build(deps): bump tj-actions/changed-files from 23 to 23.1 (linkerd/linkerd2-proxy#1782)
* build(deps): bump derive_arbitrary from 1.1.2 to 1.1.3 (linkerd/linkerd2-proxy#1783)
* build(deps): bump arbitrary from 1.1.2 to 1.1.3 (linkerd/linkerd2-proxy#1784)
* inbound: Record TCP metrics for forwarded TLS connections (linkerd/linkerd2-proxy#1785)
* inbound: Cleanup in preparation for route policies #1781 (linkerd/linkerd2-proxy#1786)
* Add HTTP route matchers to support the Gateway API (linkerd/linkerd2-proxy#1787)
* build(deps): bump unicode-normalization from 0.1.19 to 0.1.20 (linkerd/linkerd2-proxy#1789)
* build(deps): bump linked-hash-map from 0.5.4 to 0.5.6 (linkerd/linkerd2-proxy#1790)
* build(deps): bump smallvec from 1.8.0 to 1.8.1 (linkerd/linkerd2-proxy#1791)
* build(deps): bump jemalloc-sys from 0.5.0+5.3.0 to 0.5.1+5.3.0-patched (linkerd/linkerd2-proxy#1792)
* Introduce per-route authorization policies (linkerd/linkerd2-proxy#1781)
* inbound: Add a header-modification route filter (linkerd/linkerd2-proxy#1793)
* docs: update justfile man page link (linkerd/linkerd2-proxy#1794)

Signed-off-by: Oliver Gould <[email protected]>
olix0r added a commit to linkerd/linkerd2 that referenced this pull request Jun 30, 2022
This release updates the proxy's service discovery module to avoid
redundant load balancer updates that could cause unnecessary connection
churn.

This release also includes improvements to the proxy's retry handling of
gRPC requests. The proxy would not retry requests when a response's
status code was emitted in a TRAILERS frame. This has been fixed.

This release also includes a number of internal changes that set up for
per-route authorization. There should be no user-facing impact at this
point except for the introduction of additional metrics labels.

---

* build(deps): bump mio from 0.8.3 to 0.8.4 (linkerd/linkerd2-proxy#1760)
* build(deps): bump quote from 1.0.18 to 1.0.19 (linkerd/linkerd2-proxy#1761)
* build(deps): bump tower-service from 0.3.1 to 0.3.2 (linkerd/linkerd2-proxy#1762)
* build(deps): bump proc-macro2 from 1.0.39 to 1.0.40 (linkerd/linkerd2-proxy#1763)
* build(deps): bump syn from 1.0.96 to 1.0.98 (linkerd/linkerd2-proxy#1764)
* build(deps): bump prettyplease from 0.1.12 to 0.1.14 (linkerd/linkerd2-proxy#1766)
* build(deps): bump anyhow from 1.0.57 to 1.0.58 (linkerd/linkerd2-proxy#1767)
* dev: Update build settings (linkerd/linkerd2-proxy#1765)
* Dedupe discovery updates (linkerd/linkerd2-proxy#1759)
* build(deps): bump quote from 1.0.19 to 1.0.20 (linkerd/linkerd2-proxy#1768)
* deny: Remove tokio-util from exceptions (linkerd/linkerd2-proxy#1769)
* dev: Update memory contraints (linkerd/linkerd2-proxy#1770)
* Reorganize `server-policy` to set up for routes (linkerd/linkerd2-proxy#1771)
* inbound: Rename policy-enforcement layers (linkerd/linkerd2-proxy#1772)
* ci: Split fuzzer logic into a script (linkerd/linkerd2-proxy#1773)
* build(deps): bump prettyplease from 0.1.14 to 0.1.15 (linkerd/linkerd2-proxy#1775)
* build(deps): bump indexmap from 1.9.0 to 1.9.1 (linkerd/linkerd2-proxy#1776)
* integration: Cleanup test server (linkerd/linkerd2-proxy#1777)
* http-retry: Move the ReplayBody type into a module (linkerd/linkerd2-proxy#1778)
* inbound: Add route authorization labels (linkerd/linkerd2-proxy#1774)
* Rename HTTPRoutePermit to HttpRoutePermit (linkerd/linkerd2-proxy#1779)
* retry gRPC requests are immediately terminated by trailers (linkerd/linkerd2-proxy#1706)
* inbound: Record policy metrics for opaque-transport connections (linkerd/linkerd2-proxy#1780)
* build(deps): bump tj-actions/changed-files from 23 to 23.1 (linkerd/linkerd2-proxy#1782)
* build(deps): bump derive_arbitrary from 1.1.2 to 1.1.3 (linkerd/linkerd2-proxy#1783)
* build(deps): bump arbitrary from 1.1.2 to 1.1.3 (linkerd/linkerd2-proxy#1784)
* inbound: Record TCP metrics for forwarded TLS connections (linkerd/linkerd2-proxy#1785)
* inbound: Cleanup in preparation for route policies #1781 (linkerd/linkerd2-proxy#1786)
* Add HTTP route matchers to support the Gateway API (linkerd/linkerd2-proxy#1787)
* build(deps): bump unicode-normalization from 0.1.19 to 0.1.20 (linkerd/linkerd2-proxy#1789)
* build(deps): bump linked-hash-map from 0.5.4 to 0.5.6 (linkerd/linkerd2-proxy#1790)
* build(deps): bump smallvec from 1.8.0 to 1.8.1 (linkerd/linkerd2-proxy#1791)
* build(deps): bump jemalloc-sys from 0.5.0+5.3.0 to 0.5.1+5.3.0-patched (linkerd/linkerd2-proxy#1792)
* Introduce per-route authorization policies (linkerd/linkerd2-proxy#1781)
* inbound: Add a header-modification route filter (linkerd/linkerd2-proxy#1793)
* docs: update justfile man page link (linkerd/linkerd2-proxy#1794)

Signed-off-by: Oliver Gould <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Retries for gRPC services are not activated when the grpc-status field is present only in the response trailer

2 participants