-
Notifications
You must be signed in to change notification settings - Fork 5.3k
HTTP 2 connection draining is non-graceful for low-volume listeners #14350
Description
Title: HTTP 2 connection draining is non-graceful for low-volume listeners
Description:
When a listener enters into a draining state either due to a hot-restart or an LDS update, it should not accept new requests but should let existing requests finish gracefully. There is a drain-time limit that, when reached will terminate any open connections non-gracefully.
For HTTP 2, putting a connection into a draining state should be equivalent with Envoy sending a GOAWAY frame on the connection. This signals that no new streams should be created on the connection, but existing streams are allowed to finish. Assuming a well-behaved client (one that respects the GOAWAY), this should mirror the desired behavior for connection draining. However, Envoy does not send a GOAWAY proactively when the drain-time begins. Instead, Envoy issues a GOAWAY after the next request made on the connection is completed. Represented visually, this would look like:
With appropriately long drain-time for your traffic and a sufficiently busy listener, this delayed GOAWAY does not generally lead to issues. However, if the listener is processing a low volume of long-requests, then it is possible to find ourselves in the following scenario:
In this scenario the request does not begin until near the end of the drain-time window. Because the GOAWAY signal is not sent until the request ends. This results in a request being interrupted as the connection is non-gracefully closed -- non-graceful defined as 1) without a GOAWAY and 2) with in-flight requests.
An interrupted request is logged into the access log with the DC flag and will return a 503 response. If the downstream is another Envoy instance, then the downstream will have an access log with a UC flag.
I would expect that Envoy would issue a GOAWAY (NO_ERROR error-code) at the beginning of the drain-period (without the external tigger of a request) as this already matches the desired behavior of Envoy's connection-draining. I suspect the current implementation to be an artifact of how connections would be closed for persistent HTTP 1.
Repro steps:
This has been reproduced in our internal integration-test suite, but can be reproduced readily with the following:
- Configure drain-time of 20 seconds
- Configure parent-shutdown time of 25 seconds
- Start Envoy
- Create a client (either H1 or H2) and generate some traffic to ensure established connections
- Begin a reload or perform an LDS update
- Issue a request over the same connection that:
- Begins 10s after the reload/LDS-update was initiated
- Lasts for 30s (upstream service sleeps 30s before responding)
- Observe non-graceful connection termination
All tests were done using a concurrency of 1 to ensure a single listener/connection.

