HTTP2 proactive GOAWAY on drain-start#16201
HTTP2 proactive GOAWAY on drain-start#16201murray-stripe wants to merge 116 commits intoenvoyproxy:mainfrom
Conversation
Signed-off-by: John Murray <[email protected]>
Signed-off-by: John Murray <[email protected]>
Signed-off-by: John Murray <[email protected]>
Signed-off-by: John Murray <[email protected]>
Signed-off-by: John Murray <[email protected]>
Signed-off-by: John Murray <[email protected]>
Signed-off-by: John Murray <[email protected]>
Signed-off-by: John Murray <[email protected]>
Signed-off-by: John Murray <[email protected]>
Signed-off-by: John Murray <[email protected]>
antoniovicente
left a comment
There was a problem hiding this comment.
Thanks for the improvements to the drain sequence. Here's some early feedback on your WIP changes.
murray-stripe
left a comment
There was a problem hiding this comment.
@antoniovicente Thank you for the review!! 😄
I've added replies to all of your comments to either add a follow-up action (I'll be working through these) or to answer a question or start a discussion.
Signed-off-by: John Murray <[email protected]>
…truction Signed-off-by: John Murray <[email protected]>
Signed-off-by: John Murray <[email protected]>
Signed-off-by: John Murray <[email protected]>
Signed-off-by: John Murray <[email protected]>
Signed-off-by: John Murray <[email protected]>
Signed-off-by: John Murray <[email protected]>
Signed-off-by: John Murray <[email protected]>
Signed-off-by: John Murray <[email protected]>
…proactive-goaway Signed-off-by: John Murray <[email protected]>
Signed-off-by: John Murray <[email protected]>
|
/retest |
|
Retrying Azure Pipelines: |
Signed-off-by: John Murray <[email protected]>
|
@murray-stripe sorry I've lost the thread on this. Are you ready to have this looked at again? Can you fix CI? /wait |
Signed-off-by: John Murray <[email protected]>
|
/retest |
|
Retrying Azure Pipelines: |
|
/retest |
|
Retrying Azure Pipelines: |
|
/retest @mattklein123 This change is ready for re-review. Just currently dealing with a test timeout that I can't repro locally and assuming it's a flake. update - Seems it was a flake, feel free to proceed with a review. |
|
Retrying Azure Pipelines: |
mattklein123
left a comment
There was a problem hiding this comment.
Thanks generally LGTM. Flushing out some more comments.
This change is sufficiently scary that I would appreciate a smoke test. Can you please deploy this change at Stripe in some test capacity to make sure there are no obvious crashes/issues?
/wait
| enum DrainStrategy { | ||
| // Gradually discourage connections over the course of the drain period. | ||
| // Gradually discourage connections over the beginning course of the drain period, discouraging | ||
| // all connections by the time 25% of the drain-time has expired. |
There was a problem hiding this comment.
Ping on this. Can you make this comment more robust to explain how you came up with 25% and/or link to the arch overview docs if it is explained there.
| additional consideration for draining. "Draining" is the state in which components may discourage | ||
| new connections and signal on existing connections the intent to terminate. "Completed Draining" | ||
| is the final state of the drain-process in which components will refuse new connections and | ||
| terminate any remaining connections (read more at :ref:`Draining<arch_overview_draining>`). |
There was a problem hiding this comment.
nit: should this just be part of the other draining docs? Seems closely related? Just a section there?
| for "gRPC config stream closed" is now reduced to debug when the status is ``Ok`` or has been | ||
| retriable (``DeadlineExceeded``, ``ResourceExhausted``, or ``Unavailable``) for less than 30 | ||
| seconds. | ||
| * grpc: gRPC async client can be cached and shared accross filter instances in the same thread, this feature is turned off by default, can be turned on by setting runtime guard ``envoy.reloadable_features.enable_grpc_async_client_cache`` to true. |
| seconds. | ||
| * grpc: gRPC async client can be cached and shared accross filter instances in the same thread, this feature is turned off by default, can be turned on by setting runtime guard ``envoy.reloadable_features.enable_grpc_async_client_cache`` to true. | ||
| * grpc: gRPC async client can be cached and shared across filter instances in the same thread, this feature is turned off by default, can be turned on by setting runtime guard ``envoy.reloadable_features.enable_grpc_async_client_cache`` to true. | ||
| * http: connection draining is now proactive and does not require traffic to trigger graceful draining. This |
There was a problem hiding this comment.
ref link to arch overview / enum.
| // Register callback for drain-close events. | ||
| if (use_proactive_draining_) { | ||
| start_drain_cb_ = drain_close_.addOnDrainCloseCb([this](std::chrono::milliseconds drain_delay) { | ||
| // de-register callback since we only want this to fire once |
There was a problem hiding this comment.
nit: all comments start with capital, end with period, proper grammar, etc. please check the diff.
| void ConnectionManagerImpl::createStartDrainTimer(std::chrono::milliseconds drain_delay) { | ||
| if (!codec_) { | ||
| stats_.named_.downstream_cx_drain_close_.inc(); | ||
| doConnectionClose(Network::ConnectionCloseType::FlushWrite, absl::nullopt, ""); |
There was a problem hiding this comment.
IMO handling this case isn't worth it. I would just let it get handled by the common code / spread out naturally.
| start_drain_cb_.reset(); | ||
|
|
||
| // create timer to _begin_ draining | ||
| stats_.named_.downstream_cx_drain_close_.inc(); |
There was a problem hiding this comment.
Is there a chance that this can get incremented twice? I see it's still incremented in the legacy path. If not, please add comments in both places explaining why.
| // prevent any new streams. | ||
| connection_manager_.startDrainSequence(); | ||
| connection_manager_.stats_.named_.downstream_cx_drain_close_.inc(); | ||
| ENVOY_STREAM_LOG(debug, "drain closing connection", *this); |
There was a problem hiding this comment.
Why was this removed but the stat wasn't? related to my comment above about possible double stat increments.
| // Network::DrainDecision | ||
| // TODO(junr03): hook up draining to listener state management. | ||
| bool drainClose() const override { return false; } | ||
| Common::CallbackHandlePtr addOnDrainCloseCb(DrainCloseCb) const override { return nullptr; } |
|
This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions! |
|
Haven't forgotten about this PR, but will have to circle back around in Q1 due to internal reprioritization in order to provide a sufficient "bake" in Stripe's environment. I have done some initial testing, but would like to do some more sustained testing to provide better assurance of the safety of the change. |
|
This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions! |
|
This pull request has been automatically closed because it has not had activity in the last 37 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions! |
|
@murray-stripe any plans to continue this work? I'd be happy to this this in my environment too, we're dealing with the same problem |
…reamble (#17026)" (#37465) This reverts commit 96dd735. 96dd735 was a precursor to #16201 which never landed. Unwinding the change envoyproxy/envoy-mobile#176 Signed-off-by: Alyssa Wilk <[email protected]>
…reamble (#17026)" (#37465) This reverts commit 96dd735. envoyproxy/envoy@96dd735 was a precursor to envoyproxy/envoy#16201 which never landed. Unwinding the change envoyproxy/envoy-mobile#176 Signed-off-by: Alyssa Wilk <[email protected]>
Signed-off-by: John Murray [email protected]
Commit Message:
Change the default behavior of draining from a pull-based to a push-based model, proactively notifying interested parties when a drain sequence has begun. For low-volume listeners/connections, this allows for a graceful termination by proactively sending GOAWAY signals. This change resolves #14350.
This change relies the concept of "drain trees" (introduced in #17026), where the drain-manager exists as either a parent or child drain-manager. Allowing drain actions to cascade downward and also allowing for independent "sub-tree" draining (e.g. draining a specific Listener or specific FilterChain).
Additional Description:
Risk Level: moderate
Testing: Additional test coverage added for the HTTP ConnectionManagerImpl as well as the DrainManagerImpl. Many tests updated to reflect changes in behavior (push vs pull).
Docs Changes: Updated comments on DrainStrategy to convey some specifics of timing in how draining is initiated.
Release Notes:
Fixes #14350
<--
[Optional API Considerations:]
-->