HTTP2 proactive GOAWAY on drain-start by murray-stripe · Pull Request #16201 · envoyproxy/envoy

murray-stripe · 2021-04-28T00:37:39Z

Signed-off-by: John Murray [email protected]

Commit Message:
Change the default behavior of draining from a pull-based to a push-based model, proactively notifying interested parties when a drain sequence has begun. For low-volume listeners/connections, this allows for a graceful termination by proactively sending GOAWAY signals. This change resolves #14350.

This change relies the concept of "drain trees" (introduced in #17026), where the drain-manager exists as either a parent or child drain-manager. Allowing drain actions to cascade downward and also allowing for independent "sub-tree" draining (e.g. draining a specific Listener or specific FilterChain).

Additional Description:
Risk Level: moderate
Testing: Additional test coverage added for the HTTP ConnectionManagerImpl as well as the DrainManagerImpl. Many tests updated to reflect changes in behavior (push vs pull).
Docs Changes: Updated comments on DrainStrategy to convey some specifics of timing in how draining is initiated.
Release Notes:
Fixes #14350

<--
[Optional API Considerations:]
-->

Signed-off-by: John Murray <[email protected]>

antoniovicente

Thanks for the improvements to the drain sequence. Here's some early feedback on your WIP changes.

source/common/common/callback_impl.h

source/server/drain_manager_impl.cc

test/common/http/conn_manager_impl_test.cc

test/server/drain_manager_impl_test.cc

source/common/common/callback_impl.h

murray-stripe

@antoniovicente Thank you for the review!! 😄

I've added replies to all of your comments to either add a follow-up action (I'll be working through these) or to answer a question or start a discussion.

source/common/common/callback_impl.h

source/server/drain_manager_impl.cc

test/common/http/conn_manager_impl_test.cc

test/server/drain_manager_impl_test.cc

source/server/drain_manager_impl.cc

Signed-off-by: John Murray <[email protected]>

…truction Signed-off-by: John Murray <[email protected]>

Signed-off-by: John Murray <[email protected]>

repokitteh-read-only · 2021-05-05T22:56:48Z

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to api/envoy/.
envoyproxy/api-shepherds assignee is @lizan
CC @envoyproxy/api-watchers: FYI only for changes made to api/envoy/.

🐱

Caused by: #16201 was synchronize by murray-stripe.

see: more, trace.

source/common/http/conn_manager_impl.cc

…proactive-goaway Signed-off-by: John Murray <[email protected]>

Signed-off-by: John Murray <[email protected]>

murray-stripe · 2021-09-24T15:12:07Z

/retest

repokitteh-read-only · 2021-09-24T15:12:10Z

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #16201 (comment) was created by @murray-stripe.

see: more, trace.

Signed-off-by: John Murray <[email protected]>

mattklein123 · 2021-09-27T16:01:52Z

@murray-stripe sorry I've lost the thread on this. Are you ready to have this looked at again? Can you fix CI?

/wait

Signed-off-by: John Murray <[email protected]>

murray-stripe · 2021-09-27T22:16:24Z

/retest

repokitteh-read-only · 2021-09-27T22:16:28Z

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #16201 (comment) was created by @murray-stripe.

see: more, trace.

murray-stripe · 2021-09-28T13:21:29Z

/retest

repokitteh-read-only · 2021-09-28T13:21:33Z

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #16201 (comment) was created by @murray-stripe.

see: more, trace.

murray-stripe · 2021-09-28T17:04:45Z

/retest

@mattklein123 This change is ready for re-review. Just currently dealing with a test timeout that I can't repro locally and assuming it's a flake.

update - Seems it was a flake, feel free to proceed with a review.

repokitteh-read-only · 2021-09-28T17:04:49Z

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #16201 (comment) was created by @murray-stripe.

see: more, trace.

mattklein123

Thanks generally LGTM. Flushing out some more comments.

This change is sufficiently scary that I would appreciate a smoke test. Can you please deploy this change at Stripe in some test capacity to make sure there are no obvious crashes/issues?

/wait

mattklein123 · 2021-09-29T15:51:49Z

api/envoy/admin/v3/server_info.proto

  enum DrainStrategy {
-    // Gradually discourage connections over the course of the drain period.
+    // Gradually discourage connections over the beginning course of the drain period, discouraging
+    // all connections by the time 25% of the drain-time has expired.


Ping on this. Can you make this comment more robust to explain how you came up with 25% and/or link to the arch overview docs if it is explained there.

mattklein123 · 2021-09-29T15:53:09Z

docs/root/intro/arch_overview/listeners/connection_draining.rst

+additional consideration for draining. "Draining" is the state in which components may discourage
+new connections and signal on existing connections the intent to terminate. "Completed Draining"
+is the final state of the drain-process in which components will refuse new connections and
+terminate any remaining connections (read more at :ref:`Draining<arch_overview_draining>`).


nit: should this just be part of the other draining docs? Seems closely related? Just a section there?

mattklein123 · 2021-09-29T15:54:41Z

docs/root/version_history/current.rst

  for "gRPC config stream closed" is now reduced to debug when the status is ``Ok`` or has been
  retriable (``DeadlineExceeded``, ``ResourceExhausted``, or ``Unavailable``) for less than 30
  seconds.
+* grpc: gRPC async client can be cached and shared accross filter instances in the same thread, this feature is turned off by default, can be turned on by setting runtime guard ``envoy.reloadable_features.enable_grpc_async_client_cache`` to true.


merge issue, delete

mattklein123 · 2021-09-29T15:54:56Z

docs/root/version_history/current.rst

  seconds.
+* grpc: gRPC async client can be cached and shared accross filter instances in the same thread, this feature is turned off by default, can be turned on by setting runtime guard ``envoy.reloadable_features.enable_grpc_async_client_cache`` to true.
 * grpc: gRPC async client can be cached and shared across filter instances in the same thread, this feature is turned off by default, can be turned on by setting runtime guard ``envoy.reloadable_features.enable_grpc_async_client_cache`` to true.
+* http: connection draining is now proactive and does not require traffic to trigger graceful draining. This


ref link to arch overview / enum.

mattklein123 · 2021-09-29T15:56:41Z

source/common/http/conn_manager_impl.cc

+  // Register callback for drain-close events.
+  if (use_proactive_draining_) {
+    start_drain_cb_ = drain_close_.addOnDrainCloseCb([this](std::chrono::milliseconds drain_delay) {
+      // de-register callback since we only want this to fire once


nit: all comments start with capital, end with period, proper grammar, etc. please check the diff.

mattklein123 · 2021-09-29T15:58:31Z

source/common/http/conn_manager_impl.cc

+void ConnectionManagerImpl::createStartDrainTimer(std::chrono::milliseconds drain_delay) {
+  if (!codec_) {
+    stats_.named_.downstream_cx_drain_close_.inc();
+    doConnectionClose(Network::ConnectionCloseType::FlushWrite, absl::nullopt, "");


IMO handling this case isn't worth it. I would just let it get handled by the common code / spread out naturally.

mattklein123 · 2021-09-29T15:59:37Z

source/common/http/conn_manager_impl.cc

+      start_drain_cb_.reset();
+
+      // create timer to _begin_ draining
+      stats_.named_.downstream_cx_drain_close_.inc();


Is there a chance that this can get incremented twice? I see it's still incremented in the legacy path. If not, please add comments in both places explaining why.

mattklein123 · 2021-09-29T16:00:41Z

source/common/http/conn_manager_impl.cc

    // prevent any new streams.
    connection_manager_.startDrainSequence();
    connection_manager_.stats_.named_.downstream_cx_drain_close_.inc();
-    ENVOY_STREAM_LOG(debug, "drain closing connection", *this);


Why was this removed but the stat wasn't? related to my comment above about possible double stat increments.

mattklein123 · 2021-09-29T16:01:36Z

source/server/api_listener_impl.h

  // Network::DrainDecision
  // TODO(junr03): hook up draining to listener state management.
  bool drainClose() const override { return false; }
+  Common::CallbackHandlePtr addOnDrainCloseCb(DrainCloseCb) const override { return nullptr; }


Please add more comments.

github-actions · 2021-10-29T20:01:14Z

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

murray-stripe · 2021-11-01T15:14:47Z

Haven't forgotten about this PR, but will have to circle back around in Q1 due to internal reprioritization in order to provide a sufficient "bake" in Stripe's environment. I have done some initial testing, but would like to do some more sustained testing to provide better assurance of the safety of the change.

github-actions · 2021-12-01T16:06:38Z

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

github-actions · 2021-12-08T20:01:17Z

This pull request has been automatically closed because it has not had activity in the last 37 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

bernardoVale · 2022-05-10T20:03:21Z

@murray-stripe any plans to continue this work? I'd be happy to this this in my environment too, we're dealing with the same problem

…reamble (#17026)" (#37465) This reverts commit 96dd735. 96dd735 was a precursor to #16201 which never landed. Unwinding the change envoyproxy/envoy-mobile#176 Signed-off-by: Alyssa Wilk <[email protected]>

…reamble (#17026)" (#37465) This reverts commit 96dd735. envoyproxy/envoy@96dd735 was a precursor to envoyproxy/envoy#16201 which never landed. Unwinding the change envoyproxy/envoy-mobile#176 Signed-off-by: Alyssa Wilk <[email protected]>

wip

36d4a3c

Signed-off-by: John Murray <[email protected]>

murray-stripe changed the title ~~wip~~ [wip] HTTP2 proactive GOAWAY on drain-start Apr 28, 2021

murray-stripe added 5 commits April 28, 2021 17:05

PR Feedback - optional + const

85194ae

Signed-off-by: John Murray <[email protected]>

code-formatting

4bc7fe7

Signed-off-by: John Murray <[email protected]>

test fixes - connection-manager-impl

8acf5ba

Signed-off-by: John Murray <[email protected]>

test fixes - connection-manager-impl 2

13efae6

Signed-off-by: John Murray <[email protected]>

test fixes - connection-manager-impl 3

e52c183

Signed-off-by: John Murray <[email protected]>

ggreenway assigned antoniovicente May 3, 2021

murray-stripe added 4 commits May 3, 2021 15:38

undo temp change

8c1c938

Signed-off-by: John Murray <[email protected]>

remove temp fix

480a72b

Signed-off-by: John Murray <[email protected]>

build fixes

2ac99dd

Signed-off-by: John Murray <[email protected]>

spelling fix

e1dc869

Signed-off-by: John Murray <[email protected]>

antoniovicente reviewed May 4, 2021

View reviewed changes

antoniovicente added the waiting label May 4, 2021

murray-stripe commented May 5, 2021

View reviewed changes

murray-stripe added 3 commits May 5, 2021 19:37

thread-safe dispatcher usage fix

c4a1e3d

Signed-off-by: John Murray <[email protected]>

PR Feedback - assert empty callbacks on ThreadSafeCallbackManager des…

b15f12b

…truction Signed-off-by: John Murray <[email protected]>

PR Feedback - CallbackManagerImpl - tests, cleanup, comments

ef33ebf

Signed-off-by: John Murray <[email protected]>

repokitteh-read-only bot removed the waiting label May 5, 2021

murray-stripe added 6 commits May 5, 2021 21:09

PR Feedback - Unify callback-invoke style in drain-manager-impl

64a27bb

Signed-off-by: John Murray <[email protected]>

PR Feedback - comment wording + drain-delay calculation

187b3f0

Signed-off-by: John Murray <[email protected]>

PR Feedback - Document new gradual-draining behavior

f8b895a

Signed-off-by: John Murray <[email protected]>

typo fix

568b65d

Signed-off-by: John Murray <[email protected]>

api v4 addition

252ab68

Signed-off-by: John Murray <[email protected]>

generated shadow

4fd8f76

Signed-off-by: John Murray <[email protected]>

repokitteh-read-only bot added the api label May 5, 2021

repokitteh-read-only bot assigned lizan May 5, 2021

antoniovicente reviewed May 6, 2021

View reviewed changes

source/common/http/conn_manager_impl.cc Outdated Show resolved Hide resolved

antoniovicente added the waiting label May 6, 2021

murray-stripe added 2 commits September 24, 2021 08:08

Merge remote-tracking branch 'upstream/main' into murray/http2-drain-…

9d94a6a

…proactive-goaway Signed-off-by: John Murray <[email protected]>

remove createDrainManager_ call added in 7f93f94

0104329

Signed-off-by: John Murray <[email protected]>

tests - add missing check

c2de567

Signed-off-by: John Murray <[email protected]>

repokitteh-read-only bot added the waiting label Sep 27, 2021

Test cleanup

5ff4826

Signed-off-by: John Murray <[email protected]>

repokitteh-read-only bot removed the waiting label Sep 27, 2021

mattklein123 requested changes Sep 29, 2021

View reviewed changes

repokitteh-read-only bot added the waiting label Sep 29, 2021

github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Oct 29, 2021

github-actions bot removed the stale stalebot believes this issue/PR has not been touched recently label Nov 1, 2021

github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Dec 1, 2021

github-actions bot closed this Dec 8, 2021

mattklein123 mentioned this pull request Aug 29, 2022

Envoy doesn't send http2 GOAWAY when drain connection for grpc upstream cluster #22825

Closed

alyssawilk mentioned this pull request Dec 2, 2024

drain_manager: Unclean Revert of "HTTP2 Proactive GOAWAY on Drain - Preamble (#17026)" #37465

Merged

Conversation

murray-stripe commented Apr 28, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

antoniovicente left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

murray-stripe left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

repokitteh-read-only bot commented May 5, 2021

Uh oh!

Uh oh!

murray-stripe commented Sep 24, 2021

Uh oh!

repokitteh-read-only bot commented Sep 24, 2021

Uh oh!

mattklein123 commented Sep 27, 2021

Uh oh!

murray-stripe commented Sep 27, 2021

Uh oh!

repokitteh-read-only bot commented Sep 27, 2021

Uh oh!

murray-stripe commented Sep 28, 2021

Uh oh!

repokitteh-read-only bot commented Sep 28, 2021

Uh oh!

murray-stripe commented Sep 28, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

repokitteh-read-only bot commented Sep 28, 2021

Uh oh!

mattklein123 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Oct 29, 2021

Uh oh!

murray-stripe commented Nov 1, 2021

Uh oh!

github-actions bot commented Dec 1, 2021

Uh oh!

github-actions bot commented Dec 8, 2021

murray-stripe commented Apr 28, 2021 •

edited

Loading

murray-stripe commented Sep 28, 2021 •

edited

Loading