listener manager: handle listener creation failures on workers by mattklein123 · Pull Request #1257 · envoyproxy/envoy

mattklein123 · 2017-07-13T19:21:38Z

This commit properly handles listener creation failures on worker threads.
Specifically, the race condition that can happen between bind() and
listen(). (To be perfectly honest I don't understand how this can happen
at the OS level but we have seen it in practice).

This turned into somewhat of a complicated change. It does the following
things:

Add an add listener completion which can tell the listener manager
if a creation failed. As part of this work, locking was replaced with
posts inside the listener manager which makes synchronization simpler.
The failed listener is removed, in the hope that a future addition
it might succeed at that point.
Listeners are now always added to workers via post.
Because of (3), this caused race conditions in the integration tests.
The integration startup code needed to be refactored so that we now
wait for all listeners to be up before starting to run the tests.
Because of (4) it exposed an issue in the TCP proxy integration test
with trying to handle a response from the client SSL auth filter which
happens before the listeners are actually up. To fix this, I removed
the code in the tests trying to handle client SSL auth filter calls.
I also replaced all "random" upstream clusters with a host pointing
at reserved port 4 which should always fail in any reasonable case.
This still has value in testing startup with failure, but won't block
startup. If we want to bring back testing of initial startup things
like SDS, RDS, etc. we will need to revisit how this works but IMO
this is fine for now.

This commit properly handles listener creation failures on worker threads. Specifically, the race condition that can happen between bind() and listen(). (To be perfectly honest I don't understand how this can happen at the OS level but we have seen it in practice). This turned into somewhat of a complicated change. It does the following things: 1) Add an add listener completion which can tell the listener manager if a creation failed. As part of this work, locking was replaced with posts inside the listener manager which makes synchronization simpler. 2) The failed listener is removed, in the hope that a future addition it might succeed at that point. 3) Listeners are now always added to workers via post. 4) Because of (3), this caused race conditions in the integration tests. The integration startup code needed to be refactored so that we now wait for all listeners to be up before starting to run the tests. 5) Because of (4) it exposed an issue in the TCP proxy integration test with trying to handle a response from the client SSL auth filter which happens before the listeners are actually up. To fix this, I removed the code in the tests trying to handle client SSL auth filter calls. I also replaced all "random" upstream clusters with a host pointing at reserved port 4 which should always fail in any reasonable case. This still has value in testing startup with failure, but won't block startup. If we want to bring back testing of initial startup things like SDS, RDS, etc. we will need to revisit how this works but IMO this is fine for now.

mattklein123 · 2017-07-13T19:22:22Z

@lyft/network-team @htuch @dnoe @alyssawilk

@alyssawilk this is going to conflict with your flow control change in the tests. Also LMK what you think about that part of the change. We can get your change in first and then I can merge master into this.

alyssawilk · 2017-07-13T19:29:12Z

test/integration/tcp_proxy_integration_test.cc

-        request->waitForEndStream(*dispatcher_);
-        request->encodeHeaders(Http::TestHeaderMapImpl{{":status", "200"}}, false);
-        request->encodeData(0, true);
-      },


Is this the conflict you meant?
If so please feel free to check in ASAP, because it is way cleaner and I'm happy to do the merges.

Sure either way. We can take this through review and merge first if you want. I agree that as long as you don't have a problem with the change it does make the tests substantially easier to grok.

htuch

Will take a look in a bit, have you verified flake free with --runs_per_test=1000? Asking as you seem to have uncovered a bunch of concurrency issues in the integration tests, just wanted to check there aren't any more.

alyssawilk · 2017-07-13T19:31:53Z

FWIW 1k is sadly nowhere enough - I've spent more of the day debugging a 1:20k failure which I bet this is the fix to. (I got as far as figuring the SSL auth code was getting a bad FD which aligns well with the idea not everything is set up right)

…

On Thu, Jul 13, 2017 at 3:29 PM, htuch ***@***.***> wrote: ***@***.**** commented on this pull request. Will take a look in a bit, have you verified flake free with --runs_per_test=1000? Asking as you seem to have uncovered a bunch of concurrency issues in the integration tests, just wanted to check there aren't any more. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1257 (review)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ARYFvRUSLO4ShCfKLxwxrRzkWqzjQDBdks5sNnAngaJpZM4OXbgw> .

mattklein123 · 2017-07-13T19:40:48Z

Asking as you seem to have uncovered a bunch of concurrency issues in the integration tests, just wanted to check there aren't any more.

Not exactly, it's really an ordering issue. The old code would add listeners on the startup thread. This no longer happens, so new sync is needed. This also broke the TCP proxy tests. However I will run a long test cycle locally to make sure it works with multiple iterations. @alyssawilk I have no doubt there are still some edge cases in the existing code.

htuch

LGTM, but I haven't given the test much thought yet.

htuch · 2017-07-13T22:29:48Z

source/server/listener_manager_impl.cc

+      if (!success && !listener.onListenerCreateFailure()) {
+        // TODO(mattklein123): In addition to a critical log and a stat, we should consider adding
+        //                     a startup option here to cause the server to exit. I think we
+        //                     probably want this at Lyft but I will do it in a follow up.


I think we would also want to fail a dry run canary here too.

I don't think we would actually add a listener all the way to this point in a dry run situation? It's not completely clear to me what happens in a dry run (I saw your other comment). I think we should track this separately?

htuch · 2017-07-13T22:33:36Z

source/server/listener_manager_impl.h


+  /**
+   * Called when a listener failed to be actually created on a worker.
+   * @return TRUE if this is not the first create failure (via multiple workers).


@return TRUE if we have seen more than one worker failure

htuch · 2017-07-13T22:36:34Z

test/config/integration/server.json

        "type": "static",
        "lb_type": "round_robin",
-        "hosts": [{"url": "tcp://{{ ip_loopback_address }}:12000"}]
+        "hosts": [{"url": "tcp://{{ ip_loopback_address }}:4"}]


Why this change? Are we making meaningful use of these?

There is no great way to put a comment about this, but it does do more than nothing. Basically it verifies real flow of connecting to various things and then immediately failing. So it verifies that init flow doesn't get blocked, no crashes, etc. I admit it's not perfect but I don't think it's worth deleting. LMK what you think and if you want me to document somewhere.

OK, so just to clarify, you changed these values to 4 here to ensure that they really would fail and not conflict with something on 12000? Might be useful to have this documented in some JSON comments..

Yes. There are no JSON comments. :( Any other place I can put this?

Maybe don't worry then, I think anywhere else would just be confusing to the reader. I think this will be better in v2 - we won't be using manually crafted JSON configs for integration tests, but protobuf objects that we build dynamically for each different integration test. Then it can be in the code where we add these objects.

htuch · 2017-07-13T22:37:41Z

test/integration/server.cc

+    on_worker_listener_added_cb_();
+  }
+
+  std::unique_lock<std::mutex> gaurd(listeners_mutex_);


Nit: s/gaurd/guard/ (or is it gourd? :) )

htuch · 2017-07-13T22:40:57Z

source/server/listener_manager_impl.cc

+    // avoid locking.
+    server_.dispatcher().post([this, success, &listener]() -> void {
+      // It is theoretically possible for a listener to get added on 1 worker but not the others.
+      // The following code will yield 1 critical log and 1 stat. It will also cause listener


I would rephrase this as it kind of confused me on the first pass, something to the effect of "The below check with onListenerCreateFailure() is there to ensure we execute the removal/logging/stats at most once on failure."

mattklein123 · 2017-07-14T00:11:04Z

@htuch updated

htuch · 2017-07-14T17:10:55Z

test/test_common/utility.h

 public:
  /**
-   * Set the conditional to ready, should only be called once.
+   * Set the conditional to ready.


Maybe explain that waitReady() resets, but in between waitReady(), setReady() should only be called once.

mattklein123 · 2017-07-14T17:29:28Z

@htuch comment updated

Description: introduces integration test harness for the platform bridge filter that enables testing scenarios with multiple filters, and their interactions with the filter manager. Additionally, it introduces a test case for a current crashing behavior. Subsequent PR will fix and uncomment the test. Risk Level: low - introduce new testing infrastructure. Testing: new test framework, but with commented out test case. Signed-off-by: Jose Nino <[email protected]> Signed-off-by: JP Simard <[email protected]>

**Description** Closes envoyproxy/ai-gateway#1149 --------- Signed-off-by: Ion Koutsouris <[email protected]>

**Description** This fixes the broken table in #1257. Signed-off-by: Takeshi Yoneda <[email protected]>

alyssawilk reviewed Jul 13, 2017

View reviewed changes

htuch reviewed Jul 13, 2017

View reviewed changes

fix flake

fb378cc

htuch reviewed Jul 13, 2017

View reviewed changes

mattklein123 added 2 commits July 13, 2017 17:05

Merge remote-tracking branch 'origin/master' into listen_failure

44bb0d5

comments

9861fe1

mattklein123 mentioned this pull request Jul 14, 2017

Dynamic config canary dry run #1260

Closed

htuch reviewed Jul 14, 2017

View reviewed changes

htuch previously approved these changes Jul 14, 2017

View reviewed changes

comment

96f2e39

mattklein123 dismissed htuch’s stale review via 96f2e39 July 14, 2017 17:28

htuch approved these changes Jul 14, 2017

View reviewed changes

mattklein123 merged commit ea3b7c6 into master Jul 14, 2017

mattklein123 deleted the listen_failure branch July 14, 2017 18:27

mathetake pushed a commit that referenced this pull request Mar 3, 2026

feat: azure-openai embeddings translator (#1257)

4621c75

**Description** Closes envoyproxy/ai-gateway#1149 --------- Signed-off-by: Ion Koutsouris <[email protected]>

mathetake added a commit that referenced this pull request Mar 3, 2026

docs: fixes broken supported-endpoints table (#1345)

bdd9785

**Description** This fixes the broken table in #1257. Signed-off-by: Takeshi Yoneda <[email protected]>

Conversation

mattklein123 commented Jul 13, 2017

Uh oh!

mattklein123 commented Jul 13, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

htuch left a comment

Choose a reason for hiding this comment

Uh oh!

alyssawilk commented Jul 13, 2017 via email

Uh oh!

mattklein123 commented Jul 13, 2017

Uh oh!

htuch left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattklein123 commented Jul 14, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattklein123 commented Jul 14, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants