listener manager: handle listener creation failures on workers#1257
listener manager: handle listener creation failures on workers#1257mattklein123 merged 5 commits intomasterfrom
Conversation
This commit properly handles listener creation failures on worker threads. Specifically, the race condition that can happen between bind() and listen(). (To be perfectly honest I don't understand how this can happen at the OS level but we have seen it in practice). This turned into somewhat of a complicated change. It does the following things: 1) Add an add listener completion which can tell the listener manager if a creation failed. As part of this work, locking was replaced with posts inside the listener manager which makes synchronization simpler. 2) The failed listener is removed, in the hope that a future addition it might succeed at that point. 3) Listeners are now always added to workers via post. 4) Because of (3), this caused race conditions in the integration tests. The integration startup code needed to be refactored so that we now wait for all listeners to be up before starting to run the tests. 5) Because of (4) it exposed an issue in the TCP proxy integration test with trying to handle a response from the client SSL auth filter which happens before the listeners are actually up. To fix this, I removed the code in the tests trying to handle client SSL auth filter calls. I also replaced all "random" upstream clusters with a host pointing at reserved port 4 which should always fail in any reasonable case. This still has value in testing startup with failure, but won't block startup. If we want to bring back testing of initial startup things like SDS, RDS, etc. we will need to revisit how this works but IMO this is fine for now.
|
@lyft/network-team @htuch @dnoe @alyssawilk @alyssawilk this is going to conflict with your flow control change in the tests. Also LMK what you think about that part of the change. We can get your change in first and then I can merge master into this. |
| request->waitForEndStream(*dispatcher_); | ||
| request->encodeHeaders(Http::TestHeaderMapImpl{{":status", "200"}}, false); | ||
| request->encodeData(0, true); | ||
| }, |
There was a problem hiding this comment.
Is this the conflict you meant?
If so please feel free to check in ASAP, because it is way cleaner and I'm happy to do the merges.
There was a problem hiding this comment.
Sure either way. We can take this through review and merge first if you want. I agree that as long as you don't have a problem with the change it does make the tests substantially easier to grok.
htuch
left a comment
There was a problem hiding this comment.
Will take a look in a bit, have you verified flake free with --runs_per_test=1000? Asking as you seem to have uncovered a bunch of concurrency issues in the integration tests, just wanted to check there aren't any more.
|
FWIW 1k is sadly nowhere enough - I've spent more of the day debugging a
1:20k failure which I bet this is the fix to.
(I got as far as figuring the SSL auth code was getting a bad FD which
aligns well with the idea not everything is set up right)
…On Thu, Jul 13, 2017 at 3:29 PM, htuch ***@***.***> wrote:
***@***.**** commented on this pull request.
Will take a look in a bit, have you verified flake free with
--runs_per_test=1000? Asking as you seem to have uncovered a bunch of
concurrency issues in the integration tests, just wanted to check there
aren't any more.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1257 (review)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ARYFvRUSLO4ShCfKLxwxrRzkWqzjQDBdks5sNnAngaJpZM4OXbgw>
.
|
Not exactly, it's really an ordering issue. The old code would add listeners on the startup thread. This no longer happens, so new sync is needed. This also broke the TCP proxy tests. However I will run a long test cycle locally to make sure it works with multiple iterations. @alyssawilk I have no doubt there are still some edge cases in the existing code. |
htuch
left a comment
There was a problem hiding this comment.
LGTM, but I haven't given the test much thought yet.
| if (!success && !listener.onListenerCreateFailure()) { | ||
| // TODO(mattklein123): In addition to a critical log and a stat, we should consider adding | ||
| // a startup option here to cause the server to exit. I think we | ||
| // probably want this at Lyft but I will do it in a follow up. |
There was a problem hiding this comment.
I think we would also want to fail a dry run canary here too.
There was a problem hiding this comment.
I don't think we would actually add a listener all the way to this point in a dry run situation? It's not completely clear to me what happens in a dry run (I saw your other comment). I think we should track this separately?
|
|
||
| /** | ||
| * Called when a listener failed to be actually created on a worker. | ||
| * @return TRUE if this is not the first create failure (via multiple workers). |
There was a problem hiding this comment.
@return TRUE if we have seen more than one worker failure
| "type": "static", | ||
| "lb_type": "round_robin", | ||
| "hosts": [{"url": "tcp://{{ ip_loopback_address }}:12000"}] | ||
| "hosts": [{"url": "tcp://{{ ip_loopback_address }}:4"}] |
There was a problem hiding this comment.
Why this change? Are we making meaningful use of these?
There was a problem hiding this comment.
There is no great way to put a comment about this, but it does do more than nothing. Basically it verifies real flow of connecting to various things and then immediately failing. So it verifies that init flow doesn't get blocked, no crashes, etc. I admit it's not perfect but I don't think it's worth deleting. LMK what you think and if you want me to document somewhere.
There was a problem hiding this comment.
OK, so just to clarify, you changed these values to 4 here to ensure that they really would fail and not conflict with something on 12000? Might be useful to have this documented in some JSON comments..
There was a problem hiding this comment.
Yes. There are no JSON comments. :( Any other place I can put this?
There was a problem hiding this comment.
Maybe don't worry then, I think anywhere else would just be confusing to the reader. I think this will be better in v2 - we won't be using manually crafted JSON configs for integration tests, but protobuf objects that we build dynamically for each different integration test. Then it can be in the code where we add these objects.
test/integration/server.cc
Outdated
| on_worker_listener_added_cb_(); | ||
| } | ||
|
|
||
| std::unique_lock<std::mutex> gaurd(listeners_mutex_); |
There was a problem hiding this comment.
Nit: s/gaurd/guard/ (or is it gourd? :) )
| // avoid locking. | ||
| server_.dispatcher().post([this, success, &listener]() -> void { | ||
| // It is theoretically possible for a listener to get added on 1 worker but not the others. | ||
| // The following code will yield 1 critical log and 1 stat. It will also cause listener |
There was a problem hiding this comment.
I would rephrase this as it kind of confused me on the first pass, something to the effect of "The below check with onListenerCreateFailure() is there to ensure we execute the removal/logging/stats at most once on failure."
|
@htuch updated |
| public: | ||
| /** | ||
| * Set the conditional to ready, should only be called once. | ||
| * Set the conditional to ready. |
There was a problem hiding this comment.
Maybe explain that waitReady() resets, but in between waitReady(), setReady() should only be called once.
|
@htuch comment updated |
Description: introduces integration test harness for the platform bridge filter that enables testing scenarios with multiple filters, and their interactions with the filter manager. Additionally, it introduces a test case for a current crashing behavior. Subsequent PR will fix and uncomment the test. Risk Level: low - introduce new testing infrastructure. Testing: new test framework, but with commented out test case. Signed-off-by: Jose Nino <[email protected]> Signed-off-by: JP Simard <[email protected]>
Description: introduces integration test harness for the platform bridge filter that enables testing scenarios with multiple filters, and their interactions with the filter manager. Additionally, it introduces a test case for a current crashing behavior. Subsequent PR will fix and uncomment the test. Risk Level: low - introduce new testing infrastructure. Testing: new test framework, but with commented out test case. Signed-off-by: Jose Nino <[email protected]> Signed-off-by: JP Simard <[email protected]>
**Description** Closes envoyproxy/ai-gateway#1149 --------- Signed-off-by: Ion Koutsouris <[email protected]>
**Description** This fixes the broken table in #1257. Signed-off-by: Takeshi Yoneda <[email protected]>
This commit properly handles listener creation failures on worker threads.
Specifically, the race condition that can happen between bind() and
listen(). (To be perfectly honest I don't understand how this can happen
at the OS level but we have seen it in practice).
This turned into somewhat of a complicated change. It does the following
things:
if a creation failed. As part of this work, locking was replaced with
posts inside the listener manager which makes synchronization simpler.
it might succeed at that point.
The integration startup code needed to be refactored so that we now
wait for all listeners to be up before starting to run the tests.
with trying to handle a response from the client SSL auth filter which
happens before the listeners are actually up. To fix this, I removed
the code in the tests trying to handle client SSL auth filter calls.
I also replaced all "random" upstream clusters with a host pointing
at reserved port 4 which should always fail in any reasonable case.
This still has value in testing startup with failure, but won't block
startup. If we want to bring back testing of initial startup things
like SDS, RDS, etc. we will need to revisit how this works but IMO
this is fine for now.