Skip to content

listener manager: handle listener creation failures on workers#1257

Merged
mattklein123 merged 5 commits intomasterfrom
listen_failure
Jul 14, 2017
Merged

listener manager: handle listener creation failures on workers#1257
mattklein123 merged 5 commits intomasterfrom
listen_failure

Conversation

@mattklein123
Copy link
Copy Markdown
Member

This commit properly handles listener creation failures on worker threads.
Specifically, the race condition that can happen between bind() and
listen(). (To be perfectly honest I don't understand how this can happen
at the OS level but we have seen it in practice).

This turned into somewhat of a complicated change. It does the following
things:

  1. Add an add listener completion which can tell the listener manager
    if a creation failed. As part of this work, locking was replaced with
    posts inside the listener manager which makes synchronization simpler.
  2. The failed listener is removed, in the hope that a future addition
    it might succeed at that point.
  3. Listeners are now always added to workers via post.
  4. Because of (3), this caused race conditions in the integration tests.
    The integration startup code needed to be refactored so that we now
    wait for all listeners to be up before starting to run the tests.
  5. Because of (4) it exposed an issue in the TCP proxy integration test
    with trying to handle a response from the client SSL auth filter which
    happens before the listeners are actually up. To fix this, I removed
    the code in the tests trying to handle client SSL auth filter calls.
    I also replaced all "random" upstream clusters with a host pointing
    at reserved port 4 which should always fail in any reasonable case.
    This still has value in testing startup with failure, but won't block
    startup. If we want to bring back testing of initial startup things
    like SDS, RDS, etc. we will need to revisit how this works but IMO
    this is fine for now.

This commit properly handles listener creation failures on worker threads.
Specifically, the race condition that can happen between bind() and
listen(). (To be perfectly honest I don't understand how this can happen
at the OS level but we have seen it in practice).

This turned into somewhat of a complicated change. It does the following
things:
1) Add an add listener completion which can tell the listener manager
   if a creation failed. As part of this work, locking was replaced with
   posts inside the listener manager which makes synchronization simpler.
2) The failed listener is removed, in the hope that a future addition
   it might succeed at that point.
3) Listeners are now always added to workers via post.
4) Because of (3), this caused race conditions in the integration tests.
   The integration startup code needed to be refactored so that we now
   wait for all listeners to be up before starting to run the tests.
5) Because of (4) it exposed an issue in the TCP proxy integration test
   with trying to handle a response from the client SSL auth filter which
   happens before the listeners are actually up. To fix this, I removed
   the code in the tests trying to handle client SSL auth filter calls.
   I also replaced all "random" upstream clusters with a host pointing
   at reserved port 4 which should always fail in any reasonable case.
   This still has value in testing startup with failure, but won't block
   startup. If we want to bring back testing of initial startup things
   like SDS, RDS, etc. we will need to revisit how this works but IMO
   this is fine for now.
@mattklein123
Copy link
Copy Markdown
Member Author

@lyft/network-team @htuch @dnoe @alyssawilk

@alyssawilk this is going to conflict with your flow control change in the tests. Also LMK what you think about that part of the change. We can get your change in first and then I can merge master into this.

request->waitForEndStream(*dispatcher_);
request->encodeHeaders(Http::TestHeaderMapImpl{{":status", "200"}}, false);
request->encodeData(0, true);
},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the conflict you meant?
If so please feel free to check in ASAP, because it is way cleaner and I'm happy to do the merges.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure either way. We can take this through review and merge first if you want. I agree that as long as you don't have a problem with the change it does make the tests substantially easier to grok.

Copy link
Copy Markdown
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will take a look in a bit, have you verified flake free with --runs_per_test=1000? Asking as you seem to have uncovered a bunch of concurrency issues in the integration tests, just wanted to check there aren't any more.

@alyssawilk
Copy link
Copy Markdown
Contributor

alyssawilk commented Jul 13, 2017 via email

@mattklein123
Copy link
Copy Markdown
Member Author

Asking as you seem to have uncovered a bunch of concurrency issues in the integration tests, just wanted to check there aren't any more.

Not exactly, it's really an ordering issue. The old code would add listeners on the startup thread. This no longer happens, so new sync is needed. This also broke the TCP proxy tests. However I will run a long test cycle locally to make sure it works with multiple iterations. @alyssawilk I have no doubt there are still some edge cases in the existing code.

Copy link
Copy Markdown
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but I haven't given the test much thought yet.

if (!success && !listener.onListenerCreateFailure()) {
// TODO(mattklein123): In addition to a critical log and a stat, we should consider adding
// a startup option here to cause the server to exit. I think we
// probably want this at Lyft but I will do it in a follow up.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we would also want to fail a dry run canary here too.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we would actually add a listener all the way to this point in a dry run situation? It's not completely clear to me what happens in a dry run (I saw your other comment). I think we should track this separately?


/**
* Called when a listener failed to be actually created on a worker.
* @return TRUE if this is not the first create failure (via multiple workers).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@return TRUE if we have seen more than one worker failure

"type": "static",
"lb_type": "round_robin",
"hosts": [{"url": "tcp://{{ ip_loopback_address }}:12000"}]
"hosts": [{"url": "tcp://{{ ip_loopback_address }}:4"}]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change? Are we making meaningful use of these?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no great way to put a comment about this, but it does do more than nothing. Basically it verifies real flow of connecting to various things and then immediately failing. So it verifies that init flow doesn't get blocked, no crashes, etc. I admit it's not perfect but I don't think it's worth deleting. LMK what you think and if you want me to document somewhere.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so just to clarify, you changed these values to 4 here to ensure that they really would fail and not conflict with something on 12000? Might be useful to have this documented in some JSON comments..

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. There are no JSON comments. :( Any other place I can put this?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe don't worry then, I think anywhere else would just be confusing to the reader. I think this will be better in v2 - we won't be using manually crafted JSON configs for integration tests, but protobuf objects that we build dynamically for each different integration test. Then it can be in the code where we add these objects.

on_worker_listener_added_cb_();
}

std::unique_lock<std::mutex> gaurd(listeners_mutex_);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: s/gaurd/guard/ (or is it gourd? :) )

// avoid locking.
server_.dispatcher().post([this, success, &listener]() -> void {
// It is theoretically possible for a listener to get added on 1 worker but not the others.
// The following code will yield 1 critical log and 1 stat. It will also cause listener
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rephrase this as it kind of confused me on the first pass, something to the effect of "The below check with onListenerCreateFailure() is there to ensure we execute the removal/logging/stats at most once on failure."

@mattklein123
Copy link
Copy Markdown
Member Author

@htuch updated

public:
/**
* Set the conditional to ready, should only be called once.
* Set the conditional to ready.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe explain that waitReady() resets, but in between waitReady(), setReady() should only be called once.

htuch
htuch previously approved these changes Jul 14, 2017
@mattklein123
Copy link
Copy Markdown
Member Author

@htuch comment updated

@mattklein123 mattklein123 merged commit ea3b7c6 into master Jul 14, 2017
@mattklein123 mattklein123 deleted the listen_failure branch July 14, 2017 18:27
jpsim pushed a commit that referenced this pull request Nov 28, 2022
Description: introduces integration test harness for the platform bridge filter that enables testing scenarios with multiple filters, and their interactions with the filter manager. Additionally, it introduces a test case for a current crashing behavior. Subsequent PR will fix and uncomment the test.
Risk Level: low - introduce new testing infrastructure.
Testing: new test framework, but with commented out test case.

Signed-off-by: Jose Nino <[email protected]>
Signed-off-by: JP Simard <[email protected]>
jpsim pushed a commit that referenced this pull request Nov 29, 2022
Description: introduces integration test harness for the platform bridge filter that enables testing scenarios with multiple filters, and their interactions with the filter manager. Additionally, it introduces a test case for a current crashing behavior. Subsequent PR will fix and uncomment the test.
Risk Level: low - introduce new testing infrastructure.
Testing: new test framework, but with commented out test case.

Signed-off-by: Jose Nino <[email protected]>
Signed-off-by: JP Simard <[email protected]>
mathetake pushed a commit that referenced this pull request Mar 3, 2026
**Description**
Closes envoyproxy/ai-gateway#1149

---------

Signed-off-by: Ion Koutsouris <[email protected]>
mathetake added a commit that referenced this pull request Mar 3, 2026
**Description**

This fixes the broken table in #1257.

Signed-off-by: Takeshi Yoneda <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants