Skip to content

Add Route timeouts#165

Merged
seanmonstar merged 1 commit intomasterfrom
route-timeouts
Jan 23, 2019
Merged

Add Route timeouts#165
seanmonstar merged 1 commit intomasterfrom
route-timeouts

Conversation

@seanmonstar
Copy link
Contributor

@seanmonstar seanmonstar commented Dec 19, 2018

If a service profile specifies a timeout for a given route, this applies
a timeout waiting for responses on that route.

Timeouts show up in proxy metrics like this:

route_response_total{direction="outbound",dst="localhost:80",status_code="504",classification="failure",error="timeout"} 1

@seanmonstar seanmonstar changed the base branch from master to retry December 19, 2018 19:36
@seanmonstar
Copy link
Contributor Author

Rebased on master after merging retries support. I think the build may fail since the alex/timeouts branch is missing the recent tower-h2 patch.

Copy link
Member

@olix0r olix0r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good to me! Do you have any thoughts about the way to highlight these errors in metrics?

@olix0r
Copy link
Member

olix0r commented Dec 20, 2018

Also, it appears that the Cargo.lock is not up-to-date...

@seanmonstar
Copy link
Contributor Author

Do you have any thoughts about the way to highlight these errors in metrics?

Do we want a separate metric for them? Or just an additional label? Do we want to surface them by default, or just provide the ability to, such as by adding a custom response header that Route Response Classifiers could look for?

@olix0r
Copy link
Member

olix0r commented Jan 15, 2019

Several tests appear to hang on this branch:

:; cargo test -- http1::
    Finished dev [unoptimized + debuginfo] target(s) in 1.04s                                                                               
     Running target/debug/deps/linkerd2_proxy-5ddb558e37dea604

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 60 filtered out

     Running target/debug/deps/linkerd2_proxy-1c44db4914b1100b

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

     Running target/debug/deps/discovery-e9c29a5a89d924cc

running 20 tests
test http1::absolute_uris::outbound_reconnects_if_controller_stream_ends ... ok
test http1::outbound_asks_controller_without_orig_dst ... ok
test http1::outbound_asks_controller_api ... ok
test http1::absolute_uris::outbound_asks_controller_api ... ok
test http1::absolute_uris::outbound_asks_controller_without_orig_dst ... ok
test http1::absolute_uris::outbound_error_reconnects_after_backoff ... ok
test http1::outbound_error_reconnects_after_backoff ... ok
test http1::outbound_reconnects_if_controller_stream_ends ... ok
test http1::absolute_uris::outbound_destinations_reset_on_reconnect_followed_by_add_none ... test http1::absolute_uris::outbound_destinations_reset_on_reconnect_followed_by_add_none has been running for over 60 seconds
test http1::absolute_uris::outbound_dest_concurrency_limit ... test http1::absolute_uris::outbound_dest_concurrency_limit has been running for over 60 seconds
test http1::absolute_uris::outbound_dest_limit_does_not_limit_dns ... test http1::absolute_uris::outbound_dest_limit_does_not_limit_dns has been running for over 60 seconds
test http1::absolute_uris::outbound_destinations_reset_on_reconnect_followed_by_no_endpoints_exists ... test http1::absolute_uris::outbound_destinations_reset_on_reconnect_followed_by_no_endpoints_exists has been running for over 60 seconds
test http1::absolute_uris::outbound_destinations_reset_on_reconnect_followed_by_remove_none ... test http1::absolute_uris::outbound_destinations_reset_on_reconnect_followed_by_remove_none has been running for over 60 seconds
test http1::absolute_uris::outbound_times_out ... test http1::absolute_uris::outbound_times_out has been running for over 60 seconds
test http1::outbound_dest_limit_does_not_limit_dns ... test http1::outbound_dest_limit_does_not_limit_dns has been running for over 60 seconds
test http1::outbound_dest_concurrency_limit ... test http1::outbound_dest_concurrency_limit has been running for over 60 seconds
test http1::outbound_destinations_reset_on_reconnect_followed_by_add_none ... test http1::outbound_destinations_reset_on_reconnect_followed_by_add_none has been running for over 60 seconds
test http1::outbound_destinations_reset_on_reconnect_followed_by_no_endpoints_exists ... test http1::outbound_destinations_reset_on_reconnect_followed_by_no_endpoints_exists has been running for over 60 seconds
test http1::outbound_destinations_reset_on_reconnect_followed_by_remove_none ... test http1::outbound_destinations_reset_on_reconnect_followed_by_remove_none has been running for over 60 seconds
test http1::outbound_times_out ... test http1::outbound_times_out has been running for over 60 seconds

@seanmonstar
Copy link
Contributor Author

Looks like gRPC errors. My guess is something about my reference to the linkerd2-proxy-api is out of sync with this repo...

@seanmonstar
Copy link
Contributor Author

Oh I see, its related to having removed a bind timeout for the discovery tests...

@seanmonstar
Copy link
Contributor Author

Yea, so the tests are checking that if we don't get destination::Updates from the controller after a certain amount of time, the request should have been failed. By removing the top BIND_TIMEOUT, these requests just sit in the buffer waiting for balance to have a service ready...

We can either:

  • Decide to just wait, and allow the pod to determine how long it wants to wait, in which case I'd remove many of these tests.
  • Figure out some sort of "service factory timeout" that could be applied to balance or buffer or similar.

@olix0r
Copy link
Member

olix0r commented Jan 15, 2019

@seanmonstar For now, i'm in favor of removing this timeout. We don't want to stall forever when we're doing service discovery, but we should be conditioned to fail open rather than erroring hard as we do today.

@seanmonstar
Copy link
Contributor Author

@olix0r to clarify, you suggest just removing these tests?

I believe adding some sort of timeout::ready::layer would be pretty straight forward, that wouldn't put a timeout on the whole request, just on how long the underlying poll_ready has to become ready before failing.

@olix0r
Copy link
Member

olix0r commented Jan 15, 2019

@seanmonstar I think we should remove tests that depend on the old timeout logic, though I think those tests cover cases we still care about. It's not clear to me why they rely on the timeout.

@seanmonstar
Copy link
Contributor Author

@olix0r those tests are specifically checking that we don't hang if for various reasons we aren't able to determine where to route to.

For reference, I've already written a timeout::ready::layer, which fits in between the balance and buffer layers, which only times out if resolving takes too long. Looks like tests are passing again. Or I can just scrap it.

@seanmonstar
Copy link
Contributor Author

@olix0r Newest commit re-introduces the bind timeout, but it only applies to the time a request will wait in the buffer until the balance layer is ready. Once it's dispatched through (call), the timeout no longer applies. Yay or nay?

@olix0r
Copy link
Member

olix0r commented Jan 16, 2019

I'm still seeing test failures here

failures:
    http1::absolute_uris::outbound_dest_concurrency_limit
    http1::outbound_dest_concurrency_limit
    http2::outbound_dest_concurrency_limit

@seanmonstar
Copy link
Contributor Author

Hrm, it's passing locally for me. Fun!

@olix0r
Copy link
Member

olix0r commented Jan 16, 2019

Tested on both OSX and Linux (at 4bbd0c3) and I reliably get:

:; cargo test --test discovery outbound_dest_concurrency_limit
    Finished dev [unoptimized + debuginfo] target(s) in 2.71s                                                                                                                 
     Running target/debug/deps/discovery-bca328795f37f3fc

running 3 tests
test http2::outbound_dest_concurrency_limit ... FAILED
test http1::outbound_dest_concurrency_limit ... FAILED
test http1::absolute_uris::outbound_dest_concurrency_limit ... FAILED

failures:

---- http2::outbound_dest_concurrency_limit stdout ----
Http2 server running; addr=127.0.0.1:63302
proxy running; control=127.0.0.1:63308, inbound=127.0.0.1:63310, outbound=127.0.0.1:63313 (SO_ORIGINAL_DST=127.0.0.1:63302), metrics=127.0.0.1:63316
trying 1th destination...
trying 2th destination...
trying 3th destination...
trying 4th destination...
trying 5th destination...
trying 6th destination...
trying 7th destination...
trying 8th destination...
trying 9th destination...
trying 10th destination...
trying 11th destination...
trying 127.0.0.1:63302 directly...
trying 11th destination again...
thread 'http2::outbound_dest_concurrency_limit' panicked at 'client.get("/") expects 2xx, got "500 Internal Server Error"', tests/support/client.rs:75:9
note: Run with `RUST_BACKTRACE=1` for a backtrace.
server Listening dropped; addr=127.0.0.1:63302

---- http1::outbound_dest_concurrency_limit stdout ----
Http1 server running; addr=127.0.0.1:63300
proxy running; control=127.0.0.1:63306, inbound=127.0.0.1:63309, outbound=127.0.0.1:63312 (SO_ORIGINAL_DST=127.0.0.1:63300), metrics=127.0.0.1:63315
trying 1th destination...
trying 2th destination...
trying 3th destination...
trying 4th destination...
trying 5th destination...
trying 6th destination...
trying 7th destination...
trying 8th destination...
trying 9th destination...
trying 10th destination...
trying 11th destination...
trying 127.0.0.1:63300 directly...
trying 11th destination again...
thread 'http1::outbound_dest_concurrency_limit' panicked at 'client.get("/") expects 2xx, got "500 Internal Server Error"', tests/support/client.rs:75:9
server Listening dropped; addr=127.0.0.1:63300

---- http1::absolute_uris::outbound_dest_concurrency_limit stdout ----
Http1 server running; addr=127.0.0.1:63301
proxy running; control=127.0.0.1:63307, inbound=127.0.0.1:63311, outbound=127.0.0.1:63314 (SO_ORIGINAL_DST=127.0.0.1:63301), metrics=127.0.0.1:63317
trying 1th destination...
trying 2th destination...
trying 3th destination...
trying 4th destination...
trying 5th destination...
trying 6th destination...
trying 7th destination...
trying 8th destination...
trying 9th destination...
trying 10th destination...
trying 11th destination...
trying 127.0.0.1:63301 directly...
trying 11th destination again...
thread 'http1::absolute_uris::outbound_dest_concurrency_limit' panicked at 'client.get("/") expects 2xx, got "500 Internal Server Error"', tests/support/client.rs:75:9
server Listening dropped; addr=127.0.0.1:63301


failures:
    http1::absolute_uris::outbound_dest_concurrency_limit
    http1::outbound_dest_concurrency_limit
    http2::outbound_dest_concurrency_limit

test result: FAILED. 0 passed; 3 failed; 0 ignored; 0 measured; 30 filtered out

error: test failed, to rerun pass '--test discovery'

@seanmonstar
Copy link
Contributor Author

I'm still shocked it passes locally, as I've found why it definitely shouldn't. The ready timeout was causing the buffer to be closed for all later requests. So I'm going to pull that back out, and just see what I can do for the discovery tests.

@seanmonstar
Copy link
Contributor Author

Many of the discovery tests were exercising things like not exceeding our gRPC Destination Service stream count, or that reconnects make the proxy "forget" old addresses, and in both cases, the request will wait for a capacity to free up, or an update from the controller. Thus, they were hitting our BIND_TIMEOUT.

I've changed those tests to instead expect a client side timeout, asserting that the request doesn't proceed past the our balance step, and timing out the request directly in the unit test.

There was one test that was specifically testing the BIND_TIMEOUT, that a destination with no updates would eventually timeout. I've slapped #[ignore] on that, since I figure we will eventually want to add in some sort of destination-acquisition timeout.

Cargo.toml Outdated
linkerd2-timeout = { path = "lib/timeout" }

linkerd2-proxy-api = { git = "https://github.com/linkerd/linkerd2-proxy-api", rev = "1ad7018", version = "0.1.4" }
linkerd2-proxy-api = { git = "https://github.com/linkerd/linkerd2-proxy-api", branch = "sean/timeouts" } #tag = "v0.1.4", version = "0.1.4" }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should point back at master (or a tag) before merging.

fn start<B>(self, rsp: &http::Response<B>) -> Eos {
if rsp.extensions().get::<timeout::ProxyTimedOut>().is_some() {
return Eos::Error("timeout");
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems fine, though you might also consider doing this slightly differently: Instead of making the classify module enumerate all of the types of errors it may observfe, we could instead have modules add a classify::ProxyError(&'static str) marker onto responses -- so this would be able to work with arbitrary error types in the stack, not just timeouts.

This then becomes something to the effect of:

        if let Some(ProxyError(msg)) = rsp.extensions().get::<ProxyError>() {
            return Eos::Error(msg);
        }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That'd seem to have app types present in proxy modules... Unless you mean just a general proxy::http::classify::ProxyError thingy?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, yeah... good point.

Copy link
Member

@olix0r olix0r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modulo the proxy-api dependency and the Error specialization issue i flagged (not critical, though something we'll likely want to change at some point), I think this is basically good to merge. 👍

@seanmonstar
Copy link
Contributor Author

Oh noes, seems I was bitten by #170 not yet being merged, and thus quickcheck versions mismatched...

@seanmonstar seanmonstar force-pushed the route-timeouts branch 2 times, most recently from e973750 to fd07179 Compare January 23, 2019 22:15
If a service profile specifies a timeout for a given route, this applies
a timeout waiting for responses on that route.

Signed-off-by: Sean McArthur <[email protected]>
@seanmonstar
Copy link
Contributor Author

Updated, green, just needs final r+.

Copy link
Member

@olix0r olix0r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

@seanmonstar seanmonstar merged commit 28fd5e7 into master Jan 23, 2019
@seanmonstar seanmonstar deleted the route-timeouts branch January 23, 2019 23:10
hawkw added a commit to linkerd/linkerd2 that referenced this pull request Jan 24, 2019
proxy: update pinned version to 5b507a9

This picks up the following proxy commits:

* eaabc48 Update tower-grpc
* e9561de Update h2 to 0.1.16
* 28fd5e7 Add Route timeouts (linkerd/linkerd2-proxy#165)
* 5637372 Re-flag tcp_duration tests as flaky
* 20cbd18 Revise several log levels and messages (linkerd/linkerd2-proxy##177)
* ae16978 Remove flakiness from 'profiles' tests
* 49c29cd canonicalize: Only log errors at the WARN level when falling back (linkerd/linkerd2-proxy#174)
* 486dd13 Make outbound router honor `l5d-dst-override` header (linkerd/linkerd2-proxy#173)
* 7adc50d Make timeouts for canonicalization DNS queries tuneable (linkerd/linkerd2-proxy#175)
* 3188179 Try reducing CI flakiness by reducing RUST_TEST_THREADS to 1

Some of these changes will probably need changelog entries:

* Improve logging when rejecting malformed HTTP/2 pseudo-headers
  (hyperium/h2#347)
* Improve logging for gRPC errors (tower-rs/tower-grpc#111)
* Add Route timeouts (linkerd/linkerd2-proxy#165)
* Downgrade several of the noisiest log messages to TRACE
  (linkerd/linkerd2-proxy##177)
* Add an environment variable for configuring the DNS canonicalization
  timeout (linkerd/linkerd2-proxy#175)
* Make outbound router honor `l5d-dst-override` header
  (linkerd/linkerd2-proxy#173)

Perhaps all the logging related changes can be grouped into one
changelog entry, though...

Signed-off-by: Eliza Weisman <[email protected]>
hawkw added a commit to linkerd/linkerd2 that referenced this pull request Jan 24, 2019
This picks up the following proxy commits:

* eaabc48 Update tower-grpc
* e9561de Update h2 to 0.1.16
* 28fd5e7 Add Route timeouts (linkerd/linkerd2-proxy#165)
* 5637372 Re-flag tcp_duration tests as flaky
* 20cbd18 Revise several log levels and messages (linkerd/linkerd2-proxy##177)
* ae16978 Remove flakiness from 'profiles' tests
* 49c29cd canonicalize: Only log errors at the WARN level when falling back (linkerd/linkerd2-proxy#174)
* 486dd13 Make outbound router honor `l5d-dst-override` header (linkerd/linkerd2-proxy#173)
* 7adc50d Make timeouts for canonicalization DNS queries tuneable (linkerd/linkerd2-proxy#175)
* 3188179 Try reducing CI flakiness by reducing RUST_TEST_THREADS to 1

Some of these changes will probably need changelog entries:

* Improve logging when rejecting malformed HTTP/2 pseudo-headers
  (hyperium/h2#347)
* Improve logging for gRPC errors (tower-rs/tower-grpc#111)
* Add Route timeouts (linkerd/linkerd2-proxy#165)
* Downgrade several of the noisiest log messages to TRACE
  (linkerd/linkerd2-proxy##177)
* Add an environment variable for configuring the DNS canonicalization
  timeout (linkerd/linkerd2-proxy#175)
* Make outbound router honor `l5d-dst-override` header
  (linkerd/linkerd2-proxy#173)

Perhaps all the logging related changes can be grouped into one
changelog entry, though...

Signed-off-by: Eliza Weisman <[email protected]>
sprt pushed a commit to sprt/linkerd2-proxy that referenced this pull request Aug 30, 2019
simulate-proxy sends HTTP example data.

Modify this test script to also send TCP example data.

Part of linkerd#132

Signed-off-by: Andrew Seigner <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants