Skip to content

Rewrite the destination client and remove DNS fallback#259

Merged
hawkw merged 46 commits intomasterfrom
liza/remove-cache
May 23, 2019
Merged

Rewrite the destination client and remove DNS fallback#259
hawkw merged 46 commits intomasterfrom
liza/remove-cache

Conversation

@hawkw
Copy link
Contributor

@hawkw hawkw commented May 20, 2019

This branch removes the fall-back-to-DNS behaviour from the proxy's
destination module. Instead, when the Destination service returns "no
endpoints", the proxy will route the request to its original
destination, using the fallback behaviour introduced in #248.

In addition, the proxy now no longer reconnects to the destination
service for queries which receive a response with the Invalid Argument
status code. After linkerd/linkerd2#2664, this status is used by the
control plane to indicate names which should not query the destination
service. I've added a new test for this.

Furthermore (and unlike #245), this branch also removes the caching of
service discovery lookups so that they can be shared between resolvers
for authorities with the same canonical form. When we limited the number
of concurrently in flight destination service queries to 100, reusing
them was much more important; now it adds a lot of unnecessary
complexity.

The rewritten control::destination module is now much simpler and
(hopefully) easier to understand. The Resolver holds a destination
service Client, which is cloned every time a new Resolution is
requested. Each active Resolution has a corresponding background task
that drives the destination service query and sends updates through a
channel to the Resolver.

The background task is necessary so that updates are processed as they
are recieved, rather than only when the Resolution is polled (i.e.
when the load balancer is routing a request). In addition, the channel
provides the queueing necessary to transform the control plane's update
types (which contain multiple endpoint addresses) into multiple
tower-balance updates for single endpoints, and to invalidate stale
endpoints when the control plane connection is reestablished. When a
control plane query is not necessary because the requested authority
does not match the proxy's destination suffixes, the background task is
not spawned and the Resolution remains permanently in the "no
endpoints" state; and when an InvalidArgument error code is recieved,
the background task terminates.

A majority of the service discovery tests pass with no changes after
this change. The following changes were necessary:

  • The test outbound_updates_newer_services was removed, as it tests
    that the same destination service query is shared between multiple
    services with the same authority (which is no longer the case).
  • The mock destination service was changed to send the Unavailable
    status code when it has run out of expected destinations, rather than
    InvalidArgument, as that code is used to indicate that a destination
    should never be reconnected.
  • New tests were added for falling back to ORIG_DST when no endpoints
    exist, and for not reconnecting after an InvalidArgument error.

Closes #245.
Closes linkerd/linkerd2#2661
Closes linkerd/linkerd2#2728

Signed-off-by: Eliza Weisman [email protected]

hawkw added 30 commits May 15, 2019 13:15
Signed-off-by: Eliza Weisman <[email protected]>
Signed-off-by: Eliza Weisman <[email protected]>
Signed-off-by: Eliza Weisman <[email protected]>
Signed-off-by: Eliza Weisman <[email protected]>
Signed-off-by: Eliza Weisman <[email protected]>
Signed-off-by: Eliza Weisman <[email protected]>
Signed-off-by: Eliza Weisman <[email protected]>
the existing discovery tests hit some of the fallback behaviour, but
these new tests explicitly test fallback behavior around "no endpoints"
responses.

Signed-off-by: Eliza Weisman <[email protected]>
Signed-off-by: Eliza Weisman <[email protected]>
Signed-off-by: Eliza Weisman <[email protected]>
Signed-off-by: Eliza Weisman <[email protected]>
Signed-off-by: Eliza Weisman <[email protected]>
Signed-off-by: Eliza Weisman <[email protected]>
Signed-off-by: Eliza Weisman <[email protected]>
This test tests for behaviour that is no longer implemented. In
particular, it asserts that when a new service is built that shares a
destination authority with an existing resolution, any updates
previously discovered for the existing resolution are also used for the
new service without another Destination service query. Removing the
caching and sharing of queries between multiple resolvers (the primary
goal of this branch) makes it impossible for this test to pass.

Signed-off-by: Eliza Weisman <[email protected]>
hawkw added 10 commits May 20, 2019 12:51
Signed-off-by: Eliza Weisman <[email protected]>
Signed-off-by: Eliza Weisman <[email protected]>
Signed-off-by: Eliza Weisman <[email protected]>
Signed-off-by: Eliza Weisman <[email protected]>
Signed-off-by: Eliza Weisman <[email protected]>
Signed-off-by: Eliza Weisman <[email protected]>
this way, the service will not be called if it is not ready when
constructing the resolution.

Signed-off-by: Eliza Weisman <[email protected]>
Signed-off-by: Eliza Weisman <[email protected]>
@hawkw hawkw requested review from olix0r and seanmonstar May 20, 2019 22:37
@hawkw hawkw self-assigned this May 20, 2019
hawkw added 3 commits May 20, 2019 16:08
Signed-off-by: Eliza Weisman <[email protected]>
Signed-off-by: Eliza Weisman <[email protected]>
Copy link
Member

@olix0r olix0r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach looks really, really, really good to me! 🙌

What testing have you done with this branch? (I.e. have you run integration tests? anything else?)

Copy link
Member

@olix0r olix0r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK; I've run a test for the last ~12 hours -- this change does not resolve the memory leak issue we've been chasing; but otherwise, the change seems stable and good. Ship it!

It would be good to get another pair of eyes on this before merging, though...

@hawkw
Copy link
Contributor Author

hawkw commented May 23, 2019

It would be good to get another pair of eyes on this before merging, though...

@seanmonstar, if you have a chance today, would you mind taking a look at this branch? Thanks!

@hawkw
Copy link
Contributor Author

hawkw commented May 23, 2019

I've done some additional manual testing on this branch, injecting the proxy into an nginx ingress controller. As of b0ade34, the proxy will now route outbound requests from the ingress controller to the correct backend, even when the ingress controller is not rewriting the host header or adding a l5d-dst-override header.

A quick demonstration

Note that the ingress is not configured to add or rewrite any headers:

➜  kubectl -n emojivoto get ingress -o yaml
apiVersion: v1
items:
- apiVersion: extensions/v1beta1
  kind: Ingress
  metadata:
    annotations:
      kubectl.kubernetes.io/last-applied-configuration: |
        {"apiVersion":"extensions/v1beta1","kind":"Ingress","metadata":{"annotations":{"kubernetes.io/ingress.class":"nginx"},"name":"web-ingress","namespace":"emojivoto"},"spec":{"rules":[{"host":"example.com","http":{"paths":[{"backend":{"serviceName":"web-svc","servicePort":80}}]}}]}}
      kubernetes.io/ingress.class: nginx
    creationTimestamp: "2019-05-23T17:57:39Z"
    generation: 1
    name: web-ingress
    namespace: emojivoto
    resourceVersion: "268932"
    selfLink: /apis/extensions/v1beta1/namespaces/emojivoto/ingresses/web-ingress
    uid: 41534b84-7d84-11e9-aca6-42010a8a0176
  spec:
    rules:
    - host: example.com
      http:
        paths:
        - backend:
            serviceName: web-svc
            servicePort: 80
  status:
    loadBalancer:
      ingress:
      - ip: 34.83.77.151
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

However, curling the ingress IP routes correctly to the web service:

➜  curl -H "Host: example.com" http://34.83.12.9 -vvv
* Rebuilt URL to: http://34.83.12.9/
*   Trying 34.83.12.9...
* TCP_NODELAY set
* Connected to 34.83.12.9 (34.83.12.9) port 80 (#0)
> GET / HTTP/1.1
> Host: example.com
> User-Agent: curl/7.54.0
> Accept: */*
>
< HTTP/1.1 200 OK
< server: nginx/1.15.5
< date: Thu, 23 May 2019 21:14:27 GMT
< content-type: text/html
< content-length: 560
< vary: Accept-Encoding
<

	<!DOCTYPE html>
	<html>
		<head>
			<meta charset="UTF-8">
			<title>Emoji Vote</title>
			<link rel="icon" href="/img/favicon.ico">

			<script async src="https://www.googletagmanager.com/gtag/js?id=UA-60040560-4"></script>
			<script>
			  window.dataLayer = window.dataLayer || [];
			  function gtag(){dataLayer.push(arguments);}
			  gtag('js', new Date());
			  gtag('config', 'UA-60040560-4');
			</script>
		</head>
		<body>
			<div id="main" class="main"></div>
		</body>

			<script type="text/javascript" src="/js" async></script>

* Connection #0 to host 34.83.12.9 left intact
	</html>%

The proxy logs confirm that the right thing happened:

TRCE [   123.959793s] example.com:80 linkerd2_proxy::control::destination resolve; authority=NameAddr { name: "example.com", port: 80 }
TRCE [   123.959797s] example.com:80 linkerd2_proxy::control::destination -> authority example.com:80 not in search suffixes
TRCE [   123.959815s] example.com:80 linkerd2_proxy::control::destination::resolution resolution daemon has terminated
TRCE [   123.959821s] example.com:80 linkerd2_proxy::proxy::http::balance no endpoints for /
TRCE [   123.959833s] proxy={server=out listen=127.0.0.1:4140 remote=10.52.0.44:43272} linkerd2_proxy::proxy::http::fallback load balancer has no endpoints; trying to fall back
TRCE [   123.959853s] proxy={server=out listen=127.0.0.1:4140 remote=10.52.0.44:43272} linkerd2_proxy::proxy::http::router routing...
DBUG [   123.959859s] proxy={server=out listen=127.0.0.1:4140 remote=10.52.0.44:43272} linkerd2_proxy::app::main outbound ep=Some(Endpoint { dst_name: None, addr: V4(10.52.0.4:80), identity: None(NoPeerName(NotProvidedByServiceDiscovery)), metadata: Metadata { weight: 10000, labels: {}, protocol_hint: Unknown, identity: None }, http_settings: Http1 { keep_alive: true, wants_h1_upgrade: false, was_absolute_form: false } })

We may now want to update the ingress section in the documentation to reflect that rewriting host headers or adding an l5d-dst-override header is no longer strictly necessary for injected ingress controllers to route correctly. However, it is important to note that it may still be preferable to do so, as the fallback original destination routing will not be encrypted or include most of the telemetry labels.

Copy link
Contributor

@seanmonstar seanmonstar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wayyyy cleaner! Such love! 😍

@hawkw hawkw merged commit 1c91a39 into master May 23, 2019
hawkw added a commit that referenced this pull request Jun 2, 2019
When the Destination service returns a `NoEndpoints` response, a field
`exists` is set if the destination _does_ exist in service discovery
but has no endpoints. The API spec states that:

> `no_endpoints{exists: false}` indicates that the service does not exist
> and the client MAY try an alternate service discovery method (e.g. DNS).
>
> `no_endpoints(exists: true)` indicates that the service does exist and
> the client MUST NOT fall back to an alternate service discovery
> method.

When the DNS fallback behaviour was removed from the proxy in #259, this
field was overlooked, and the proxy was changed to fall back to original
destination routing _any_ time the control plane indicates that no
endpoints exist for a destination. This is incorrect, as the proxy
should always treat the destination service as authoritative.
Additionally, this means that requests to endpoints which are known to
not exist will construct an unnecessary client service, and eventually
fail with a 502 error when the upstream client connection  to the
non-existant endpoint ultimately fails.

This branch changes the `control::destination::resolution::Daemon`
future to only send `Update::NoEndpoints` (which indicates that the load
balancer should fall back) to the load balancer when the Destination
service response has `exists: false`, or on `InvalidArgument` errors,
and _not_ when a `no_endpoints{exists: true}` response is recieved.
These requests will now fail fast with a 503 error rather than
attempting an ultimately futile fallback request.

Previously, one of the discovery tests expected the incorrect behaviour.
I've changed this test to now expect that destinations known to not
exist do _not_ fall back, and renamed it from
`outbound_falls_back_to_orig_dst_when_destination_has_no_endpoints` to
`outbound_fails_fast_when_destination_has_no_endpoints`. I've also
confirmed that the updated test fails against master and passes after
this change.

Fixes linkerd/linkerd2#2880

Signed-off-by: Eliza Weisman <[email protected]>
hawkw added a commit that referenced this pull request Jun 12, 2019
When the DNS fallback behavior was removed from the proxy in #259, the
proxy was changed to fall back to original destination routing _any_
time the control plane indicates that no endpoints exist for a
destination. This is incorrect, as the proxy should always treat the
destination service as authoritative. Additionally, this means that
requests to endpoints which are known to not exist will construct an
unnecessary client service, and eventually fail with a 502 error when
the upstream client connection to the non-existent endpoint ultimately
fails. Instead, the proxy should only fall back when the destination is
outside the search suffixes or the Destination service returns an
`InvalidArgument` response.

This is a much larger change than #263. In particular, the fallback
behavior has been moved from when the the actual client service is 
called to when the client service is _constructed_, as we do not expect
to recieve an `InvalidArgument` response on a Destination query that
has previously recieved updates. 

This was implemented by changing the `Resolve` trait to return a
`Future` whose `Item` type is a `Resolution`, rather than returning a
`Resolution`. When the Destination query returns `InvalidArgument`, the
future will fail. The `proxy::http::fallback` middleware has been
changed so that rather than making both the primary and fallback service
and falling back on a per-request basis, it tries to make the primary
service, and falls back if the primary `MakeService` future fails. 

Although this is a large change, I think the resulting code is simpler
and more elegant than the previous approach.

Previously, one of the discovery tests expected the incorrect behavior.
I've changed this test to now expect that destinations known to not
exist do _not_ fall back, and renamed it from
`outbound_falls_back_to_orig_dst_when_destination_has_no_endpoints` to
`outbound_fails_fast_when_destination_has_no_endpoints`. I've also
confirmed that the updated test fails against master and passes after
this change.

Fixes linkerd/linkerd2#2880
Closes #263 

Signed-off-by: Eliza Weisman <[email protected]>
@olix0r olix0r deleted the liza/remove-cache branch August 17, 2019 01:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

proxy: Don't retry discovery requests that return InvalidArgument proxy: Simplify destination discovery fallback

3 participants