Conversation
Signed-off-by: Zahari Dichev <[email protected]>
Signed-off-by: Zahari Dichev <[email protected]>
linkerd/app/src/identity.rs
Outdated
| .push_timeout(control.connect.timeout) | ||
| .push(control::client::layer()) | ||
| .push(control::resolve::layer(dns)) | ||
| .push(discover) |
There was a problem hiding this comment.
Can we abstract all of that for identity, dst and the oc collector. If so, what would a method that produces this stack return? impl MakeService ? I had some trouble with the typo contraints
linkerd/app/src/lib.rs
Outdated
| const EWMA_DEFAULT_RTT: Duration = Duration::from_millis(30); | ||
| const EWMA_DECAY: Duration = Duration::from_secs(10); |
There was a problem hiding this comment.
What are sensible values for all these constants and magix numbers when it comes to communicating with the control plane?
hawkw
left a comment
There was a problem hiding this comment.
The overall approach looks good. I had some notes on the implementation.
Signed-off-by: Zahari Dichev <[email protected]>
Signed-off-by: Zahari Dichev <[email protected]>
hawkw
left a comment
There was a problem hiding this comment.
Overall, this definitely feels like the right approach. I had a few more suggestions.
Signed-off-by: Zahari Dichev <[email protected]>
458eeb9 to
0084048
Compare
Signed-off-by: Zahari Dichev <[email protected]>
0084048 to
29e410a
Compare
Signed-off-by: Zahari Dichev <[email protected]>
kleimkuhler
left a comment
There was a problem hiding this comment.
Overall this is looking really good! Walking through the tests a little more, but just had some smaller comments to add.
Signed-off-by: Zahari Dichev <[email protected]>
Signed-off-by: Zahari Dichev <[email protected]>
…d2-proxy into zd/control-plane-discover
|
Are any control plane changes needed to test this? I seem to hit errors like: Are we going to have to support falling back kube-proxy? |
|
@olix0r yes you need to make identitiy and dst headless. Here is the branch: linkerd/linkerd2@17e6490 Do we want to fall back on kube-proxy. If so, how would we do that? Just do an Ip lookup in case we error out? |
* dns-resolve: Always use resets There's no reason that we have to maintain the resolution state now that we have a Reset type. Furthermore, we can use a unit endpoint type, since it is ignored. * undo errant change * Use map_endpoint to build control client targets * undo unnecessarty change * lookup_service_ips => lookup_service_addrs * dns: Run DNS resolutions on the main runtime DNS resolutions are run on the admin runtime. This requires an unnecessary layer of indirection around the resolver, including an MPSC. Now that we allow the main runtime to use more than one thread, it's preferable to do this discovery on the main runtime and we can simplify the implementation. * Remove needless recursion_limit settings * Set span on background task * Fallback from SRV records to A records when SRV records are invalid * Fallback on each call, not only the first * -async_stream; not pin, etc * touchup * instrument dns resolver * trace log on close * -boilerplate * -recursion_limit
hawkw
left a comment
There was a problem hiding this comment.
much simpler now! this lgtm!
| Service = discover::MakeEndpoint< | ||
| discover::FromResolve<map_endpoint::Resolve<IntoTarget, DnsResolve>, Target>, | ||
| M, |
There was a problem hiding this comment.
not a blocker but could we maybe make this thing a type alias or something?
There was a problem hiding this comment.
it's only used the once, isn't it? so, yes, but it wouldn't really reduce anything
This release includes internal changes to the service discovery system, especially when discovering control plane components (like the destination and identity controllers). Now, the proxy attempts to balance requests across all pods in each control plane service. This requires control plane changes to use "headless" services so that SRV records are exposed. When the control plane services have a `clusterIP` set, the proxy falls back to using normal A-record lookups. --- * tracing: add richer verbose spans to http clients (linkerd/linkerd2-proxy#622) * trace: update tracing dependencies (linkerd/linkerd2-proxy#623) * Remove `Resolution` trait (linkerd/linkerd2-proxy#606) * Update proxy-identity to edge-20.8.2 (linkerd/linkerd2-proxy#627) * Add build arg for skipping identity wrapper (linkerd/linkerd2-proxy#624) * Wait for proxy thread to terminate in integration tests (linkerd/linkerd2-proxy#625) * Remove scrubbing for unused headers (linkerd/linkerd2-proxy#628) * Split orig-proto tests out of discovery tests (linkerd/linkerd2-proxy#629) * Re-enable outbound timeout test (linkerd/linkerd2-proxy#630) * profiles: perform profile resolution for IP addresses (linkerd/linkerd2-proxy#626) * Move resolve api to async-stream (linkerd/linkerd2-proxy#599) * Decouple discovery buffering from endpoint conversion (linkerd/linkerd2-proxy#631) * resolve: Add a Reset state (linkerd/linkerd2-proxy#633) * resolve: Eagerly fail resolutions (linkerd/linkerd2-proxy#634) * test: replace `net2` dependency with `socket2` (linkerd/linkerd2-proxy#635) * dns: Run DNS resolutions on the main runtime (linkerd/linkerd2-proxy#637) * Load balance requests to the control plane (linkerd/linkerd2-proxy#594) * Unify control plane client construction (linkerd/linkerd2-proxy#638)
This release includes internal changes to the service discovery system, especially when discovering control plane components (like the destination and identity controllers). Now, the proxy attempts to balance requests across all pods in each control plane service. This requires control plane changes to use "headless" services so that SRV records are exposed. When the control plane services have a `clusterIP` set, the proxy falls back to using normal A-record lookups. --- * tracing: add richer verbose spans to http clients (linkerd/linkerd2-proxy#622) * trace: update tracing dependencies (linkerd/linkerd2-proxy#623) * Remove `Resolution` trait (linkerd/linkerd2-proxy#606) * Update proxy-identity to edge-20.8.2 (linkerd/linkerd2-proxy#627) * Add build arg for skipping identity wrapper (linkerd/linkerd2-proxy#624) * Wait for proxy thread to terminate in integration tests (linkerd/linkerd2-proxy#625) * Remove scrubbing for unused headers (linkerd/linkerd2-proxy#628) * Split orig-proto tests out of discovery tests (linkerd/linkerd2-proxy#629) * Re-enable outbound timeout test (linkerd/linkerd2-proxy#630) * profiles: perform profile resolution for IP addresses (linkerd/linkerd2-proxy#626) * Move resolve api to async-stream (linkerd/linkerd2-proxy#599) * Decouple discovery buffering from endpoint conversion (linkerd/linkerd2-proxy#631) * resolve: Add a Reset state (linkerd/linkerd2-proxy#633) * resolve: Eagerly fail resolutions (linkerd/linkerd2-proxy#634) * test: replace `net2` dependency with `socket2` (linkerd/linkerd2-proxy#635) * dns: Run DNS resolutions on the main runtime (linkerd/linkerd2-proxy#637) * Load balance requests to the control plane (linkerd/linkerd2-proxy#594) * Unify control plane client construction (linkerd/linkerd2-proxy#638)
|
Any chance that this is not fixed in non-HA mode? We had this behaviour recently after an AKS node crash, had 2 Linkerd Destination Pods, one of which was on that node 🤔 |
This PR enables the discovery of control plane components through DNS. This is done by implementing a resolution stream backed by a DNS resolver. The stream will yield
Updates that feed the loadbalancer.The stream is written with
async-streamto avoid manual state machines. In addition to that adns::Resolvertrait is introduced to allow for easily mocking a DNS resolver and testing the stream implementation.I have tested this and can confirm that it resolves the problems described in linkerd/linkerd2#4674 and linkerd/linkerd2#4624
Signed-off-by: Zahari Dichev [email protected]