[RFC] cluster: introduce futureThreadLocalCluster interface to cluster manager by lambdai · Pull Request #15053 · envoyproxy/envoy

lambdai · 2021-02-15T09:34:15Z

Partially address #15026

This is a strawman that explains how ClusterManager embraces on-demand cluster by adding futureThreadLocalCluster() method . The ClusterManager user(e.g. the downstream terminated filter) need to switch from getThreadLocalCluster(). I rewrite the tcp proxy filter and add the basic test.

Note that tcp proxy does not need the details of why FutureCluster is not ready.
In the TcpProxyFutureClusterTest I introduced a DelayedFutureCluster to emulate the ondemand cluster config update.
In the traditional TcpProxyTest I introduced a ReadyFutureCluster to explain how to migrate to futureThreadLocalCluster and from getThreadLocalCluster()

Notes for reviewers:
I refactor the tcp proxy test to reduce the time on compile and run test. I can create a separate PR to land it in master if you like.
The newly added test live in test/common/tcp_proxy/tcp_proxy_cluster_test.cc.

Commit Message:

Additional Description:
Risk Level:
Testing:
Docs Changes:
Release Notes:
Platform Specific Features:
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Deprecated:]
[Optional API Considerations:]

Signed-off-by: Yuchen Dai <[email protected]> DelayedFutureCluster and refactor tcp proxy tests Signed-off-by: Yuchen Dai <[email protected]> fix LocalClosedBeforeClusterIsReady Signed-off-by: Yuchen Dai <[email protected]> fix TcpProxyFutureClusterTest Signed-off-by: Yuchen Dai <[email protected]>

bryce-anderson · 2021-02-15T16:52:31Z

This looks like what we'd need for our use case. Since some details are missing, I presume the final implementation would have a failure pathway of sorts, but otherwise it appears that its goal is to provide a somewhat simple interface for establishing a callback for when the cluster resolves.

mattklein123 · 2021-02-16T01:24:59Z

At a very high level this looks right to me, however, I think like VHDS the actual lazy loading should be in its own HTTP or network filter that runs before the other filters. I think this would be a cleaner (and less risky design). My advice would be to first build the future API into CM, and then we can make the filters that use it?

htuch · 2021-02-16T02:13:53Z

+1 on separating the lazy load to a filter. By default, CM should not have to make decisions and we should have non-lazy EDS.

lambdai · 2021-02-16T04:30:24Z

@bryce-anderson Thank you for the feedback! The goal of this PR is to explain the necessary changes to the data path including CM beyond http request. I absolutely agree tthat the eventual on-demand CDS/EDS needs a fallback strategy.

lambdai · 2021-02-16T05:07:42Z

@htuch @mattklein123
There are 3 arguments

Should CM provide a ClusterManager::futureThreadLocalCluster() API?
My answer is Yes. Also I want to add that this API is independent of whether the cluster is lazy-loadable. See also question 3. futureThreadLocalCluster() along with FutureCluster could work when the cluster is not lazy-loadable or the cluster is lazy-loaded already. The FutureCluster::isReady() returns the state of whether the cluster is available and save the effort to await.

if (future->isReady()) {
  onClusterReady(*future);
} else {
  future_cluster_handle_ =
      future->await(read_callbacks_->connection().dispatcher(),
                    [this](Upstream::FutureCluster& f) { onClusterReady(f); });
}

What component should initiate the lazy load attempt if the cluster is lazy-loadable?
In Http conn manager we could use a preload http filter as in the VHDS flow because VHDS filter know the cluster name prior to create the connection in the http router filter.
In this PR which address the tcp_proxy. TcpProxy is the trigger because only TcpProxy filter knows the target cluster name. See TcpProxy::pickRoute Not all the network filters have the luxury to fire the lazy load attempt prior to request the upstream connection.
futureThreadLocalCluster() provides the minimum API to enable a lazyload filter.
Is non lazy-loadable EDS cluster allowed? Which API should decide if a known cluster is lazy-loadable?
This PR doesn't address this issue but I have two cents.
Let's exclude some answers. Filter should not decide whether a cluster is lazy-loadable. The filter could use FutureCluster::isReady() and not call FutureCluster::await() to avoid blindly wait.
I prefer to add a field in message Cluster to indicate whether CdsApi should sending EDS request put the cluster in a warming state.
For the on-demand CDS, because there is no cluster field before lazy-load the cluster (chicken-egg problem). We probably need to provide a matcher for CM to decide whether we should add a cluster name to watcher in on-demand CDS stream. Luckily lazy-loadable EDS cluster doesn't have this problem because we assume CDS must provide the cluster info that include is lazy-loadable.

htuch · 2021-02-17T03:15:27Z

@lambdai I think one thing that we would want to make sure is possible is receiving a CDS update, storing the cluster config locally in proto but not instantiating a full Cluster (with stats) and then only on-demand doing the instantiation. This would potentially save a ton of memory and CPU on updates for rarely uses clusters. We might even fetch the EDS resources, but leave them uninstantiated.

mattklein123 · 2021-02-17T22:55:20Z

re: (1), yes, agreed, let's add this functionality and test it as an independent PR. It's not very complicated as all the parts are already there. I would note that we don't explicitly need this to be part of the cluster manager API, since with the callback mechanisms in place I think this could be implemented on top. I don't feel strongly about it but you might consider a wrapper class.

re: (2) for HTTP I agree this should be a filter. For TCP, I'm not crazy about making the tcp_proxy filter even more complicated, so it might be worth it to see if we can refactor somehow to allow TCP route selection to be done by a previous filter? I think maybe this is possible? It's worth investigating.

re: (3) I agree with what @htuch said, but this part is complicated enough that I would recommend doing an independent design doc on this portion to make sure we are all on the same page.

lambdai · 2021-02-23T18:38:43Z

Thank you for the feedback! I am drafting and I will post in the linked issue

lambdai · 2021-02-24T00:32:44Z

Also design doc here https://docs.google.com/document/d/1-QgiqOCq8EcQO1cO22dRFKpuUglQp_pKZVXFCPmMSJQ/edit?usp=sharing

htuch · 2021-02-25T02:50:07Z

I think we can move lazy local expansion to a separate issue, but the API (and ideally implementation) should be forward thinking with this use case in mind, as it's something likely to come up in MT scenarios where on-demand doesn't work for reliability reasons.

github-actions · 2021-03-27T04:02:02Z

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

github-actions · 2021-04-03T04:02:13Z

This pull request has been automatically closed because it has not had activity in the last 37 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

lambdai mentioned this pull request Feb 15, 2021

Generic Dynamic Forward Proxy with On Demand CDS #15026

Closed

mattklein123 self-assigned this Feb 16, 2021

mattklein123 added the waiting label Feb 16, 2021

lambdai mentioned this pull request Feb 16, 2021

test: split tcp_proxy test into config test and data test #15067

Merged

lambdai mentioned this pull request Mar 11, 2021

add no cluster found error detail #15448

Closed

github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Mar 27, 2021

github-actions bot closed this Apr 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] cluster: introduce futureThreadLocalCluster interface to cluster manager#15053

[RFC] cluster: introduce futureThreadLocalCluster interface to cluster manager#15053
lambdai wants to merge 1 commit intoenvoyproxy:mainfrom
lambdai:futureclusterproto

lambdai commented Feb 15, 2021 •

edited

Loading

Uh oh!

bryce-anderson commented Feb 15, 2021

Uh oh!

mattklein123 commented Feb 16, 2021

Uh oh!

htuch commented Feb 16, 2021

Uh oh!

lambdai commented Feb 16, 2021

Uh oh!

lambdai commented Feb 16, 2021

Uh oh!

htuch commented Feb 17, 2021

Uh oh!

mattklein123 commented Feb 17, 2021

Uh oh!

lambdai commented Feb 23, 2021

Uh oh!

lambdai commented Feb 24, 2021 •

edited

Loading

Uh oh!

htuch commented Feb 25, 2021

Uh oh!

github-actions bot commented Mar 27, 2021

Uh oh!

github-actions bot commented Apr 3, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

lambdai commented Feb 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bryce-anderson commented Feb 15, 2021

Uh oh!

mattklein123 commented Feb 16, 2021

Uh oh!

htuch commented Feb 16, 2021

Uh oh!

lambdai commented Feb 16, 2021

Uh oh!

lambdai commented Feb 16, 2021

Uh oh!

htuch commented Feb 17, 2021

Uh oh!

mattklein123 commented Feb 17, 2021

Uh oh!

lambdai commented Feb 23, 2021

Uh oh!

lambdai commented Feb 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

htuch commented Feb 25, 2021

Uh oh!

github-actions bot commented Mar 27, 2021

Uh oh!

github-actions bot commented Apr 3, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

lambdai commented Feb 15, 2021 •

edited

Loading

lambdai commented Feb 24, 2021 •

edited

Loading