Optimize EDS stream allocation#22419
Conversation
Signed-off-by: Kateryna Nezdolii <[email protected]>
|
cc @htuch we have chatted about this change long time ago prior to my parental leave. |
|
Thanks so much @nezdolik, this is a definitely long asked for feature from the community and will be a welcome contribution. Assigning @adisuissa for review, since Adi is now leading work around xDS architecture in Envoy. |
Signed-off-by: Kateryna Nezdolii <[email protected]>
|
/wait on CI fixes |
|
Apologies for the late reply. I'll start by giving some background on where I think we should be heading, and then answer your questions. In the future, as we move towards the new xDS-TP, the subscriptions to the same authority will essentially be multiplexed using this feature. Your optimized EDS stream allocation will work for each ADS subscription to a specific authority. IIUC the current design is: The cluster manager has a single For this PR I've got a few high-level comments to start with:
Addressing your questions:
LMK if there's something unclear, and if I can help with anything. |
Signed-off-by: Kateryna Nezdolii <[email protected]>
Signed-off-by: Kateryna Nezdolii <[email protected]>
|
@adisuissa thanks for a very detailed feedback and apologies for late reply, I have been stuck in figuring out mocks fix for the remaining Eds tests. I agree that MultiplexedSubscriptionFactory should be a general feature that could be enabled for any XDS type (since RDS and SDS currently expose the same perf problem for a large number of resources).We could start by implementing a generic MultiplexedSubscriptionFactory and exposing it as a config option in EDS api as a start, does that sound good? I will start addressing your feedback in a series of commits and will ping you when all parts are done. May need some help later with the integration test, I was not completely sure on which signal/stats to rely to verify that grpc channel is being multiplexed in the test(may figure it out later in the process). The original idea was to look at Envoy stats for successfully completed requests for an H2 Eds cluster, since in H2 each stream maps to a single logical request/response. |
Sounds good to me, thanks for all the efforts!
I wonder if the upstream statistics can be used to count the number of connections. |
|
/wait-any |
Signed-off-by: Kateryna Nezdolii <[email protected]>
|
Applied portion of review comments:
I will now work on integration test, add tests for new feature and make it parametrised for xds type (sotw, delta) and grpc flavour (envoy, google). |
Signed-off-by: Kateryna Nezdolii <[email protected]>
Signed-off-by: Kateryna Nezdolii <[email protected]>
| EXPECT_EQ(0, test_server_->gauge("cluster.cluster_1.membership_total")->value()); | ||
| EXPECT_EQ(0, test_server_->gauge("cluster.cluster_2.membership_total")->value()); | ||
| switch (clientType()) { | ||
| case Grpc::ClientType::EnvoyGrpc: |
There was a problem hiding this comment.
@adisuissa figured out the approach to verify the number of streams for both Grpc client types. For Envoy Grpc number of total streams towards specific cluster equals to total number of H2 requests. Google Grpc actually exposes stream metrics per client instance, which is nice. Please let me know wdyt.
| @@ -1,3 +1,5 @@ | |||
| #include <iostream> | |||
Looks like there is merge conflict that needs to be fixed. Will work now on adding api feature switch in EDS config and extending tests for the config property. |
Signed-off-by: Kateryna Nezdolii <[email protected]>
Signed-off-by: Kateryna Nezdolii <[email protected]>
Signed-off-by: Kateryna Nezdolii <[email protected]>
Signed-off-by: Kateryna Nezdolii <[email protected]>
Signed-off-by: Kateryna Nezdolii <[email protected]>
|
The patch has been successfully tested in Spotify prod environment, @lizan |
Signed-off-by: Kateryna Nezdolii <[email protected]>
Signed-off-by: Kateryna Nezdolii <[email protected]>
Signed-off-by: Kateryna Nezdolii <[email protected]>
|
@lizan @adisuissa @mattklein123 any chance this PR could get a final pass? I may not be able to keep it up to date in the near future due to other priorities. |
adisuissa
left a comment
There was a problem hiding this comment.
LGTM, thanks!
/assign-from @envoyproxy/senior-maintainers
|
@envoyproxy/senior-maintainers assignee is @lizan |
changelogs/current.yaml
Outdated
| - area: eds | ||
| change: | | ||
| added ``envoy.reloadable_features.multiplex_eds`` to enable eds multiplexing by default. |
There was a problem hiding this comment.
Can you say more about what this does? Also, can you flip the text about turning it off via runtime flag because I think this is on by default now?
This will need another rebase as we will land this after the release cut today due to risk. Thank you!
/wait
Signed-off-by: Kateryna Nezdolii <[email protected]>
Signed-off-by: Kateryna Nezdolii <[email protected]>
Signed-off-by: Kateryna Nezdolii <[email protected]>
This reverts commit 97a7f00. There was a crash reported by go-control-plane attributed to this change: envoyproxy/go-control-plane#625 Signed-off-by: Greg Greenway <[email protected]>
This reverts commit 97a7f00. There was a crash reported by go-control-plane attributed to this change: envoyproxy/go-control-plane#625 Signed-off-by: Greg Greenway <[email protected]>
Signed-off-by: Kateryna Nezdolii <[email protected]>
…y#25157) This reverts commit 97a7f00. There was a crash reported by go-control-plane attributed to this change: envoyproxy/go-control-plane#625 Signed-off-by: Greg Greenway <[email protected]>
Signed-off-by: Kateryna Nezdolii [email protected]
Commit Message: Optimize streams allocation for Eds Grpc
Additional Description: With this change Eds Grpc subscriptions towards the same management server reuse a shared grpc connection, which greatly reduces the number of concurrent H2 streams towards a single management server. The work is built on top of 5026, with bugs fixed and an addition of integration test for Eds over Grpc. I could not port 3 existing Eds integration tests for warming clusters, as they were blocked by non graceful shutdown of GoogleAsyncClient.
The change has also been tested in production Spotify environment, we use Eds Grpc. Envoy successfully resolved cluster load assignments via optimized Eds.
Risk Level: High
Testing: Unit tests + integration tests
Docs Changes: TBD
Release Notes: TBD
Platform Specific Features:
Fixes #2943