Skip to content

Optimize EDS stream allocation#22419

Merged
mattklein123 merged 41 commits intoenvoyproxy:mainfrom
nezdolik:eds-optimization
Jan 18, 2023
Merged

Optimize EDS stream allocation#22419
mattklein123 merged 41 commits intoenvoyproxy:mainfrom
nezdolik:eds-optimization

Conversation

@nezdolik
Copy link
Copy Markdown
Member

@nezdolik nezdolik commented Jul 26, 2022

Signed-off-by: Kateryna Nezdolii [email protected]

Commit Message: Optimize streams allocation for Eds Grpc
Additional Description: With this change Eds Grpc subscriptions towards the same management server reuse a shared grpc connection, which greatly reduces the number of concurrent H2 streams towards a single management server. The work is built on top of 5026, with bugs fixed and an addition of integration test for Eds over Grpc. I could not port 3 existing Eds integration tests for warming clusters, as they were blocked by non graceful shutdown of GoogleAsyncClient.

The change has also been tested in production Spotify environment, we use Eds Grpc. Envoy successfully resolved cluster load assignments via optimized Eds.

Risk Level: High
Testing: Unit tests + integration tests
Docs Changes: TBD
Release Notes: TBD
Platform Specific Features:
Fixes #2943

Signed-off-by: Kateryna Nezdolii <[email protected]>
@nezdolik
Copy link
Copy Markdown
Member Author

cc @htuch we have chatted about this change long time ago prior to my parental leave.

@htuch
Copy link
Copy Markdown
Member

htuch commented Jul 27, 2022

Thanks so much @nezdolik, this is a definitely long asked for feature from the community and will be a welcome contribution. Assigning @adisuissa for review, since Adi is now leading work around xDS architecture in Envoy.

Signed-off-by: Kateryna Nezdolii <[email protected]>
@alyssawilk
Copy link
Copy Markdown
Contributor

/wait on CI fixes

@adisuissa
Copy link
Copy Markdown
Contributor

Apologies for the late reply.
Thanks for working on this, and thanks for trying this in prod and giving us the feedback!
Overall I think we would like to have this support for any xDS service, not only for EDS.

I'll start by giving some background on where I think we should be heading, and then answer your questions.

In the future, as we move towards the new xDS-TP, the subscriptions to the same authority will essentially be multiplexed using this feature. Your optimized EDS stream allocation will work for each ADS subscription to a specific authority.
For now, we have a single gRPC stream when ADS is used (which doesn't need this multiplexing), and for non-ADS we have per-subscription stream.

IIUC the current design is: The cluster manager has a single EdsSubscriptionFactory which maps between config and a gRPC stream.
In the future, we will probably have a ConfigSubscriptionsManager in the server, that will own all the subscriptions, and a map between a config-source (and authority) to its actual GrpcMux. The GrpcMux can be configured to either SotW or Delta, and will abstract over it (just as is being done today).

For this PR I've got a few high-level comments to start with:

  1. Should the EdsSubscriptionFactory extend SubscriptionFactoryImpl, and just have a method to return the correct GrpcMux (overriding the current create methods, and calling them if the subscription isn't there already)?
  2. Can we use a different name for this factory, maybe MultiplexedSubscriptionFactory? (other ideas to make it more generic are welcome :) )
  3. Can we add some integration tests?
  4. For now I do agree that this factory should co-exist with the current (non-multiplexed) subscription factory, but we should gradually move away from using the original one. I think that introducing (and guarding) it for each xDS-type is the right way forward.

Addressing your questions:

Do we need support for Delta Grpc APi type?
Yes, we should support both SotW and Delta, but I don't think it will add too much to this implementation.

Do we need support for unified mux?
The unified-mux code essentially uses the same code-base for both SotW and Delta implementations, so yes we would like to support it. In the near future we will move to that code-base and abandon the old one. The unified mux code should implement the GrpcMux interface, so I don't think it will require much work.

Implement collectionSubscriptionFromUrl method? (part of udpa/resource federation)
Yes, it should be implemented, but I think it is very straight-forward.

Hide change behind feature flag in api? (as it is high risk)
I think this is highly desired, at least temporarily, as we want to allow the use of the current subscription code if we break anyone.

LMK if there's something unclear, and if I can help with anything.

Signed-off-by: Kateryna Nezdolii <[email protected]>
Signed-off-by: Kateryna Nezdolii <[email protected]>
@nezdolik
Copy link
Copy Markdown
Member Author

nezdolik commented Aug 13, 2022

@adisuissa thanks for a very detailed feedback and apologies for late reply, I have been stuck in figuring out mocks fix for the remaining Eds tests.

I agree that MultiplexedSubscriptionFactory should be a general feature that could be enabled for any XDS type (since RDS and SDS currently expose the same perf problem for a large number of resources).We could start by implementing a generic MultiplexedSubscriptionFactory and exposing it as a config option in EDS api as a start, does that sound good?

I will start addressing your feedback in a series of commits and will ping you when all parts are done.

May need some help later with the integration test, I was not completely sure on which signal/stats to rely to verify that grpc channel is being multiplexed in the test(may figure it out later in the process). The original idea was to look at Envoy stats for successfully completed requests for an H2 Eds cluster, since in H2 each stream maps to a single logical request/response.

@adisuissa
Copy link
Copy Markdown
Contributor

I agree that MultiplexedSubscriptionFactory should be a general feature that could be enabled for any XDS type (since RDS and SDS currently expose the same perf problem for a large number of resources).We could start by implementing a generic MultiplexedSubscriptionFactory and exposing it as a config option in EDS api as a start, does that sound good?

Sounds good to me, thanks for all the efforts!

May need some help later with the integration test, I was not completely sure on which signal/stats to rely to verify that grpc channel is being multiplexed in the test(may figure it out later in the process). The original idea was to look at Envoy stats for successfully completed requests for an H2 Eds cluster, since in H2 each stream maps to a single logical request/response.

I wonder if the upstream statistics can be used to count the number of connections.
If you point me to the specific test, I can try to take a look.

@adisuissa
Copy link
Copy Markdown
Contributor

/wait-any

Signed-off-by: Kateryna Nezdolii <[email protected]>
@nezdolik
Copy link
Copy Markdown
Member Author

nezdolik commented Aug 19, 2022

Applied portion of review comments:

  • EdsSubscriptionFactory renamed to MultiplexedSubscriptionFactory
  • MultiplexedSubscriptionFactory extends SubscriptionFactoryImpl and overrides method for creating mux
  • MultiplexedSubscriptionFactory shares mux for subscriptions per management server per xds type url.
  • Updated MultiplexedSubscriptionFactory test

I will now work on integration test, add tests for new feature and make it parametrised for xds type (sotw, delta) and grpc flavour (envoy, google).
@adisuissa I am not sure if upstream stats for number of connections is good enough signal to rely on. Since mux manages a single grpc stream (according to docs) and grpc channel may have many streams.

@lizan lizan added the waiting label Aug 25, 2022
Kateryna Nezdolii added 2 commits September 4, 2022 21:56
EXPECT_EQ(0, test_server_->gauge("cluster.cluster_1.membership_total")->value());
EXPECT_EQ(0, test_server_->gauge("cluster.cluster_2.membership_total")->value());
switch (clientType()) {
case Grpc::ClientType::EnvoyGrpc:
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adisuissa figured out the approach to verify the number of streams for both Grpc client types. For Envoy Grpc number of total streams towards specific cluster equals to total number of H2 requests. Google Grpc actually exposes stream metrics per client instance, which is nice. Please let me know wdyt.

@@ -1,3 +1,5 @@
#include <iostream>
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be cleaned up

@nezdolik
Copy link
Copy Markdown
Member Author

nezdolik commented Sep 4, 2022

  • Added integration test to verify that discovery requests for multiple EDS clusters reuse one stream/mux towards same management server.

Looks like there is merge conflict that needs to be fixed. Will work now on adding api feature switch in EDS config and extending tests for the config property.

Kateryna Nezdolii added 4 commits September 5, 2022 19:01
Signed-off-by: Kateryna Nezdolii <[email protected]>
Signed-off-by: Kateryna Nezdolii <[email protected]>
Signed-off-by: Kateryna Nezdolii <[email protected]>
Signed-off-by: Kateryna Nezdolii <[email protected]>
@nezdolik
Copy link
Copy Markdown
Member Author

The patch has been successfully tested in Spotify prod environment, @lizan

@KBaichoo
Copy link
Copy Markdown
Contributor

friendly ping @lizan

I think this is waiting for your input.

@nezdolik this needs a main merge

/wait

@repokitteh-read-only
Copy link
Copy Markdown

CC @envoyproxy/runtime-guard-changes: FYI only for changes made to (source/common/runtime/runtime_features.cc).

🐱

Caused by: #22419 was synchronize by nezdolik.

see: more, trace.

Kateryna Nezdolii added 2 commits January 13, 2023 12:37
Signed-off-by: Kateryna Nezdolii <[email protected]>
@nezdolik
Copy link
Copy Markdown
Member Author

@lizan @adisuissa @mattklein123 any chance this PR could get a final pass? I may not be able to keep it up to date in the near future due to other priorities.

adisuissa
adisuissa previously approved these changes Jan 17, 2023
Copy link
Copy Markdown
Contributor

@adisuissa adisuissa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

/assign-from @envoyproxy/senior-maintainers

@repokitteh-read-only
Copy link
Copy Markdown

@envoyproxy/senior-maintainers assignee is @lizan

🐱

Caused by: a #22419 (review) was submitted by @adisuissa.

see: more, trace.

@mattklein123 mattklein123 self-assigned this Jan 17, 2023
Comment on lines +125 to +127
- area: eds
change: |
added ``envoy.reloadable_features.multiplex_eds`` to enable eds multiplexing by default.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you say more about what this does? Also, can you flip the text about turning it off via runtime flag because I think this is on by default now?

This will need another rebase as we will land this after the release cut today due to risk. Thank you!

/wait

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Kateryna Nezdolii added 2 commits January 18, 2023 18:34
Signed-off-by: Kateryna Nezdolii <[email protected]>
Signed-off-by: Kateryna Nezdolii <[email protected]>
Copy link
Copy Markdown
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@mattklein123 mattklein123 enabled auto-merge (squash) January 18, 2023 18:59
@mattklein123 mattklein123 merged commit 97a7f00 into envoyproxy:main Jan 18, 2023
ggreenway added a commit to ggreenway/envoy that referenced this pull request Jan 25, 2023
This reverts commit 97a7f00.

There was a crash reported by go-control-plane attributed to this change:
envoyproxy/go-control-plane#625

Signed-off-by: Greg Greenway <[email protected]>
ggreenway added a commit that referenced this pull request Jan 26, 2023
This reverts commit 97a7f00.

There was a crash reported by go-control-plane attributed to this change:
envoyproxy/go-control-plane#625

Signed-off-by: Greg Greenway <[email protected]>
VishalDamgude pushed a commit to freshworks-oss/envoy that referenced this pull request Feb 2, 2023
VishalDamgude pushed a commit to freshworks-oss/envoy that referenced this pull request Feb 2, 2023
…y#25157)

This reverts commit 97a7f00.

There was a crash reported by go-control-plane attributed to this change:
envoyproxy/go-control-plane#625

Signed-off-by: Greg Greenway <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Optimize EDS stream allocation