Skip to content

xds: XdsNR should be subscribing to clusters with XdsDepManager#12154

Merged
ejona86 merged 1 commit intogrpc:masterfrom
ejona86:xdsnr-needs-subscription
Jun 17, 2025
Merged

xds: XdsNR should be subscribing to clusters with XdsDepManager#12154
ejona86 merged 1 commit intogrpc:masterfrom
ejona86:xdsnr-needs-subscription

Conversation

@ejona86
Copy link
Copy Markdown
Member

@ejona86 ejona86 commented Jun 13, 2025

This is missing behavior defined in gRFC A74:

As per gRFC A31, the ConfigSelector gives each RPC a ref to the
cluster that was selected for it to ensure that the cluster is not
removed from the xds_cluster_manager LB policy config before the RPC
is done with its LB picks. These cluster refs will also hold a
subscription for the cluster from the XdsDependencyManager, so that
the XdsDependencyManager will not stop watching the cluster resource
until the cluster is removed from the xds_cluster_manager LB policy
config.

Without the logic, RPCs can race and see the error:

INTERNAL: CdsLb for cluster0: Unable to find non-dynamic root cluster

Fixes #12152. This fixes the regression introduced in 297ab05

This is missing behavior defined in gRFC A74:

> As per gRFC A31, the ConfigSelector gives each RPC a ref to the
> cluster that was selected for it to ensure that the cluster is not
> removed from the xds_cluster_manager LB policy config before the RPC
> is done with its LB picks. These cluster refs will also hold a
> subscription for the cluster from the XdsDependencyManager, so that
> the XdsDependencyManager will not stop watching the cluster resource
> until the cluster is removed from the xds_cluster_manager LB policy
> config.

Without the logic, RPCs can race and see the error:

> INTERNAL: CdsLb for cluster0: Unable to find non-dynamic root cluster

Fixes grpc#12152. This fixes the regression introduced in 297ab05
@ejona86 ejona86 requested a review from kannanjgithub June 13, 2025 22:38
if (clusterNameMap.containsKey(cluster)) {
assert cluster.startsWith("cluster:");
XdsConfig.Subscription subscription =
xdsDependencyManager.subscribeToCluster(cluster.substring("cluster:".length()));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are not retrieving the subscription in CdsLoadBalancer2. The clusterSubscription field is only ever assigned in the case of dynamic cluster, and not otherwise. So for non dynamic cluster it will still be null and cause "Unable to find non-dynamic root cluster" error?
What is the race condition?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RPCs use the current route configuration when they were created. But it takes time for them to progress through the filters, such that the route configuration could be different by the time they get to the terminating filter router and do a pick. So XdsNameResolver already has reference counting to keep clusters alive that are only pointed to by old route configurations that are still in use in RPCs.

When a new route configuration is used that points to different clusters, the old clusters will be removed from the XdsConfig, but XdsNR will be keeping the old CdsLB2 instances alive as long as RPCs still need them. Before A74 CdsLB2 would still have an xdsClient watch for that cluster, but before this change it will be receiving the XdsConfig and see the missing cluster. So as long as the XdsNR is keeping the CdsLB2 instance alive, it also needs to keep the subscription to that cluster for XdsConfig.

This case was tested in FakeControlPlaneXdsIntegrationTest.changeClusterForRoute, which is the test that was flaky.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood, thanks.

@ejona86 ejona86 merged commit 2604ce8 into grpc:master Jun 17, 2025
15 of 16 checks passed
@ejona86 ejona86 deleted the xdsnr-needs-subscription branch June 17, 2025 13:36
@github-actions github-actions Bot locked as resolved and limited conversation to collaborators Sep 16, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

XdsDepManager generated invalid configuration

2 participants