xds: XdsNR should be subscribing to clusters with XdsDepManager#12154
xds: XdsNR should be subscribing to clusters with XdsDepManager#12154ejona86 merged 1 commit intogrpc:masterfrom
Conversation
This is missing behavior defined in gRFC A74: > As per gRFC A31, the ConfigSelector gives each RPC a ref to the > cluster that was selected for it to ensure that the cluster is not > removed from the xds_cluster_manager LB policy config before the RPC > is done with its LB picks. These cluster refs will also hold a > subscription for the cluster from the XdsDependencyManager, so that > the XdsDependencyManager will not stop watching the cluster resource > until the cluster is removed from the xds_cluster_manager LB policy > config. Without the logic, RPCs can race and see the error: > INTERNAL: CdsLb for cluster0: Unable to find non-dynamic root cluster Fixes grpc#12152. This fixes the regression introduced in 297ab05
| if (clusterNameMap.containsKey(cluster)) { | ||
| assert cluster.startsWith("cluster:"); | ||
| XdsConfig.Subscription subscription = | ||
| xdsDependencyManager.subscribeToCluster(cluster.substring("cluster:".length())); |
There was a problem hiding this comment.
We are not retrieving the subscription in CdsLoadBalancer2. The clusterSubscription field is only ever assigned in the case of dynamic cluster, and not otherwise. So for non dynamic cluster it will still be null and cause "Unable to find non-dynamic root cluster" error?
What is the race condition?
There was a problem hiding this comment.
RPCs use the current route configuration when they were created. But it takes time for them to progress through the filters, such that the route configuration could be different by the time they get to the terminating filter router and do a pick. So XdsNameResolver already has reference counting to keep clusters alive that are only pointed to by old route configurations that are still in use in RPCs.
When a new route configuration is used that points to different clusters, the old clusters will be removed from the XdsConfig, but XdsNR will be keeping the old CdsLB2 instances alive as long as RPCs still need them. Before A74 CdsLB2 would still have an xdsClient watch for that cluster, but before this change it will be receiving the XdsConfig and see the missing cluster. So as long as the XdsNR is keeping the CdsLB2 instance alive, it also needs to keep the subscription to that cluster for XdsConfig.
This case was tested in FakeControlPlaneXdsIntegrationTest.changeClusterForRoute, which is the test that was flaky.
This is missing behavior defined in gRFC A74:
Without the logic, RPCs can race and see the error:
Fixes #12152. This fixes the regression introduced in 297ab05