Change RR to consider a subchannel in TF if it has failed since it was READY.#20245
Change RR to consider a subchannel in TF if it has failed since it was READY.#20245markdroth merged 1 commit intogrpc:masterfrom
Conversation
AspirinSJL
left a comment
There was a problem hiding this comment.
Reviewed 2 of 2 files at r1.
Reviewable status:complete! all files reviewed, all discussions resolved (waiting on @apolcyn)
src/core/ext/filters/client_channel/lb_policy/round_robin/round_robin.cc, line 388 at r1 (raw file):
do not change the aggregation state
Nit: might be clearer to say something like "do not report state changes" so that the "not changing state" behavior sounds more like a subchannel limitation other than aggregation state limitation for every subchannel.
test/cpp/end2end/client_lb_end2end_test.cc, line 1294 at r1 (raw file):
EXPECT_TRUE(WaitForChannelReady(channel.get())); // Now kill the servers. The channel should transition to TRANSIENT_FAILURE. // TODO(roth): This test should ideally check that even when the
This is unfortunate but I can't think of any easy way to tune the connection time either.
15493f3 to
6cbd610
Compare
markdroth
left a comment
There was a problem hiding this comment.
The impetus for this PR was a suggestion from Eric and Doug. I don't think they've made this change in Java or Go yet, but we'll probably do this as part of the xDS work, since we're thinking that we'll want to use this same aggregation algorithm across priorities in the xds policy.
Reviewable status:
complete! all files reviewed, all discussions resolved (waiting on @apolcyn)
src/core/ext/filters/client_channel/lb_policy/round_robin/round_robin.cc, line 388 at r1 (raw file):
Previously, AspirinSJL (Juanli Shen) wrote…
do not change the aggregation stateNit: might be clearer to say something like "do not report state changes" so that the "not changing state" behavior sounds more like a subchannel limitation other than aggregation state limitation for every subchannel.
I've updated the comment to hopefully make this clearer.
AspirinSJL
left a comment
There was a problem hiding this comment.
Reviewed 1 of 1 files at r2.
Reviewable status:complete! all files reviewed, all discussions resolved (waiting on @apolcyn)
6cbd610 to
01eeec7
Compare
AspirinSJL
left a comment
There was a problem hiding this comment.
Reviewed 1 of 1 files at r3.
Reviewable status:complete! all files reviewed, all discussions resolved (waiting on @apolcyn)
|
Known failures: #18840 |
This ensures that RR will report TRANSIENT_FAILURE (and thus fail pending RPCs instead of queuing them indefinitely) when all subchannels are failing, even if the subchannels are bouncing back and forth between CONNECTING and TRANSIENT_FAILURE.
Unfortunately, the tests I've added here don't actually prove that this change works correctly, because they pass even without the change. As noted in the TODO comments there, we currently don't actually seem to have a good way to test the case that this PR was meant to address (although if you can think of a way to do this, I'd love to hear it).
This change is