Fixing TryNextPrioirtyLocked to ensure we properly try higher priorities when lower ones fail#25139
Conversation
| WaitForBackend(3, false); | ||
| ShutdownBackend(3); | ||
| ShutdownBackend(0); | ||
| WaitForBackend(1, false); |
There was a problem hiding this comment.
We fail here, backend 1 never gets rpcs
| if (child_priority == UINT32_MAX) return; | ||
| // Ignore lower-than-current priorities. | ||
| if (child_priority > current_priority_) return; | ||
| // If a child reports TRANSIENT_FAILURE, start trying the next priority. |
There was a problem hiding this comment.
Moving this down as it prevents us from calling TryNextPrioirityLocked
|
|
||
| void PriorityLb::TryNextPriorityLocked(bool report_connecting) { | ||
| current_priority_ = UINT32_MAX; | ||
| current_child_from_before_update_ = nullptr; |
There was a problem hiding this comment.
Moving these up because otherwises the current_prioirity_ could be retained as this method returns, preventing us from doing a SelectPrioirtyLocked in HandleChildConnectivityStateChangeLocked
| if (child_priority > current_priority_) return; | ||
| // The update is for a higher-than-current priority (or for any | ||
| // priority if we don't have any current priority). | ||
| if (child_priority < current_priority_) { |
There was a problem hiding this comment.
This is not entered if current_prioirty_ is not unset right away in TryNextPrioirtyLocked()
markdroth
left a comment
There was a problem hiding this comment.
Thanks for finding and fixing this!
I think the second part of this change is exactly right, but I don't think the first part is. It's not clear to me that it actually solves any problem, and it looks like it's not actually necessary.
Please let me know if you have any questions. Thanks!
Reviewed 2 of 2 files at r2.
Reviewable status: all files reviewed, 6 unresolved discussions (waiting on @donnadionne)
src/core/ext/filters/client_channel/lb_policy/priority/priority.cc, line 368 at r2 (raw file):
gpr_log(GPR_INFO, "[priority_lb %p] state update for priority %d, child %s, current " "priority %d",
This should be %u instead of %d, since the value is unsigned.
Looks like this same bug already exists on the previous line, and probably elsewhere in the file. Can you do a quick audit to fix these as long as you're in here? Thanks!
src/core/ext/filters/client_channel/lb_policy/priority/priority.cc, line 373 at r2 (raw file):
Previously, donnadionne wrote…
Moving this down as it prevents us from calling TryNextPrioirityLocked
This change looks wrong to me. I don't think we should call TryNextPriorityLocked() if this state change is for a priority lower than the current priority. If the current priority is READY, then the state change for a lower-priority child is irrelevant. And if the current priority is not READY, then we will already have called TryNextPriorityLocked() when we got the state change from the current priority. In either case, calling TryNextPriorityLocked() for a lower-priority child doesn't make sense.
Is this change actually necessary to fix the bug you were seeing? If I understand the problem correctly, I think the change you made in TryNextPriorityLocked() should be enough to fix the problem by itself.
If the other change is not enough to fix the bug by itself, then I'd like to understand the exact sequence of events that's triggering the bug.
src/core/ext/filters/client_channel/lb_policy/priority/priority.cc, line 418 at r2 (raw file):
Previously, donnadionne wrote…
Moving these up because otherwises the current_prioirity_ could be retained as this method returns, preventing us from doing a SelectPrioirtyLocked in HandleChildConnectivityStateChangeLocked
Resetting current_priority_ at the top of this function makes sense (and I think that's actually the real bug here). However, I think resetting current_child_from_before_update_ should be left where it was, because until we actually give up on connecting to any of the children in the new update, we want HandleChildConnectivityStateChangeLocked() to continue to be able to pass picker updates for that child up to our parent.
test/cpp/end2end/xds_end2end_test.cc, line 6344 at r2 (raw file):
balancers_[0]->ads_service()->SetEdsResource( BuildEdsResource(args, DefaultEdsServiceName())); WaitForBackend(3, true);
No need for the second parameter here, since that's the default value.
…ies when lower ones fail
cd6176f to
4715bbc
Compare
donnadionne
left a comment
There was a problem hiding this comment.
fixed! you are right the first change was not needed. After the second change, current_prioirty_ gets updated properly in SelectPriorityLocked.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @donnadionne and @markdroth)
src/core/ext/filters/client_channel/lb_policy/priority/priority.cc, line 368 at r2 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
This should be
%uinstead of%d, since the value is unsigned.Looks like this same bug already exists on the previous line, and probably elsewhere in the file. Can you do a quick audit to fix these as long as you're in here? Thanks!
fixed all in the file
src/core/ext/filters/client_channel/lb_policy/priority/priority.cc, line 373 at r2 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
This change looks wrong to me. I don't think we should call
TryNextPriorityLocked()if this state change is for a priority lower than the current priority. If the current priority is READY, then the state change for a lower-priority child is irrelevant. And if the current priority is not READY, then we will already have calledTryNextPriorityLocked()when we got the state change from the current priority. In either case, callingTryNextPriorityLocked()for a lower-priority child doesn't make sense.Is this change actually necessary to fix the bug you were seeing? If I understand the problem correctly, I think the change you made in
TryNextPriorityLocked()should be enough to fix the problem by itself.If the other change is not enough to fix the bug by itself, then I'd like to understand the exact sequence of events that's triggering the bug.
yes, just the one line chagne in TryNextPrioirtyLocked to unset the current_prioirity_ is enough
src/core/ext/filters/client_channel/lb_policy/priority/priority.cc, line 387 at r2 (raw file):
Previously, donnadionne wrote…
This is not entered if current_prioirty_ is not unset right away in TryNextPrioirtyLocked()
correct, it will not be entered after the unset
src/core/ext/filters/client_channel/lb_policy/priority/priority.cc, line 418 at r2 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
Resetting
current_priority_at the top of this function makes sense (and I think that's actually the real bug here). However, I think resettingcurrent_child_from_before_update_should be left where it was, because until we actually give up on connecting to any of the children in the new update, we wantHandleChildConnectivityStateChangeLocked()to continue to be able to pass picker updates for that child up to our parent.
ok, just one line change, the other line restored
test/cpp/end2end/xds_end2end_test.cc, line 6344 at r2 (raw file):
Previously, markdroth (Mark D. Roth) wrote…
No need for the second parameter here, since that's the default value.
Done.
markdroth
left a comment
There was a problem hiding this comment.
Looks great! Thanks for fixing this!
Reviewed 2 of 2 files at r3.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @donnadionne)
|
Known issues: #25097 |
|
Known issues: #25100 |
This change is