Skip to content

Fixing TryNextPrioirtyLocked to ensure we properly try higher priorities when lower ones fail#25139

Merged
donnadionne merged 1 commit intogrpc:masterfrom
donnadionne:switch_bug
Jan 16, 2021
Merged

Fixing TryNextPrioirtyLocked to ensure we properly try higher priorities when lower ones fail#25139
donnadionne merged 1 commit intogrpc:masterfrom
donnadionne:switch_bug

Conversation

@donnadionne
Copy link
Copy Markdown
Contributor

@donnadionne donnadionne commented Jan 12, 2021


This change is Reviewable

WaitForBackend(3, false);
ShutdownBackend(3);
ShutdownBackend(0);
WaitForBackend(1, false);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We fail here, backend 1 never gets rpcs

if (child_priority == UINT32_MAX) return;
// Ignore lower-than-current priorities.
if (child_priority > current_priority_) return;
// If a child reports TRANSIENT_FAILURE, start trying the next priority.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving this down as it prevents us from calling TryNextPrioirityLocked


void PriorityLb::TryNextPriorityLocked(bool report_connecting) {
current_priority_ = UINT32_MAX;
current_child_from_before_update_ = nullptr;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving these up because otherwises the current_prioirity_ could be retained as this method returns, preventing us from doing a SelectPrioirtyLocked in HandleChildConnectivityStateChangeLocked

if (child_priority > current_priority_) return;
// The update is for a higher-than-current priority (or for any
// priority if we don't have any current priority).
if (child_priority < current_priority_) {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not entered if current_prioirty_ is not unset right away in TryNextPrioirtyLocked()

@donnadionne donnadionne changed the title Strange failure Fixing TryNextPrioirtyLocked to ensure we properly try higher priorities when lower ones fail Jan 13, 2021
@donnadionne donnadionne added the release notes: no Indicates if PR should not be in release notes label Jan 13, 2021
Copy link
Copy Markdown
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for finding and fixing this!

I think the second part of this change is exactly right, but I don't think the first part is. It's not clear to me that it actually solves any problem, and it looks like it's not actually necessary.

Please let me know if you have any questions. Thanks!

Reviewed 2 of 2 files at r2.
Reviewable status: all files reviewed, 6 unresolved discussions (waiting on @donnadionne)


src/core/ext/filters/client_channel/lb_policy/priority/priority.cc, line 368 at r2 (raw file):

    gpr_log(GPR_INFO,
            "[priority_lb %p] state update for priority %d, child %s, current "
            "priority %d",

This should be %u instead of %d, since the value is unsigned.

Looks like this same bug already exists on the previous line, and probably elsewhere in the file. Can you do a quick audit to fix these as long as you're in here? Thanks!


src/core/ext/filters/client_channel/lb_policy/priority/priority.cc, line 373 at r2 (raw file):

Previously, donnadionne wrote…

Moving this down as it prevents us from calling TryNextPrioirityLocked

This change looks wrong to me. I don't think we should call TryNextPriorityLocked() if this state change is for a priority lower than the current priority. If the current priority is READY, then the state change for a lower-priority child is irrelevant. And if the current priority is not READY, then we will already have called TryNextPriorityLocked() when we got the state change from the current priority. In either case, calling TryNextPriorityLocked() for a lower-priority child doesn't make sense.

Is this change actually necessary to fix the bug you were seeing? If I understand the problem correctly, I think the change you made in TryNextPriorityLocked() should be enough to fix the problem by itself.

If the other change is not enough to fix the bug by itself, then I'd like to understand the exact sequence of events that's triggering the bug.


src/core/ext/filters/client_channel/lb_policy/priority/priority.cc, line 418 at r2 (raw file):

Previously, donnadionne wrote…

Moving these up because otherwises the current_prioirity_ could be retained as this method returns, preventing us from doing a SelectPrioirtyLocked in HandleChildConnectivityStateChangeLocked

Resetting current_priority_ at the top of this function makes sense (and I think that's actually the real bug here). However, I think resetting current_child_from_before_update_ should be left where it was, because until we actually give up on connecting to any of the children in the new update, we want HandleChildConnectivityStateChangeLocked() to continue to be able to pass picker updates for that child up to our parent.


test/cpp/end2end/xds_end2end_test.cc, line 6344 at r2 (raw file):

  balancers_[0]->ads_service()->SetEdsResource(
      BuildEdsResource(args, DefaultEdsServiceName()));
  WaitForBackend(3, true);

No need for the second parameter here, since that's the default value.

Copy link
Copy Markdown
Contributor Author

@donnadionne donnadionne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed! you are right the first change was not needed. After the second change, current_prioirty_ gets updated properly in SelectPriorityLocked.

Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @donnadionne and @markdroth)


src/core/ext/filters/client_channel/lb_policy/priority/priority.cc, line 368 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This should be %u instead of %d, since the value is unsigned.

Looks like this same bug already exists on the previous line, and probably elsewhere in the file. Can you do a quick audit to fix these as long as you're in here? Thanks!

fixed all in the file


src/core/ext/filters/client_channel/lb_policy/priority/priority.cc, line 373 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This change looks wrong to me. I don't think we should call TryNextPriorityLocked() if this state change is for a priority lower than the current priority. If the current priority is READY, then the state change for a lower-priority child is irrelevant. And if the current priority is not READY, then we will already have called TryNextPriorityLocked() when we got the state change from the current priority. In either case, calling TryNextPriorityLocked() for a lower-priority child doesn't make sense.

Is this change actually necessary to fix the bug you were seeing? If I understand the problem correctly, I think the change you made in TryNextPriorityLocked() should be enough to fix the problem by itself.

If the other change is not enough to fix the bug by itself, then I'd like to understand the exact sequence of events that's triggering the bug.

yes, just the one line chagne in TryNextPrioirtyLocked to unset the current_prioirity_ is enough


src/core/ext/filters/client_channel/lb_policy/priority/priority.cc, line 387 at r2 (raw file):

Previously, donnadionne wrote…

This is not entered if current_prioirty_ is not unset right away in TryNextPrioirtyLocked()

correct, it will not be entered after the unset


src/core/ext/filters/client_channel/lb_policy/priority/priority.cc, line 418 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Resetting current_priority_ at the top of this function makes sense (and I think that's actually the real bug here). However, I think resetting current_child_from_before_update_ should be left where it was, because until we actually give up on connecting to any of the children in the new update, we want HandleChildConnectivityStateChangeLocked() to continue to be able to pass picker updates for that child up to our parent.

ok, just one line change, the other line restored


test/cpp/end2end/xds_end2end_test.cc, line 6344 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

No need for the second parameter here, since that's the default value.

Done.

Copy link
Copy Markdown
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thanks for fixing this!

Reviewed 2 of 2 files at r3.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @donnadionne)

@donnadionne
Copy link
Copy Markdown
Contributor Author

Known issues: #25097

@donnadionne
Copy link
Copy Markdown
Contributor Author

Known issues: #25100

@donnadionne donnadionne merged commit d17ef6b into grpc:master Jan 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release notes: no Indicates if PR should not be in release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants