Simplify service config processing and fix config selector handling. by markdroth · Pull Request #24114 · grpc/grpc

markdroth · 2020-09-10T18:59:06Z

This PR accomplishes the following:

It fixes a problem that @donnadionne ran into in Moving XDS Routing functionalities from LB policy into XdsConfigSelector #23659 where, when the resolver returned a service config error, we retained the old ServiceConfig but failed to retain the old ConfigSelector. I fixed that by more tightly integrating the ServiceConfig and ConfigSelector handling so that we always handle them in the same places and in the same ways. (This is a better alternative to If no new ServiceConfig and no new ConfigSelector, stick with old ConfigSelector. #24091, which I will revert in favor of this.)
Simplifies the service config handling logic, eliminating some edge cases that no longer exist since Add ConfigSelector API. #23051, due to the changes described in A21 update: Do not use default service config on transient resolver failure proposal#188. This brings us closer to where we ultimately want to be with the client channel principles revamp design.

markdroth · 2020-09-10T23:22:34Z

@gnossen Could you help me figure out what's causing the python failures here? I'm sure it's a bug in my code here, but it only seems to be tripping tests in wrapped languages, so I can't just attack it with gdb the way that I normally would. I'd appreciate any help you can provide in diagnosing this. Thanks!

gnossen · 2020-09-10T23:36:59Z

I would help in more depth, but I'm OOO today and possibly tomorrow as well. My recommendation would actually still be to use Bazel+gdb. It's just a bit more tricky with Python. Try using --run_under="gdb /usr/bin/python3.7" (or something like that; I can't seem to find this in my history). The critical ingredient is reconstructing Python stacks. Take a look at this for details on that. Once you have all that in place, you can debug pretty much the way you're used to for C++.

I understand that's a lot to get set up, so I'm willing to help in more depth once I'm back in the (metaphorical) office.

donnadionne

I tested this with my changes and they work

markdroth · 2020-09-11T15:38:38Z

@gnossen I can't seem to reproduce the failures when I run locally, and I can't figure out how to build on RBE. I'd appreciate your help when you get back to the office.

yashykt · 2020-09-11T16:16:35Z

What is the problem that we are trying to solve here?

markdroth · 2020-09-11T16:31:18Z

@yashykt See the wrapped language test failures.

yashykt · 2020-09-11T16:34:33Z

@markdroth Sorry for the confusion. I was asking about the aim for the PR. Is there a bug that is being fixed or is it just simplification?

markdroth · 2020-09-11T16:37:54Z

@yashykt The impetus for this PR was a problem that @donnadionne ran into in #23659 where we would fail to retain the ConfigSelector when we retained the previous ServiceConfig. I fixed that by more tightly integrating the ServiceConfig and ConfigSelector error handling. But I also removed handling for some edge cases that no longer exist since #23051, due to the changes described in grpc/proposal#188.

I'll update the PR description to note all of this.

yashykt · 2020-09-11T16:58:59Z

Thanks for the explanation!

markdroth · 2020-09-11T22:20:12Z

It looks like the latest run triggered a use-after-free bug in the ASAN test. I'll look into that -- fixing that might also fix the wrapped language problem.

markdroth · 2020-09-11T23:56:00Z

Doesn't look like the use-after-free bug was related to the wrapped language problem.

@gnossen, I'd appreciate any help you can give me with this when you get back. Thanks!

gnossen · 2020-09-15T01:44:38Z

@markdroth

$ virualenv venv -p python3
$ source venv/bin/activate
$ bazel run -c dbg  --run_under="gdb -ex run --args $(which python3) " //src/python/grpcio_tests/tests/unit:_channel_close_test.python3
...
Thread 1 "python3" received signal SIGSEGV, Segmentation fault.
0x00007ffff68c6f90 in grpc_core::ServiceConfig::GetMethodParsedConfigVector (this=0xbc10e0, path=...) at src/core/ext/filters/client_channel/service_config.cc:213
213     src/core/ext/filters/client_channel/service_config.cc: No such file or directory.
(gdb) dir /usr/local/google/home/rbellevi/Dev/grpc
Source directories searched: /usr/local/google/home/rbellevi/Dev/grpc:$cdir:$cwd
(gdb) list
208       // If we didn't find a match for the path, try looking for a wildcard
209       // entry (i.e., change "/service/method" to "/service/").
210       UniquePtr<char> path_str(grpc_slice_to_c_string(path));
211       char* sep = strrchr(path_str.get(), '/') + 1;
212       if (sep == nullptr) return nullptr;  // Shouldn't ever happen.
213       *sep = '\0';
214       grpc_slice wildcard_path = grpc_slice_from_static_string(path_str.get());
215       it = parsed_method_configs_map_.find(wildcard_path);
216       if (it != parsed_method_configs_map_.end()) return it->second;
217       // Try default method config, if set.
(gdb) p sep
$1 = 0x1 <error: Cannot access memory at address 0x1>
(gdb) p path_str.get()
$2 = 0xbb03b0 "Meffod"

This nullptr check was never valid because if strrchr returns nullptr, then sep will be 0x1 by the time the check happens.

I tried to instrument ServiceConfig::GetMethodParsedConfigVector on master to see what the value of path was since Meffod doesn't appear to be a full valid method, but it seems that the _channel_close_test isn't calling this function at all on master. Something in this PR must be calling this method in the wrapped languages for the first time.

Hope this helps!

markdroth · 2020-09-15T15:39:10Z

Thanks, Richard, that was exactly the information I needed!

Yes, it is expected that this method is being called now when it was not before, because we now set the default service config to {} instead of leaving it null. And there's an obvious pre-existing (but still my fault :)) bug there in handling method names that don't have slashes, which I've now fixed.

This discussion also made me realize that we should short-circuit this query completely if the map is empty (which happens commonly when we use the default service config), so I've done that as well. And I've cleaned up several other places where we were still checking for a null service config.

Let's see if everything passes now...

markdroth · 2020-09-15T22:04:09Z

Yup, looks like everything is happy now!

@yashykt, I'd like your review before I merge this. Thanks!

yashykt

approved with clarifying questions and minor doc nits?

yashykt · 2020-09-15T23:02:42Z

src/core/ext/filters/client_channel/client_channel.cc

-      default_service_config_.reset();
-      return;
-    }
+  if (service_config_json == nullptr) service_config_json = "{}";


what happens if initially the resolver does not provide a service config, but later provides an incorrect service config?
In that case, are we to continue using the previously used "no service config" or are we to go into transient failure?

Also, please add a comment around the semantics of default_service_config_ that no default service config provided as an arg means an empty service config being used as the default

If the resolver returns a service config error on its first result, we go into TRANSIENT_FAILURE, and we stay there until it returns a non-error service config (i.e., either no service config or a valid service config).

If the resolver does not provide a service config, the default service config will be used, but otherwise this is considered the same as if it returned a valid config.

Once we have a valid config, if the resolver subsequently returns a service config error, we keep using the previous valid config.

I've added a comment about default_service_config_.

src/core/ext/filters/client_channel/client_channel.cc

yashykt · 2020-09-15T23:16:38Z

src/core/ext/filters/client_channel/config_selector.h

-      call_config.method_configs =
-          service_config_->GetMethodParsedConfigVector(*args.path);
-    }
+    call_config.method_configs =


Can we add
GPR_DEBUG_ASSERT(service_config_ != nullptr) here with a comment as to why it will never happen?

yashykt · 2020-09-15T23:23:28Z

src/core/ext/filters/client_channel/resolving_lb_policy.cc

-  ChannelConfigHelper::ApplyServiceConfigResult service_config_result;
+  MaybeAddTraceMessagesForAddressChangesLocked(!result.addresses.empty(),
+                                               &trace_strings);
+  // Hold a ref to the error so that the entry in trace_strings lives


I don't understand this comment but since Resolver::Result::~Result() unrefs service_config_error, I'm not worried about the error being unreffed atleast.

The result of grpc_error_string(), which we're storing in trace_strings, is a pointer to a string that is owned by the error itself. However, we transfer ownership of the Resolver::Result object containing the error to the LB policy on line 336, and the LB policy will destroy it before that call returns. So we need to hold a ref to the error until line 349, where we're actually using trace_strings.

I've tried to clarify the comment here.

markdroth · 2020-09-16T22:04:54Z

The android failure is an infrastructure timeout.

markdroth added the release notes: no Indicates if PR should not be in release notes label Sep 10, 2020

markdroth requested review from donnadionne and yashykt September 10, 2020 18:59

donnadionne approved these changes Sep 11, 2020

View reviewed changes

markdroth mentioned this pull request Sep 11, 2020

If no new ServiceConfig and no new ConfigSelector, stick with old ConfigSelector. #24091

Closed

yashykt approved these changes Sep 15, 2020

View reviewed changes

Simplify service config processing and fix config selector handling.

84c8f70

markdroth force-pushed the config_selector_error_handling2 branch from 426e653 to 84c8f70 Compare September 15, 2020 23:50

markdroth merged commit a11e4df into grpc:master Sep 16, 2020

markdroth deleted the config_selector_error_handling2 branch September 16, 2020 22:05

markdroth mentioned this pull request Sep 18, 2020

Fix channel to apply new ConfigSelector even if ServiceConfig hasn't changed. #24191

Merged

Conversation

markdroth commented Sep 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

markdroth commented Sep 10, 2020

Uh oh!

gnossen commented Sep 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

donnadionne left a comment

Choose a reason for hiding this comment

Uh oh!

markdroth commented Sep 11, 2020

Uh oh!

yashykt commented Sep 11, 2020

Uh oh!

markdroth commented Sep 11, 2020

Uh oh!

yashykt commented Sep 11, 2020

Uh oh!

markdroth commented Sep 11, 2020

Uh oh!

yashykt commented Sep 11, 2020

Uh oh!

markdroth commented Sep 11, 2020

Uh oh!

markdroth commented Sep 11, 2020

Uh oh!

gnossen commented Sep 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

markdroth commented Sep 15, 2020

Uh oh!

markdroth commented Sep 15, 2020

Uh oh!

yashykt left a comment

Choose a reason for hiding this comment

Uh oh!

yashykt Sep 15, 2020

Choose a reason for hiding this comment

Uh oh!

markdroth Sep 15, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yashykt Sep 15, 2020

Choose a reason for hiding this comment

Uh oh!

markdroth Sep 15, 2020

Choose a reason for hiding this comment

Uh oh!

yashykt Sep 15, 2020

Choose a reason for hiding this comment

Uh oh!

markdroth Sep 15, 2020

Choose a reason for hiding this comment

Uh oh!

markdroth commented Sep 16, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

markdroth commented Sep 10, 2020 •

edited

Loading

gnossen commented Sep 10, 2020 •

edited

Loading

gnossen commented Sep 15, 2020 •

edited

Loading