Skip to content

Simplify service config processing and fix config selector handling.#24114

Merged
markdroth merged 1 commit intogrpc:masterfrom
markdroth:config_selector_error_handling2
Sep 16, 2020
Merged

Simplify service config processing and fix config selector handling.#24114
markdroth merged 1 commit intogrpc:masterfrom
markdroth:config_selector_error_handling2

Conversation

@markdroth
Copy link
Copy Markdown
Member

@markdroth markdroth commented Sep 10, 2020

This PR accomplishes the following:

@markdroth markdroth added the release notes: no Indicates if PR should not be in release notes label Sep 10, 2020
@markdroth
Copy link
Copy Markdown
Member Author

@gnossen Could you help me figure out what's causing the python failures here? I'm sure it's a bug in my code here, but it only seems to be tripping tests in wrapped languages, so I can't just attack it with gdb the way that I normally would. I'd appreciate any help you can provide in diagnosing this. Thanks!

@gnossen
Copy link
Copy Markdown
Contributor

gnossen commented Sep 10, 2020

I would help in more depth, but I'm OOO today and possibly tomorrow as well. My recommendation would actually still be to use Bazel+gdb. It's just a bit more tricky with Python. Try using --run_under="gdb /usr/bin/python3.7" (or something like that; I can't seem to find this in my history). The critical ingredient is reconstructing Python stacks. Take a look at this for details on that. Once you have all that in place, you can debug pretty much the way you're used to for C++.

I understand that's a lot to get set up, so I'm willing to help in more depth once I'm back in the (metaphorical) office.

Copy link
Copy Markdown
Contributor

@donnadionne donnadionne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this with my changes and they work

@markdroth
Copy link
Copy Markdown
Member Author

@gnossen I can't seem to reproduce the failures when I run locally, and I can't figure out how to build on RBE. I'd appreciate your help when you get back to the office.

@yashykt
Copy link
Copy Markdown
Member

yashykt commented Sep 11, 2020

What is the problem that we are trying to solve here?

@markdroth
Copy link
Copy Markdown
Member Author

@yashykt See the wrapped language test failures.

@yashykt
Copy link
Copy Markdown
Member

yashykt commented Sep 11, 2020

@markdroth Sorry for the confusion. I was asking about the aim for the PR. Is there a bug that is being fixed or is it just simplification?

@markdroth
Copy link
Copy Markdown
Member Author

@yashykt The impetus for this PR was a problem that @donnadionne ran into in #23659 where we would fail to retain the ConfigSelector when we retained the previous ServiceConfig. I fixed that by more tightly integrating the ServiceConfig and ConfigSelector error handling. But I also removed handling for some edge cases that no longer exist since #23051, due to the changes described in grpc/proposal#188.

I'll update the PR description to note all of this.

@yashykt
Copy link
Copy Markdown
Member

yashykt commented Sep 11, 2020

Thanks for the explanation!

@markdroth
Copy link
Copy Markdown
Member Author

It looks like the latest run triggered a use-after-free bug in the ASAN test. I'll look into that -- fixing that might also fix the wrapped language problem.

@markdroth
Copy link
Copy Markdown
Member Author

Doesn't look like the use-after-free bug was related to the wrapped language problem.

@gnossen, I'd appreciate any help you can give me with this when you get back. Thanks!

@gnossen
Copy link
Copy Markdown
Contributor

gnossen commented Sep 15, 2020

@markdroth

$ virualenv venv -p python3
$ source venv/bin/activate
$ bazel run -c dbg  --run_under="gdb -ex run --args $(which python3) " //src/python/grpcio_tests/tests/unit:_channel_close_test.python3
...
Thread 1 "python3" received signal SIGSEGV, Segmentation fault.
0x00007ffff68c6f90 in grpc_core::ServiceConfig::GetMethodParsedConfigVector (this=0xbc10e0, path=...) at src/core/ext/filters/client_channel/service_config.cc:213
213     src/core/ext/filters/client_channel/service_config.cc: No such file or directory.
(gdb) dir /usr/local/google/home/rbellevi/Dev/grpc
Source directories searched: /usr/local/google/home/rbellevi/Dev/grpc:$cdir:$cwd
(gdb) list
208       // If we didn't find a match for the path, try looking for a wildcard
209       // entry (i.e., change "/service/method" to "/service/").
210       UniquePtr<char> path_str(grpc_slice_to_c_string(path));
211       char* sep = strrchr(path_str.get(), '/') + 1;
212       if (sep == nullptr) return nullptr;  // Shouldn't ever happen.
213       *sep = '\0';
214       grpc_slice wildcard_path = grpc_slice_from_static_string(path_str.get());
215       it = parsed_method_configs_map_.find(wildcard_path);
216       if (it != parsed_method_configs_map_.end()) return it->second;
217       // Try default method config, if set.
(gdb) p sep
$1 = 0x1 <error: Cannot access memory at address 0x1>
(gdb) p path_str.get()
$2 = 0xbb03b0 "Meffod"

This nullptr check was never valid because if strrchr returns nullptr, then sep will be 0x1 by the time the check happens.

I tried to instrument ServiceConfig::GetMethodParsedConfigVector on master to see what the value of path was since Meffod doesn't appear to be a full valid method, but it seems that the _channel_close_test isn't calling this function at all on master. Something in this PR must be calling this method in the wrapped languages for the first time.

Hope this helps!

@markdroth
Copy link
Copy Markdown
Member Author

Thanks, Richard, that was exactly the information I needed!

Yes, it is expected that this method is being called now when it was not before, because we now set the default service config to {} instead of leaving it null. And there's an obvious pre-existing (but still my fault :)) bug there in handling method names that don't have slashes, which I've now fixed.

This discussion also made me realize that we should short-circuit this query completely if the map is empty (which happens commonly when we use the default service config), so I've done that as well. And I've cleaned up several other places where we were still checking for a null service config.

Let's see if everything passes now...

@markdroth
Copy link
Copy Markdown
Member Author

Yup, looks like everything is happy now!

@yashykt, I'd like your review before I merge this. Thanks!

Copy link
Copy Markdown
Member

@yashykt yashykt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approved with clarifying questions and minor doc nits?

default_service_config_.reset();
return;
}
if (service_config_json == nullptr) service_config_json = "{}";
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if initially the resolver does not provide a service config, but later provides an incorrect service config?
In that case, are we to continue using the previously used "no service config" or are we to go into transient failure?

Also, please add a comment around the semantics of default_service_config_ that no default service config provided as an arg means an empty service config being used as the default

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the resolver returns a service config error on its first result, we go into TRANSIENT_FAILURE, and we stay there until it returns a non-error service config (i.e., either no service config or a valid service config).

If the resolver does not provide a service config, the default service config will be used, but otherwise this is considered the same as if it returned a valid config.

Once we have a valid config, if the resolver subsequently returns a service config error, we keep using the previous valid config.

I've added a comment about default_service_config_.

call_config.method_configs =
service_config_->GetMethodParsedConfigVector(*args.path);
}
call_config.method_configs =
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add
GPR_DEBUG_ASSERT(service_config_ != nullptr) here with a comment as to why it will never happen?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

ChannelConfigHelper::ApplyServiceConfigResult service_config_result;
MaybeAddTraceMessagesForAddressChangesLocked(!result.addresses.empty(),
&trace_strings);
// Hold a ref to the error so that the entry in trace_strings lives
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this comment but since Resolver::Result::~Result() unrefs service_config_error, I'm not worried about the error being unreffed atleast.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The result of grpc_error_string(), which we're storing in trace_strings, is a pointer to a string that is owned by the error itself. However, we transfer ownership of the Resolver::Result object containing the error to the LB policy on line 336, and the LB policy will destroy it before that call returns. So we need to hold a ref to the error until line 349, where we're actually using trace_strings.

I've tried to clarify the comment here.

@markdroth markdroth force-pushed the config_selector_error_handling2 branch from 426e653 to 84c8f70 Compare September 15, 2020 23:50
@markdroth
Copy link
Copy Markdown
Member Author

The android failure is an infrastructure timeout.

@markdroth markdroth merged commit a11e4df into grpc:master Sep 16, 2020
@markdroth markdroth deleted the config_selector_error_handling2 branch September 16, 2020 22:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release notes: no Indicates if PR should not be in release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants