Simplify service config processing and fix config selector handling.#24114
Conversation
|
@gnossen Could you help me figure out what's causing the python failures here? I'm sure it's a bug in my code here, but it only seems to be tripping tests in wrapped languages, so I can't just attack it with gdb the way that I normally would. I'd appreciate any help you can provide in diagnosing this. Thanks! |
|
I would help in more depth, but I'm OOO today and possibly tomorrow as well. My recommendation would actually still be to use Bazel+gdb. It's just a bit more tricky with Python. Try using I understand that's a lot to get set up, so I'm willing to help in more depth once I'm back in the (metaphorical) office. |
donnadionne
left a comment
There was a problem hiding this comment.
I tested this with my changes and they work
|
@gnossen I can't seem to reproduce the failures when I run locally, and I can't figure out how to build on RBE. I'd appreciate your help when you get back to the office. |
|
What is the problem that we are trying to solve here? |
|
@yashykt See the wrapped language test failures. |
|
@markdroth Sorry for the confusion. I was asking about the aim for the PR. Is there a bug that is being fixed or is it just simplification? |
|
@yashykt The impetus for this PR was a problem that @donnadionne ran into in #23659 where we would fail to retain the ConfigSelector when we retained the previous ServiceConfig. I fixed that by more tightly integrating the ServiceConfig and ConfigSelector error handling. But I also removed handling for some edge cases that no longer exist since #23051, due to the changes described in grpc/proposal#188. I'll update the PR description to note all of this. |
|
Thanks for the explanation! |
|
It looks like the latest run triggered a use-after-free bug in the ASAN test. I'll look into that -- fixing that might also fix the wrapped language problem. |
|
Doesn't look like the use-after-free bug was related to the wrapped language problem. @gnossen, I'd appreciate any help you can give me with this when you get back. Thanks! |
This I tried to instrument Hope this helps! |
|
Thanks, Richard, that was exactly the information I needed! Yes, it is expected that this method is being called now when it was not before, because we now set the default service config to This discussion also made me realize that we should short-circuit this query completely if the map is empty (which happens commonly when we use the default service config), so I've done that as well. And I've cleaned up several other places where we were still checking for a null service config. Let's see if everything passes now... |
|
Yup, looks like everything is happy now! @yashykt, I'd like your review before I merge this. Thanks! |
yashykt
left a comment
There was a problem hiding this comment.
approved with clarifying questions and minor doc nits?
| default_service_config_.reset(); | ||
| return; | ||
| } | ||
| if (service_config_json == nullptr) service_config_json = "{}"; |
There was a problem hiding this comment.
what happens if initially the resolver does not provide a service config, but later provides an incorrect service config?
In that case, are we to continue using the previously used "no service config" or are we to go into transient failure?
Also, please add a comment around the semantics of default_service_config_ that no default service config provided as an arg means an empty service config being used as the default
There was a problem hiding this comment.
If the resolver returns a service config error on its first result, we go into TRANSIENT_FAILURE, and we stay there until it returns a non-error service config (i.e., either no service config or a valid service config).
If the resolver does not provide a service config, the default service config will be used, but otherwise this is considered the same as if it returned a valid config.
Once we have a valid config, if the resolver subsequently returns a service config error, we keep using the previous valid config.
I've added a comment about default_service_config_.
| call_config.method_configs = | ||
| service_config_->GetMethodParsedConfigVector(*args.path); | ||
| } | ||
| call_config.method_configs = |
There was a problem hiding this comment.
Can we add
GPR_DEBUG_ASSERT(service_config_ != nullptr) here with a comment as to why it will never happen?
| ChannelConfigHelper::ApplyServiceConfigResult service_config_result; | ||
| MaybeAddTraceMessagesForAddressChangesLocked(!result.addresses.empty(), | ||
| &trace_strings); | ||
| // Hold a ref to the error so that the entry in trace_strings lives |
There was a problem hiding this comment.
I don't understand this comment but since Resolver::Result::~Result() unrefs service_config_error, I'm not worried about the error being unreffed atleast.
There was a problem hiding this comment.
The result of grpc_error_string(), which we're storing in trace_strings, is a pointer to a string that is owned by the error itself. However, we transfer ownership of the Resolver::Result object containing the error to the LB policy on line 336, and the LB policy will destroy it before that call returns. So we need to hold a ref to the error until line 349, where we're actually using trace_strings.
I've tried to clarify the comment here.
426e653 to
84c8f70
Compare
|
The android failure is an infrastructure timeout. |
This PR accomplishes the following: