Run a local DNS server in run_tests and add c-ares tests with it#12210
Run a local DNS server in run_tests and add c-ares tests with it#12210apolcyn merged 1 commit intogrpc:masterfrom
Conversation
markdroth
left a comment
There was a problem hiding this comment.
Thanks for doing this! Most of my comments are minor, but there are a few substantive ones. Please let me know if you have any questions about any of this.
There was a problem hiding this comment.
DNS doesn't allow individual TXT records larger than 255 characters, but it does allow you to split up a TXT record into multiple strings. For example:
foo.com. IN TXT "grpc_config=[{\"serviceConfig\":{" "\"loadBalancingPolicy\":\"round_robin\"}}]"
I don't know the protocol details, but I do know that DNS can somehow represent the difference between multiple independent TXT records and a single split-up TXT record. For example, if I add a second TXT record to the zone file in addition to the above, dig shows the following:
roth@segfault:~/grpc> dig @127.0.0.1 -p 1234 foo.com. txt
; <<>> DiG 9.9.5-3ubuntu0.15-Ubuntu <<>> @127.0.0.1 -p 1234 foo.com. txt
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 60627
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 1, ADDITIONAL: 2
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;foo.com. IN TXT
;; ANSWER SECTION:
foo.com. 3600 IN TXT "a=12345678901234567890123456789
01234567890123456789012345678901234567890123456789012345678901234567890123456789
01234567890123456789012345678901234567890123456789012345678901234567890123456789
0123456789012345678901234567890123456789012345678901234567890"
foo.com. 3600 IN TXT "grpc_config=[{\"serviceConfig\"
:{" "\"loadBalancingPolicy\":\"round_robin\"}}]"
;; AUTHORITY SECTION:
foo.com. 3600 IN NS ns.foo.com.
;; ADDITIONAL SECTION:
ns.foo.com. 3600 IN A 127.0.0.1
;; Query time: 0 msec
;; SERVER: 127.0.0.1#1234(127.0.0.1)
;; WHEN: Thu Aug 17 08:09:58 PDT 2017
;; MSG SIZE rcvd: 417
Note that the two parts of the service config string are shown as a single split-up entry, but the other TXT record is shown as an independent record.
Note that because the split-up record appears differently at the protocol level, the c-ares code has special handling for this case. So we do need to find a way to test this.
From a quick glance at the dnslib code, I'm not sure it can handle this right now. If not, perhaps we can fix it to do that?
There was a problem hiding this comment.
We should also make sure to test the case where the sum total of the TXT records is larger than 512 bytes, because that will cause DNS to fall back from UDP to TCP. (I was having some problems testing this with the service config code, so this may not work right yet.)
There was a problem hiding this comment.
marked the UDP -> TCP test case as a TODO.
from offline chat, this needs some change to the ares wrapper code to work.
The problem that's currently happening is:
- c-ares receives a truncated DNS response and re-sends the query over TCP
- c-ares now signals interest in the TCP socket and the previous UDP socket (sometimes needed), and we still register
notify_on_readon the UDP socket fd_orphanfails because the callback is still pending
I think we need to "maybe cancel" the callback on all sockets to cleanup - cc @zyc
There was a problem hiding this comment.
I think you meant to tag @y-zeng here.
Yuchen, do you have time to look at this? We don't need to block this PR on it, but I think we will need to fix this before we can make c-ares the default DNS resolver.
There was a problem hiding this comment.
In C++, instead of using static, it's cleaner to put things in an anonymous namespace.
There was a problem hiding this comment.
done, moved everything into an anonymous namespace
There was a problem hiding this comment.
This is closing the testing namespace, not the anonymous namespace.
There was a problem hiding this comment.
got rid of the whole matches_any function
There was a problem hiding this comment.
Checking that the value is present in expected_addrs isn't quite precise enough. If the expected addresses are (A, B) and the actual addresses are (A, A), the test will not fail as it should.
Since we're using C++ here, can we use some of the GMock constructs to make this more readable? For example, if we can get both the expected and actual addresses into vectors of grpc_lb_address objects, perhaps we can use something like UnorderedElementsAreArray to concisely check the result?
There was a problem hiding this comment.
Had some trouble getting the gmock construct to work - I'm not certain why but I think think that it's because this test isn't following a needed framework (running as an ordinary binary, and not with RUN_ALL_TESTS and TEST_ test cases).
Can look into this further if we want to use that though
There was a problem hiding this comment.
I don't see any reason why we can't structure this like any other C++ test, where the individual tests are defined using the TEST() macros and invoked via RUN_ALL_TESTS().
There was a problem hiding this comment.
If the service config specifies the LB policy, then we should also check that the resolver has set GRPC_ARG_LB_POLICY_NAME.
There was a problem hiding this comment.
done. I'd like to check offline though that the expected lb policy names that I have set (which are the ones appearing), are correct. E.g., my understanding is that if an SRV address is resolved, then grpclb should be used - but it appears that the policy name channel arg is actually null in that case
There was a problem hiding this comment.
That's correct. The resolver's behavior is that it will only set GRPC_ARG_LB_POLICY_NAME if the service config explicitly specifies an LB policy, otherwise it is left unset. When the resolver returns its results to the client_channel code, that's where we check whether any of the addresses are balancer addresses and if so, force the use of grpclb regardless of what policy (if any) was present in the service config.
There was a problem hiding this comment.
I don't think we want case-insensitive comparison here -- the JSON data is compared in a case-sensitive way.
There was a problem hiding this comment.
This is not actually an end-to-end test. Can we rename it to something like resolver_test? (Please update function names, file names, etc).
There was a problem hiding this comment.
Renamed this to resolver_component_test...
I wanted make it more unique than resolver_test, for namespacing/regex matching purposes, but can change if we want to
4254aa5 to
0538da0
Compare
|
tried to address all comments, PTAL |
markdroth
left a comment
There was a problem hiding this comment.
This is looking pretty good. Just a few issues remaining.
test/cpp/naming/gen_build_yaml.py
Outdated
There was a problem hiding this comment.
Could just say return test_case['expected_lb_policy'] or ''.
There was a problem hiding this comment.
No need for the ? true : false part.
There was a problem hiding this comment.
We should probably use C++ style names in this test -- e.g., ConvertStringToBool() instead of convert_string_to_bool().
There was a problem hiding this comment.
Converted some but not all, I was a little unsure here. Kept types used by C funcs (e.g., args_struct), and functions called back from C, as C-style, and converted the top-level to C++
There was a problem hiding this comment.
This test is written in C++, so I think it's fine to use C++ style even for things that are used as C callbacks.
There was a problem hiding this comment.
Please declare variables right before they are used, rather than declaring them all at once at the top of a function.
There was a problem hiding this comment.
Suggest calling this is_balancer.
There was a problem hiding this comment.
That's correct. The resolver's behavior is that it will only set GRPC_ARG_LB_POLICY_NAME if the service config explicitly specifies an LB policy, otherwise it is left unset. When the resolver returns its results to the client_channel code, that's where we check whether any of the addresses are balancer addresses and if so, force the use of grpclb regardless of what policy (if any) was present in the service config.
test/cpp/naming/gen_build_yaml.py
Outdated
There was a problem hiding this comment.
Instead of having an expected_config_index attribute, I suggest having a expected_service_config attribute that contains the expected string. This is because when the TXT record contains multiple choices, the resolver is responsible for picking one of those choices. The service config it returns should include only the selected choice -- i.e., it's a subset of the TXT record.
For example, one of the tests you've defined has the following TXT record:
grpc_config=[
{"percentage":0,
"serviceConfig":{"loadBalancingPolicy":"round_robin","methodConfig":[{"name":[{"method":"Foo","service":"NeverPickedService","waitForReady":true}]}]}},
{"percentage":100,
"serviceConfig":{"loadBalancingPolicy":"round_robin","methodConfig":[{"name":[{"method":"Foo","service":"AlwaysPickedService","waitForReady":true}]}]}}]
In this case, the resolver should return only the second service config:
{"loadBalancingPolicy":"round_robin","methodConfig":[{"name":[{"method":"Foo","service":"AlwaysPickedService","waitForReady":true}]}]}}
There was a problem hiding this comment.
done - previously string found in python, that change simplified the code
There was a problem hiding this comment.
I don't see any reason why we can't structure this like any other C++ test, where the individual tests are defined using the TEST() macros and invoked via RUN_ALL_TESTS().
There was a problem hiding this comment.
Looks good. The format might be a bit less ambiguous if we used a different delimiter character for the (address,is_balancer) pairs than between the pairs. But this is fine as-is too.
There was a problem hiding this comment.
I think you meant to tag @y-zeng here.
Yuchen, do you have time to look at this? We don't need to block this PR on it, but I think we will need to fix this before we can make c-ares the default DNS resolver.
|
tried to address all comments again, also renamed records PTAL |
|
cc @jtattermusch for how this is tying in with |
There was a problem hiding this comment.
I think port_server.py also starting a DNS server as a side effect is a pretty dirty solution (no one would expect such behavior from something called port_server). Also, port server is started in other circumstances - like in performance benchmarks, where you don't need the DNS server at all. I think overall this solution will lead to problems in the future.
There was a problem hiding this comment.
@nicolasnoble recommended this solution, since we need a hermetic DNS server in the tests. I suppose we could have the tests run yet another binary, but that seems unnecessarily complicated.
There was a problem hiding this comment.
References should only be used for const parameters.
There was a problem hiding this comment.
This doesn't say how the is_balancer bit is encoded.
Also, the parens seem a little cumbersome. I was thinking of something more like "address0,is_balancer0;address1,is_balancer1;..." -- i.e., use comma between the two elements of each pair and semicolon between the pairs. Or perhaps "address0=is_balancer0,address1=is_balancer1,...".
There was a problem hiding this comment.
Please eliminate unnecessary blank lines within functions.
In general, I find that if you really need to separate code into logical "chunks", it's better to use a comment line documenting the purpose of the next section than a blank line.
There was a problem hiding this comment.
If you save this as a C++ string instead of a C string, then you can change check_service_config_result_locked() to just use EXPECT_EQ().
(Actually, even if you keep it as a C string, you can use EXPECT_STREQ().)
There was a problem hiding this comment.
@nicolasnoble recommended this solution, since we need a hermetic DNS server in the tests. I suppose we could have the tests run yet another binary, but that seems unnecessarily complicated.
|
BTW, I meant to say, this is looking really good. I think the remaining issues are mostly cosmetic. |
|
After some chat, I am thinking that I will probably restructure how the port server is started: instead of starting one global DNS server along side the port server, a subprocess DNS server, with a dynamically assigned port, can be spawned per-test. Doing that would isolate the DNS server from tests that don't need it, prevent need of changes to the port server, and also help in making changes to the test cases here (backwards compatibility of the test records is no longer a big deal since the DNS server is never shared across tests). |
|
Addressed comments in most recent commit, and changed the DNS server to launch per-test run, when ran with One thing that is significant here though is the race between the sub-process standing up it's server and the c-ares DNS query beginning - I found that a 1-second sleep between starting the server and beginning the query was needed to prevent failures. With that sleep:
I'm not sure right now if the race there is something that can (or should) be solved with c-ares retry config, or if there is some unexpected behavior around this with the different pollers. PTAL |
|
|
markdroth
left a comment
There was a problem hiding this comment.
I am a little concerned that having to start up a separate server for each test will make things more brittle, since there are a lot more moving parts here that need to be handled right. I'm not sure that all of the additional complexity is worth not having the DNS server in the port server binary.
There was a problem hiding this comment.
Suggest moving the flag checking down into main().
There was a problem hiding this comment.
Maybe also log the pid, if there's an easy way to get at it?
There was a problem hiding this comment.
I couldn't see an easy way to do this. Even with gpr_subprocess directly, the pid info appears to be platform-specific and not exposed. Using fork/exec directly I could, but I'd lean towards gpr_subprocess and its SubProcess (I think this wrapper is more portable too)
There was a problem hiding this comment.
Still need to remove the == true here.
There was a problem hiding this comment.
Suggest moving this down into the main logic below.
There was a problem hiding this comment.
Let's put the rest of this in a main() function, and invoke it only if __name__ == '__main__'.
|
IMO I agree with the earlier push back against the DNS server running with the port server. While the DNS server per-test subprocess adds more complexity to the test, I'm worried that the DNS server running with the port server adds too much long-lived and global state to the overall test environment, with higher risk of infrastructure bugs and manual work. The biggest problems I see with the DNS server in the port server are:
|
|
Is the |
|
Yes, that's fixed. The thing that I said wasn't fixed was that there are still a bunch of C-style names that should be converted to C++-style. |
markdroth
left a comment
There was a problem hiding this comment.
I understand the reasoning for having the DNS server be separate from the portserver. I am willing to give it a shot, but we're going to have to monitor it closely to make sure it doesn't fail.
What do you think about writing a wrapper shell script to get the port for the DNS server, start the server, run dig, and then run the test program with --local_dns_server_address=localhost:${port}? The wrapper script could trap SIGTERM and kill both the test and the DNS server when it terminates. That would (a) make it a bit more robust for handling termination and (b) more cleanly separate the test from the infrastructure that we need to run it automatically.
There was a problem hiding this comment.
Still need to remove the == true here.
|
a59e37a to
3a31596
Compare
|
1 similar comment
|
|
|
|
||
| } // namespace grpc | ||
|
|
||
| DEFINE_bool( |
There was a problem hiding this comment.
Please put these flag definitions up at the top of the file, right under the includes.
| gpr_cv_init(&test_driver_cv); | ||
| int test_driver_done = 0; | ||
| register_sighandler(); | ||
| std::thread sig_handling_thread(RunSigHandlingThread, test_driver, |
There was a problem hiding this comment.
Ah, I didn't realize that there wasn't a form of SubProcess::Join() that takes a timeout. In that case, I guess this is fine as-is. It's a shame that we need so much machinery to do something so simple, but I don't see an obviously better approach.
|
|
||
| void RunSigHandlingThread(SubProcess *test_driver, gpr_mu *test_driver_mu, | ||
| gpr_cv *test_driver_cv, int *test_driver_done) { | ||
| for (size_t i = 0; i < kTestTimeoutSeconds && !abort_wait_for_child; i++) { |
There was a problem hiding this comment.
I don't think this would cause the response to SIGINIT / SIGTERM to be delayed. The overall mechanics of the loop would not be different -- we would still be blocking for no more than 1 second at a time.
What I'm thinking is something like this:
gpr_timespec deadline = gpr_time_add(
gpr_now(GPR_CLOCK_MONOTONIC),
gpr_time_from_seconds(kTestTimeoutSeconds, GPR_TIMESPAN));
while (true) {
gpr_timespec now = gpr_now(GPR_CLOCK_MONOTONIC);
if (gpr_time_cmp(now, deadline) <= 0) break;
gpr_mu_lock(test_driver_mu);
if (*test_driver_done) {
gpr_mu_unlock(test_driver_mu);
return;
}
gpr_timespan wait_deadline = gpr_time_min(
deadline,
gpr_time_add(now, gpr_time_from_seconds(1, GPR_TIMESPAN)));
gpr_cv_wait(test_driver_cv, test_driver_mu, wait_deadline);
gpr_mu_unlock(test_driver_mu);
}
|
|
|
da75708 to
43a72c2
Compare
|
thanks for review! FYI squashed the reviewed commits into the single commit with the message: "local dns server". Made a couple of changes, starting from "fix native asan failure in driver", to fix remaining found issues in tests. |
|
Please squash all commits into one before merging. Thanks! |
|
|
43a72c2 to
a37decd
Compare
|
|
|
test failures all look unrelated (will squash again before merging) |
|
I think you still need approval from @nicolasnoble before you can merge. |
|
Ah, right, OWNERS and everything. |
a37decd to
27bf05d
Compare
Starts running a local DNS server (using
dnslib) in the same process as the port server. It also adds a revised version of the c-ares test in #12010 to run against it, and fixes a couple of memory leaks within thegrpc_cares_wrapper.c.This keeps
resolver_test_record_groups.yamlas the source of truth for all test records.To run the tests added here:
tools/run_tests/run_tests.py -l c++ -r resolver_component_testPlanning on creating a follow-up PR that uses what's added here for running GCE DNS integration tests (as a revised version of #12010)
-------------- update: ----------------
/quitquitquithandler would still leave the port server process running. On shared resource workers including the jenkins mac and jenkins windows workers, this bug brought up a larger issue with this PR, in that changes to the set of DNS test records must always be backwards compatible as long as tests are sharing resources (IIC while there are still mac/windows tests on jenkins)(note all workers right now are ok; non-docker workers failed to start the faulty port server in this PR because
dnslibwasn't present)