Skip to content

Conversation

@jtc3161
Copy link
Contributor

@jtc3161 jtc3161 commented Jan 5, 2018

Description

This pull request adds a startup flag to tell rethinkdb to skip resolving hostnames via the dns server. This is being submitted as a possible fix to Issue 4600 where a misconfigured DNS server causes a startup delay for the database.

@srh srh added this to the 2.4 milestone Jan 5, 2018
@srh
Copy link
Contributor

srh commented Jan 5, 2018

I can confirm it works! Thank you.

I'll look at the code, figure out what this is all about and what should be the default behavior, and merge this into next and v2.4.x this weekend.

@srh
Copy link
Contributor

srh commented Jan 6, 2018

It looks like the practice of calling getaddrinfo on the result of gethostname was introduced with commit 9fa17c5.

Also, I'll mention issue 4600 here -> #4600 <- so that it has a backlink to this PR.

I'm probably going to add a tweak to make the testing system pass this flag, because I've been annoyed in the past by invocations of ./test/run just resulting in a bunch of timed out tests.

@srh
Copy link
Contributor

srh commented Jan 7, 2018

It looks like we do a few things in get_local_ips:

  1. We call getaddrinfo on the result of gethostname to get some IP addresses to bind to. And we filter out the loopback IP's we see from that.
  2. We use getifaddrs to to get some IP addresses to bind to, which we add to the set we're building. This provides us with the loopback IP's.
  3. Then we filter that set of addresses with a filter provided to the --bind argument.

After that returns we:
4. "// Make sure that all specified addresses were found". We then go through the filter, and check that an interface was found for every address we can bind to in that filter.
5. Then, finally, if the user passed a --bind all, we ignore the results of that work entirely and return an empty set of IP addresses, which gets treated specially by linux_nonthrowing_tcp_listener_t's constructor (see "// If no addresses were supplied, default to 'any'").


Here's what I think we should do instead:

  • If the user passes --bind all, we will just bind on '0.0.0.0' like we end up doing.
  • If the user passes bind options with certain IP addresses, we'll just try binding on those IP addresses and see if the bind() call succeeds. We'll do this instead of item 3 above where we read the set of interfaces and filter through them.
  • If we want to bind to loopback but nothing else (e.g. if the user passes no --bind options), then we'll use getifaddrs to get the loopback IP's. Looking at commits f1bade7 and 0116bb7 in the history of src/arch/address.cc, I'm going to conclude that getting loopback IP's is at least conservative, non-breaking behavior.

So, what would be different? Only the case when the user passes certain IP addresses for us to bind to. Instead of using getaddrinfo and getifaddrs to get interface IP's, then filtering them, and then binding to the IP addresses the user supplied anyway (and checking the error anyway, as we do), we'll just bind to the addresses straight away.

@srh
Copy link
Contributor

srh commented Jan 7, 2018

Just noticed this from commit 3d2e9ae:

    // Ignore loopback addresses - those will be returned by getifaddrs, and
    // getaddrinfo is not so trustworthy.
    // See https://github.com/rethinkdb/rethinkdb/issues/2405
    for (auto it = all_ips.begin(); it != all_ips.end();) {
        auto curr = it++;
        if (!curr->is_loopback()) {
            all_ips.erase(curr);
        }
    }

The curr->is_loopback() test is mistakenly negated.

@srh
Copy link
Contributor

srh commented Jan 7, 2018

It turns out my plan above has one flaw: We still use get_local_ips to compute our own canonical addresses in the rpc/ directory. I haven't looked at that code too closely but it does mean one use of get_local_ips will stick around. I wonder why we compute our own canonical addresses, instead of just having the node receiving a connection use the IP address that it was told connected to it. Well, maybe because then it might start sharing that peer info with other peers.

I don't see any intelligent design behind using DNS resolution to find our network addresses, so I'm going to remove it and solely use getifaddrs instead. So my plan is now:


To remove the --no-dns-resolution flag and just drop DNS resolution entirely (in other words, make the code behave as if that flag was turned on). The exception is Windows, where DNS resolution of our own hostname will remain turned on, for now.

@srh
Copy link
Contributor

srh commented Jan 7, 2018

Oh, I should add: Because curr->is_loopback() was negated, we actually discarded all the non-loopback addresses returned by DNS resolution, and kept the loopback addresses. So our DNS resolution was all for nothing. I'm confident this won't break things.

@srh
Copy link
Contributor

srh commented Jan 8, 2018

These changes have been merged with #6588. Note that I dropped the --no-dns-resolution flag entirely, and made the code act as if the flag were always turned on.

@srh srh closed this Jan 8, 2018
@srh srh modified the milestones: 2.4, 2.3.x Jan 8, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants