KAFKA-12193: Re-resolve IPs after a client disconnects#9902
KAFKA-12193: Re-resolve IPs after a client disconnects#9902mimaison merged 4 commits intoapache:trunkfrom
Conversation
This patch changes the NetworkClient behavior to resolve the target node's hostname after disconnecting from an established connection, rather than waiting until the previously-resolved addresses are exhausted. This is to handle the scenario when the node's IP addresses have changed during the lifetime of the connection, and means that the client does not have to try to connect to invalid IP addresses until it has tried each address.
mimaison
left a comment
There was a problem hiding this comment.
Thanks for the PR. It looks good overall, I left a few comments for minor issues
| ApiVersions apiVersions, | ||
| Sensor throttleTimeSensor, | ||
| LogContext logContext) { | ||
| public NetworkClient(MetadataUpdater metadataUpdater, |
There was a problem hiding this comment.
Do we really need this constructor? As far as I can tell, it's only called by the other 3 above. These could directly call the real one below instead of going through this new one. WDYT?
There was a problem hiding this comment.
Good call, fixed. These are also weirdly inconsistent (selector is before metadata and metadataUpdater sometimes, after them other times), which I guess we could clean up, but I didn't want to remove any existing public constructors.
| resetConnectionSetupTimeout(nodeState); | ||
| if (nodeState.state.isConnected()) { | ||
| // If a connection had previously been established, re-resolve DNS because the IPs may have changed | ||
| nodeState.addresses = Collections.emptyList(); |
There was a problem hiding this comment.
Slightly weird that we're updating a "private" field here.
Also the comment is a bit misleading. We're not re-resolving DNS here but instead clearing state so if we reconnect later, the client will be forced to re-resolve then.
There was a problem hiding this comment.
@mimaison Java allows outer classes to access inner member's private members. This is described in detail at JLS-6.6.1
A member (class, interface, field, or method) of a reference (class, interface, or array) type or a constructor of a class type is accessible only if the type is accessible and the member or constructor is declared to permit access:
- if the member or constructor is declared private, then access is permitted if and only if it occurs within the body of the top level class (§7.6) that encloses the declaration of the member or constructor.
There was a problem hiding this comment.
Agree with @mimaison on the comment, good to make it clear by mentioning that the addresses are cleared to re-resolve later when it reconnects.
There was a problem hiding this comment.
Accessing the field is allowed, but I think @mimaison meant that it makes the code a little confusing and inconsistent. NodeConnectionState has some private fields and some package-private ones, and addresses previous was only modified by methods on NodeConnectionState, not accessed directly by the top-level class. I've moved the empty list assignment into a NodeConnectionState.clearAddresses method
Agree about the comment, I've updated it
satishd
left a comment
There was a problem hiding this comment.
thanks @bob-barrett for addressing the review comments. LGTM
|
@bob-barrett Thanks for the PR. For my understanding, have we considered changing the way backoff is applied instead of re-resolving when an established connection is closed. For instance, we could try all the resolved IPs before applying the backoff. However, we would still have to wait until the connection timeouts for each IP. Also, I suppose that we assume that closing connection is rather infrequent in normal circumstances so the cost of the DNS resolution is low. On the other hand, we relies on the DNS cache if the resolution happens really frequently. Am I getting this right? That seems reasonable. I am trying to convince myself that this is the correct approach to address the issue. |
@dajac That's right, and that's my concern with exhausting the currently-resolved IPs. The default connection setup timeout is 10 seconds. Even without applying the backoff, 10 seconds of delay per additional IP address is pretty noticeable.
I think this is the right way to think about it, yes. There's already a DNS cache, which can be configured as needed. We're effectively adding our own additional cache (with a very long TTL) on top of that by preserving the resolved IPs. If a client has been connected to a server for hours or days and then disconnects, it makes complete sense to me to re-resolve DNS. |
|
@bob-barrett Thanks for your answers. That makes sense and I do agree with the approach. |
dajac
left a comment
There was a problem hiding this comment.
@bob-barrett Thanks for the PR. I have left a meta-comment. Please, let me know what you think. Besides this, the rest looks good.
| static List<InetAddress> resolve(String host, ClientDnsLookup clientDnsLookup, | ||
| HostResolver hostResolver) throws UnknownHostException { |
There was a problem hiding this comment.
This method and its usages look a bit weird to me now. It resolves an host based on clientDnsLookup and a hostResolver which resolves an host at its turn. We also have to pass the HostResolver wherever it is called now. I wonder if we should avoid this level of indirection here and directly encapsulate the entire logic in the HostResolver.
HostResolver would look like this:
public interface HostResolver {
List<InetAddress> resolve(String host, ClientDnsLookup clientDnsLookup) throws UnknownHostException;
}
In NodeConnectionState#currentAddress, we could replace addresses = ClientUtils.resolve(host, clientDnsLookup, hostResolver) by addresses = hostResolver.resolve(host, clientDnsLookup).
The DefaultHostResolver would do what ClientUtils.resolve does today and we can mock it in tests like you did.
Is this something that you have considered?
There was a problem hiding this comment.
@dajac I started making this change, but while working on it, I realized that I don't think I like requiring the HostResolver to understand the client.dns.lookup config. In the same way that MockTime is only used to mock Java built-in methods, I think the HostResolver should only be responsible for the DNS resolution (either real or mocked), and that ClientUtils should be responsible for understanding the Kafka application behavior. That way, tests that mock DNS still test the actual ClientDnsLookup behavior, rather than relying on both the default resolver and any mocked resolvers handle it correctly. It also means that if we ever add an option for ClientDnsLookup or change its behavior, we don't need to remember to update all HostResolver implementations.
There was a problem hiding this comment.
Thanks @bob-barrett. I do agree with your point. Let's keep it as you suggested.
dajac
left a comment
There was a problem hiding this comment.
@bob-barrett Thanks for your reply. I left few minor comments.
| static List<InetAddress> resolve(String host, ClientDnsLookup clientDnsLookup, | ||
| HostResolver hostResolver) throws UnknownHostException { |
There was a problem hiding this comment.
Thanks @bob-barrett. I do agree with your point. Let's keep it as you suggested.
| private final NetworkClient clientWithNoExponentialBackoff = createNetworkClient(reconnectBackoffMsTest); | ||
| private final NetworkClient clientWithStaticNodes = createNetworkClientWithStaticNodes(); | ||
| private final NetworkClient clientWithNoVersionDiscovery = createNetworkClientWithNoVersionDiscovery(); | ||
| private ArrayList<InetAddress> initialAddresses = new ArrayList<>(Arrays.asList( |
There was a problem hiding this comment.
nit: Should we make these two static?
There was a problem hiding this comment.
Catching the UnknownHostException makes it a bit annoying, I threw it in a static block to make the compiler happy. Let me know if you think there's something cleaner.
|
@dajac @mimaison @satishd Thanks for your reviews! In addition to PR feedback, I pushed a new change to |
mimaison
left a comment
There was a problem hiding this comment.
Thanks for the updates. LGTM
|
Tests passed on JDK 8. |
This patch changes the NetworkClient behavior to resolve the target node's hostname after disconnecting from an established connection, rather than waiting until the previously-resolved addresses are exhausted. This is to handle the scenario when the node's IP addresses have changed during the lifetime of the connection, and means that the client does not have to try to connect to invalid IP addresses until it has tried each address. Reviewers: Mickael Maison <[email protected]>, Satish Duggana <[email protected]>, David Jacot <[email protected]> Conflicts (related to junit upgrade): clients/src/test/java/org/apache/kafka/clients/ClientUtilsTest.java clients/src/test/java/org/apache/kafka/clients/ClusterConnectionStatesTest.java clients/src/test/java/org/apache/kafka/clients/NetworkClientTest.java
|
I have picked the commit into 2.7 but could not pick it into 2.6 cleanly. @bob-barrett Could you open a PR for 2.6? |
This patch changes the NetworkClient behavior to resolve the target node's hostname after disconnecting from an established connection, rather than waiting until the previously-resolved addresses are exhausted. This is to handle the scenario when the node's IP addresses have changed during the lifetime of the connection, and means that the client does not have to try to connect to invalid IP addresses until it has tried each address. Reviewers: Mickael Maison <[email protected]>, Satish Duggana <[email protected]>, David Jacot <[email protected]>
This patch changes the NetworkClient behavior to resolve the target node's hostname after disconnecting from an established connection, rather than waiting until the previously-resolved addresses are exhausted. This is to handle the scenario when the node's IP addresses have changed during the lifetime of the connection, and means that the client does not have to try to connect to invalid IP addresses until it has tried each address. Reviewers: Mickael Maison <[email protected]>, Satish Duggana <[email protected]>, David Jacot <[email protected]>
This patch changes the NetworkClient behavior to resolve the target node's hostname after disconnecting from an established connection, rather than waiting until the previously-resolved addresses are exhausted. This is to handle the scenario when the node's IP addresses have changed during the lifetime of the connection, and means that the client does not have to try to connect to invalid IP addresses until it has tried each address. Reviewers: Mickael Maison <[email protected]>, Satish Duggana <[email protected]>, David Jacot <[email protected]>
This patch changes the NetworkClient behavior to resolve the target node's hostname after disconnecting from an established connection, rather than waiting until the previously-resolved addresses are exhausted. This is to handle the scenario when the node's IP addresses have changed during the lifetime of the connection, and means that the client does not have to try to connect to invalid IP addresses until it has tried each address. Reviewers: Mickael Maison <[email protected]>, Satish Duggana <[email protected]>, David Jacot <[email protected]>
This patch changes the NetworkClient behavior to resolve the target node's hostname after disconnecting from an established connection, rather than waiting until the previously-resolved addresses are exhausted. This is to handle the scenario when the node's IP addresses have changed during the lifetime of the connection, and means that the client does not have to try to connect to invalid IP addresses until it has tried each address. Reviewers: Mickael Maison <[email protected]>, Satish Duggana <[email protected]>, David Jacot <[email protected]>
…ltipleIPsWithUseAll TICKET = N/A LI_DESCRIPTION = The test fails since the domain kafka.apache.org used to return 3 IPs and is now only returning two IPs. Furthermore, the upstream fix identified below cannot be cleanly cherry picked. EXIT_CRITERIA = when the commit 131d475 is picked from upstream: KAFKA-12193: Re-resolve IPs after a client disconnects apache#9902
…ltipleIPsWithUseAll (#116) TICKET = N/A LI_DESCRIPTION = The test fails since the domain kafka.apache.org used to return 3 IPs and is now only returning two IPs. Furthermore, the upstream fix identified below cannot be cleanly cherry picked. EXIT_CRITERIA = when the commit 131d475 is picked from upstream: KAFKA-12193: Re-resolve IPs after a client disconnects apache#9902
This patch changes the NetworkClient behavior to resolve the target node's hostname after disconnecting from an established connection, rather than waiting until the previously-resolved addresses are exhausted. This is to handle the scenario when the node's IP addresses have changed during the lifetime of the connection, and means that the client does not have to try to connect to invalid IP addresses until it has tried each address. Reviewers: Mickael Maison <[email protected]>, Satish Duggana <[email protected]>, David Jacot <[email protected]>
This patch changes the NetworkClient behavior to resolve the target node's hostname after disconnecting from an established connection, rather than waiting until the previously-resolved addresses are exhausted. This is to handle the scenario when the node's IP addresses have changed during the lifetime of the connection, and means that the client does not have to try to connect to invalid IP addresses until it has tried each address. Reviewers: Mickael Maison <[email protected]>, Satish Duggana <[email protected]>, David Jacot <[email protected]>
This patch changes the NetworkClient behavior to resolve the target node's hostname after disconnecting from an established connection, rather than waiting until the previously-resolved addresses are exhausted. This is to handle the scenario when the node's IP addresses have changed during the lifetime of the connection, and means that the client does not have to try to connect to invalid IP addresses until it has tried each address. Reviewers: Mickael Maison <[email protected]>, Satish Duggana <[email protected]>, David Jacot <[email protected]>
…pache#10067) This patch changes the NetworkClient behavior to resolve the target node's hostname after disconnecting from an established connection, rather than waiting until the previously-resolved addresses are exhausted. This is to handle the scenario when the node's IP addresses have changed during the lifetime of the connection, and means that the client does not have to try to connect to invalid IP addresses until it has tried each address. Reviewers: Mickael Maison <[email protected]>, Satish Duggana <[email protected]>, David Jacot <[email protected]>
…pache#10067) This patch changes the NetworkClient behavior to resolve the target node's hostname after disconnecting from an established connection, rather than waiting until the previously-resolved addresses are exhausted. This is to handle the scenario when the node's IP addresses have changed during the lifetime of the connection, and means that the client does not have to try to connect to invalid IP addresses until it has tried each address. Reviewers: Mickael Maison <[email protected]>, Satish Duggana <[email protected]>, David Jacot <[email protected]>
… (#10109) This patch changes the NetworkClient behavior to resolve the target node's hostname after disconnecting from an established connection, rather than waiting until the previously-resolved addresses are exhausted. This is to handle the scenario when the node's IP addresses have changed during the lifetime of the connection, and means that the client does not have to try to connect to invalid IP addresses until it has tried each address. Reviewers: Mickael Maison <[email protected]>, Satish Duggana <[email protected]>, David Jacot <[email protected]>
… (#10108) This patch changes the NetworkClient behavior to resolve the target node's hostname after disconnecting from an established connection, rather than waiting until the previously-resolved addresses are exhausted. This is to handle the scenario when the node's IP addresses have changed during the lifetime of the connection, and means that the client does not have to try to connect to invalid IP addresses until it has tried each address. Reviewers: Mickael Maison <[email protected]>, Satish Duggana <[email protected]>, David Jacot <[email protected]>
|
您好,我已收到您的来信,会尽快回复。谢谢!
|
…ltipleIPsWithUseAll (#116) TICKET = N/A LI_DESCRIPTION = The test fails since the domain kafka.apache.org used to return 3 IPs and is now only returning two IPs. Furthermore, the upstream fix identified below cannot be cleanly cherry picked. EXIT_CRITERIA = when the commit 131d475 is picked from upstream: KAFKA-12193: Re-resolve IPs after a client disconnects apache#9902
* [LI-HOTFIX] Do not resolve bootstrap server during client bootstrap/startup (#315) * do not resolve bootstrap server during client bootstrap/startup Co-authored-by: Ke Hu <[email protected]> * [LI-HOTFIX] use_all_dns_ips as default value for client.dns.lookup (#348) * [LI-HOTFIX] log level change to avoid flooding the logs * [LI-HOTFIX] Ignore the failed test ClusterConnectionStatesTest#testMultipleIPsWithUseAll (#116) TICKET = N/A LI_DESCRIPTION = The test fails since the domain kafka.apache.org used to return 3 IPs and is now only returning two IPs. Furthermore, the upstream fix identified below cannot be cleanly cherry picked. EXIT_CRITERIA = when the commit 131d475 is picked from upstream: KAFKA-12193: Re-resolve IPs after a client disconnects apache#9902 * [LI-HOTFIX] Ignoring the failed tests (#188) TICKET = N/A LI_DESCRIPTION = Several tests are failing since the domain kafka.apache.org that used to resolve to more than 1 IPv4 addresses are not only resolving to 1 IPv4 address. The upstream code has overhauled the ClusterConnectionStatesTest. We are simply ignoring these tests for now, and will get the new logic from upstream after a major version rebase. EXIT_CRITERIA = This hotfix can be removed in the next major version rebase Co-authored-by: Ke Hu <[email protected]> Co-authored-by: Ke Hu <[email protected]> Co-authored-by: Lucas Wang <[email protected]>
…ltipleIPsWithUseAll (#116) TICKET = N/A LI_DESCRIPTION = The test fails since the domain kafka.apache.org used to return 3 IPs and is now only returning two IPs. Furthermore, the upstream fix identified below cannot be cleanly cherry picked. EXIT_CRITERIA = when the commit 131d475 is picked from upstream: KAFKA-12193: Re-resolve IPs after a client disconnects apache#9902
* KAFKA-6863 Kafka clients should try to use multiple DNS resolved IP (apache#4987) Implementation of KIP-302: Based on the new client configuration `client.dns.lookup`, a NetworkClient can use InetAddress.getAllByName to find all IPs and iterate over them when they fail to connect. Only uses either IPv4 or IPv6 addresses similar to the default mode. Co-authored-by: Edoardo Comar <[email protected]> Co-authored-by: Mickael Maison <[email protected]> Reviewers: Rajini Sivaram <[email protected]> * [LI-HOTFIX] Ignore the failed test ClusterConnectionStatesTest#testMultipleIPsWithUseAll (#116) TICKET = N/A LI_DESCRIPTION = The test fails since the domain kafka.apache.org used to return 3 IPs and is now only returning two IPs. Furthermore, the upstream fix identified below cannot be cleanly cherry picked. EXIT_CRITERIA = when the commit 131d475 is picked from upstream: KAFKA-12193: Re-resolve IPs after a client disconnects apache#9902 * Ignoring the failed tests (#188) [LI-HOTFIX] Ignoring the failed tests (#188) TICKET = N/A LI_DESCRIPTION = Several tests are failing since the domain kafka.apache.org that used to resolve to more than 1 IPv4 addresses are not only resolving to 1 IPv4 address. The upstream code has overhauled the ClusterConnectionStatesTest. We are simply ignoring these tests for now, and will get the new logic from upstream after a major version rebase. EXIT_CRITERIA = This hotfix can be removed in the next major version rebase * Fix for KAFKA-7974: Avoid zombie AdminClient when node host isn't resolvable (apache#6305) * Fix for KAFKA-7974: Avoid calling disconnect() when not connecting * Resolve host only when currentAddress() is called Moves away from automatically resolving the host when the connection entry is constructed, which can leave ClusterConnectionStates in a confused state. Instead, resolution is done on demand, ensuring that the entry in the connection list is present even if the resolution failed. * Add Javadoc to ClusterConnectionStates.connecting() * KAFKA-9313: Set `use_all_dns_ips` as the new default for `client.dns.lookup` (KIP-602) (apache#8644) This applies to the producer, consumer, admin client, connect worker and inter broker communication. `ClientDnsLookup.DEFAULT` has been deprecated and a warning will be logged if it's explicitly set in a client config. Reviewers: Mickael Maison <[email protected]>, Ismael Juma <[email protected]> * Update NetworkClient usage in SSLNetworkClient * [LI-HOTFIX] Bypass cluster metadata auto refresh code path by default (#329) The PR #228 attempted to resolve provided boostrap servers when the metadata is exceeding a staleness threshold. The config is coverred both on producer and consumer, and default behavior without configured value is setting timeout to Long.MAX_VALUE. However, cruise-control is affected by the behavior as it implements a similar mechanism on its own and directly uses of NetworkClient. The code would fail if empty bootstrap server is passed to NetworkClient, which is the case for internal use of CC. To resolve this, this patch aims to make default value as -1, and omit the code path referencing bootstrap server when we see -1. EXIT_CRITERIA = When #228 is ejected Co-authored-by: Xiongqi Wu <[email protected]> Co-authored-by: Edoardo Comar <[email protected]> Co-authored-by: Lucas Wang <[email protected]> Co-authored-by: Nicholas Parker <[email protected]> Co-authored-by: Badai Aqrandista <[email protected]> Co-authored-by: Joseph (Ting-Chou) Lin <[email protected]>
If multiple addresses are available. This change is a follow-up to #9902. When re-resolving DNS after disconnecting, it is possible (likely, even) that we will resolve the same set of addresses in the same order, meaning we could attempt to reconnect to the same IP that we just disconnected from. In the case where we disconnected from the IP address because it has become unhealthy (due to a load balancer going down or a network routing layer restarting, for example), the client will need to time out before connecting to the next IP. This essentially unifies the behavior before and after KAFKA-12193: re-resolve after disconnecting while still progressing through the IP list. Reviewers: Lucas Brutschy <[email protected]>, Andrew Schofield <[email protected]>
If multiple addresses are available. This change is a follow-up to apache#9902. When re-resolving DNS after disconnecting, it is possible (likely, even) that we will resolve the same set of addresses in the same order, meaning we could attempt to reconnect to the same IP that we just disconnected from. In the case where we disconnected from the IP address because it has become unhealthy (due to a load balancer going down or a network routing layer restarting, for example), the client will need to time out before connecting to the next IP. This essentially unifies the behavior before and after KAFKA-12193: re-resolve after disconnecting while still progressing through the IP list. Reviewers: Lucas Brutschy <[email protected]>, Andrew Schofield <[email protected]>
If multiple addresses are available. This change is a follow-up to #9902. When re-resolving DNS after disconnecting, it is possible (likely, even) that we will resolve the same set of addresses in the same order, meaning we could attempt to reconnect to the same IP that we just disconnected from. In the case where we disconnected from the IP address because it has become unhealthy (due to a load balancer going down or a network routing layer restarting, for example), the client will need to time out before connecting to the next IP. This essentially unifies the behavior before and after KAFKA-12193: re-resolve after disconnecting while still progressing through the IP list. Cherry-pick of 47e3777 Reviewers: Lucas Brutschy <[email protected]>, Andrew Schofield <[email protected]> Co-authored-by: Bob Barrett <[email protected]>
…pache#10061) This patch changes the NetworkClient behavior to resolve the target node's hostname after disconnecting from an established connection, rather than waiting until the previously-resolved addresses are exhausted. This is to handle the scenario when the node's IP addresses have changed during the lifetime of the connection, and means that the client does not have to try to connect to invalid IP addresses until it has tried each address. Reviewers: Mickael Maison <[email protected]>, Satish Duggana <[email protected]>, David Jacot <[email protected]>
If multiple addresses are available. This change is a follow-up to apache#9902. When re-resolving DNS after disconnecting, it is possible (likely, even) that we will resolve the same set of addresses in the same order, meaning we could attempt to reconnect to the same IP that we just disconnected from. In the case where we disconnected from the IP address because it has become unhealthy (due to a load balancer going down or a network routing layer restarting, for example), the client will need to time out before connecting to the next IP. This essentially unifies the behavior before and after KAFKA-12193: re-resolve after disconnecting while still progressing through the IP list. Reviewers: Lucas Brutschy <[email protected]>, Andrew Schofield <[email protected]>
If multiple addresses are available. This change is a follow-up to apache#9902. When re-resolving DNS after disconnecting, it is possible (likely, even) that we will resolve the same set of addresses in the same order, meaning we could attempt to reconnect to the same IP that we just disconnected from. In the case where we disconnected from the IP address because it has become unhealthy (due to a load balancer going down or a network routing layer restarting, for example), the client will need to time out before connecting to the next IP. This essentially unifies the behavior before and after KAFKA-12193: re-resolve after disconnecting while still progressing through the IP list. Reviewers: Lucas Brutschy <[email protected]>, Andrew Schofield <[email protected]>
This patch changes the NetworkClient behavior to resolve the target node's hostname after disconnecting from an established connection, rather than waiting until the previously-resolved addresses are exhausted. This is to handle the scenario when the node's IP addresses have changed during the lifetime of the connection, and means that the client does not have to try to connect to invalid IP addresses until it has tried each address.
Committer Checklist (excluded from commit message)