Fix for KAFKA-7974: Avoid zombie AdminClient when node host isn't resolvable#6305
Fix for KAFKA-7974: Avoid zombie AdminClient when node host isn't resolvable#6305cmccabe merged 3 commits intoapache:trunkfrom nickbp:KAFKA-7974-dns-failure-kills-adminclient-thread
Conversation
|
Cc @cmccabe |
|
Good find, @nickbp . The fix seems a bit incomplete in the sense that there are more exceptions that we could get, besides Could catch the possible exceptions and then create a Or perhaps there is another way to make this more robust-- that was just the first that came to mind. |
Moves away from automatically resolving the host when the connection entry is constructed, which can leave ClusterConnectionStates in a confused state. Instead, resolution is done on demand, ensuring that the entry in the connection list is present even if the resolution failed.
|
Good point. I was able to change the strategy of the fix to instead clean up when the host resolution actually occurs within Previously the resolve was happening automatically with the initial construction of the |
| * @param now the current time | ||
| * @throws UnknownHostException | ||
| * @param now the current time in ms | ||
| */ |
There was a problem hiding this comment.
Can we add JavaDoc for host and clientDnsLookup, while we're changing this?
|
LGTM |
|
retest this please |
nickbp
left a comment
There was a problem hiding this comment.
FWIW I locally ran the failing CI tests against the branch and they were successful:
$ gradle :core:test --tests=*ConsumerBounce*
> Configure project :
Building project 'core' with Scala version 2.12.8
Building project 'streams-scala' with Scala version 2.12.8
[...]
> Task :core:test
kafka.api.ConsumerBounceTest > testCloseDuringRebalance PASSED
kafka.api.ConsumerBounceTest > testClose PASSED
kafka.api.ConsumerBounceTest > testSeekAndCommitWithBrokerFailures PASSED
kafka.api.ConsumerBounceTest > testConsumerReceivesFatalExceptionWhenGroupPassesMaxSize PASSED
kafka.api.ConsumerBounceTest > testSubscribeWhenTopicUnavailable PASSED
kafka.api.ConsumerBounceTest > testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup PASSED
kafka.api.ConsumerBounceTest > testConsumptionWithBrokerFailures SKIPPED
BUILD SUCCESSFUL in 4m 41s
14 actionable tasks: 9 executed, 5 up-to-date
$ gradle :core:test --tests=*CustomQuotaCallbackTest*
> Configure project :
Building project 'core' with Scala version 2.12.8
Building project 'streams-scala' with Scala version 2.12.8
[...]
> Task :core:test
kafka.api.CustomQuotaCallbackTest > testCustomQuotaCallback PASSED
BUILD SUCCESSFUL in 1m 24s
14 actionable tasks: 8 executed, 6 up-to-date
| * @param now the current time | ||
| * @throws UnknownHostException | ||
| * @param now the current time in ms | ||
| */ |
|
Looks like the build is green now, btw. Anything else needed before a merge? |
|
@cmccabe looks like this is ready to be merged since you approved it and the build is green. |
|
Thanks, @nickbp ! |
…olvable (#6305) * Fix for KAFKA-7974: Avoid calling disconnect() when not connecting * Resolve host only when currentAddress() is called Moves away from automatically resolving the host when the connection entry is constructed, which can leave ClusterConnectionStates in a confused state. Instead, resolution is done on demand, ensuring that the entry in the connection list is present even if the resolution failed. * Add Javadoc to ClusterConnectionStates.connecting()
|
@cmccabe I cherry-picked this to the 2.1 and 2.2 branches as it seems like an important fix. Also cc @rajinisivaram who is familiar with this code. |
|
@ijuma is this likely to go in to the 2.2 branch soon? |
…olvable (#6305) * Fix for KAFKA-7974: Avoid calling disconnect() when not connecting * Resolve host only when currentAddress() is called Moves away from automatically resolving the host when the connection entry is constructed, which can leave ClusterConnectionStates in a confused state. Instead, resolution is done on demand, ensuring that the entry in the connection list is present even if the resolution failed. * Add Javadoc to ClusterConnectionStates.connecting()
|
@manderson23 it's done now. I was running the tests locally before doing the push and had ran out of time. |
|
I stumbled onto this issue a couple of days ago when trying to run an application in K8s. Any estimates for when a build will be ready with this fix? |
|
I likewise originally encountered this in a K8s environment. The core issue leading to this bug arising was the result of DNS requests round-robining between CoreDNS instances which were individually eventually consistent but as a group not consistent at all, with entries popping in and out of existence as pods were brought up and some but not all CoreDNS instances knew about them at any given moment. To avoid the inconsistency issues, I was able to use the following workarounds in my K8s 1.13.x environment. With these workarounds a patch to the Kafka clients was not required to avoid the issue. YMMV:
|
|
@nickbp interesting... I'll give that a go too. My workaround was to implement a simple, blocking, utility function that will attempt to resolve the host(s) for a finite duration |
|
Yeah the issue I was seeing was that after a utility check found the entry to be resolvable, it could still then become "unresolvable" due to dns lookup round-robining to a dns instance that didn't know about the host yet |
…olvable (apache#6305) * Fix for KAFKA-7974: Avoid calling disconnect() when not connecting * Resolve host only when currentAddress() is called Moves away from automatically resolving the host when the connection entry is constructed, which can leave ClusterConnectionStates in a confused state. Instead, resolution is done on demand, ensuring that the entry in the connection list is present even if the resolution failed. * Add Javadoc to ClusterConnectionStates.connecting()
|
Is this available in any release? I can't seem to find it... |
…olvable (apache#6305) * Fix for KAFKA-7974: Avoid calling disconnect() when not connecting * Resolve host only when currentAddress() is called Moves away from automatically resolving the host when the connection entry is constructed, which can leave ClusterConnectionStates in a confused state. Instead, resolution is done on demand, ensuring that the entry in the connection list is present even if the resolution failed. * Add Javadoc to ClusterConnectionStates.connecting()
* KAFKA-6863 Kafka clients should try to use multiple DNS resolved IP (apache#4987) Implementation of KIP-302: Based on the new client configuration `client.dns.lookup`, a NetworkClient can use InetAddress.getAllByName to find all IPs and iterate over them when they fail to connect. Only uses either IPv4 or IPv6 addresses similar to the default mode. Co-authored-by: Edoardo Comar <[email protected]> Co-authored-by: Mickael Maison <[email protected]> Reviewers: Rajini Sivaram <[email protected]> * [LI-HOTFIX] Ignore the failed test ClusterConnectionStatesTest#testMultipleIPsWithUseAll (#116) TICKET = N/A LI_DESCRIPTION = The test fails since the domain kafka.apache.org used to return 3 IPs and is now only returning two IPs. Furthermore, the upstream fix identified below cannot be cleanly cherry picked. EXIT_CRITERIA = when the commit 131d475 is picked from upstream: KAFKA-12193: Re-resolve IPs after a client disconnects apache#9902 * Ignoring the failed tests (#188) [LI-HOTFIX] Ignoring the failed tests (#188) TICKET = N/A LI_DESCRIPTION = Several tests are failing since the domain kafka.apache.org that used to resolve to more than 1 IPv4 addresses are not only resolving to 1 IPv4 address. The upstream code has overhauled the ClusterConnectionStatesTest. We are simply ignoring these tests for now, and will get the new logic from upstream after a major version rebase. EXIT_CRITERIA = This hotfix can be removed in the next major version rebase * Fix for KAFKA-7974: Avoid zombie AdminClient when node host isn't resolvable (apache#6305) * Fix for KAFKA-7974: Avoid calling disconnect() when not connecting * Resolve host only when currentAddress() is called Moves away from automatically resolving the host when the connection entry is constructed, which can leave ClusterConnectionStates in a confused state. Instead, resolution is done on demand, ensuring that the entry in the connection list is present even if the resolution failed. * Add Javadoc to ClusterConnectionStates.connecting() * KAFKA-9313: Set `use_all_dns_ips` as the new default for `client.dns.lookup` (KIP-602) (apache#8644) This applies to the producer, consumer, admin client, connect worker and inter broker communication. `ClientDnsLookup.DEFAULT` has been deprecated and a warning will be logged if it's explicitly set in a client config. Reviewers: Mickael Maison <[email protected]>, Ismael Juma <[email protected]> * Update NetworkClient usage in SSLNetworkClient * [LI-HOTFIX] Bypass cluster metadata auto refresh code path by default (#329) The PR #228 attempted to resolve provided boostrap servers when the metadata is exceeding a staleness threshold. The config is coverred both on producer and consumer, and default behavior without configured value is setting timeout to Long.MAX_VALUE. However, cruise-control is affected by the behavior as it implements a similar mechanism on its own and directly uses of NetworkClient. The code would fail if empty bootstrap server is passed to NetworkClient, which is the case for internal use of CC. To resolve this, this patch aims to make default value as -1, and omit the code path referencing bootstrap server when we see -1. EXIT_CRITERIA = When #228 is ejected Co-authored-by: Xiongqi Wu <[email protected]> Co-authored-by: Edoardo Comar <[email protected]> Co-authored-by: Lucas Wang <[email protected]> Co-authored-by: Nicholas Parker <[email protected]> Co-authored-by: Badai Aqrandista <[email protected]> Co-authored-by: Joseph (Ting-Chou) Lin <[email protected]>
When attempting to get topic list via KafkaAdminClient against a server that isn't resolvable, the worker thread can get killed as follows, leading to a zombie KafkaAdminClient:
It looks like cause is a bug in state handling between
NetworkClientandClusterConnectionStates:NetworkClient.ready()invokesthis.initiateConnect()as seen in the above stacktraceNetworkClient.initiateConnect()invokesClusterConnectionStates.connecting(), which internally invokesClientUtils.resolve()to resolve the host when creating an entry for the connection.UnknownHostExceptioncan be thrown back toNetworkClient.initiateConnect()and the connection entry is not created inClusterConnectionStates. This exception doesn't currently get logged so this is a guess on my part.NetworkClient.initiateConnect()catches the exception and attempts to callClusterConnectionStates.disconnected(), which throws anIllegalStateExceptionbecause no entry had yet been created due to the lookup failure.IllegalStateExceptionends up killing the worker thread andKafkaAdminClientgets stuck, never returning fromlistTopics().This PR includes a unit test which reproduces the original issue (matching stacktrace) and verifies the fix.
Committer Checklist (excluded from commit message)