Fix for KAFKA-7974: Avoid zombie AdminClient when node host isn't resolvable by nickbp · Pull Request #6305 · apache/kafka

nickbp · 2019-02-22T01:35:15Z

When attempting to get topic list via KafkaAdminClient against a server that isn't resolvable, the worker thread can get killed as follows, leading to a zombie KafkaAdminClient:

ERROR [kafka-admin-client-thread | adminclient-1] 2019-02-18 01:00:45,597 KafkaThread.java:51 - Uncaught exception in thread 'kafka-admin-client-thread | adminclient-1':
java.lang.IllegalStateException: No entry found for connection 0
    at org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:330)
    at org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:134)
    at org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:921)
    at org.apache.kafka.clients.NetworkClient.ready(NetworkClient.java:287)
    at org.apache.kafka.clients.admin.KafkaAdminClient$AdminClientRunnable.sendEligibleCalls(KafkaAdminClient.java:898)
    at org.apache.kafka.clients.admin.KafkaAdminClient$AdminClientRunnable.run(KafkaAdminClient.java:1113)
    at java.lang.Thread.run(Thread.java:748)

It looks like cause is a bug in state handling between NetworkClient and ClusterConnectionStates:

NetworkClient.ready() invokes this.initiateConnect() as seen in the above stacktrace
NetworkClient.initiateConnect() invokes ClusterConnectionStates.connecting(), which internally invokes ClientUtils.resolve() to resolve the host when creating an entry for the connection.
If this host lookup fails, a UnknownHostException can be thrown back to NetworkClient.initiateConnect() and the connection entry is not created in ClusterConnectionStates. This exception doesn't currently get logged so this is a guess on my part.
NetworkClient.initiateConnect() catches the exception and attempts to call ClusterConnectionStates.disconnected(), which throws an IllegalStateException because no entry had yet been created due to the lookup failure.
This IllegalStateException ends up killing the worker thread and KafkaAdminClient gets stuck, never returning from listTopics().

This PR includes a unit test which reproduces the original issue (matching stacktrace) and verifies the fix.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

ijuma · 2019-02-22T02:43:51Z

Cc @cmccabe

cmccabe · 2019-02-22T19:30:48Z

Good find, @nickbp . The fix seems a bit incomplete in the sense that there are more exceptions that we could get, besides UnknownHostException. It seems like perhaps this line:

        } else {
            nodeState.put(id, new NodeConnectionState(ConnectionState.CONNECTING, now,
                this.reconnectBackoffInitMs, host, clientDnsLookup));
        }

Could catch the possible exceptions and then create a Disconnected connection, and rethrow, so that we didn't get more issues later on if something else throws here.

Or perhaps there is another way to make this more robust-- that was just the first that came to mind.

Moves away from automatically resolving the host when the connection entry is constructed, which can leave ClusterConnectionStates in a confused state. Instead, resolution is done on demand, ensuring that the entry in the connection list is present even if the resolution failed.

nickbp · 2019-02-24T22:24:31Z

Good point. I was able to change the strategy of the fix to instead clean up when the host resolution actually occurs within ClusterConnectionStates.

Previously the resolve was happening automatically with the initial construction of the CONNECTING entry, leading to a missing entry and IllegalStateException when the connection was then marked disconnected. This fix switches to performing no resolution when the entry is first created, instead only performing it on-demand when currentAddress() is called explicitly. This allows NetworkClient to cleanly mark the entry as disconnected when resolution fails.

cmccabe · 2019-02-26T18:25:25Z

-     * @param now the current time
-     * @throws UnknownHostException 
+     * @param now the current time in ms
     */


Can we add JavaDoc for host and clientDnsLookup, while we're changing this?

cmccabe · 2019-02-26T18:38:01Z

LGTM

cmccabe · 2019-02-26T18:38:06Z

retest this please

nickbp

FWIW I locally ran the failing CI tests against the branch and they were successful:

$ gradle :core:test --tests=*ConsumerBounce*
> Configure project :
Building project 'core' with Scala version 2.12.8
Building project 'streams-scala' with Scala version 2.12.8
[...]
> Task :core:test
kafka.api.ConsumerBounceTest > testCloseDuringRebalance PASSED
kafka.api.ConsumerBounceTest > testClose PASSED
kafka.api.ConsumerBounceTest > testSeekAndCommitWithBrokerFailures PASSED
kafka.api.ConsumerBounceTest > testConsumerReceivesFatalExceptionWhenGroupPassesMaxSize PASSED
kafka.api.ConsumerBounceTest > testSubscribeWhenTopicUnavailable PASSED
kafka.api.ConsumerBounceTest > testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup PASSED
kafka.api.ConsumerBounceTest > testConsumptionWithBrokerFailures SKIPPED

BUILD SUCCESSFUL in 4m 41s
14 actionable tasks: 9 executed, 5 up-to-date

$ gradle :core:test --tests=*CustomQuotaCallbackTest*
> Configure project :
Building project 'core' with Scala version 2.12.8
Building project 'streams-scala' with Scala version 2.12.8
[...]
> Task :core:test
kafka.api.CustomQuotaCallbackTest > testCustomQuotaCallback PASSED

BUILD SUCCESSFUL in 1m 24s
14 actionable tasks: 8 executed, 6 up-to-date

nickbp · 2019-02-27T00:05:55Z

-     * @param now the current time
-     * @throws UnknownHostException 
+     * @param now the current time in ms
     */


nickbp · 2019-03-04T00:20:30Z

Looks like the build is green now, btw. Anything else needed before a merge?

ijuma · 2019-03-04T12:33:37Z

@cmccabe looks like this is ready to be merged since you approved it and the build is green.

cmccabe · 2019-03-07T08:07:32Z

Thanks, @nickbp !

…olvable (#6305) * Fix for KAFKA-7974: Avoid calling disconnect() when not connecting * Resolve host only when currentAddress() is called Moves away from automatically resolving the host when the connection entry is constructed, which can leave ClusterConnectionStates in a confused state. Instead, resolution is done on demand, ensuring that the entry in the connection list is present even if the resolution failed. * Add Javadoc to ClusterConnectionStates.connecting()

ijuma · 2019-04-03T14:55:01Z

@cmccabe I cherry-picked this to the 2.1 and 2.2 branches as it seems like an important fix. Also cc @rajinisivaram who is familiar with this code.

manderson23 · 2019-04-03T19:46:16Z

@ijuma is this likely to go in to the 2.2 branch soon?

…olvable (#6305) * Fix for KAFKA-7974: Avoid calling disconnect() when not connecting * Resolve host only when currentAddress() is called Moves away from automatically resolving the host when the connection entry is constructed, which can leave ClusterConnectionStates in a confused state. Instead, resolution is done on demand, ensuring that the entry in the connection list is present even if the resolution failed. * Add Javadoc to ClusterConnectionStates.connecting()

ijuma · 2019-04-24T04:50:42Z

@manderson23 it's done now. I was running the tests locally before doing the push and had ran out of time.

kpmeen · 2019-05-19T09:56:28Z

I stumbled onto this issue a couple of days ago when trying to run an application in K8s. Any estimates for when a build will be ready with this fix?

nickbp · 2019-05-20T03:17:51Z

I likewise originally encountered this in a K8s environment. The core issue leading to this bug arising was the result of DNS requests round-robining between CoreDNS instances which were individually eventually consistent but as a group not consistent at all, with entries popping in and out of existence as pods were brought up and some but not all CoreDNS instances knew about them at any given moment.

To avoid the inconsistency issues, I was able to use the following workarounds in my K8s 1.13.x environment. With these workarounds a patch to the Kafka clients was not required to avoid the issue. YMMV:

kubectl edit service kube-dns --namespace=kube-system: Edit the sessionAffinity value to be ClientIP rather than None. This avoids rotating DNS lookups between CoreDNS instances, which can result in host records appearing to pop in and out of existence. Confirm the change with kubectl get service kube-dns --namespace=kube-system -o yaml
kubectl edit configmap coredns --namespace=kube-system: Edit the Corefile entry to have cache 30. By default this is 300. This ensures that DNS caches are updated every 30s rather than 5 minutes, which should result in pods deploying more quickly after having set sessionAffinity=ClientIP in the previous step. Confirm the change with kubectl get configmap coredns --namespace=kube-system -o yaml

kpmeen · 2019-05-21T06:18:16Z

@nickbp interesting...

I'll give that a go too. My workaround was to implement a simple, blocking, utility function that will attempt to resolve the host(s) for a finite duration n. If the host(s) cannot be resolved within the configured time, the application is terminated. It works for me at least.

nickbp · 2019-05-21T06:29:41Z

Yeah the issue I was seeing was that after a utility check found the entry to be resolvable, it could still then become "unresolvable" due to dns lookup round-robining to a dns instance that didn't know about the host yet

…olvable (apache#6305) * Fix for KAFKA-7974: Avoid calling disconnect() when not connecting * Resolve host only when currentAddress() is called Moves away from automatically resolving the host when the connection entry is constructed, which can leave ClusterConnectionStates in a confused state. Instead, resolution is done on demand, ensuring that the entry in the connection list is present even if the resolution failed. * Add Javadoc to ClusterConnectionStates.connecting()

bertdewolf · 2020-09-14T08:36:49Z

Is this available in any release? I can't seem to find it...

…olvable (apache#6305) * Fix for KAFKA-7974: Avoid calling disconnect() when not connecting * Resolve host only when currentAddress() is called Moves away from automatically resolving the host when the connection entry is constructed, which can leave ClusterConnectionStates in a confused state. Instead, resolution is done on demand, ensuring that the entry in the connection list is present even if the resolution failed. * Add Javadoc to ClusterConnectionStates.connecting()

* KAFKA-6863 Kafka clients should try to use multiple DNS resolved IP (apache#4987) Implementation of KIP-302: Based on the new client configuration `client.dns.lookup`, a NetworkClient can use InetAddress.getAllByName to find all IPs and iterate over them when they fail to connect. Only uses either IPv4 or IPv6 addresses similar to the default mode. Co-authored-by: Edoardo Comar <[email protected]> Co-authored-by: Mickael Maison <[email protected]> Reviewers: Rajini Sivaram <[email protected]> * [LI-HOTFIX] Ignore the failed test ClusterConnectionStatesTest#testMultipleIPsWithUseAll (#116) TICKET = N/A LI_DESCRIPTION = The test fails since the domain kafka.apache.org used to return 3 IPs and is now only returning two IPs. Furthermore, the upstream fix identified below cannot be cleanly cherry picked. EXIT_CRITERIA = when the commit 131d475 is picked from upstream: KAFKA-12193: Re-resolve IPs after a client disconnects apache#9902 * Ignoring the failed tests (#188) [LI-HOTFIX] Ignoring the failed tests (#188) TICKET = N/A LI_DESCRIPTION = Several tests are failing since the domain kafka.apache.org that used to resolve to more than 1 IPv4 addresses are not only resolving to 1 IPv4 address. The upstream code has overhauled the ClusterConnectionStatesTest. We are simply ignoring these tests for now, and will get the new logic from upstream after a major version rebase. EXIT_CRITERIA = This hotfix can be removed in the next major version rebase * Fix for KAFKA-7974: Avoid zombie AdminClient when node host isn't resolvable (apache#6305) * Fix for KAFKA-7974: Avoid calling disconnect() when not connecting * Resolve host only when currentAddress() is called Moves away from automatically resolving the host when the connection entry is constructed, which can leave ClusterConnectionStates in a confused state. Instead, resolution is done on demand, ensuring that the entry in the connection list is present even if the resolution failed. * Add Javadoc to ClusterConnectionStates.connecting() * KAFKA-9313: Set `use_all_dns_ips` as the new default for `client.dns.lookup` (KIP-602) (apache#8644) This applies to the producer, consumer, admin client, connect worker and inter broker communication. `ClientDnsLookup.DEFAULT` has been deprecated and a warning will be logged if it's explicitly set in a client config. Reviewers: Mickael Maison <[email protected]>, Ismael Juma <[email protected]> * Update NetworkClient usage in SSLNetworkClient * [LI-HOTFIX] Bypass cluster metadata auto refresh code path by default (#329) The PR #228 attempted to resolve provided boostrap servers when the metadata is exceeding a staleness threshold. The config is coverred both on producer and consumer, and default behavior without configured value is setting timeout to Long.MAX_VALUE. However, cruise-control is affected by the behavior as it implements a similar mechanism on its own and directly uses of NetworkClient. The code would fail if empty bootstrap server is passed to NetworkClient, which is the case for internal use of CC. To resolve this, this patch aims to make default value as -1, and omit the code path referencing bootstrap server when we see -1. EXIT_CRITERIA = When #228 is ejected Co-authored-by: Xiongqi Wu <[email protected]> Co-authored-by: Edoardo Comar <[email protected]> Co-authored-by: Lucas Wang <[email protected]> Co-authored-by: Nicholas Parker <[email protected]> Co-authored-by: Badai Aqrandista <[email protected]> Co-authored-by: Joseph (Ting-Chou) Lin <[email protected]>

Fix for KAFKA-7974: Avoid calling disconnect() when not connecting

c6cff20

nickbp changed the title ~~Fix for KAFKA-7974: Avoid calling disconnect() when not yet connecting~~ Fix for KAFKA-7974: Avoid zombie AdminClient when node host isn't resolvable Feb 22, 2019

cmccabe reviewed Feb 26, 2019

View reviewed changes

Add Javadoc to ClusterConnectionStates.connecting()

54d1baa

nickbp commented Feb 27, 2019

View reviewed changes

cmccabe merged commit 7f6bf95 into apache:trunk Mar 7, 2019

nickbp deleted the KAFKA-7974-dns-failure-kills-adminclient-thread branch April 14, 2019 23:15

Conversation

nickbp commented Feb 22, 2019

Committer Checklist (excluded from commit message)

Uh oh!

ijuma commented Feb 22, 2019

Uh oh!

cmccabe commented Feb 22, 2019

Uh oh!

nickbp commented Feb 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cmccabe Feb 26, 2019

Choose a reason for hiding this comment

Uh oh!

nickbp Feb 27, 2019

Choose a reason for hiding this comment

Uh oh!

cmccabe commented Feb 26, 2019

Uh oh!

cmccabe commented Feb 26, 2019

Uh oh!

nickbp left a comment

Choose a reason for hiding this comment

Uh oh!

nickbp Feb 27, 2019

Choose a reason for hiding this comment

Uh oh!

nickbp commented Mar 4, 2019

Uh oh!

ijuma commented Mar 4, 2019

Uh oh!

cmccabe commented Mar 7, 2019

Uh oh!

ijuma commented Apr 3, 2019

Uh oh!

manderson23 commented Apr 3, 2019

Uh oh!

ijuma commented Apr 24, 2019

Uh oh!

kpmeen commented May 19, 2019

Uh oh!

nickbp commented May 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kpmeen commented May 21, 2019

Uh oh!

nickbp commented May 21, 2019

Uh oh!

bertdewolf commented Sep 14, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

nickbp commented Feb 24, 2019 •

edited

Loading

nickbp commented May 20, 2019 •

edited

Loading