Skip to content

Fix for KAFKA-7974: Avoid zombie AdminClient when node host isn't resolvable#6305

Merged
cmccabe merged 3 commits intoapache:trunkfrom
nickbp:KAFKA-7974-dns-failure-kills-adminclient-thread
Mar 7, 2019
Merged

Fix for KAFKA-7974: Avoid zombie AdminClient when node host isn't resolvable#6305
cmccabe merged 3 commits intoapache:trunkfrom
nickbp:KAFKA-7974-dns-failure-kills-adminclient-thread

Conversation

@nickbp
Copy link
Copy Markdown
Contributor

@nickbp nickbp commented Feb 22, 2019

When attempting to get topic list via KafkaAdminClient against a server that isn't resolvable, the worker thread can get killed as follows, leading to a zombie KafkaAdminClient:

ERROR [kafka-admin-client-thread | adminclient-1] 2019-02-18 01:00:45,597 KafkaThread.java:51 - Uncaught exception in thread 'kafka-admin-client-thread | adminclient-1':
java.lang.IllegalStateException: No entry found for connection 0
    at org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:330)
    at org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:134)
    at org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:921)
    at org.apache.kafka.clients.NetworkClient.ready(NetworkClient.java:287)
    at org.apache.kafka.clients.admin.KafkaAdminClient$AdminClientRunnable.sendEligibleCalls(KafkaAdminClient.java:898)
    at org.apache.kafka.clients.admin.KafkaAdminClient$AdminClientRunnable.run(KafkaAdminClient.java:1113)
    at java.lang.Thread.run(Thread.java:748)

It looks like cause is a bug in state handling between NetworkClient and ClusterConnectionStates:

  • NetworkClient.ready() invokes this.initiateConnect() as seen in the above stacktrace
  • NetworkClient.initiateConnect() invokes ClusterConnectionStates.connecting(), which internally invokes ClientUtils.resolve() to resolve the host when creating an entry for the connection.
  • If this host lookup fails, a UnknownHostException can be thrown back to NetworkClient.initiateConnect() and the connection entry is not created in ClusterConnectionStates. This exception doesn't currently get logged so this is a guess on my part.
  • NetworkClient.initiateConnect() catches the exception and attempts to call ClusterConnectionStates.disconnected(), which throws an IllegalStateException because no entry had yet been created due to the lookup failure.
  • This IllegalStateException ends up killing the worker thread and KafkaAdminClient gets stuck, never returning from listTopics().

This PR includes a unit test which reproduces the original issue (matching stacktrace) and verifies the fix.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@nickbp nickbp changed the title Fix for KAFKA-7974: Avoid calling disconnect() when not yet connecting Fix for KAFKA-7974: Avoid zombie AdminClient when node host isn't resolvable Feb 22, 2019
@ijuma
Copy link
Copy Markdown
Member

ijuma commented Feb 22, 2019

Cc @cmccabe

@cmccabe
Copy link
Copy Markdown
Contributor

cmccabe commented Feb 22, 2019

Good find, @nickbp . The fix seems a bit incomplete in the sense that there are more exceptions that we could get, besides UnknownHostException. It seems like perhaps this line:

        } else {
            nodeState.put(id, new NodeConnectionState(ConnectionState.CONNECTING, now,
                this.reconnectBackoffInitMs, host, clientDnsLookup));
        }

Could catch the possible exceptions and then create a Disconnected connection, and rethrow, so that we didn't get more issues later on if something else throws here.

Or perhaps there is another way to make this more robust-- that was just the first that came to mind.

Moves away from automatically resolving the host when the connection entry is constructed, which can leave ClusterConnectionStates in a confused state.
Instead, resolution is done on demand, ensuring that the entry in the connection list is present even if the resolution failed.
@nickbp
Copy link
Copy Markdown
Contributor Author

nickbp commented Feb 24, 2019

Good point. I was able to change the strategy of the fix to instead clean up when the host resolution actually occurs within ClusterConnectionStates.

Previously the resolve was happening automatically with the initial construction of the CONNECTING entry, leading to a missing entry and IllegalStateException when the connection was then marked disconnected. This fix switches to performing no resolution when the entry is first created, instead only performing it on-demand when currentAddress() is called explicitly. This allows NetworkClient to cleanly mark the entry as disconnected when resolution fails.

* @param now the current time
* @throws UnknownHostException
* @param now the current time in ms
*/
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add JavaDoc for host and clientDnsLookup, while we're changing this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@cmccabe
Copy link
Copy Markdown
Contributor

cmccabe commented Feb 26, 2019

LGTM

@cmccabe
Copy link
Copy Markdown
Contributor

cmccabe commented Feb 26, 2019

retest this please

Copy link
Copy Markdown
Contributor Author

@nickbp nickbp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I locally ran the failing CI tests against the branch and they were successful:

$ gradle :core:test --tests=*ConsumerBounce*
> Configure project :
Building project 'core' with Scala version 2.12.8
Building project 'streams-scala' with Scala version 2.12.8
[...]
> Task :core:test
kafka.api.ConsumerBounceTest > testCloseDuringRebalance PASSED
kafka.api.ConsumerBounceTest > testClose PASSED
kafka.api.ConsumerBounceTest > testSeekAndCommitWithBrokerFailures PASSED
kafka.api.ConsumerBounceTest > testConsumerReceivesFatalExceptionWhenGroupPassesMaxSize PASSED
kafka.api.ConsumerBounceTest > testSubscribeWhenTopicUnavailable PASSED
kafka.api.ConsumerBounceTest > testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup PASSED
kafka.api.ConsumerBounceTest > testConsumptionWithBrokerFailures SKIPPED

BUILD SUCCESSFUL in 4m 41s
14 actionable tasks: 9 executed, 5 up-to-date
$ gradle :core:test --tests=*CustomQuotaCallbackTest*
> Configure project :
Building project 'core' with Scala version 2.12.8
Building project 'streams-scala' with Scala version 2.12.8
[...]
> Task :core:test
kafka.api.CustomQuotaCallbackTest > testCustomQuotaCallback PASSED

BUILD SUCCESSFUL in 1m 24s
14 actionable tasks: 8 executed, 6 up-to-date

* @param now the current time
* @throws UnknownHostException
* @param now the current time in ms
*/
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@nickbp
Copy link
Copy Markdown
Contributor Author

nickbp commented Mar 4, 2019

Looks like the build is green now, btw. Anything else needed before a merge?

@ijuma
Copy link
Copy Markdown
Member

ijuma commented Mar 4, 2019

@cmccabe looks like this is ready to be merged since you approved it and the build is green.

@cmccabe cmccabe merged commit 7f6bf95 into apache:trunk Mar 7, 2019
@cmccabe
Copy link
Copy Markdown
Contributor

cmccabe commented Mar 7, 2019

Thanks, @nickbp !

ijuma pushed a commit that referenced this pull request Apr 3, 2019
…olvable (#6305)

* Fix for KAFKA-7974: Avoid calling disconnect() when not connecting

* Resolve host only when currentAddress() is called

Moves away from automatically resolving the host when the connection entry is constructed, which can leave ClusterConnectionStates in a confused state.
Instead, resolution is done on demand, ensuring that the entry in the connection list is present even if the resolution failed.

* Add Javadoc to ClusterConnectionStates.connecting()
@ijuma
Copy link
Copy Markdown
Member

ijuma commented Apr 3, 2019

@cmccabe I cherry-picked this to the 2.1 and 2.2 branches as it seems like an important fix. Also cc @rajinisivaram who is familiar with this code.

@manderson23
Copy link
Copy Markdown

@ijuma is this likely to go in to the 2.2 branch soon?

@nickbp nickbp deleted the KAFKA-7974-dns-failure-kills-adminclient-thread branch April 14, 2019 23:15
ijuma pushed a commit that referenced this pull request Apr 24, 2019
…olvable (#6305)

* Fix for KAFKA-7974: Avoid calling disconnect() when not connecting

* Resolve host only when currentAddress() is called

Moves away from automatically resolving the host when the connection entry is constructed, which can leave ClusterConnectionStates in a confused state.
Instead, resolution is done on demand, ensuring that the entry in the connection list is present even if the resolution failed.

* Add Javadoc to ClusterConnectionStates.connecting()
@ijuma
Copy link
Copy Markdown
Member

ijuma commented Apr 24, 2019

@manderson23 it's done now. I was running the tests locally before doing the push and had ran out of time.

@kpmeen
Copy link
Copy Markdown

kpmeen commented May 19, 2019

I stumbled onto this issue a couple of days ago when trying to run an application in K8s. Any estimates for when a build will be ready with this fix?

@nickbp
Copy link
Copy Markdown
Contributor Author

nickbp commented May 20, 2019

I likewise originally encountered this in a K8s environment. The core issue leading to this bug arising was the result of DNS requests round-robining between CoreDNS instances which were individually eventually consistent but as a group not consistent at all, with entries popping in and out of existence as pods were brought up and some but not all CoreDNS instances knew about them at any given moment.

To avoid the inconsistency issues, I was able to use the following workarounds in my K8s 1.13.x environment. With these workarounds a patch to the Kafka clients was not required to avoid the issue. YMMV:

  • kubectl edit service kube-dns --namespace=kube-system: Edit the sessionAffinity value to be ClientIP rather than None. This avoids rotating DNS lookups between CoreDNS instances, which can result in host records appearing to pop in and out of existence. Confirm the change with kubectl get service kube-dns --namespace=kube-system -o yaml
  • kubectl edit configmap coredns --namespace=kube-system: Edit the Corefile entry to have cache 30. By default this is 300. This ensures that DNS caches are updated every 30s rather than 5 minutes, which should result in pods deploying more quickly after having set sessionAffinity=ClientIP in the previous step. Confirm the change with kubectl get configmap coredns --namespace=kube-system -o yaml

@kpmeen
Copy link
Copy Markdown

kpmeen commented May 21, 2019

@nickbp interesting...

I'll give that a go too. My workaround was to implement a simple, blocking, utility function that will attempt to resolve the host(s) for a finite duration n. If the host(s) cannot be resolved within the configured time, the application is terminated. It works for me at least.

@nickbp
Copy link
Copy Markdown
Contributor Author

nickbp commented May 21, 2019

Yeah the issue I was seeing was that after a utility check found the entry to be resolvable, it could still then become "unresolvable" due to dns lookup round-robining to a dns instance that didn't know about the host yet

pengxiaolong pushed a commit to pengxiaolong/kafka that referenced this pull request Jun 14, 2019
…olvable (apache#6305)

* Fix for KAFKA-7974: Avoid calling disconnect() when not connecting

* Resolve host only when currentAddress() is called

Moves away from automatically resolving the host when the connection entry is constructed, which can leave ClusterConnectionStates in a confused state.
Instead, resolution is done on demand, ensuring that the entry in the connection list is present even if the resolution failed.

* Add Javadoc to ClusterConnectionStates.connecting()
@bertdewolf
Copy link
Copy Markdown

Is this available in any release? I can't seem to find it...

ZIDAZ pushed a commit to linkedin/kafka that referenced this pull request Jun 14, 2022
…olvable (apache#6305)

* Fix for KAFKA-7974: Avoid calling disconnect() when not connecting

* Resolve host only when currentAddress() is called

Moves away from automatically resolving the host when the connection entry is constructed, which can leave ClusterConnectionStates in a confused state.
Instead, resolution is done on demand, ensuring that the entry in the connection list is present even if the resolution failed.

* Add Javadoc to ClusterConnectionStates.connecting()
ZIDAZ pushed a commit to linkedin/kafka that referenced this pull request Jun 14, 2022
* KAFKA-6863 Kafka clients should try to use multiple DNS resolved IP (apache#4987)

Implementation of KIP-302: Based on the new client configuration `client.dns.lookup`, a NetworkClient can use InetAddress.getAllByName to find all IPs and iterate over them when they fail to connect. Only uses either IPv4 or IPv6 addresses similar to the default mode.

Co-authored-by: Edoardo Comar <[email protected]>
Co-authored-by: Mickael Maison <[email protected]>

Reviewers: Rajini Sivaram <[email protected]>

* [LI-HOTFIX] Ignore the failed test ClusterConnectionStatesTest#testMultipleIPsWithUseAll (#116)

TICKET = N/A
LI_DESCRIPTION = The test fails since the domain kafka.apache.org used to return 3 IPs and is now
only returning two IPs. Furthermore, the upstream fix identified below cannot be cleanly cherry
picked.
EXIT_CRITERIA = when the commit 131d475 is picked from upstream:
KAFKA-12193: Re-resolve IPs after a client disconnects apache#9902

* Ignoring the failed tests (#188)

[LI-HOTFIX] Ignoring the failed tests (#188)

TICKET = N/A
LI_DESCRIPTION = Several tests are failing since the domain kafka.apache.org that used to resolve to more than 1 IPv4 addresses are not only resolving to 1 IPv4 address.
The upstream code has overhauled the ClusterConnectionStatesTest. We are simply ignoring these tests for now, and will get the new logic from upstream after a major version rebase.
EXIT_CRITERIA = This hotfix can be removed in the next major version rebase

* Fix for KAFKA-7974: Avoid zombie AdminClient when node host isn't resolvable (apache#6305)

* Fix for KAFKA-7974: Avoid calling disconnect() when not connecting

* Resolve host only when currentAddress() is called

Moves away from automatically resolving the host when the connection entry is constructed, which can leave ClusterConnectionStates in a confused state.
Instead, resolution is done on demand, ensuring that the entry in the connection list is present even if the resolution failed.

* Add Javadoc to ClusterConnectionStates.connecting()

* KAFKA-9313: Set `use_all_dns_ips` as the new default for `client.dns.lookup` (KIP-602) (apache#8644)

This applies to the producer, consumer, admin client, connect worker
and inter broker communication.

`ClientDnsLookup.DEFAULT` has been deprecated and a warning
will be logged if it's explicitly set in a client config.

Reviewers: Mickael Maison <[email protected]>, Ismael Juma <[email protected]>

* Update NetworkClient usage in SSLNetworkClient

* [LI-HOTFIX] Bypass cluster metadata auto refresh code path by default (#329)

The PR #228 attempted to resolve provided boostrap servers when
the metadata is exceeding a staleness threshold.  The config is coverred
both on producer and consumer, and default behavior without configured
value is setting timeout to Long.MAX_VALUE.

However, cruise-control is affected by the behavior as it implements a
similar mechanism on its own and directly uses of NetworkClient. The code
would fail if empty bootstrap server is passed to NetworkClient, which
is the case for internal use of CC.

To resolve this, this patch aims to make default value as -1, and omit
the code path referencing bootstrap server when we see -1.

EXIT_CRITERIA = When #228 is ejected

Co-authored-by: Xiongqi Wu <[email protected]>
Co-authored-by: Edoardo Comar <[email protected]>
Co-authored-by: Lucas Wang <[email protected]>
Co-authored-by: Nicholas Parker <[email protected]>
Co-authored-by: Badai Aqrandista <[email protected]>
Co-authored-by: Joseph (Ting-Chou) Lin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants