[LI-HOTFIX] Resolve the bootstrap server when cluster metadata hasn't refreshed for a long time by kehuum · Pull Request #228 · linkedin/kafka

kehuum · 2021-12-21T18:39:43Z

[LI-HOTFIX] Resolve the bootstrap server when cluster metadata hasn't been refreshed for a long time

This patch adds a config li.client.cluster.metadata.expire.time.ms which controls the max time cluster metadata can remain unchanged. On NetworkClient.poll, if this
timeout has been reached and the client has tried half of the nodes in the original cached node set and failed, it will try to resolve the bootstrap servers again and us
e the newly resolved nodes to pick a leastLoadedNode to send updateMetadataRequest.

This is to avoid following two scenarios:

consumer has been idle for a long time, and whole cluster has been swapped. This case, all the cached nodes are invalid and resolve bootstrap is needed.
consumer hasn't refreshed metadata for a long time and some brokers in the cluster had been moved to another cluster, and the client randomly picks up the moved broker to send md request and get a response for a different cluster. In this case, we simply reject the stale md response and resolve bootstrap when conditions are met.
TICKET =
LI_DESCRIPTION = LIKAFKA-40759,
EXIT_CRITERIA = MANUAL this is not going to merged with upstream

More detailed description of your change,
if necessary. The PR title and PR message become
the squashed commit message, so use a separate
comment to ping reviewers.

Summary of testing strategy (including rationale)
for the feature or bug fix. Unit and/or integration
tests are expected for any behaviour change and
system tests should be considered for larger changes.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

… been refreshed for a long time This patch adds a config li.client.cluster.metadata.expire.time.ms which controls the max time cluster metadata can remain unchanged. On NetworkClient.poll, if this timeout has been reached and the client has tried half of the nodes in the original cached node set and failed, it will try to resolve the bootstrap servers again and use the newly resolved nodes to pick a leastLoadedNode to send updateMetadataRequest. This is to avoid following two scenarios: consumer has been idle for a long time, and whole cluster has been swapped. This case, all the cached nodes are invalid and resolve bootstrap is needed. consumer hasn't refreshed metadata for a long time and some brokers in the cluster had been moved to another cluster, and the client randomly picks up the moved broker to send md request and get a response for a different cluster. In this case, we simply reject the stale md response and resolve bootstrap when conditions are met. TICKET = LI_DESCRIPTION = LIKAFKA-40759, EXIT_CRITERIA = MANUAL this is not going to merged with upstream

The PR linkedin#228 attempted to resolve provided boostrap servers when the metadata is exceeding a staleness threshold. The config is coverred both on producer and consumer, and default behavior without configured value is setting timeout to Long.MAX_VALUE. However, cruise-control is affected by the behavior as it implements a similar mechanism on its own and directly uses of NetworkClient. The code would fail if empty bootstrap server is passed to NetworkClient, which is the case for internal use of CC. To resolve this, this patch aims to make default value as -1, and omit the code path referencing bootstrap server when we see -1. EXIT_CRITERIA = When linkedin#228 is ejected

This should be a fix up to linkedin#316 The PR linkedin#228 attempted to resolve provided boostrap servers when the metadata is exceeding a staleness threshold. The config is coverred both on producer and consumer, and default behavior without configured value is setting timeout to Long.MAX_VALUE. However, cruise-control is affected by the behavior as it implements a similar mechanism on its own and directly uses of NetworkClient. The code would fail if empty bootstrap server is passed to NetworkClient, which is the case for internal use of CC. To resolve this, this patch aims to make default value as -1, and omit the code path referencing bootstrap server when we see -1. EXIT_CRITERIA = When linkedin#316 is ejected

…#329) The PR #228 attempted to resolve provided boostrap servers when the metadata is exceeding a staleness threshold. The config is coverred both on producer and consumer, and default behavior without configured value is setting timeout to Long.MAX_VALUE. However, cruise-control is affected by the behavior as it implements a similar mechanism on its own and directly uses of NetworkClient. The code would fail if empty bootstrap server is passed to NetworkClient, which is the case for internal use of CC. To resolve this, this patch aims to make default value as -1, and omit the code path referencing bootstrap server when we see -1. EXIT_CRITERIA = When #228 is ejected

…330) This should be a fix-up to #316, and the same patch is also made to the 2.4 branch in #329. The PR #228 attempted to resolve provided bootstrap servers when the metadata is exceeding a staleness threshold. The config is covered both on producer and consumer, and default behavior without configured value is setting the timeout to `Long.MAX_VALUE`. However, `cruise-control` is affected by the behavior as it implements a similar mechanism on its own and directly uses `NetworkClient`. The code would fail if an empty bootstrap server list is passed to `NetworkClient`, which is the case for internal use of CC. To resolve this, this patch aims to make the default value -1, and omit the code path referencing bootstrap server when we see -1. EXIT_CRITERIA = When #316 is ejected

…#329) The PR #228 attempted to resolve provided boostrap servers when the metadata is exceeding a staleness threshold. The config is coverred both on producer and consumer, and default behavior without configured value is setting timeout to Long.MAX_VALUE. However, cruise-control is affected by the behavior as it implements a similar mechanism on its own and directly uses of NetworkClient. The code would fail if empty bootstrap server is passed to NetworkClient, which is the case for internal use of CC. To resolve this, this patch aims to make default value as -1, and omit the code path referencing bootstrap server when we see -1. EXIT_CRITERIA = When #228 is ejected

* KAFKA-6863 Kafka clients should try to use multiple DNS resolved IP (apache#4987) Implementation of KIP-302: Based on the new client configuration `client.dns.lookup`, a NetworkClient can use InetAddress.getAllByName to find all IPs and iterate over them when they fail to connect. Only uses either IPv4 or IPv6 addresses similar to the default mode. Co-authored-by: Edoardo Comar <[email protected]> Co-authored-by: Mickael Maison <[email protected]> Reviewers: Rajini Sivaram <[email protected]> * [LI-HOTFIX] Ignore the failed test ClusterConnectionStatesTest#testMultipleIPsWithUseAll (#116) TICKET = N/A LI_DESCRIPTION = The test fails since the domain kafka.apache.org used to return 3 IPs and is now only returning two IPs. Furthermore, the upstream fix identified below cannot be cleanly cherry picked. EXIT_CRITERIA = when the commit 131d475 is picked from upstream: KAFKA-12193: Re-resolve IPs after a client disconnects apache#9902 * Ignoring the failed tests (#188) [LI-HOTFIX] Ignoring the failed tests (#188) TICKET = N/A LI_DESCRIPTION = Several tests are failing since the domain kafka.apache.org that used to resolve to more than 1 IPv4 addresses are not only resolving to 1 IPv4 address. The upstream code has overhauled the ClusterConnectionStatesTest. We are simply ignoring these tests for now, and will get the new logic from upstream after a major version rebase. EXIT_CRITERIA = This hotfix can be removed in the next major version rebase * Fix for KAFKA-7974: Avoid zombie AdminClient when node host isn't resolvable (apache#6305) * Fix for KAFKA-7974: Avoid calling disconnect() when not connecting * Resolve host only when currentAddress() is called Moves away from automatically resolving the host when the connection entry is constructed, which can leave ClusterConnectionStates in a confused state. Instead, resolution is done on demand, ensuring that the entry in the connection list is present even if the resolution failed. * Add Javadoc to ClusterConnectionStates.connecting() * KAFKA-9313: Set `use_all_dns_ips` as the new default for `client.dns.lookup` (KIP-602) (apache#8644) This applies to the producer, consumer, admin client, connect worker and inter broker communication. `ClientDnsLookup.DEFAULT` has been deprecated and a warning will be logged if it's explicitly set in a client config. Reviewers: Mickael Maison <[email protected]>, Ismael Juma <[email protected]> * Update NetworkClient usage in SSLNetworkClient * [LI-HOTFIX] Bypass cluster metadata auto refresh code path by default (#329) The PR #228 attempted to resolve provided boostrap servers when the metadata is exceeding a staleness threshold. The config is coverred both on producer and consumer, and default behavior without configured value is setting timeout to Long.MAX_VALUE. However, cruise-control is affected by the behavior as it implements a similar mechanism on its own and directly uses of NetworkClient. The code would fail if empty bootstrap server is passed to NetworkClient, which is the case for internal use of CC. To resolve this, this patch aims to make default value as -1, and omit the code path referencing bootstrap server when we see -1. EXIT_CRITERIA = When #228 is ejected Co-authored-by: Xiongqi Wu <[email protected]> Co-authored-by: Edoardo Comar <[email protected]> Co-authored-by: Lucas Wang <[email protected]> Co-authored-by: Nicholas Parker <[email protected]> Co-authored-by: Badai Aqrandista <[email protected]> Co-authored-by: Joseph (Ting-Chou) Lin <[email protected]>

kehuum requested review from lmr3796 and xiowu0 December 21, 2021 18:39

xiowu0 approved these changes Dec 21, 2021

View reviewed changes

kehuum merged commit ec1d353 into 2.4-li Dec 21, 2021

This was referenced Apr 6, 2022

[LI-HOTFIX] Bypass cluster metadata auto refresh code path by default #329

Merged

[LI-FIXUP] Bypass cluster metadata auto refresh code path by default #330

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[LI-HOTFIX] Resolve the bootstrap server when cluster metadata hasn't refreshed for a long time#228

[LI-HOTFIX] Resolve the bootstrap server when cluster metadata hasn't refreshed for a long time#228
kehuum merged 1 commit into2.4-lifrom
24MDFix

kehuum commented Dec 21, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kehuum commented Dec 21, 2021

Committer Checklist (excluded from commit message)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants