[LI-HOTFIX] Resolve the bootstrap server when cluster metadata hasn't been refreshed for a long time by kehuum · Pull Request #316 · linkedin/kafka

kehuum · 2022-03-24T18:43:00Z

This patch adds a config li.client.cluster.metadata.expire.time.ms which controls the max time cluster metadata can remain unchanged. On NetworkClient.poll, if this timeout has been reached and the client has tried half of the nodes in the original cached node set and failed, it will try to resolve the bootstrap servers again and use the newly resolved nodes to pick a leastLoadedNode to send updateMetadataRequest.

This is to avoid following two scenarios:

consumer has been idle for a long time, and whole cluster has been swapped. This case, all the cached nodes are invalid and resolve bootstrap is needed.
consumer hasn't refreshed metadata for a long time and some brokers in the cluster had been moved to another cluster, and the client randomly picks up the moved broker to send md request and get a response for a different cluster. In this case, we simply reject the stale md response and resolve bootstrap when conditions are met.
TICKET =
LI_DESCRIPTION = LIKAFKA-40759,
EXIT_CRITERIA = MANUAL this is not going to merged with upstream

Co-authored-by: Ke Hu [email protected]
(cherry picked from commit ec1d353)

Summary of testing strategy (including rationale)
for the feature or bug fix. Unit and/or integration
tests are expected for any behaviour change and
system tests should be considered for larger changes.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

… been refreshed for a long time This patch adds a config li.client.cluster.metadata.expire.time.ms which controls the max time cluster metadata can remain unchanged. On NetworkClient.poll, if this timeout has been reached and the client has tried half of the nodes in the original cached node set and failed, it will try to resolve the bootstrap servers again and use the newly resolved nodes to pick a leastLoadedNode to send updateMetadataRequest. This is to avoid following two scenarios: consumer has been idle for a long time, and whole cluster has been swapped. This case, all the cached nodes are invalid and resolve bootstrap is needed. consumer hasn't refreshed metadata for a long time and some brokers in the cluster had been moved to another cluster, and the client randomly picks up the moved broker to send md request and get a response for a different cluster. In this case, we simply reject the stale md response and resolve bootstrap when conditions are met. TICKET = LI_DESCRIPTION = LIKAFKA-40759, EXIT_CRITERIA = MANUAL this is not going to merged with upstream Co-authored-by: Ke Hu <[email protected]> (cherry picked from commit ec1d353)

… been refreshed for a long time (linkedin#316) This patch adds a config li.client.cluster.metadata.expire.time.ms which controls the max time cluster metadata can remain unchanged. On NetworkClient.poll, if this timeout has been reached and the client has tried half of the nodes in the original cached node set and failed, it will try to resolve the bootstrap servers again and use the newly resolved nodes to pick a leastLoadedNode to send updateMetadataRequest. This is to avoid following two scenarios: consumer has been idle for a long time, and whole cluster has been swapped. This case, all the cached nodes are invalid and resolve bootstrap is needed. consumer hasn't refreshed metadata for a long time and some brokers in the cluster had been moved to another cluster, and the client randomly picks up the moved broker to send md request and get a response for a different cluster. In this case, we simply reject the stale md response and resolve bootstrap when conditions are met. TICKET = LI_DESCRIPTION = LIKAFKA-40759, EXIT_CRITERIA = MANUAL this is not going to merged with upstream (cherry picked from commit ec1d353) Co-authored-by: Ke Hu <[email protected]> Co-authored-by: Prachi Khobragade <[email protected]>

This should be a fix up to linkedin#316 The PR linkedin#228 attempted to resolve provided boostrap servers when the metadata is exceeding a staleness threshold. The config is coverred both on producer and consumer, and default behavior without configured value is setting timeout to Long.MAX_VALUE. However, cruise-control is affected by the behavior as it implements a similar mechanism on its own and directly uses of NetworkClient. The code would fail if empty bootstrap server is passed to NetworkClient, which is the case for internal use of CC. To resolve this, this patch aims to make default value as -1, and omit the code path referencing bootstrap server when we see -1. EXIT_CRITERIA = When linkedin#316 is ejected

…330) This should be a fix-up to #316, and the same patch is also made to the 2.4 branch in #329. The PR #228 attempted to resolve provided bootstrap servers when the metadata is exceeding a staleness threshold. The config is covered both on producer and consumer, and default behavior without configured value is setting the timeout to `Long.MAX_VALUE`. However, `cruise-control` is affected by the behavior as it implements a similar mechanism on its own and directly uses `NetworkClient`. The code would fail if an empty bootstrap server list is passed to `NetworkClient`, which is the case for internal use of CC. To resolve this, this patch aims to make the default value -1, and omit the code path referencing bootstrap server when we see -1. EXIT_CRITERIA = When #316 is ejected

… been refreshed for a long time (linkedin#316) This patch adds a config li.client.cluster.metadata.expire.time.ms which controls the max time cluster metadata can remain unchanged. On NetworkClient.poll, if this timeout has been reached and the client has tried half of the nodes in the original cached node set and failed, it will try to resolve the bootstrap servers again and use the newly resolved nodes to pick a leastLoadedNode to send updateMetadataRequest. This is to avoid following two scenarios: consumer has been idle for a long time, and whole cluster has been swapped. This case, all the cached nodes are invalid and resolve bootstrap is needed. consumer hasn't refreshed metadata for a long time and some brokers in the cluster had been moved to another cluster, and the client randomly picks up the moved broker to send md request and get a response for a different cluster. In this case, we simply reject the stale md response and resolve bootstrap when conditions are met. TICKET = LI_DESCRIPTION = LIKAFKA-40759, EXIT_CRITERIA = MANUAL this is not going to merged with upstream (cherry picked from commit ec1d353) Co-authored-by: Ke Hu <[email protected]> Co-authored-by: Prachi Khobragade <[email protected]>

kehuum and others added 3 commits March 24, 2022 11:51

resolve merge conflicts and fix tests

67d3a55

fix md resolve behavior during bootstrap

d0f9f5b

kehuum force-pushed the 3.0-md-fix branch from 7839501 to d0f9f5b Compare March 24, 2022 18:51

kehuum added 3 commits March 24, 2022 11:57

remove extra metadata_topic_expire_ms

90bb9a2

remove unused import

8935993

fix testResolveBootstrapAfterClusterMetadataTimeout

09ad26e

xiowu0 approved these changes Mar 24, 2022

View reviewed changes

kehuum merged commit ff434ae into 3.0-li-dev7 Mar 25, 2022

lmr3796 mentioned this pull request Apr 6, 2022

[LI-FIXUP] Bypass cluster metadata auto refresh code path by default #330

Merged

3 tasks

lmr3796 deleted the 3.0-md-fix branch June 20, 2023 19:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[LI-HOTFIX] Resolve the bootstrap server when cluster metadata hasn't been refreshed for a long time#316

[LI-HOTFIX] Resolve the bootstrap server when cluster metadata hasn't been refreshed for a long time#316
kehuum merged 6 commits into3.0-li-dev7from
3.0-md-fix

kehuum commented Mar 24, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

kehuum commented Mar 24, 2022

Committer Checklist (excluded from commit message)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants