Skip to content

[LI-HOTFIX] Bypass cluster metadata auto refresh code path by default#329

Merged
lmr3796 merged 1 commit intolinkedin:2.4-lifrom
lmr3796:fix-metadta-2.4
Apr 6, 2022
Merged

[LI-HOTFIX] Bypass cluster metadata auto refresh code path by default#329
lmr3796 merged 1 commit intolinkedin:2.4-lifrom
lmr3796:fix-metadta-2.4

Conversation

@lmr3796
Copy link
Copy Markdown

@lmr3796 lmr3796 commented Apr 6, 2022

The same patch is also made to the 3.0 branch in #330.

The PR #228 attempted to resolve provided bootstrap servers when
the metadata is exceeding a staleness threshold. The config is covered
both on producer and consumer, and default behavior without configured
value is setting the timeout to Long.MAX_VALUE.

However, cruise-control is affected by the behavior as it implements a
similar mechanism on its own and directly uses NetworkClient. The code
would fail if an empty bootstrap server list is passed to NetworkClient, which
is the case for internal use of CC.

To resolve this, this patch aims to make the default value -1, and omit
the code path referencing bootstrap server when we see -1.

EXIT_CRITERIA = When #228 is ejected

More detailed description of your change,
if necessary. The PR title and PR message become
the squashed commit message, so use a separate
comment to ping reviewers.

Summary of testing strategy (including rationale)
for the feature or bug fix. Unit and/or integration
tests are expected for any behaviour change and
system tests should be considered for larger changes.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

The PR linkedin#228 attempted to resolve provided boostrap servers when
the metadata is exceeding a staleness threshold.  The config is coverred
both on producer and consumer, and default behavior without configured
value is setting timeout to Long.MAX_VALUE.

However, cruise-control is affected by the behavior as it implements a
similar mechanism on its own and directly uses of NetworkClient. The code
would fail if empty bootstrap server is passed to NetworkClient, which
is the case for internal use of CC.

To resolve this, this patch aims to make default value as -1, and omit
the code path referencing bootstrap server when we see -1.

EXIT_CRITERIA = When linkedin#228 is ejected
@lmr3796 lmr3796 requested a review from kehuum April 6, 2022 02:35
@gitlw
Copy link
Copy Markdown

gitlw commented Apr 6, 2022

Can you please add a reference to the relevant CC code?
Apart from making this change inside kafka, is it possible for CC to deprecate its own approach and rely on the mechanism in the kafka client?

@lmr3796
Copy link
Copy Markdown
Author

lmr3796 commented Apr 6, 2022

@gitlw :

Can you please add a reference to the relevant CC code?

The MetadataClient in cruise control is the class doing metadata refresh.
https://github.com/linkedin/cruise-control/blob/migrate_to_kafka_2_5/cruise-control/src/main/java/com/linkedin/kafka/cruisecontrol/common/MetadataClient.java

Apart from making this change inside kafka, is it possible for CC to deprecate its own approach and rely on the mechanism in the kafka client?

The problem for this change is specifically in LinkedIn kafka clients, whereas opensource CC point to vanilla Apache kafka clients, which does not have this mechanism yet, and open source CC needs to work on the common surface of Apache & Linkedin Kafka clients.

Changing the dependency strategy for CC to actually use LinkedIn kafka client is another story then. As CC is also used a lot by the community, I feel it would be hard to make this huge shift in strategy.

*/
public synchronized boolean shouldUpdateClusterMetadataFromBootstrap(long nowMs) {
return (this.nodesTriedSinceLastSuccessfulRefresh >= 1 &&
return this.maxClusterMetadataExpireTimeMs > 0 &&
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. Seems CC is using Metadata directly instead of MetadataUpdater, is that the reason it cannot override this method to false always?

Copy link
Copy Markdown
Author

@lmr3796 lmr3796 Apr 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kehuum It's actually the problem of DefaultMetadataUpdater in client. The DefaultMetadataUpdater is not extensible as it's tangled with the whole NetworkClient (note that it's not a static inner class but a dynamic one)

@lmr3796 lmr3796 merged commit 9844ceb into linkedin:2.4-li Apr 6, 2022
lmr3796 added a commit that referenced this pull request Apr 6, 2022
…330)

This should be a fix-up to #316, and the same patch is also made to the 2.4 branch in #329.

The PR #228 attempted to resolve provided bootstrap servers when
the metadata is exceeding a staleness threshold.  The config is covered
both on producer and consumer, and default behavior without configured
value is setting the timeout to `Long.MAX_VALUE`.

However, `cruise-control` is affected by the behavior as it implements a
similar mechanism on its own and directly uses `NetworkClient`. The code
would fail if an empty bootstrap server list is passed to `NetworkClient`, which
is the case for internal use of CC.

To resolve this, this patch aims to make the default value -1, and omit
the code path referencing bootstrap server when we see -1.

EXIT_CRITERIA = When #316 is ejected
@lmr3796 lmr3796 deleted the fix-metadta-2.4 branch April 6, 2022 21:12
@gitlw
Copy link
Copy Markdown

gitlw commented Apr 6, 2022

Sorry that my comment is a little late. I feel a more proper fix is to check whether the bootstrapServers are empty, and only set the forceClusterMetadataUpdateFromBootstrap to true when the bootstrapServers are not empty.

@gitlw
Copy link
Copy Markdown

gitlw commented Apr 6, 2022

It may need some refactoring of the current implementation to make the proposed change above.

@lmr3796
Copy link
Copy Markdown
Author

lmr3796 commented Apr 7, 2022

@gitlw the forceClusterMetadataUpdateFromBootstrap flag's semantic is an extra condition to enforce refresh even it's not due, but here what we want is a way to make it not refresh even it's due.

Do you think it'd be better if I perform the empty bootstrap server here at line 981, as a pre-condition before checking due?

i.e., change

        if (this.metadataUpdater.isUpdateClusterMetadataDue(now)) {
            // Resolve & refresh
        }

into

        // into this
        if (!this.bootstrapServers.isEmpty() && this.metadataUpdater.isUpdateClusterMetadataDue(now)) {
            // Resolve & refresh
        }
                        

@gitlw
Copy link
Copy Markdown

gitlw commented Apr 7, 2022

@lmr3796

  1. I feel the isUpdateClusterMetadataDue should be renamed to something like isUpdateFromBootstrapDue.
  2. Then, either put the this.bootstrapServers.isEmpty() checking into the if condition as you suggested or inside the isUpdateFromBootstrapDue method. I slightly prefer the latter, but both look good to me.

@lmr3796
Copy link
Copy Markdown
Author

lmr3796 commented Apr 7, 2022

@gitlw ,

After revisit here are points I'd like to seek for input.

By the new proposed approach, if I check bootstrap server emptiness to bypass the code path, then it means a (likely erroneous) config that provides a max age/timeout with empty bootstrap server would be silently swallowed, which seemed bad to me.

As for the approach in this PR

  1. The -1 serves as a flag to disable the feature.
  2. If bootstrap server is not provided but a positive timeout is set, which indicates user wants the refresh feature, there would be an error.

ZIDAZ pushed a commit that referenced this pull request Jun 14, 2022
…#329)

The PR #228 attempted to resolve provided boostrap servers when
the metadata is exceeding a staleness threshold.  The config is coverred
both on producer and consumer, and default behavior without configured
value is setting timeout to Long.MAX_VALUE.

However, cruise-control is affected by the behavior as it implements a
similar mechanism on its own and directly uses of NetworkClient. The code
would fail if empty bootstrap server is passed to NetworkClient, which
is the case for internal use of CC.

To resolve this, this patch aims to make default value as -1, and omit
the code path referencing bootstrap server when we see -1.

EXIT_CRITERIA = When #228 is ejected
ZIDAZ pushed a commit that referenced this pull request Jun 14, 2022
* KAFKA-6863 Kafka clients should try to use multiple DNS resolved IP (apache#4987)

Implementation of KIP-302: Based on the new client configuration `client.dns.lookup`, a NetworkClient can use InetAddress.getAllByName to find all IPs and iterate over them when they fail to connect. Only uses either IPv4 or IPv6 addresses similar to the default mode.

Co-authored-by: Edoardo Comar <[email protected]>
Co-authored-by: Mickael Maison <[email protected]>

Reviewers: Rajini Sivaram <[email protected]>

* [LI-HOTFIX] Ignore the failed test ClusterConnectionStatesTest#testMultipleIPsWithUseAll (#116)

TICKET = N/A
LI_DESCRIPTION = The test fails since the domain kafka.apache.org used to return 3 IPs and is now
only returning two IPs. Furthermore, the upstream fix identified below cannot be cleanly cherry
picked.
EXIT_CRITERIA = when the commit 131d475 is picked from upstream:
KAFKA-12193: Re-resolve IPs after a client disconnects apache#9902

* Ignoring the failed tests (#188)

[LI-HOTFIX] Ignoring the failed tests (#188)

TICKET = N/A
LI_DESCRIPTION = Several tests are failing since the domain kafka.apache.org that used to resolve to more than 1 IPv4 addresses are not only resolving to 1 IPv4 address.
The upstream code has overhauled the ClusterConnectionStatesTest. We are simply ignoring these tests for now, and will get the new logic from upstream after a major version rebase.
EXIT_CRITERIA = This hotfix can be removed in the next major version rebase

* Fix for KAFKA-7974: Avoid zombie AdminClient when node host isn't resolvable (apache#6305)

* Fix for KAFKA-7974: Avoid calling disconnect() when not connecting

* Resolve host only when currentAddress() is called

Moves away from automatically resolving the host when the connection entry is constructed, which can leave ClusterConnectionStates in a confused state.
Instead, resolution is done on demand, ensuring that the entry in the connection list is present even if the resolution failed.

* Add Javadoc to ClusterConnectionStates.connecting()

* KAFKA-9313: Set `use_all_dns_ips` as the new default for `client.dns.lookup` (KIP-602) (apache#8644)

This applies to the producer, consumer, admin client, connect worker
and inter broker communication.

`ClientDnsLookup.DEFAULT` has been deprecated and a warning
will be logged if it's explicitly set in a client config.

Reviewers: Mickael Maison <[email protected]>, Ismael Juma <[email protected]>

* Update NetworkClient usage in SSLNetworkClient

* [LI-HOTFIX] Bypass cluster metadata auto refresh code path by default (#329)

The PR #228 attempted to resolve provided boostrap servers when
the metadata is exceeding a staleness threshold.  The config is coverred
both on producer and consumer, and default behavior without configured
value is setting timeout to Long.MAX_VALUE.

However, cruise-control is affected by the behavior as it implements a
similar mechanism on its own and directly uses of NetworkClient. The code
would fail if empty bootstrap server is passed to NetworkClient, which
is the case for internal use of CC.

To resolve this, this patch aims to make default value as -1, and omit
the code path referencing bootstrap server when we see -1.

EXIT_CRITERIA = When #228 is ejected

Co-authored-by: Xiongqi Wu <[email protected]>
Co-authored-by: Edoardo Comar <[email protected]>
Co-authored-by: Lucas Wang <[email protected]>
Co-authored-by: Nicholas Parker <[email protected]>
Co-authored-by: Badai Aqrandista <[email protected]>
Co-authored-by: Joseph (Ting-Chou) Lin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants