[KIP-848] Removed partition.assignment.strategy configuration for the new consumer protocol#4993
Conversation
|
🎉 All Contributor License Agreements have been signed. Ready to merge. |
03b5642 to
c8a205e
Compare
Emanuele Sabellico (emasab)
left a comment
There was a problem hiding this comment.
First round about implementation changes
| /* When users starts setting properties of the new protocol, | ||
| * they can only use incremental_assign/unassign. */ | ||
| if (rk->rk_conf.group_protocol == RD_KAFKA_GROUP_PROTOCOL_CONSUMER) | ||
| rk->rk_conf.partition_assignors_cooperative = rd_true; |
There was a problem hiding this comment.
This if can be removed completely, together with the partition_assignors_cooperative field. In rd_kafka_cgrp_rebalance_protocol we always return RD_KAFKA_REBALANCE_PROTOCOL_COOPERATIVE when subscribed once.
There was a problem hiding this comment.
When finalizing the configuration we should return an error if setting the partition.assignment.strategy together with group.protocol=consumer. So people that are upgrading can remove the configuration value.
In the tests when setting partition.assignment.strategy we can have a function that sets the value if classic or sets the group.remote.assignor to range or uniform depending on the passed value, so the test would be similar.
There was a problem hiding this comment.
Doing it in a new PR with other properties.
c8a205e to
9c48987
Compare
Emanuele Sabellico (emasab)
left a comment
There was a problem hiding this comment.
First comments are about the implementation changes
| goto fail; | ||
| } | ||
|
|
||
| if (!rk->rk_conf.group_remote_assignor) { |
There was a problem hiding this comment.
About returning an error when the user sets partition.assignment.strategy together with the group.protocol=consumer (and for other configuration properties) are you doing it in a different PR?
There was a problem hiding this comment.
Yes. Doing that in a separate PR.
Emanuele Sabellico (emasab)
left a comment
There was a problem hiding this comment.
Comments about the tests
0d27939 to
af40c1f
Compare
Emanuele Sabellico (emasab)
left a comment
There was a problem hiding this comment.
LGTM! Thanks Pranav
… new consumer group `consumer` protocol (#4993)
* Fix to remove fetch queue messages that blocked the destroy of rdkafka instances (confluentinc#4724) Circular dependencies from a partition fetch queue message to the same partition blocked the destroy of an instance, that happened in case the partition was removed from the cluster while it was being consumed. Solved by purging internal partition queue, after being stopped and removed, to allow reference count to reach zero and trigger a destroy. Purging internal fetch queue on removing the partition only for the consumer. * Upgrade linux dependencies (confluentinc#4875) * Security upgrade for OpenSSL and Curl, CVEs fixed: OpenSSL - CVE-2024-2511 - CVE-2024-4603 - CVE-2024-4741 - CVE-2024-5535 - CVE-2024-6119 CURL - CVE-2024-8096 - CVE-2024-7264 - CVE-2024-6874 - CVE-2024-6197 * Fix for curl configure failure caused by curl/curl#14373 * Upgrade msvcr140 and vcpkg dependencies (confluentinc#4872) * Bump version for v2.6.1 release (confluentinc#4891) * Fix for SCRAM client final message, the 'r' parameter (confluentinc#4895) must be equal to the server sent nonce, that already contains the client side nonce. librdkafka was incorrectly concatenating the client side nonce again, leading to this fix being made on AK side, released in 3.8.1, with endsWith instead of equals. apache/kafka@0a00456 * Count 'Too Many Requests' as a retriable error (confluentinc#4902) * Upgrade semaphore agent to Ubuntu 22.04 (confluentinc#4909) except the Style check job because it needs clang format 10 Fix to inherit javac path, needed by test 0098 * Socket options are now all set before connection (confluentinc#4893) * [KIP-848] Mock handler and Integration tests passing (confluentinc#4662) Mock handler implementation Rename current consumer protocol from generic to classic Mock handler with automatic or manual assignment More consumer group metadata getters Test helpers Configurable session timeout and HB interval Fix mock handler ListOffsets response LeaderEpoch instead of CurrentLeaderEpoch Integration tests passing with AK trunk Improve documentation and KIP 848 specific mock tests Add mock tests for unknown topic id in metadata request and partial reconciliation Make test 0147 more reliable Fix test 0106 after HB timeout change Exclude test case with AK trunk Rename rd_kafka_buf_write_tags to rd_kafka_buf_write_tags_empty Trivup 0.12.5 can run a KafkaCluster directly with KRaft and AK trunk Trivup 0.12.6 build with a specific commit Trivup 0.12.7 with fixes for AK 3.8.0 and Py 3.12 New version of trivup 0.12.7 to fix an issue with apache/kafka#16464 on AK > 3.8.0 Static group membership mock tests Move test 0147 to a different PR Disable interactive "needsrestart" prompt * Client certificate chain is now sent (confluentinc#4894) * test_read_file can read binary files too * Trivup 0.12.8 * Read certificate CA chain when set using a configuration setter with PEM format. Test that CA with untrusted chain fails authentication. * Test untrusted certificate signed with an intermediate CA * Remove private key and duplicate certs from pem client certificate * Print logs sent as events * Trivup now already inheriths the environment in interactive mode * Use namespace to avoid conflicts on TestEventCb * Client cert callback to check if trusted certificate authorities match with client certificate chain (confluentinc#4900) Failing test: expect the error code that is received when no certificate is sent instead of the one received when it's sent but not trusted. Client cert callback to check if trusted certificate authorities match with client certificate chain. Log a warning when client certificate isn't sent --------- Co-authored-by: trnguyencflt <[email protected]> * Allow to migrate partitions to leaders with same leader epoch or NULL leader epoch (confluentinc#4901) Failing tests including for confluentinc#4796 and confluentinc#4804 Closes confluentinc#4796 and confluentinc#4804 CHANGELOG Fix for the correct expected RPC code in test 0139 Apply same fix to metadata update operation too Don't change rktp state to active when there's no leader but wait it's available to validate it Comment about excluded -1 value * librdkafka v2.6.3 (confluentinc#4934) * ssl: support libssls with no ENGINE implementation (confluentinc#3535) An incorrect assumption is made that libssl is built with support for the (now-deprecated) ENGINE API if it is provided by OpenSSL >= 1.1.0 or LibreSSL. OPENSSL_NO_ENGINE is defined by OpenSSL and all of its forks if the ENGINE API was disabled at compile-time - ensure that the definition of OPENSSL_NO_ENGINE is taken into account when using ENGINE features. * Fix to avoid issuing a warning in case server requests a certificate and client is using SASL authentication only (confluentinc#4936) without any client certificate set * Version changes for v2.8.0 (confluentinc#4946) * chore: update repo by service bot (confluentinc#4463) * removing generated internal project.yml * removing generated public project.yml --------- Co-authored-by: service-bot-app[bot] <189278048+service-bot-app[bot]@users.noreply.github.com> * chore: update repo by service bot (confluentinc#4950) * removing generated internal project.yml * removing generated public project.yml --------- Co-authored-by: service-bot-app[bot] <189278048+service-bot-app[bot]@users.noreply.github.com> * Ubuntu 24.04 and arm64 linux package verification (confluentinc#4954) * Verify Ubuntu 24.04 and arm64 packages * Add Semaphore task for verifying * fixup: Ubuntu 24.04 and arm64 linux package verification(confluentinc#4967) Style fixed check_features.c * Remove installation script with OpenSSL 1.1.1k (confluentinc#4990) as it's not used anymore. Was used for AppVeyor CI builds. * Script to run tests in batches, with several modes and for a variable number of iterations (confluentinc#5002) * Fix assignment lost, on illegal generation, during a commit (confluentinc#4908) Closes: confluentinc#4059. Commits during a rebalance could cause to lose the assignment if the generation id was bumped by second join group request. Solved by not re-joining the group in case an illegal generation error happens during a rebalance. Happening since v1.6.0. * Run all tests pipeline (confluentinc#5004) Semaphore pipeline linked to a task that runs the full test suite with customizable parameters. Contains a promotion to run it automatically on master commits only. * [run-all-tests pipeline] Replace descriptions using double quotes (confluentinc#5006) * Remove single quotes (confluentinc#5007) * fixup: [run-all-tests pipeline] Replace descriptions using double quotes (confluentinc#5006) Remove double quotes * Change cnd_timedwait_abs to take a monotonic clock value as timeout and checking after wakeups if it's been reached, Avoids yielding earlier than requested because of spourious wakeups. Fix flakiness in many tests, especially 0080 * Fix for a minimum latency of 500ms in case of leader change, because of the fetch backoff left from previous broker. Resets the fetch backoff when the partitions joins a new broker. * Fix flakyness in test 0068 due to latency increase applying to all RPCs, including ApiVersions, leading to the timeout happening before the produce request is sent. The error is IN_QUEUE instead of IN_FLIGHT, and the status becomes NOT_PERSISTED instead of POSSIBLY_PERSISTED. Fixed using the mock cluster instead of sockem and applying the latency only to the Produce request. * Fix flakyness in test 0086 Given only a single request can be in-flight, in some cases second request still had not been sent when purging the buffer. A condition in `on_request_sent` allows to wait the second request was sent before purging the buffers, allowing to test the scenario that is expected to test. * Fix for test 0119, remove ACLs in the test that created them to prevent authorization issues when removing all the topics on final cleanup * Fix test 0126 memory leak * Remove brokers that aren't up from mock metadata response. This is in line with KRaft behavior * More tests on fast metadata refresh, ensuring the topic is marked as errored only in case of permanent errors, no error is surfaced to the application unless it's an authorization error and produce requests can continue with the cached metadata * Don't remove a topic from cache before its expiration in case of a temporary error * Don't mark the topic as unknown even when it's already known and we're in the metadata propagation period * Only update partition leaders if the topic has no errors. Similar to Java client logic. Avoids a segmentation fault if the rktp is missing such as in cases of topic deletion and re-creation with same name. * Don't set errors other than TOPIC_AUTHORIZATION_FAILED as permanent topic errors. Fixes test `do_test_fast_metadata_refresh:141: retries` * Fix for issuing a metadata refresh after offsets_for_times call failed an it was needed but not sent. Current leader epoch is now taken from metadata cache instead of from the requested topic partitions to ensure it's correctly validated on the leader. * Connection close debug logs * Filter jmx output for improved test speed * Deprecate `api.version.request`, `api.version.fallback.ms` and `broker.version.fallback` configuration properties. * Fix for test 0084, avoid purging the rk_ops queue when terminating the consumer group * Avoid rk_telemetry.termination_cnd is triggered when no one's listening * Fix flaky test 0059. Given the timestamp is so old it's possible that broker retention timer is triggered deleting all produced records and generating an offset out of range error while querying. * Test 0044: use name of the created topic * Additional test fixes * Fix flakyness in test 0061 skip events generated before the assignment that lead to a test failure * Fix flakyness with metadata propagation in test 0085 * Fix flakyness in test 0102: - avoid full metadata refresh during metadata propagation time after topic creation - Rebalance events order after max.poll.interval.ms exceeded * Fix flakyness in test 0137: don't consider error count during read messages verification. Log warnings for the errors to identify the cause. * Subscription version to avoid stale metadata calls cause an unknown topic or partition error * Use same strategy for updating cache partition metadata as for the `rd_kafka_toppar_t` that is also the same strategy as in Java client, considering: - when topic id changes partitions metadata is taken entirely from the new one. - when leader epoch is greater or equal to the one in the cache, or null (-1), partition metadata is taken from the new one. - when leader epoch is less than the one in the cache, partition metadata remains the same. Also when full metadata is necessary, the cache is used for storing it and for matching topics in the regex, removing the need to store the full metadata result, that is about the same size, but the cache is updated more accurately. * Allow unittests to complete before the timeout when using Valgrind by reducing or skipping tests with a large number of elements * Fix for the case where a metadata refresh enqueued on an unreachable broker prevents refreshing the controller or the coordinator until that broker becomes reachable again * Test exclusions when running against Apache Kafka >= 4.0 * Test 848 consumer group protocol in 4.0-rc0 with librdkafka * Fix flakyness in test 0001. It's due to the metadata propagation period so even if producing to the topic is done, the metadata is not yet propagated to all brokers. As in `test_wait_metadata_update` we wait 1s for propagation. TBD: checking the JMX metrics about metadata propagation to tell exactly when metadata has been propagated. * Fix refcnt preventing final destruction of a broker. This issue happens when a broker is destroying and there are enqueued operations to add buffers to id. If the operations are executed after purging the buffers their aren't destroyed and the refcnt for `rkbuf_rkb` prevents the broker and whole instance from being finally destroyed * Fix flakyness in compaction test * Fix for the case where an assert was failing on `rd_kafka_toppar_delegate_to_leader` * Fix flakyness in test 0105 first the buffer times out and produces a _TIMED_OUT error and then after retrying the delivery callbacks are called with _MSG_TIMED_OUT. * Avoid hinting partitions without requesting metadata for them, as this way it won't be requested later and will skip metadata refresh with a log containing "already being requested". Related issue is solved with subscription versions and tests 0143 and 0146 still pass Replace wait cache hints in case consumer group metadata refresh wasn't sent to request them again. Avoid joining the group if not all topics are in cache but metadata request couldn't be sent * Last AK RC, waiting for a trivup fix on scala version to set the released one * Copyright updates * Disabling test requiring KIP-848 TOPIC_AUTHORIZATION_FAILED handling after AK version upgrade to rc4 * [test 0105] Increase likehood of TxnOffsetCommit failing after the produce requests with a _STATE error. * Speed up fetch restart after a fetch error and an offset validation. Test 0146 is less dependent on timing. * [test 0146] Move start request tracking before first produce in both variations to track first metadata request as well * Added trivup 0.12.10 version (confluentinc#5010) * Purge brokers no longer reported in metadata (confluentinc#4557) * Fix for brokers with different Ids but same host:port The Kafka protocol allows for brokers to have multiple host:port pairs for a given node Id, e.g. see UpdateMetadata request which contains a live_brokers list where each broker Id has a list of host:port pairs. It follows from this that the thing that uniquely identifies a broker is its Id, and not the host:port. The behaviour right now is that if we have multiple brokers with the same host:port but different Ids, the first broker in the list will be updated to have the Id of whatever broker we're looking at as we iterate through the brokers in the Metadata response in rd_kafka_parse_Metadata0(), e.g. Step 1. Broker[0] = Metadata.brokers[0] Step 2. Broker[0] = Metadata.brokers[1] Step 3. Broker[0] = Metadata.brokers[2] A typical situation where brokers have the same host:port pair but differ in their Id is if the brokers are behind a load balancer. The NODE_UPDATE mechanism responsible for this was originally added in b09ff60 ("Handle broker name and nodeid updates (issue confluentinc#343)") as a way to forcibly update a broker hostname if an Id is reused with a new host after the original one was decommissioned. But this isn't how the Java Kafka client works, so let's use the Metadata response as the source of truth instead of updating brokers if we can only match by their host:port. * Fix for purging brokers no longer reported in metadata Brokers that are not in the metadata should be purged from the internal client lists. This helps to avoid annoying "No route to host" and other connection failure messages. Fixes confluentinc#238. * Remove the possibility to modify rkb_nodeid after rkb creation. * Remove locking when accessing rkb_nodeid as it's now set only on creation and not modified anymore * Add new brokers and reassign partitions in the mock cluster * Remove bootstrap broker after receiving learned ones. Wait decommissioned threads after they've stopped instead of on termination. * Handle the _DESTROY_BROKER local error, triggered when a broker is removed without terminating the client. * Test 0151 improved with cluster replacement and cluster roll * Fix for test 0105, do_test_txn_broker_down_in_txn: remove left references when decommissioning a broker and avoid it's selected as leader again or that partitions are delegated to it * Avoid selecting a configured broker as a logical or telemetry broker * Avoid selecting terminating brokers for sending calls or new connections * Remove addressless count and avoid counting the logical broker for the all brokers down error, to send the error in all the cases * Test: verify that decommissining a broker while adding a new one with same id isn't causing problems * Handle the case where current group coordinator is decommissioned without leaving dangling references until the coordinator is changed Test 0151 fix. Given the find coordinator response adds a new broker (not a logical one, a learned one to set into `rkcg_curr_coord`) Removed brokers can be added again even if not present in metadata. This is mock cluster only problem as in a real cluster a broker that is set down cannot be a coordinator. This commit changes the coordinator before setting down a broker that is current coordinator * Remove the decommissioning broker from rk_broker_by_id when starting to avoid multiple instances are added to the list with same id. The decommissioned broker returned by the find can lead to multiple brokers with same id being added. * Don't select logical brokers at all for general purpose request like metadata ones * Schedule an immediate connection when there are no brokers connecting nor requests for connection. When we're in this state, if we respect the sparse connection interval, there's no event that notifies the awaiters at `rd_kafka_brokers_wait_state_change` given it's an interval and not a timer. This is more visible when brokers are decommissioned and there's no broker down even causing the notification. Check it with test `0113` subtest `u_multiple_subscription_changes`. * Remove all configured brokers when there are learned ones. This is to avoid leaving connections to the boostrap brokers that are continued to be used instead of the learned one, adding additional requests that can be later purged by the decommissioning of that last configured broker. * Change test 0075 after removing all bootstrap brokers. This will be reverted with KIP-899 * Remove rk_logical_broker_up_cnt * [test 0151] Simplify the test removing `await_verification`. It's possible to await for the correct list of brokers in all tests given since decommissioning brokers are excluded from that list * Remove broker state from labels * Remove `nodeid` from op * Use `rk_broker_by_id` for learned broker ids to return sorted broker ids * Verify nodename change through a test log interceptor. Used the test log interceptor for test 0151 too --------- Co-authored-by: Emanuele Sabellico <[email protected]> * [KIP-899] Allow producer and consumer clients to rebootstrap (confluentinc#4980) * Avoid re-bootstrapping while terminating * Revert 0121 test changes * librdkafka v2.9.0 * Change tcp_nodelay config to be true (confluentinc#4986) * Change tcp_nodelay config * Update changelog.md * Update CHANGELOG.md for version * [KIP-848] Removed partition.assignment.strategy configuration for the new consumer group `consumer` protocol (confluentinc#4993) * Update minor+patch dependencies (confluentinc#5019) * [KIP-848] Added Regex support for the new consumer group protocol. (confluentinc#4968) [KIP-848] Added Regex support for the new consumer group protocol. * fixup: Use same strategy for updating cache partition metadata (confluentinc#5024) remove topic id check given it's the zero UUID when topic doesn't exist and it invalidates the new check that avoids the transition exists -> not exists before `metadata.propagation.max.ms` * [KIP-848] Improved handling of subscribe and unsubscribe cases, (confluentinc#5015) * Set devel and warning as error flags in CI configure * [KIP-848] Improved handling of subscribe and unsubscribe cases, especially when changing subscription multiple times in a short timeframe, to ensure the coordinator sees all the changes * librdkafka v2.10.0 (confluentinc#5025) * chore: update repo by service bot (confluentinc#5016) * Replace all occurrences of 's1-prod-ubuntu20-04' with 's1-prod-ubuntu24-04' * Replace all occurrences of 's1-prod-ubuntu22-04' with 's1-prod-ubuntu24-04' --------- Co-authored-by: service-bot-app[bot] <foo-bar+service-bot-app[bot]@users.noreply.github.com> * Revert "chore: update repo by service bot (confluentinc#5016)" (confluentinc#5035) This reverts commit d6fe217. * [KIP-848] Error out setting invalid configurations (confluentinc#5028) * [KIP-848] Handle topic authorization failure on consumer group heartbeat (confluentinc#5009) Handle topic authorization failure on consumer group heartbeat, in this case the heartbeat is retried with given interval or with exponential backoff if it's the first heartbeat. An error is returned to the user in this case and isn't sent again for the given heartbeat interval or at least 5 seconds. * [KIP-848] Extend DescribeConfigs and IncrementalAlterConfigs to support GROUP Config (confluentinc#4939) * Added group support in the AlterConfigs API too * Handle the DESTROY_BROKER error on SaslAuthenticate RPC (confluentinc#5036) by not sending an error message to the user as the authentication will be done on the new broker connection. * [KIP-848] Updated KIP-848 from EA to Preview (confluentinc#5037) * KIP 848: Added support for DescribeConsumerGroup for consumer protocol groups (confluentinc#4941) * Add an interval to retry connection to any random broker after interval has passed. (confluentinc#5039) Remove the override to schedule a connection when there's no existing one as it causes too frequent connection retries in case cluster isn't reachable. Remove scheduled connections count as the `sparse_connect_random` interval is again effective in any case. * [KIP-848] Improved KIP-848 related docs for Preview release (confluentinc#5041) * chore(deps): update dependency boto3 to v1.38.2 (confluentinc#5052) * chore(deps): update dependency boto3 to v1.38.3 (confluentinc#5054) * Fix typo in configuration error (confluentinc#4717) * chore(deps): update dependency boto3 to v1.38.5 (confluentinc#5060) * Upgrade Ubuntu to 24.04 * Upgrade to clang-format-18 * Apply style-fix * Fix for a warning about using the variable in input and output * Fix return value `rd_kafka_mock_broker_decommission` to be the same as the implementation. * Style fixes change define position to avoid conflicting auto-format * Fix INTRODUCTION.md ToC by adding explicit anchors (confluentinc#5056) * Script to update max RPC versions supported in AK (confluentinc#5023) * chore(deps): update dependency boto3 to v1.38.6 (confluentinc#5061) * chore(deps): update dependency boto3 to v1.38.7 (confluentinc#5062) * chore(deps): update dependency boto3 to v1.38.8 (confluentinc#5065) * chore(deps): update dependency boto3 to v1.38.9 (confluentinc#5068) * chore(deps): update dependency boto3 to v1.38.10 (confluentinc#5070) * chore(deps): update dependency boto3 to v1.38.11 (confluentinc#5071) * chore(deps): update dependency boto3 to v1.38.13 (confluentinc#5074) * chore(deps): update dependency boto3 to v1.38.14 (confluentinc#5078) * Compatible versions for nuget dependencies (confluentinc#5075) this avoids frequent dependency bot commits. * [internal] Add missing wrlock around rd_kafka_metadata_cache_hint call (confluentinc#5066) --------- Co-authored-by: Marcin Krystianc <[email protected]> * Fix reboostrapping with no bootstrap servers (confluentinc#5067) Handle the case where the bootstrap servers are NULL when re-bootstrapping: Fix ubsan: call to function rd_kafka_topic_info_destroy through pointer to incorrect function type Avoid NULL pointer dereference when rebootstrapping given no `boostrap.servers` is present and brokers were added through `rd_kafka_brokers_add` Closes confluentinc#5057 * Fix for a loop of re-bootstrap sequences (confluentinc#5086) in case the client reaches the all brokers down state. The client continue to select the bootstrap brokers given they have no connection and doesn't re-connect to the learned ones. Fixed by giving a chance to connect to the learned brokers even if there are new ones that never tried to connect. Closes confluentinc#5088 * Fix up to a 1s delay in first produce to new topic (confluentinc#5032) * Add failing test * Fix producev(a) issue where topic metadata is queried for late * Add changelog * Fix compilation issues for older c stds * Fix style check * Address comments - part 1 * Address comments - part 2 * Address comments - part 3 * Address comments - Part 4 * Param tag fix * [KIP-714] Fix to avoid sending a NULL metrics payload (confluentinc#4912) Otherwise a NULL payload isn't accepted by the protocol and is causing a disconnection at every push interval * [KIP-714] Fix: Metric name from the protocol is read outside boundaries (confluentinc#5105) * Failing test * Fixes confluentinc#5102 * [KIP-714] Fix for duplicate metrics when multiple prefixes are matching (confluentinc#5104) * Failing test * Fixes confluentinc#5103 * Remove CURL warning and Makefile fix (confluentinc#5110) * Makefile: bash is required for LICENSES.txt * Conditionally use the older or the newer CURLOPT CURLOPT_PROTOCOLS_STR is available since CURL 7.85.0 * librdkafka v2.10.1 (confluentinc#5112) * [KIP-714] Fix idle ratio calculation for non forwarded queues (confluentinc#5017) * Fix for devel assert "rkq->rkq_ts_last_poll_end >= rkq->rkq_ts_last_poll_start" in test 0056. Partition queues can be not forwarded to the main poll queue so these timestamps must be per-queue instead of per-instance. We avoid checking if app polled should be called in case of internal poll calls where we're sure it's not a consume call, we also avoid checking it if "app polled" was already called by a dedicated consume function. * [KIP-848] Test 0147: mock tests specific for the 848 consumer group protocol (confluentinc#4920) * [KIP-8484] Mock tests specific for the new consumer group protocol * Expedite HB only after coordinator actually changes Expedite coordinator query when there's a DESTROY_BROKER error. * Flaky test 0067 fix * [KIP-848] fix generated memberid uniqueness (confluentinc#5101) * [KIP-848] Fixed not adhering to HB interval when auto commit interval was less than HB interval (confluentinc#5114) * [KIP-1102] Enable clients to rebootstrap based on timeout or error code (confluentinc#4981) * Fix data race when a buffer queue is being initialized instead of being reset (confluentinc#4718) A data race happened when emptying buffers of a failing broker, in its thread, with the statistics callback in main thread gathering the buffer counts. Solved by resetting the atomic counters instead of initializing them. Happening since 1.x Closes confluentinc#4522 * librdkafka v2.11.0 (confluentinc#5121) * [KIP-848] [mock cluster] Improved static group membership implementation (confluentinc#5030) When member is leaving with -2 a new member that is joining can replace the previous member id with the new one. If previous member didn't leave the new member joining with same group.instance.id is fenced in new protocol. * Features `BROKER_BALANCED_CONSUMER` and `SASL_GSSAPI` don't depend on `JoinGroup v0` anymore (confluentinc#5131) `JoinGroup v0` was removed in AK 4.0 and CP 8.0 * [KIP-848] Tests for: Fix for a rapid unsubscribe while the member id is still not assigned (confluentinc#4700) local tests to check a previous segfault is avoided with fast subscribe/unsubscribe changes. * [KIP-848] Tests for ListGroups filter to list only given group types (confluentinc#4861) * [KIP-848] Tests for ListGroups filter to list only given group types * Fix doxygen tags order * Improve ' test_ListConsumerGroups_helper' documentation --------- Co-authored-by: mahajanadhitya <[email protected]> * [KIP-1139] Add support for OAuth jwt-bearer grant type (confluentinc#4978) support for `jwt-bearer` grant type as mandated by the KIP. --------- Co-authored-by: Emanuele Sabellico <[email protected]> * Improve HTTPS CA certificates configuration (confluentinc#5107) Use `rd_kafka_ssl_probe_and_set_default_ca_location` and set same CA search policy as Kafka SSL. - on Windows try to load from certificate store or fallback to paths in case using msys2. - on macOS use `probe` by default. - on Linux use `probe` when configured, check default path for dynamic linked OpenSSL or fallback to `probe` when statically linked. --------- Co-authored-by: Emanuele Sabellico <[email protected]> * `https.ca.pem` requires `CURLOPT_CAINFO_BLOB` so at least CURL 7.77.0 (confluentinc#5133) * Update C++ client error codes (confluentinc#5134) * Fix feature version ranges, currently matching a single version (confluentinc#5130) Made the conditions for enabling the features future proof, allowing to remove RPC versions in a subsequent major Apache Kafka version without disabling features. The existing checks were matching a single version instead of a range and were failing if the older version was removed. Happening since 1.x * Avoid returning an all brokers down error on planned disconnections (confluentinc#5126) * Skip increasing brokers down count for planned disconnections, those that were logged as DEBUG, including idle disconnections. Only the first time it reaches the DOWN state it's skipped but it's counted if there are further disconnections. * Avoid returning an all brokers down error on planned disconnections This is done by avoiding to count planned disconnections, such as idle disconnections, broker host change and similar, as events that can cause the client to reach the "all brokers down" state, returning an error and possibly restarting a re-bootstrap sequence since 2.10.0 * Add more information to disconnection error messages * Skip mock test when using SSL * `rd_atomic32_set` and `rd_atomic64_set` return the previous value to be able to check if the atomic set changed the value * Fix for `rd_kafka_broker_set_error` * Regarding broker errors, only log warnings and only report errors to the user for levels less or equal to `ERR` * Reduce flakyness: test 0083 * Reset the `rkb_down_reported` field when a broker reaches the `UP` state, to check again other brokers before reaching the `ALL_BROKERS_DOWN` state. Avoids repeated `ALL_BROKERS_DOWN`, maybe triggered by idle disconnections when only connecting to a single broker * Consider the disconnection after timeout as a planned disconnection that doesn't count for reaching the all brokers down state * Reset any broker down reported on re-bootstrap too * Revert `rd_kafka_connect_any` to the previous version. Given we reset the broker down reported state it's not needed anymore to prioritize the learned brokers as all brokers will be tried again before starting a new re-boostrap sequence * Add the `rk_rebootstrap_in_progress` field to prevent duplicated re-bootstrap starts if the initial re-bootstrap sequence isn't completed yet. * Remove flakyness from test 0151 - `do_test_down_then_up_no_rebootstrap_loop` and make it consistent with current changes * Fix test 0034 given the down reported state is reset, after retrying on re-bootstrap a new ALL_BROKERS_DOWN error is issued. This happens after 1 second max (`rd_kafka_broker_reconnect_backoff`) and the test 0034 timeout is 5 seconds so it doesn't timeout and set the broker up again. With an explicit timeout check the test is reliable. * Fix memory leaks in tests * Fix flakyness in test 0075 when using valgrind due to timing and multiple connect requests * Set up a fixed max jitter of 2s * Remove flakyness from test 0086_purge_remote * Fix for a deadlock when calling `rd_kafka_filter_broker_by_GetTelemetrySubscription` introduced with confluentinc#5130 (not released) * librdkafka v2.11.1 (confluentinc#5152) * Fix deadlock in `rd_kafka_reset_any_broker_down_reported` (confluentinc#5156) * Fix deadlock in `rd_kafka_reset_any_broker_down_reported` by releasing and reacquiring the broker lock, to respect lock ordering and avoid deadlocks * Moved resetting the broker down reported field after setting the broker state for consistency with rest of the counters when unlocking * Remove Debian 10 from verify packages script (confluentinc#5171) * Revert Makefile to use sh, removing the single bashism present (confluentinc#5184) * Fix compression types read issue in GetTelemetrySubscriptions response with big-endian architectures (confluentinc#5183) * Fix compression types read issue in GetTelemetrySubscriptions response for big-endian architectures * Decrease allocated buffer size in `rd_kafka_PushTelemetryRequest` and explicitly cast the enum * Handle integer overflow in `rd_kafka_broker_buf_retry` (confluentinc#5157) * [KIP-848] Fixes related to Offset Commit previous error and ConsumerGroupHeartbeat not updating member epoch in a case (confluentinc#4672) [KIP-848] Fixed a condition where error was being raised in commit due to old error in the topic partition [KIP-848] Fix discarding heartbeat response without epoch update when leaving during inflight HB * Fix for KIP-1102 time based re-bootstrap condition (confluentinc#5177) Re-bootstrap is now triggered only after metadata.recovery.rebootstrap.trigger.ms have passed since first metadata refresh request after last successful metadata response. The calculation was since last successful metadata response so it's possible it did overlap with the periodic topic.metadata.refresh.interval.ms and cause a re-bootstrap even if not needed. * librdkafka 2.12.0 (confluentinc#5196) * [KIP-320] Validate assigned partitions before starting to consume from them (confluentinc#4931) * Fetched committed offsets should be validated before starting to consume from it. Failing test and mock handler implementation for returning the committed offset leader epoch instead of current leader epoch. * Validate the offsets before starting to fetch assigned partitions * Add more test cases for partition assignment offset validation * Fix for test 0139 subtest `do_test_store_offset_without_leader_epoch` . When fetching an offset it returns the leader epoch used when committing, not the current leader epoch. Given the mock cluster fix the test needs to be changed. * Fix test `0139` subtest `do_test_list_offsets_leader_change`: use cloned partition list for listing offsets, to avoid the fake leader epoch is then used for validation when assigning. Fix ListOffsets mock handler for logging the correct returned leader epoch. * Changelog entry * Reduce number of tests in quick mode * Add a new fetch state when finishing validating and starting to seek after a truncation, to avoid a second repeated validation and possibly duplicated messages. * Increase single test timeout * Fix to leave the group in `rd_kafka_cgrp_incr_unassign_done` if terminate was requested, as done in `rd_kafka_cgrp_unassign_done` and `rd_kafka_cgrp_consumer_incr_unassign_done` * Mock cluster, set the group as empty when last member leaves instead of triggering a rebalance * Test 0139 with mock cluster marked as local. Doesn't delete topic if tests are local only as it's possible there's no cluster to connect to and it speeds up completing the test * Resume the partition before fetch start or before validation * fix double free for hdrs (pass ownership to message) (confluentinc#4628) --------- Co-authored-by: Arthur O'Dwyer <[email protected]> * Revert setting timeout to infinity (confluentinc#5201) * Revert setting timeout to infinity * style fix * Changelog change * Changelog changes * Changelog change * Fix flakyness of tests 0014 and 0085 (confluentinc#5189) * Fix flakyness test 0085 * Errors that cause a refresh coordinator like NOT_COORDINATOR during an offset fetch should not be propagated to the application. * Pipeline improvements about machine types, auto-cancel, caching, manual promotions (confluentinc#5191) * Pipeline improvements about machine types and auto-cancel * Use cached docker image for integration tests, style checks, docs build * vcpkg cache * msys2 cache * Upgrade macOS agents * Implementation of OAUTHBEARER/OIDC metadata based authentication (confluentinc#5155) * Implementation of OAUTHBEARER/OIDC metadata based authentication, initially supporting the Azure UAMI method. * Tests with trivup 0.14.0 supporting metadata based authentications * Add documentation and changelog entry * Rename `azure` value to `azure_imds` and replace UAMI that is the identity with IMDS that is the authentication service * Extract authentication URL and rename internal function and enums * Changes to name the configuration property "query" instead of "params" as in other implementations and to make it optional if the default endpoint is overridden. * Revert "[KIP-320] Validate assigned partitions before starting to consume from them (confluentinc#4931)" (confluentinc#5207) This reverts commit 13a2bba. * [KIP-848] Add test cases for new OffsetCommit and OffsetFetch Error Codes (confluentinc#5194) * Add test cases for new OffsetCommit and OffsetFetch Error Codes * Testcase for discarding the member epoch in a consumer group heartbeat response when leaving with an inflight HB * [KIP-848] Added migration guide and removed preview warning (confluentinc#5210) * [KIP-848] Tests for: dev_kip848_fix_fast_subscribe_or_unsubscribe (confluentinc#5026) * Changelog changes and some modification to the KIP-848 migration guide (confluentinc#5214) * Changelog changes and some modification to the KIP-848 migration guide * Add that KIP-848 is not enabled by default and other PR comments * Kerberos cross-real authentication changelog (confluentinc#5215) * Downgrad min supported OSX version to 13 (confluentinc#5219) * Downgrade min supported OSX version to 13 * Version upgrade to v2.12.1 * Changelog for 2.12.1 (confluentinc#5220) * SOL-143245: Commented out the error level check in rd_kafka_broker_set_error() to report all errors to the application * SOL-143245: Fixed rd_kafka_broker_set_error() to return more specific disconnect error string --------- Co-authored-by: Emanuele Sabellico <[email protected]> Co-authored-by: Milind L <[email protected]> Co-authored-by: trnguyencflt <[email protected]> Co-authored-by: Chris Novakovic <[email protected]> Co-authored-by: Pranav Rathi <[email protected]> Co-authored-by: ConfluentTools <[email protected]> Co-authored-by: service-bot-app[bot] <189278048+service-bot-app[bot]@users.noreply.github.com> Co-authored-by: Matt Fleming <[email protected]> Co-authored-by: renovatebot-confluentinc[bot] <169726756+renovatebot-confluentinc[bot]@users.noreply.github.com> Co-authored-by: service-bot-app[bot] <foo-bar+service-bot-app[bot]@users.noreply.github.com> Co-authored-by: Pratyush Ranjan <[email protected]> Co-authored-by: Anchit Jain <[email protected]> Co-authored-by: Marcin Krystianc <[email protected]> Co-authored-by: mahajanadhitya <[email protected]> Co-authored-by: Ritish Kumar Singh <[email protected]> Co-authored-by: blindspotbounty <[email protected]> Co-authored-by: Arthur O'Dwyer <[email protected]>
Closing #4749 in favour of this PR.