Skip to content

Fix loop of OffsetForLeaderEpoch requests on quick leader changes#4433

Merged
Emanuele Sabellico (emasab) merged 6 commits intomasterfrom
dev_bug_offsetforleaderepoch
Sep 29, 2023
Merged

Fix loop of OffsetForLeaderEpoch requests on quick leader changes#4433
Emanuele Sabellico (emasab) merged 6 commits intomasterfrom
dev_bug_offsetforleaderepoch

Conversation

@milindl
Copy link
Copy Markdown
Contributor

Fixes #4425

@milindl Milind L (milindl) requested a review from a team September 14, 2023 05:26
@mjd95
Copy link
Copy Markdown

Thanks for the PR! We have a pretty reliable scenario for triggering #4425 (we do a reassignment on a small topic so that there are at least two partition state changes in short succession, the first and then all subsequent OffsetForLeaderEpoch requests fail with FencedLeaderEpoch).

We ran the consumers from this branch and confirmed that we don't see the issue with this patch.

@milindl
Copy link
Copy Markdown
Contributor Author

Thanks for confirming it with some independent testing Martin Dickson (@mjd95)!

Copy link
Copy Markdown
Member

@pranavrth Pranav Rathi (pranavrth) left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just a question and a nit.

Comment on lines +266 to +272
rd_kafka_mock_broker_push_request_error_rtts(
mcluster, 2, RD_KAFKAP_OffsetForLeaderEpoch, 1,
RD_KAFKA_RESP_ERR_KAFKA_STORAGE_ERROR, 900);

rd_kafka_mock_broker_push_request_error_rtts(
mcluster, 2, RD_KAFKAP_OffsetForLeaderEpoch, 1,
RD_KAFKA_RESP_ERR_NO_ERROR, 1000);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can merge these two.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

*
* See #4425.
*/
static void do_test_two_leader_changes(void) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you confirm that this test fails with the old code?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, it fails with a timeout after we enter the infinite loop

@emasab Emanuele Sabellico (emasab) deleted the dev_bug_offsetforleaderepoch branch September 29, 2023 10:23
Axel Andersson (axelandersson) added a commit to axelandersson/librdkafka that referenced this pull request Oct 5, 2023
* upstream/master:
  librdkafka v2.3.0 (confluentinc#4455)
  Fix for idempotent producer fatal errors, triggered after a possibly persisted message state (confluentinc#4438)
  Move can_q_contain_fetched_msgs inside q_serve (confluentinc#4431)
  [KIP-580] Exponential Backoff with Mock Broker Changes to Automate Testing. (confluentinc#4422)
  Update only the mklove version of OpenSSL to 3.0.11 (confluentinc#4454)
  Permanent errors during offset validation should be retried (confluentinc#4447)
  Increased flexver request size for Metadata request to include topic_id size (confluentinc#4453)
  Fix loop of OffsetForLeaderEpoch requests on quick leader changes (confluentinc#4433)
  Fix for stored offsets not being committed if they lacked the leader epoch (confluentinc#4442)
  Add leader epoch to control messages (confluentinc#4434)
  Refactored tmpabuf and fixed an insufficient buffer allocation (confluentinc#4449)
  Work around KIP-700 restrictions for DescribeCluster [KIP-430]
  [admin] KIP-430: Add authorized operations to describe API
  Fix segfault if assignor state is NULL, (confluentinc#4381)
@ghost
Copy link
Copy Markdown

Emanuele Sabellico (@emasab) Do you have an ETA on when 2.3 will be released? We believe this might the issue we are experiencing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

OffsetForLeaderEpoch loop of failed requests with multiple leader changes

4 participants