Skip to content

Fix: Avoid unnecessary producer epoch bumps#4972

Open
Marcin Krystianc (marcin-krystianc) wants to merge 7 commits intoconfluentinc:masterfrom
marcin-krystianc:dev-20250219-outoforder
Open

Fix: Avoid unnecessary producer epoch bumps#4972
Marcin Krystianc (marcin-krystianc) wants to merge 7 commits intoconfluentinc:masterfrom
marcin-krystianc:dev-20250219-outoforder

Conversation

@marcin-krystianc
Copy link
Copy Markdown
Contributor

Fixes: #4953

When the producer is idempotent and max.in.flight.requests.per.connection is set to a value between 2 and 5, it's normal to receive OUT_OF_ORDER_SEQUENCE_NUMBER produce responses for requests R2 through R5 when the R1 failed for any other reason.

Bumping the producer epoch in this scenario violates the "exactly-once" guarantees. Therefore, we believe that it's unnecessary to bump the producer's epoch; re-enqueuing the messages is sufficient.

The same "retry, but don't bump producer epoch" behavior is implemented in the Java client: https://github.com/apache/kafka/blob/a6a588fbed9982598377060c63f94ee6184b4295/clients/src/main/java/org/apache/kafka/clients/producer/internals/TransactionManager.java#L1015-L1016

@confluent-cla-assistant
Copy link
Copy Markdown

🎉 All Contributor License Agreements have been signed. Ready to merge.
✅ marcin-krystianc
Please push an empty commit if you would like to re-run the checks to verify CLA status for all contributors.

Marcin Krystianc (marcin-krystianc) added a commit to marcin-krystianc/librdkafka that referenced this pull request Apr 29, 2025
 - confluentinc#4972 (Avoid unnecessary producer epoch bumps)
 - confluentinc#4989 (Fully utilize the max.in.flight.requests.per.connection parameter on the idempotent producer)
 - confluentinc#5055 (Add missing wrlock around rd_kafka_metadata_cache_hint call)
Marcin Krystianc (marcin-krystianc) added a commit to marcin-krystianc/librdkafka that referenced this pull request Apr 29, 2025
 - confluentinc#4972 (Avoid unnecessary producer epoch bumps)
 - confluentinc#4989 (Fully utilize the max.in.flight.requests.per.connection parameter on the idempotent producer)
 - confluentinc#5055 (Add missing wrlock around rd_kafka_metadata_cache_hint call)
Marcin Krystianc (marcin-krystianc) added a commit to marcin-krystianc/librdkafka that referenced this pull request Apr 29, 2025
 - confluentinc#4972 (Avoid unnecessary producer epoch bumps)
 - confluentinc#4989 (Fully utilize the max.in.flight.requests.per.connection parameter on the idempotent producer)
 - confluentinc#5055 (Add missing wrlock around rd_kafka_metadata_cache_hint call)
Marcin Krystianc (marcin-krystianc) added a commit to G-Research/librdkafka that referenced this pull request Apr 30, 2025
- confluentinc#4972 (Avoid unnecessary producer epoch bumps)
 - confluentinc#4989 (Fully utilize the max.in.flight.requests.per.connection parameter on the idempotent producer)
 - confluentinc#5055 (Add missing wrlock around rd_kafka_metadata_cache_hint call)
@emasab
Copy link
Copy Markdown
Contributor

Thanks for the contribution Marcin Krystianc (@marcin-krystianc) I just need to change a test that is expecting a different behaviour. It's to note that the exactly-once guarantees are only partial for the idempotent producer, see this comment.

… when it's not the first error.

In such cases all messages can be retried because if the first error was persistent it had caused a epoch bump already and if it was possibly persisted it can be retried together with next sequences.

If it bumps the epoch, instead, there's a chance it causes side effects as message duplications on other partitions.
@emasab
Copy link
Copy Markdown
Contributor

/sem-approve

are sent before setting the broker down
@emasab
Copy link
Copy Markdown
Contributor

/sem-approve

@marcin-krystianc
Copy link
Copy Markdown
Contributor Author

Thanks for the contribution Marcin Krystianc (@marcin-krystianc) I just need to change a test that is expecting a different behaviour. It's to note that the exactly-once guarantees are only partial for the idempotent producer, see this comment.

Thanks. Are there any other tests worth adding? I'm happy to do it, but I need some guidance.

@emasab
Copy link
Copy Markdown
Contributor

Emanuele Sabellico (emasab) commented May 19, 2025

Marcin Krystianc (@marcin-krystianc) at the moment I can't think of other tests related to this error that are different from the existing ones. In theory it has a meaning and in practice it's done by Java client as well so it seems like a safer behaviour.
It doesn't happen with the transactional producer as the transaction is aborted with the code before this change.
I request a second review as well.

btw, if you remove the sleeps in test 0144 it fails with a fatal error (r=0). I verified that it's because of the mocked error codes that should be returned after the error caused by the disconnection and not because of some problem with the new logic. Anyway we have some other soak and transactional tests that we'll run against real brokers.

Thanks again!

Jonathan Giannuzzi (jgiannuzzi) added a commit to jgiannuzzi/librdkafka that referenced this pull request Jul 31, 2025
- G-Research#3 (CI/CD script)
- confluentinc#4972 (Avoid unnecessary producer epoch bumps)
- confluentinc#4989 (Fully utilize the max.in.flight.requests.per.connection parameter on the idempotent producer)
Mark Wadham (m4rkw) pushed a commit to G-Research/librdkafka that referenced this pull request Jul 31, 2025
* Hotfix Release: v2.11.0-gr - collective changes

- #3 (CI/CD script)
- confluentinc#4972 (Avoid unnecessary producer epoch bumps)
- confluentinc#4989 (Fully utilize the max.in.flight.requests.per.connection parameter on the idempotent producer)

* Fix line endings

* Make style checks CI job work

It will fail because of some style issues from upstream, but at least it should complete instead of hanging forever.

* Build for arm64 linux without emulation
@marcin-krystianc
Copy link
Copy Markdown
Contributor Author

Hi Emanuele Sabellico (@emasab) , do you have a rough timeline for when you anticipate being able to merge this change?

Jonathan Giannuzzi (jgiannuzzi) added a commit to jgiannuzzi/librdkafka that referenced this pull request Sep 8, 2025
- G-Research#3 (CI/CD script)
- confluentinc#4972 (Avoid unnecessary producer epoch bumps)
- confluentinc#4989 (Fully utilize the max.in.flight.requests.per.connection parameter on the idempotent producer)
 - confluentinc#4972 (Avoid unnecessary producer epoch bumps)
 - confluentinc#4989 (Fully utilize the max.in.flight.requests.per.connection parameter on the idempotent producer)
 - confluentinc#5168 (Use system-provided cyrus-sasl/libsasl2 at runtime)
Jonathan Giannuzzi (jgiannuzzi) added a commit to jgiannuzzi/librdkafka that referenced this pull request Sep 8, 2025
- G-Research#3 (CI/CD script)
- confluentinc#4972 (Avoid unnecessary producer epoch bumps)
- confluentinc#4989 (Fully utilize the max.in.flight.requests.per.connection parameter on the idempotent producer)
 - confluentinc#5168 (Use system-provided cyrus-sasl/libsasl2 at runtime)
Jonathan Giannuzzi (jgiannuzzi) added a commit to G-Research/librdkafka that referenced this pull request Sep 8, 2025
- #3 (CI/CD script)
- confluentinc#4972 (Avoid unnecessary producer epoch bumps)
- confluentinc#4989 (Fully utilize the max.in.flight.requests.per.connection parameter on the idempotent producer)
 - confluentinc#5168 (Use system-provided cyrus-sasl/libsasl2 at runtime)
Jonathan Giannuzzi (jgiannuzzi) added a commit to jgiannuzzi/librdkafka that referenced this pull request Oct 9, 2025
- G-Research#3 (CI/CD script)
- confluentinc#4972 (Avoid unnecessary producer epoch bumps)
- confluentinc#4989 (Fully utilize the max.in.flight.requests.per.connection parameter on the idempotent producer)
 - confluentinc#5168 (Use system-provided cyrus-sasl/libsasl2 at runtime)
Jonathan Giannuzzi (jgiannuzzi) added a commit to jgiannuzzi/librdkafka that referenced this pull request Oct 9, 2025
- G-Research#3 (CI/CD script)
- confluentinc#4972 (Avoid unnecessary producer epoch bumps)
- confluentinc#4989 (Fully utilize the max.in.flight.requests.per.connection parameter on the idempotent producer)
 - confluentinc#5168 (Use system-provided cyrus-sasl/libsasl2 at runtime)
Jonathan Giannuzzi (jgiannuzzi) added a commit to jgiannuzzi/librdkafka that referenced this pull request Oct 9, 2025
- G-Research#3 (CI/CD script)
- confluentinc#4972 (Avoid unnecessary producer epoch bumps)
- confluentinc#4989 (Fully utilize the max.in.flight.requests.per.connection parameter on the idempotent producer)
 - confluentinc#5168 (Use system-provided cyrus-sasl/libsasl2 at runtime)
Jonathan Giannuzzi (jgiannuzzi) added a commit to jgiannuzzi/librdkafka that referenced this pull request Oct 9, 2025
- G-Research#3 (CI/CD script)
- confluentinc#4972 (Avoid unnecessary producer epoch bumps)
- confluentinc#4989 (Fully utilize the max.in.flight.requests.per.connection parameter on the idempotent producer)
 - confluentinc#5168 (Use system-provided cyrus-sasl/libsasl2 at runtime)
Jonathan Giannuzzi (jgiannuzzi) added a commit to jgiannuzzi/librdkafka that referenced this pull request Oct 22, 2025
- G-Research#3 (CI/CD script)
- confluentinc#4972 (Avoid unnecessary producer epoch bumps)
- confluentinc#4989 (Fully utilize the max.in.flight.requests.per.connection parameter on the idempotent producer)
 - confluentinc#5168 (Use system-provided cyrus-sasl/libsasl2 at runtime)
Jonathan Giannuzzi (jgiannuzzi) added a commit to G-Research/librdkafka that referenced this pull request Oct 22, 2025
- #3 (CI/CD script)
- confluentinc#4972 (Avoid unnecessary producer epoch bumps)
- confluentinc#4989 (Fully utilize the max.in.flight.requests.per.connection parameter on the idempotent producer)
 - confluentinc#5168 (Use system-provided cyrus-sasl/libsasl2 at runtime)
Jonathan Giannuzzi (jgiannuzzi) added a commit to jgiannuzzi/librdkafka that referenced this pull request Jan 16, 2026
- G-Research#3 (CI/CD script)
- confluentinc#4972 (Avoid
unnecessary producer epoch bumps)
- confluentinc#4989 (Fully utilize
the max.in.flight.requests.per.connection parameter on the idempotent
producer)
 - confluentinc#5168 (Use
system-provided cyrus-sasl/libsasl2 at runtime)
Jonathan Giannuzzi (jgiannuzzi) added a commit to jgiannuzzi/librdkafka that referenced this pull request Jan 16, 2026
- G-Research#3 (CI/CD script)
- confluentinc#4972 (Avoid
unnecessary producer epoch bumps)
- confluentinc#4989 (Fully utilize
the max.in.flight.requests.per.connection parameter on the idempotent
producer)
 - confluentinc#5168 (Use
system-provided cyrus-sasl/libsasl2 at runtime)
Jonathan Giannuzzi (jgiannuzzi) added a commit to jgiannuzzi/librdkafka that referenced this pull request Jan 16, 2026
Jonathan Giannuzzi (jgiannuzzi) added a commit to G-Research/librdkafka that referenced this pull request Jan 16, 2026
* CI/CD script

* Avoid unnecessary producer epoch bumps

confluentinc#4972

* Fully utilize the max.in.flight.requests.per.connection parameter on the idempotent producer

confluentinc#4989

* Use system-provided cyrus-sasl/libsasl2 at runtime

confluentinc#5168

* Update changelog
Jonathan Giannuzzi (jgiannuzzi) added a commit to jgiannuzzi/librdkafka that referenced this pull request Mar 6, 2026
Jonathan Giannuzzi (jgiannuzzi) added a commit to G-Research/librdkafka that referenced this pull request Mar 8, 2026
* CI/CD script

* Avoid unnecessary producer epoch bumps

confluentinc#4972

* Fully utilize the max.in.flight.requests.per.connection parameter on the idempotent producer

confluentinc#4989

* Use system-provided cyrus-sasl/libsasl2 at runtime

confluentinc#5168

* Update changelog
Jonathan Giannuzzi (jgiannuzzi) added a commit to jgiannuzzi/librdkafka that referenced this pull request Apr 9, 2026
Jonathan Giannuzzi (jgiannuzzi) added a commit to G-Research/librdkafka that referenced this pull request Apr 9, 2026
* CI/CD script

* Avoid unnecessary producer epoch bumps

confluentinc#4972

* Fully utilize the max.in.flight.requests.per.connection parameter on the idempotent producer

confluentinc#4989

* Use system-provided cyrus-sasl/libsasl2 at runtime

confluentinc#5168

* Update changelog
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Sporadic message duplication during leader transition with idempotent producer

2 participants