Fix: Avoid unnecessary producer epoch bumps#4972
Fix: Avoid unnecessary producer epoch bumps#4972Marcin Krystianc (marcin-krystianc) wants to merge 7 commits intoconfluentinc:masterfrom
Conversation
|
🎉 All Contributor License Agreements have been signed. Ready to merge. |
- CI/CD script (See #3) - confluentinc#4989 - confluentinc#4972 - confluentinc#4905 - confluentinc#4864
- confluentinc#4972 (Avoid unnecessary producer epoch bumps) - confluentinc#4989 (Fully utilize the max.in.flight.requests.per.connection parameter on the idempotent producer) - confluentinc#5055 (Add missing wrlock around rd_kafka_metadata_cache_hint call)
- confluentinc#4972 (Avoid unnecessary producer epoch bumps) - confluentinc#4989 (Fully utilize the max.in.flight.requests.per.connection parameter on the idempotent producer) - confluentinc#5055 (Add missing wrlock around rd_kafka_metadata_cache_hint call)
- confluentinc#4972 (Avoid unnecessary producer epoch bumps) - confluentinc#4989 (Fully utilize the max.in.flight.requests.per.connection parameter on the idempotent producer) - confluentinc#5055 (Add missing wrlock around rd_kafka_metadata_cache_hint call)
- confluentinc#4972 (Avoid unnecessary producer epoch bumps) - confluentinc#4989 (Fully utilize the max.in.flight.requests.per.connection parameter on the idempotent producer) - confluentinc#5055 (Add missing wrlock around rd_kafka_metadata_cache_hint call)
|
Thanks for the contribution Marcin Krystianc (@marcin-krystianc) I just need to change a test that is expecting a different behaviour. It's to note that the exactly-once guarantees are only partial for the idempotent producer, see this comment. |
at line 3910
… when it's not the first error. In such cases all messages can be retried because if the first error was persistent it had caused a epoch bump already and if it was possibly persisted it can be retried together with next sequences. If it bumps the epoch, instead, there's a chance it causes side effects as message duplications on other partitions.
|
/sem-approve |
are sent before setting the broker down
|
/sem-approve |
Thanks. Are there any other tests worth adding? I'm happy to do it, but I need some guidance. |
|
Marcin Krystianc (@marcin-krystianc) at the moment I can't think of other tests related to this error that are different from the existing ones. In theory it has a meaning and in practice it's done by Java client as well so it seems like a safer behaviour. btw, if you remove the sleeps in test 0144 it fails with a fatal error (r=0). I verified that it's because of the mocked error codes that should be returned after the error caused by the disconnection and not because of some problem with the new logic. Anyway we have some other soak and transactional tests that we'll run against real brokers. Thanks again! |
- G-Research#3 (CI/CD script) - confluentinc#4972 (Avoid unnecessary producer epoch bumps) - confluentinc#4989 (Fully utilize the max.in.flight.requests.per.connection parameter on the idempotent producer)
* Hotfix Release: v2.11.0-gr - collective changes - #3 (CI/CD script) - confluentinc#4972 (Avoid unnecessary producer epoch bumps) - confluentinc#4989 (Fully utilize the max.in.flight.requests.per.connection parameter on the idempotent producer) * Fix line endings * Make style checks CI job work It will fail because of some style issues from upstream, but at least it should complete instead of hanging forever. * Build for arm64 linux without emulation
|
Hi Emanuele Sabellico (@emasab) , do you have a rough timeline for when you anticipate being able to merge this change? |
- G-Research#3 (CI/CD script) - confluentinc#4972 (Avoid unnecessary producer epoch bumps) - confluentinc#4989 (Fully utilize the max.in.flight.requests.per.connection parameter on the idempotent producer) - confluentinc#4972 (Avoid unnecessary producer epoch bumps) - confluentinc#4989 (Fully utilize the max.in.flight.requests.per.connection parameter on the idempotent producer) - confluentinc#5168 (Use system-provided cyrus-sasl/libsasl2 at runtime)
- G-Research#3 (CI/CD script) - confluentinc#4972 (Avoid unnecessary producer epoch bumps) - confluentinc#4989 (Fully utilize the max.in.flight.requests.per.connection parameter on the idempotent producer) - confluentinc#5168 (Use system-provided cyrus-sasl/libsasl2 at runtime)
- #3 (CI/CD script) - confluentinc#4972 (Avoid unnecessary producer epoch bumps) - confluentinc#4989 (Fully utilize the max.in.flight.requests.per.connection parameter on the idempotent producer) - confluentinc#5168 (Use system-provided cyrus-sasl/libsasl2 at runtime)
- G-Research#3 (CI/CD script) - confluentinc#4972 (Avoid unnecessary producer epoch bumps) - confluentinc#4989 (Fully utilize the max.in.flight.requests.per.connection parameter on the idempotent producer) - confluentinc#5168 (Use system-provided cyrus-sasl/libsasl2 at runtime)
- G-Research#3 (CI/CD script) - confluentinc#4972 (Avoid unnecessary producer epoch bumps) - confluentinc#4989 (Fully utilize the max.in.flight.requests.per.connection parameter on the idempotent producer) - confluentinc#5168 (Use system-provided cyrus-sasl/libsasl2 at runtime)
- G-Research#3 (CI/CD script) - confluentinc#4972 (Avoid unnecessary producer epoch bumps) - confluentinc#4989 (Fully utilize the max.in.flight.requests.per.connection parameter on the idempotent producer) - confluentinc#5168 (Use system-provided cyrus-sasl/libsasl2 at runtime)
- G-Research#3 (CI/CD script) - confluentinc#4972 (Avoid unnecessary producer epoch bumps) - confluentinc#4989 (Fully utilize the max.in.flight.requests.per.connection parameter on the idempotent producer) - confluentinc#5168 (Use system-provided cyrus-sasl/libsasl2 at runtime)
- G-Research#3 (CI/CD script) - confluentinc#4972 (Avoid unnecessary producer epoch bumps) - confluentinc#4989 (Fully utilize the max.in.flight.requests.per.connection parameter on the idempotent producer) - confluentinc#5168 (Use system-provided cyrus-sasl/libsasl2 at runtime)
- #3 (CI/CD script) - confluentinc#4972 (Avoid unnecessary producer epoch bumps) - confluentinc#4989 (Fully utilize the max.in.flight.requests.per.connection parameter on the idempotent producer) - confluentinc#5168 (Use system-provided cyrus-sasl/libsasl2 at runtime)
- G-Research#3 (CI/CD script) - confluentinc#4972 (Avoid unnecessary producer epoch bumps) - confluentinc#4989 (Fully utilize the max.in.flight.requests.per.connection parameter on the idempotent producer) - confluentinc#5168 (Use system-provided cyrus-sasl/libsasl2 at runtime)
- G-Research#3 (CI/CD script) - confluentinc#4972 (Avoid unnecessary producer epoch bumps) - confluentinc#4989 (Fully utilize the max.in.flight.requests.per.connection parameter on the idempotent producer) - confluentinc#5168 (Use system-provided cyrus-sasl/libsasl2 at runtime)
* CI/CD script * Avoid unnecessary producer epoch bumps confluentinc#4972 * Fully utilize the max.in.flight.requests.per.connection parameter on the idempotent producer confluentinc#4989 * Use system-provided cyrus-sasl/libsasl2 at runtime confluentinc#5168 * Update changelog
* CI/CD script * Avoid unnecessary producer epoch bumps confluentinc#4972 * Fully utilize the max.in.flight.requests.per.connection parameter on the idempotent producer confluentinc#4989 * Use system-provided cyrus-sasl/libsasl2 at runtime confluentinc#5168 * Update changelog
* CI/CD script * Avoid unnecessary producer epoch bumps confluentinc#4972 * Fully utilize the max.in.flight.requests.per.connection parameter on the idempotent producer confluentinc#4989 * Use system-provided cyrus-sasl/libsasl2 at runtime confluentinc#5168 * Update changelog
Fixes: #4953
When the producer is idempotent and
max.in.flight.requests.per.connectionis set to a value between 2 and 5, it's normal to receiveOUT_OF_ORDER_SEQUENCE_NUMBERproduce responses for requests R2 through R5 when the R1 failed for any other reason.Bumping the producer epoch in this scenario violates the "exactly-once" guarantees. Therefore, we believe that it's unnecessary to bump the producer's epoch; re-enqueuing the messages is sufficient.
The same "retry, but don't bump producer epoch" behavior is implemented in the Java client: https://github.com/apache/kafka/blob/a6a588fbed9982598377060c63f94ee6184b4295/clients/src/main/java/org/apache/kafka/clients/producer/internals/TransactionManager.java#L1015-L1016