Handle node failure properly in sender by vmaurin · Pull Request #1135 · aio-libs/aiokafka

vmaurin · 2025-11-14T14:34:59Z

Various sender activities require the sender to talk to a dedicated coordinator node, for a related group or for transaction management.

When trying to reach this node, an error could happen because of the broker being unavailable or unreachable. In this case, we must mark the coordinator dead, so instead of retrying over and over to the same node, we will first try to find an active coordinator.

fixes #1134

Changes

Fixes #1134

Checklist

I think the code is well written
Unit tests for the changes exist
Documentation reflects the changes
Add a new news fragment into the CHANGES folder
- name it <issue_id>.<type> (e.g. 588.bugfix)
- if you don't have an issue_id change it to the pr id after creating the PR
- ensure type is one of the following:
  - .feature: Signifying a new feature.
  - .bugfix: Signifying a bug fix.
  - .doc: Signifying a documentation improvement.
  - .removal: Signifying a deprecation or removal of public API.
  - .misc: A ticket has been closed, but it is not of interest to users.
- Make sure to use full sentences with correct case and punctuation, for example: Fix issue with non-ascii contents in doctest text files.

codecov · 2025-11-14T14:57:50Z

Codecov Report

❌ Patch coverage is 97.87234% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 94.80%. Comparing base (7d0bd25) to head (6615f07).
⚠️ Report is 1 commits behind head on master.

Files with missing lines	Patch %	Lines
aiokafka/producer/sender.py	95.83%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1135      +/-   ##
==========================================
- Coverage   94.83%   94.80%   -0.03%     
==========================================
  Files          88       88              
  Lines       15671    15716      +45     
  Branches     1374     1374              
==========================================
+ Hits        14861    14900      +39     
- Misses        565      570       +5     
- Partials      245      246       +1

Flag	Coverage Δ
cext	`94.76% <97.87%> (-0.03%)`	⬇️
integration	`94.69% <97.87%> (-0.03%)`	⬇️
purepy	`94.76% <97.87%> (-0.03%)`	⬇️
unit	`51.18% <14.89%> (-0.10%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

vmaurin · 2025-11-14T15:07:44Z

@ods Does it look like a good approach to you ?

vmaurin · 2025-11-25T07:41:11Z

@ods The logic of this PR is too mark the (transaction or group) coordinator dead when we encounter a low level error while trying to send a message to it. On the retry, it will give the opportunity to find the eventual new coordinator, while currently it is just looping over and over like described in #1134

vmaurin · 2025-11-26T17:55:41Z

Back to draft.
This PR was made on the assumption that the handling in handle_response was working as expected, but that is not the case. Even marking the coordinator dead won't trigger a new discovery / retry

Waiting for a new solution

vmaurin · 2025-11-27T10:27:54Z

Back to draft. This PR was made on the assumption that the handling in handle_response was working as expected, but that is not the case. Even marking the coordinator dead won't trigger a new discovery / retry

@ods I was wrong !

I did a dumb mistake as on my testing project, the __transaction_state topic had a replication factor of 1. When I was killing the coordinator, no new leader was elected and the client was getting a -1 response as coordinator.

Setting the proper replication factor for __transaction_state, I managed to have the proper use cases:

I simulate a loop of transactional processing
I find the coordinator with bin/kafka-transactions.sh --bootstrap-server kafka1:9092,kafka2:9092,kafka3:9092 list
I kill the coordinator
after getting a NodeNotAvailableError, the producer is finding the new coordinator (this time a new broker is elected leader)
the transaction can complete properly

On master branch, the producer is looping over NodeNotAvailableError as the coordinator is never marked dead.

I reduce the exception scope to only two exceptions, we might need to had more if needed, but it feels more appropriated like this.

So I would consider this PR ready to merge

Various sender activities require the sender to talk to a dedicated coordinator node, for a related group or for transaction management. When trying to reach this node, an error could happen because of the broker being unavailable or unreachable. In this case, we must mark the coordinator dead, so instead of retrying over and over to the same node, we will first try to find an active coordinator. fixes aio-libs#1134

vmaurin force-pushed the issue-1134 branch 3 times, most recently from c8f4f51 to 1290a2c Compare November 14, 2025 14:44

vmaurin force-pushed the issue-1134 branch 2 times, most recently from 0a7c691 to d23e281 Compare November 14, 2025 15:06

ods reviewed Nov 25, 2025

View reviewed changes

Comment thread aiokafka/producer/sender.py Outdated

ods previously approved these changes Nov 25, 2025

View reviewed changes

vmaurin force-pushed the issue-1134 branch 2 times, most recently from 779d9f2 to 7a3889a Compare November 26, 2025 15:28

vmaurin marked this pull request as draft November 26, 2025 17:54

vmaurin force-pushed the issue-1134 branch 2 times, most recently from 7cecec0 to 5ccca4d Compare November 27, 2025 10:18

vmaurin marked this pull request as ready for review November 27, 2025 10:28

vmaurin force-pushed the issue-1134 branch from 5ccca4d to 49c1dda Compare November 27, 2025 13:37

ods reviewed Nov 27, 2025

View reviewed changes

Comment thread CHANGES.rst Outdated

vmaurin force-pushed the issue-1134 branch from 49c1dda to 6615f07 Compare November 27, 2025 19:29

ods merged commit 9f9161e into aio-libs:master Nov 28, 2025
25 of 26 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Handle node failure properly in sender#1135

Handle node failure properly in sender#1135
ods merged 1 commit intoaio-libs:masterfrom
vmaurin:issue-1134

vmaurin commented Nov 14, 2025 •

edited

Loading

Uh oh!

codecov Bot commented Nov 14, 2025 •

edited

Loading

Uh oh!

vmaurin commented Nov 14, 2025

Uh oh!

vmaurin commented Nov 25, 2025

Uh oh!

Uh oh!

vmaurin commented Nov 26, 2025

Uh oh!

vmaurin commented Nov 27, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

vmaurin commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Checklist

Uh oh!

codecov Bot commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

vmaurin commented Nov 14, 2025

Uh oh!

vmaurin commented Nov 25, 2025

Uh oh!

Uh oh!

vmaurin commented Nov 26, 2025

Uh oh!

vmaurin commented Nov 27, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vmaurin commented Nov 14, 2025 •

edited

Loading

codecov Bot commented Nov 14, 2025 •

edited

Loading