Skip to content

Handle node failure properly in sender#1135

Merged
ods merged 1 commit intoaio-libs:masterfrom
vmaurin:issue-1134
Nov 28, 2025
Merged

Handle node failure properly in sender#1135
ods merged 1 commit intoaio-libs:masterfrom
vmaurin:issue-1134

Conversation

@vmaurin
Copy link
Copy Markdown
Contributor

@vmaurin vmaurin commented Nov 14, 2025

Various sender activities require the sender to talk to a dedicated coordinator node, for a related group or for transaction management.

When trying to reach this node, an error could happen because of the broker being unavailable or unreachable. In this case, we must mark the coordinator dead, so instead of retrying over and over to the same node, we will first try to find an active coordinator.

fixes #1134

Changes

Fixes #1134

Checklist

  • I think the code is well written
  • Unit tests for the changes exist
  • Documentation reflects the changes
  • Add a new news fragment into the CHANGES folder
    • name it <issue_id>.<type> (e.g. 588.bugfix)
    • if you don't have an issue_id change it to the pr id after creating the PR
    • ensure type is one of the following:
      • .feature: Signifying a new feature.
      • .bugfix: Signifying a bug fix.
      • .doc: Signifying a documentation improvement.
      • .removal: Signifying a deprecation or removal of public API.
      • .misc: A ticket has been closed, but it is not of interest to users.
    • Make sure to use full sentences with correct case and punctuation, for example: Fix issue with non-ascii contents in doctest text files.

@vmaurin vmaurin force-pushed the issue-1134 branch 3 times, most recently from c8f4f51 to 1290a2c Compare November 14, 2025 14:44
@codecov
Copy link
Copy Markdown

codecov Bot commented Nov 14, 2025

Codecov Report

❌ Patch coverage is 97.87234% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 94.80%. Comparing base (7d0bd25) to head (6615f07).
⚠️ Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
aiokafka/producer/sender.py 95.83% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1135      +/-   ##
==========================================
- Coverage   94.83%   94.80%   -0.03%     
==========================================
  Files          88       88              
  Lines       15671    15716      +45     
  Branches     1374     1374              
==========================================
+ Hits        14861    14900      +39     
- Misses        565      570       +5     
- Partials      245      246       +1     
Flag Coverage Δ
cext 94.76% <97.87%> (-0.03%) ⬇️
integration 94.69% <97.87%> (-0.03%) ⬇️
purepy 94.76% <97.87%> (-0.03%) ⬇️
unit 51.18% <14.89%> (-0.10%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@vmaurin vmaurin force-pushed the issue-1134 branch 2 times, most recently from 0a7c691 to d23e281 Compare November 14, 2025 15:06
@vmaurin
Copy link
Copy Markdown
Contributor Author

vmaurin commented Nov 14, 2025

@ods Does it look like a good approach to you ?

@vmaurin
Copy link
Copy Markdown
Contributor Author

vmaurin commented Nov 25, 2025

@ods The logic of this PR is too mark the (transaction or group) coordinator dead when we encounter a low level error while trying to send a message to it. On the retry, it will give the opportunity to find the eventual new coordinator, while currently it is just looping over and over like described in #1134

Comment thread aiokafka/producer/sender.py Outdated
ods
ods previously approved these changes Nov 25, 2025
@vmaurin vmaurin force-pushed the issue-1134 branch 2 times, most recently from 779d9f2 to 7a3889a Compare November 26, 2025 15:28
@vmaurin vmaurin marked this pull request as draft November 26, 2025 17:54
@vmaurin
Copy link
Copy Markdown
Contributor Author

vmaurin commented Nov 26, 2025

Back to draft.
This PR was made on the assumption that the handling in handle_response was working as expected, but that is not the case. Even marking the coordinator dead won't trigger a new discovery / retry

@ods ods dismissed their stale review November 27, 2025 09:16

Waiting for a new solution

@vmaurin vmaurin force-pushed the issue-1134 branch 2 times, most recently from 7cecec0 to 5ccca4d Compare November 27, 2025 10:18
@vmaurin
Copy link
Copy Markdown
Contributor Author

vmaurin commented Nov 27, 2025

Back to draft. This PR was made on the assumption that the handling in handle_response was working as expected, but that is not the case. Even marking the coordinator dead won't trigger a new discovery / retry

@ods I was wrong !

I did a dumb mistake as on my testing project, the __transaction_state topic had a replication factor of 1. When I was killing the coordinator, no new leader was elected and the client was getting a -1 response as coordinator.

Setting the proper replication factor for __transaction_state, I managed to have the proper use cases:

  • I simulate a loop of transactional processing
  • I find the coordinator with bin/kafka-transactions.sh --bootstrap-server kafka1:9092,kafka2:9092,kafka3:9092 list
  • I kill the coordinator
  • after getting a NodeNotAvailableError, the producer is finding the new coordinator (this time a new broker is elected leader)
  • the transaction can complete properly

On master branch, the producer is looping over NodeNotAvailableError as the coordinator is never marked dead.

I reduce the exception scope to only two exceptions, we might need to had more if needed, but it feels more appropriated like this.

So I would consider this PR ready to merge

@vmaurin vmaurin marked this pull request as ready for review November 27, 2025 10:28
Comment thread CHANGES.rst Outdated
Various sender activities require the sender to talk to a dedicated
coordinator node, for a related group or for transaction management.

When trying to reach this node, an error could happen because of the
broker being unavailable or unreachable. In this case, we must mark the
coordinator dead, so instead of retrying over and over to the same node,
we will first try to find an active coordinator.

fixes aio-libs#1134
@ods ods merged commit 9f9161e into aio-libs:master Nov 28, 2025
25 of 26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

The transaction coordinator is never refreshed

2 participants