Skip to content

fix(broker): auto-close broken connections#3412

Merged
dnwe merged 4 commits intoIBM:mainfrom
DCjanus:fix/auto-close-broken-pipe
Jan 9, 2026
Merged

fix(broker): auto-close broken connections#3412
dnwe merged 4 commits intoIBM:mainfrom
DCjanus:fix/auto-close-broken-pipe

Conversation

@DCjanus
Copy link
Copy Markdown
Contributor

@DCjanus DCjanus commented Dec 19, 2025

Background

  • Reproduces and fixes the "cannot recover after TCP reset" issue tracked in IBM/sarama#1162.
  • An earlier attempt in PR #3406 required scattered fixes in many call sites and was hard to maintain. This revision centralizes the logic inside Broker.
  • Note: The design idea is mine, but the implementation and wording were assisted by AI (I am not a native speaker). If the tone feels a bit mechanical, please kindly excuse it, and thank you for reviewing.

Approach

  • Add shouldCloseBrokerConn to detect fatal transport errors inside Broker.
  • When triggered, immediately call closeInnerLocked to close/reset the connection and preserve the original error in connErr, so callers can re-establish via explicit Open().

Changes

  • broker.go: auto-close and reset state on transport errors during ApiVersions negotiation and sendAndReceive; extract lock-held helper closeInnerLocked.
  • Unit test: TestBrokerOpenApiVersionsTransportError uses a custom dialer to reliably hit EOF, ensuring the connection is marked unusable and Open is retryable.
  • Functional test: TestFuncAdminNetworkErrorClosesControllerConnection injects a one-shot reset_peer via toxiproxy to prove Open triggers reconnection and metadata requests recover; adds proxy helper utilities.
  • Test env: docker compose down/rm always includes --profile zookeeper to avoid leftovers with Kafka < 4.
  • Dependency bumps: github.com/klauspost/compress → 1.18.2, golang.org/x/sync → 0.18.0 (no behavior change).

Verification

  • Unit: go test -run '^TestBrokerOpenApiVersionsTransportError$' -count=1 ./...
  • Functional: go test -tags functional -run '^TestFuncAdminNetworkErrorClosesControllerConnection$' -count=1 ./...

Compatibility

No breaking API changes; only closes broken connections earlier on transport errors and relies on explicit Open to reconnect.

@DCjanus DCjanus force-pushed the fix/auto-close-broken-pipe branch from 6ab92b9 to 0960b52 Compare December 19, 2025 16:20
@DCjanus DCjanus force-pushed the fix/auto-close-broken-pipe branch from 6e747d2 to f69de2e Compare December 21, 2025 15:11
Copy link
Copy Markdown
Collaborator

@dnwe dnwe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for raising this PR!

There's a pattern of:

if shouldCloseBrokerConn(err) {
	b.connErr = err
	_ = b.closeInnerLocked()
	// … 
}

(occasionally with a return at the // ... and occasionally not)

I wonder if we should make this a dedicated func on broker to avoid the repetition?

b.maybeCloseBrokerConn(err)

I also wasn't totally sure on the naming of closeInnerLocked, although I couldn't immediately think of a better name for it

@puellanivis
Copy link
Copy Markdown
Collaborator

I also wasn't totally sure on the naming of closeInnerLocked, although I couldn't immediately think of a better name for it

Yeah, the common pattern in Go is usually to provide a close receiver method, and a Close receiver method that calls the unexported receiver method. Since the lowercase name cannot be called by anyone outside of this package, it’s inherently “inner”. Even when a lock must be held when calling it. However, I can agree that a Locked suffix is useful, and prevents a collision with the builtin close function name, though since it’s a receiver method, scope and form of usage is usually sufficient disambiguation.

@DCjanus
Copy link
Copy Markdown
Contributor Author

DCjanus commented Dec 22, 2025

I also wasn't totally sure on the naming of closeInnerLocked, although I couldn't immediately think of a better name for it

I’m considering renaming closeInnerLocked to closeLocked since it’s a broker method and the lock requirement is implied.

Does that feel clear enough, or would you prefer something more explicit like closeConnAndResetLocked?

@DCjanus DCjanus force-pushed the fix/auto-close-broken-pipe branch from 5922d29 to 506ed17 Compare December 22, 2025 16:51
@DCjanus
Copy link
Copy Markdown
Contributor Author

DCjanus commented Dec 22, 2025

I wonder if we should make this a dedicated func on broker to avoid the repetition?

Just introduced a maybeCloseLocked helper in this commit to consolidate the repeated close-on-error logic.

@dnwe dnwe added the fix label Dec 27, 2025
Copy link
Copy Markdown
Collaborator

@dnwe dnwe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for this PR. Logically the changes look good now. Some of the code comments are a little verbose, if you’re using agent assistance then maybe prompt it to clean those up:

only add comments when they explain why, not what - if the code needs a comment to explain what it does, rewrite the code to be clearer

@DCjanus DCjanus force-pushed the fix/auto-close-broken-pipe branch from f81c761 to 6f8546b Compare December 27, 2025 02:13
@DCjanus
Copy link
Copy Markdown
Contributor Author

DCjanus commented Dec 27, 2025

I went through the comments pass and made the notes shorter. The comment-trimming commit is a138d2c. 🤝

About the extra comments: I do use an agent, but I’m still getting familiar with this codebase. I asked the agent to explain parts of the design so I could check its output. That made it add more comments than needed for reviewers. I removed most of them and kept only the useful ones. If I removed too much, please feel free to adjust the comments directly.

There was earlier feedback that closeInnerLocked felt off. I renamed it to closeLocked (commit: 6f8546b); my thinking is that a lowercase method name already implies “inner,” so Inner felt redundant.

If you prefer a different name or any other changes, please feel free to push directly to this branch — maintainer edits are enabled.

@dnwe dnwe force-pushed the fix/auto-close-broken-pipe branch 3 times, most recently from fbdeb5b to b6aaa81 Compare January 9, 2026 10:52
@dnwe dnwe changed the title fix(broker): auto-close broken connections to allow reconnect after TCP reset fix(broker): auto-close broken connections Jan 9, 2026
@dnwe dnwe force-pushed the fix/auto-close-broken-pipe branch 2 times, most recently from 06b3458 to 2b38cb1 Compare January 9, 2026 11:39
@dnwe
Copy link
Copy Markdown
Collaborator

dnwe commented Jan 9, 2026

LGTM. I've rebase the commits into a clean set and will approve and merge once the FV checks have all passed 👍🏻

Copy link
Copy Markdown
Collaborator

@dnwe dnwe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again, great contribution!

@dnwe dnwe merged commit 4637c7e into IBM:main Jan 9, 2026
17 checks passed
@DCjanus DCjanus deleted the fix/auto-close-broken-pipe branch January 9, 2026 14:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants