fix(admin): retry on network errors by resetting controller connection by DCjanus · Pull Request #3406 · IBM/sarama

DCjanus · 2025-12-15T04:15:02Z

Context
- Admin calls can get stuck after a network failure (broken pipe / connection reset) because the cached controller connection isn’t discarded. See ClusterAdmin should reconnect broker after disconnected #1162.
- The current admin retry only considers ErrNotController/EOF as retriable (definition at admin.go#L214-L225), so network errors bubble out without healing the connection.
What’s changed
- Treat network failures (net.Error, io.ErrUnexpectedEOF, EPIPE) as retriable in admin/controller paths.
- When such an error occurs inside retryOnError, close the cached controller connection before the next attempt; the subsequent call to Controller() in each admin operation (e.g. CreateTopic at admin.go#L265-L288) reopens a fresh TCP socket automatically.
Why it’s safe
- The close happens only after a request has already failed with a network error, inside the admin retry loop (admin.go#L230-L242). Healthy calls are untouched.
- Reconnect reuses the existing controller metadata; if the broker has changed, the next retry still hits the existing ErrNotController path, which already triggers RefreshController()—same behavior as today, just without reusing a broken socket.
- Scope is limited to ClusterAdmin; producer/consumer code paths are unchanged.

Signed-off-by: DCjanus <[email protected]>

DCjanus · 2025-12-15T14:25:56Z

After further consideration, I realized that my fix was incorrect, and I need to think more about a more appropriate implementation.

DCjanus added 2 commits December 15, 2025 12:13

fix(admin): retry on network errors by resetting controller connection

8c61b9b

Signed-off-by: DCjanus <[email protected]>

chore: fix spelling in admin network retry comment

21f2702

Signed-off-by: DCjanus <[email protected]>

DCjanus mentioned this pull request Dec 15, 2025

fix(admin): retry admin calls on network errors #3407

Closed

DCjanus closed this Dec 15, 2025

DCjanus mentioned this pull request Dec 19, 2025

fix(broker): auto-close broken connections #3412

Merged

DCjanus deleted the fix/admin-network-retry branch December 25, 2025 18:48

Provide feedback