Skip to content

fix(admin): retry on network errors by resetting controller connection#3406

Closed
DCjanus wants to merge 2 commits intoIBM:mainfrom
DCjanus:fix/admin-network-retry
Closed

fix(admin): retry on network errors by resetting controller connection#3406
DCjanus wants to merge 2 commits intoIBM:mainfrom
DCjanus:fix/admin-network-retry

Conversation

@DCjanus
Copy link
Copy Markdown
Contributor

@DCjanus DCjanus commented Dec 15, 2025

  • Context

  • What’s changed

    • Treat network failures (net.Error, io.ErrUnexpectedEOF, EPIPE) as retriable in admin/controller paths.
    • When such an error occurs inside retryOnError, close the cached controller connection before the next attempt; the subsequent call to Controller() in each admin operation (e.g. CreateTopic at admin.go#L265-L288) reopens a fresh TCP socket automatically.
  • Why it’s safe

    • The close happens only after a request has already failed with a network error, inside the admin retry loop (admin.go#L230-L242). Healthy calls are untouched.
    • Reconnect reuses the existing controller metadata; if the broker has changed, the next retry still hits the existing ErrNotController path, which already triggers RefreshController()—same behavior as today, just without reusing a broken socket.
    • Scope is limited to ClusterAdmin; producer/consumer code paths are unchanged.

@DCjanus
Copy link
Copy Markdown
Contributor Author

DCjanus commented Dec 15, 2025

After further consideration, I realized that my fix was incorrect, and I need to think more about a more appropriate implementation.

@DCjanus DCjanus deleted the fix/admin-network-retry branch December 25, 2025 18:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant