Add zk error handling and logging#1762
Merged
sunsingerus merged 1 commit intoAltinity:0.25.3from Jul 16, 2025
Merged
Conversation
Contributor
Author
|
Please let me know if this is the right branch to merge into, I can fix which branch I branched off of |
6a11858 to
9970359
Compare
9970359 to
910be59
Compare
- mock out zk connection for testing - surface errors that were being swallowed up/ignored - make retry delay configurable (useful for testing) - add logging to pathmanager when checking for path existence Signed-off-by: Michael Wilkerson <[email protected]>
910be59 to
d49362b
Compare
Collaborator
|
PR is quite big, will take a look later |
Member
|
Tobe added to 0.25.3 |
sunsingerus
approved these changes
Jul 16, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR improves error handling and observability for Keeper operations by:
Previously, Keeper connection failures during retries were not visible, making it difficult to diagnose deployment issues.
Motivation
When deploying the operator in a separate namespace from the ClickHouse and Keeper clusters, we experienced a consistent ~10 minute delay for ClickHouse pods to start after Keeper pods were ready.
The root cause was that cross-namespace deployments require fully qualified DNS names (including namespace) in the ClickHouse configuration to properly resolve Keeper endpoints. However, the existing retry logic silently swallowed connection errors, making this configuration issue nearly impossible to diagnose.
This change surfaces these errors to help operators quickly identify and resolve similar DNS/connectivity issues.
Testing
Tested in a local kind cluster:
Before fix: With incorrect DNS configuration, no error logs were visible despite connection failures. It would just log a warning about retrying the connection.
After fix: Clear error messages now appear showing the specific connection failures, making the DNS misconfiguration immediately apparent

Once the DNS configuration was corrected with fully qualified names, ClickHouse pods started immediately after Keeper became available 🎆.
From Altinity
Thanks for taking the time to contribute to
clickhouse-operator!Please, read carefully instructions on how to make a Pull Request.
This will help a lot for maintainers to adopt your Pull Request.
Important items to consider before making a Pull Request
Please check items PR complies to:
next-releasebranch, not intomasterbranch1. More info--
1 If you feel your PR does not affect any Go-code or any testable functionality (for example, PR contains docs only or supplementary materials), PR can be made into
masterbranch, but it has to be confirmed by project's maintainer.