-
Notifications
You must be signed in to change notification settings - Fork 539
ClickHouseKeeperInstallation: Keepers take too long to restart after spec.stop transition from "yes" to "no" #1931
Description
Description
When using spec.stop: "yes" to stop a ClickHouseKeeperInstallation (scaling all StatefulSets to 0 replicas), and then setting spec.stop: "no" to restart it, the keepers take too long to come back up. This causes the associated ClickHouseInstallation to fail/timeout because it depends on the keepers being available.
Environment
Operator version: 0.26.0
Steps to Reproduce
- Deploy a CHK with 3 keeper replicas
- Set spec.stop: "yes" —> all keeper StatefulSets scale to 0 replicas (works correctly)
- Set spec.stop: "no" —> keepers should restart
- Observe that keeper-0 starts but doesn't become Ready (the readiness probe requires Raft quorum)
- The operator waits for keeper-0 before proceeding to keeper-1 and keeper-2
- Eventually the operator times out and moves on, but the total restart time is too long, causing the ClickHouseInstallation reconciliation to fail
Expected Behavior
After spec.stop transitions from "yes" to "no", all keeper replicas should start quickly so that Raft quorum can form and the ClickHouseInstallation can proceed normally.
Workarounds Attempted
- Setting reconcile.statefulSet.update.onFailure: ignore on the CHK spec
- Changing spec.taskID to force reconciliation
None of these significantly improve the restart time.
Additional Context
The core issue is that keepers are started sequentially and each one needs to pass a readiness check that requires quorum. Since quorum needs a majority of keepers running, the first keeper can never pass readiness on its own, creating a bottleneck.
Thanks,
Jérémy