Skip to content

ClickHouseKeeperInstallation: Keepers take too long to restart after spec.stop transition from "yes" to "no" #1931

@Jrmy2402

Description

@Jrmy2402

Description

When using spec.stop: "yes" to stop a ClickHouseKeeperInstallation (scaling all StatefulSets to 0 replicas), and then setting spec.stop: "no" to restart it, the keepers take too long to come back up. This causes the associated ClickHouseInstallation to fail/timeout because it depends on the keepers being available.
Environment

Operator version: 0.26.0

Steps to Reproduce

  1. Deploy a CHK with 3 keeper replicas
  2. Set spec.stop: "yes" —> all keeper StatefulSets scale to 0 replicas (works correctly)
  3. Set spec.stop: "no" —> keepers should restart
  4. Observe that keeper-0 starts but doesn't become Ready (the readiness probe requires Raft quorum)
  5. The operator waits for keeper-0 before proceeding to keeper-1 and keeper-2
  6. Eventually the operator times out and moves on, but the total restart time is too long, causing the ClickHouseInstallation reconciliation to fail

Expected Behavior

After spec.stop transitions from "yes" to "no", all keeper replicas should start quickly so that Raft quorum can form and the ClickHouseInstallation can proceed normally.

Workarounds Attempted

  • Setting reconcile.statefulSet.update.onFailure: ignore on the CHK spec
  • Changing spec.taskID to force reconciliation

None of these significantly improve the restart time.

Additional Context

The core issue is that keepers are started sequentially and each one needs to pass a readiness check that requires quorum. Since quorum needs a majority of keepers running, the first keeper can never pass readiness on its own, creating a bottleneck.

Thanks,
Jérémy

Metadata

Metadata

Assignees

No one assigned

    Labels

    KeeperClickHouse Keeper issues

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions