ClickHouseKeeperInstallation: Keepers take too long to restart after spec.stop transition from "yes" to "no"

**Description**                                                                                                                                                                                                                                                                                                                                                                                                                    
                                                            
When using spec.stop: "yes" to stop a ClickHouseKeeperInstallation (scaling all StatefulSets to 0 replicas), and then setting spec.stop: "no" to restart it, the keepers take too long to come back up. This causes the associated ClickHouseInstallation to fail/timeout because it depends on the keepers being available.                                                                                                   
                                                                                                                                                                                                                                                                                                                                                                                                                                 **Environment**

Operator version: 0.26.0

**Steps to Reproduce**

  1. Deploy a CHK with 3 keeper replicas
  2. Set spec.stop: "yes" —> all keeper StatefulSets scale to 0 replicas (works correctly)
  3. Set spec.stop: "no" —> keepers should restart
  4. Observe that keeper-0 starts but doesn't become Ready (the readiness probe requires Raft quorum)
  5. The operator waits for keeper-0 before proceeding to keeper-1 and keeper-2
  6. Eventually the operator times out and moves on, but the total restart time is too long, causing the ClickHouseInstallation reconciliation to fail

**Expected Behavior**

After spec.stop transitions from "yes" to "no", all keeper replicas should start quickly so that Raft quorum can form and the ClickHouseInstallation can proceed normally.

**Workarounds Attempted**

  - Setting reconcile.statefulSet.update.onFailure: ignore on the CHK spec
  - Changing spec.taskID to force reconciliation

  None of these significantly improve the restart time.

**Additional Context**

The core issue is that keepers are started sequentially and each one needs to pass a readiness check that requires quorum. Since quorum needs a majority of keepers running, the first keeper can never pass readiness on its own, creating a bottleneck.

Thanks,
Jérémy


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ClickHouseKeeperInstallation: Keepers take too long to restart after spec.stop transition from "yes" to "no" #1931

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ClickHouseKeeperInstallation: Keepers take too long to restart after spec.stop transition from "yes" to "no" #1931

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions