feat(recipes): add Prometheus/Alertmanager StatefulSet readiness to kube-prometheus-stack check#1497
Conversation
…ube-prometheus-stack check Closes NVIDIA#1247
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Enterprise Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughUpdates the Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
njhensley
left a comment
There was a problem hiding this comment.
📋 Multi-persona review — Prometheus/Alertmanager StatefulSet readiness
Method: 3 independent persona reviewers (Correctness/JMESPath-semantics · Domain/prometheus-operator · Operability/CI-DX) → adversarial senior meta-reviewer that re-derived every finding from the resolved code. Line links pinned to head SHA 3470de7.
Tier legend: 🔴 Blocker · 🟠 Major · 🟡 Minor · 🔵 Nitpick
Overall assessment
A clean, well-scoped, single-file correctness improvement. It upgrades the kube-prometheus-stack health check from "operator Deployment is up + at least one Prometheus replica ready" to a full-rollout assertion across the operator, both managed CRs, and both StatefulSets — all in the fail-loud direction, all using the read-only assert op (passes the chainsaw allowlist guard). Names, JMESPath shapes, and fail-closed behavior all check out.
The one risk worth chasing — that the Alertmanager CR might never expose a status.conditions[type=Available], which would make the new assertion fail forever on a healthy cluster — was investigated and disproven: chart 84.4.0 bundles prometheus-operator v0.90.1, which populates the Available condition on both the Prometheus and Alertmanager CRs (Alertmanager status conditions have existed since ~v0.55). No overlay disables either component, so the new steps can't false-fail on a disabled-component basis.
Recommendation: ✅ Approve with comments — the two surviving findings are minor/advisory, not blocking.
Confirmed non-issues (examined and cleared)
- Timeout sufficiency —
spec.timeouts.assert: 5mis per-assert, not a global Test budget; adding steps doesn't shrink any step's window, and the value is unchanged from the pre-PR file. Default storage isemptyDir(no PVC latency); 5m is the repo-wide convention (34/35 checks). Refuted. readyReplicas >= replicas+replicas > 0being "too strict" — this is the intended full-rollout idiom (mirrors the DaemonSetnumberReady == desiredNumberScheduledform in aws-efa / nvidia-dra-driver-gpu). Refuted.- Alertmanager
Availablecondition missing (the crux) — populated at v0.90.1; tracks STS readiness. Cleared. - One-element-slice projection
(conditions[?type=='Available']): - status: "True"— on the correct side of the slice-literal anti-pattern: a JMESPath projection precedes the one-element expected value, and conditions are type-keyed (≤1Available). Fail-closed when absent/False. Cleared. - CR/STS names — all four match
assert-monitoring.yamland thefullnameOverride=kube-prometheusderivation. Cleared. - Overlay disabling alertmanager/prometheus — audited all overlays/mixins; only grafana is disabled (in
kind.yaml), both target components stay enabled. Cleared.
Summary
| 🔴 Blocker | 🟠 Major | 🟡 Minor | 🔵 Nitpick |
|---|---|---|---|
| 0 | 0 | 2 | 1 |
Recommendation: Approve with comments. No blocking issues; the two 🟡s are optional polish.
Recipe evidence checkNo leaf overlays affected by this PR. This gate is warning-only and never blocks merge. |
…ube-prometheus-stack check
3470de7 to
c328712
Compare
|
@mohityadav8, please resolve linting issues. |
Head branch was pushed to by a user without write access
4d0b007 to
cd2c65f
Compare
Signed-off-by: Mohit Yadav <[email protected]>
done |
|
thanks @njhensley :D |
Summary
Add Prometheus and Alertmanager StatefulSet readiness assertions to the
kube-prometheus-stack health check, closing #1247.
Motivation / Context
Fixes: #1247
Related: #1245, #660
Type of Change
Component(s) Affected
Implementation Notes
StatefulSet names (prometheus-kube-prometheus-prometheus,
alertmanager-kube-prometheus-alertmanager) verified from
tests/chainsaw/ai-conformance/common/assert-monitoring.yaml —
live-cluster-validated. CR names derived from fullnameOverride=kube-prometheus
in recipes/components/kube-prometheus-stack/values.yaml.
No Go code changed. No registry.yaml changes.
Testing
go test -race -count=1 ./validators/chainsaw/ -run TestValidateTestReadOnly_RegistryContent
→ 27/27 pass
go test -race -count=1 ./pkg/recipe/ -run TestComponentRegistry_RequiresHealthCheck
→ 27/27 pass
yamllint recipes/checks/kube-prometheus-stack/health-check.yaml → clean
Live-cluster validation deferred to maintainer per same agreement as #1245.
Risk Assessment
Semantically a correctness improvement (fail-loud direction).
Rollout notes: N/A
Checklist