Skip to content

feat(recipes): add Prometheus/Alertmanager StatefulSet readiness to kube-prometheus-stack check#1497

Merged
njhensley merged 4 commits into
NVIDIA:mainfrom
mohityadav8:feat/1247-kube-prometheus-statefulset-readiness
Jun 26, 2026
Merged

feat(recipes): add Prometheus/Alertmanager StatefulSet readiness to kube-prometheus-stack check#1497
njhensley merged 4 commits into
NVIDIA:mainfrom
mohityadav8:feat/1247-kube-prometheus-statefulset-readiness

Conversation

@mohityadav8

Copy link
Copy Markdown
Contributor

Summary

Add Prometheus and Alertmanager StatefulSet readiness assertions to the
kube-prometheus-stack health check, closing #1247.

Motivation / Context

Fixes: #1247
Related: #1245, #660

Type of Change

  • New feature (non-breaking change that adds functionality)

Component(s) Affected

  • Other: recipes/checks (health-check YAML only)

Implementation Notes

StatefulSet names (prometheus-kube-prometheus-prometheus,
alertmanager-kube-prometheus-alertmanager) verified from
tests/chainsaw/ai-conformance/common/assert-monitoring.yaml —
live-cluster-validated. CR names derived from fullnameOverride=kube-prometheus
in recipes/components/kube-prometheus-stack/values.yaml.

No Go code changed. No registry.yaml changes.

Testing

go test -race -count=1 ./validators/chainsaw/ -run TestValidateTestReadOnly_RegistryContent
→ 27/27 pass
go test -race -count=1 ./pkg/recipe/ -run TestComponentRegistry_RequiresHealthCheck
→ 27/27 pass
yamllint recipes/checks/kube-prometheus-stack/health-check.yaml → clean

Live-cluster validation deferred to maintainer per same agreement as #1245.

Risk Assessment

  • Low — Single YAML file change. Isolated to one health-check.
    Semantically a correctness improvement (fail-loud direction).
    Rollout notes: N/A

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • No new tests added — existing two lint guards already cover this
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

@mohityadav8 mohityadav8 requested a review from a team as a code owner June 26, 2026 17:23
@copy-pr-bot

copy-pr-bot Bot commented Jun 26, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 26, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: d227b04f-fe6b-42e5-b473-3729aadbf781

📥 Commits

Reviewing files that changed from the base of the PR and between c328712 and cd2c65f.

📒 Files selected for processing (1)
  • recipes/checks/kube-prometheus-stack/health-check.yaml

📝 Walkthrough

Walkthrough

Updates the kube-prometheus-stack Chainsaw health check to document the validation scope, assert Available=True for the Prometheus and Alertmanager custom resources, and require both StatefulSets to have replicas > 0 with readyReplicas >= replicas. It also clarifies the naming and selector conventions used in the monitoring namespace.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • NVIDIA/aicr#1452: Updates the same kube-prometheus-stack health-check area with related readiness gating changes for Prometheus stack validation.

Suggested labels

theme/validation

Suggested reviewers

  • mchmarny
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: adding Prometheus/Alertmanager StatefulSet readiness checks to the kube-prometheus-stack health check.
Description check ✅ Passed The description is directly related and accurately describes the YAML-only health-check update, testing notes, and issue closure.
Linked Issues check ✅ Passed The PR meets #1247 by adding Prometheus/Alertmanager CR availability checks and StatefulSet full-rollout assertions with vacuous-pass guards.
Out of Scope Changes check ✅ Passed The change stays within the requested health-check YAML scope and adds no unrelated code paths or files.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

njhensley
njhensley previously approved these changes Jun 26, 2026

@njhensley njhensley left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📋 Multi-persona review — Prometheus/Alertmanager StatefulSet readiness

Method: 3 independent persona reviewers (Correctness/JMESPath-semantics · Domain/prometheus-operator · Operability/CI-DX) → adversarial senior meta-reviewer that re-derived every finding from the resolved code. Line links pinned to head SHA 3470de7.

Tier legend: 🔴 Blocker · 🟠 Major · 🟡 Minor · 🔵 Nitpick

Overall assessment

A clean, well-scoped, single-file correctness improvement. It upgrades the kube-prometheus-stack health check from "operator Deployment is up + at least one Prometheus replica ready" to a full-rollout assertion across the operator, both managed CRs, and both StatefulSets — all in the fail-loud direction, all using the read-only assert op (passes the chainsaw allowlist guard). Names, JMESPath shapes, and fail-closed behavior all check out.

The one risk worth chasing — that the Alertmanager CR might never expose a status.conditions[type=Available], which would make the new assertion fail forever on a healthy cluster — was investigated and disproven: chart 84.4.0 bundles prometheus-operator v0.90.1, which populates the Available condition on both the Prometheus and Alertmanager CRs (Alertmanager status conditions have existed since ~v0.55). No overlay disables either component, so the new steps can't false-fail on a disabled-component basis.

Recommendation: ✅ Approve with comments — the two surviving findings are minor/advisory, not blocking.

Confirmed non-issues (examined and cleared)

  • Timeout sufficiencyspec.timeouts.assert: 5m is per-assert, not a global Test budget; adding steps doesn't shrink any step's window, and the value is unchanged from the pre-PR file. Default storage is emptyDir (no PVC latency); 5m is the repo-wide convention (34/35 checks). Refuted.
  • readyReplicas >= replicas + replicas > 0 being "too strict" — this is the intended full-rollout idiom (mirrors the DaemonSet numberReady == desiredNumberScheduled form in aws-efa / nvidia-dra-driver-gpu). Refuted.
  • Alertmanager Available condition missing (the crux) — populated at v0.90.1; tracks STS readiness. Cleared.
  • One-element-slice projection (conditions[?type=='Available']): - status: "True" — on the correct side of the slice-literal anti-pattern: a JMESPath projection precedes the one-element expected value, and conditions are type-keyed (≤1 Available). Fail-closed when absent/False. Cleared.
  • CR/STS names — all four match assert-monitoring.yaml and the fullnameOverride=kube-prometheus derivation. Cleared.
  • Overlay disabling alertmanager/prometheus — audited all overlays/mixins; only grafana is disabled (in kind.yaml), both target components stay enabled. Cleared.

Summary

🔴 Blocker 🟠 Major 🟡 Minor 🔵 Nitpick
0 0 2 1

Recommendation: Approve with comments. No blocking issues; the two 🟡s are optional polish.

Comment thread recipes/checks/kube-prometheus-stack/health-check.yaml
Comment thread recipes/checks/kube-prometheus-stack/health-check.yaml
Comment thread recipes/checks/kube-prometheus-stack/health-check.yaml Outdated
@github-actions

Copy link
Copy Markdown
Contributor

Recipe evidence check

No leaf overlays affected by this PR.

This gate is warning-only and never blocks merge.

@mohityadav8 mohityadav8 force-pushed the feat/1247-kube-prometheus-statefulset-readiness branch from 3470de7 to c328712 Compare June 26, 2026 18:37
@mohityadav8 mohityadav8 requested a review from njhensley June 26, 2026 18:40
njhensley
njhensley previously approved these changes Jun 26, 2026
@njhensley njhensley enabled auto-merge (squash) June 26, 2026 18:52
@njhensley

Copy link
Copy Markdown
Member

@mohityadav8, please resolve linting issues.

auto-merge was automatically disabled June 26, 2026 19:05

Head branch was pushed to by a user without write access

@mohityadav8 mohityadav8 force-pushed the feat/1247-kube-prometheus-statefulset-readiness branch from 4d0b007 to cd2c65f Compare June 26, 2026 19:05
@mohityadav8

Copy link
Copy Markdown
Contributor Author

@mohityadav8, please resolve linting issues.

done

@mohityadav8 mohityadav8 requested a review from njhensley June 26, 2026 19:08
@njhensley njhensley merged commit caecdd0 into NVIDIA:main Jun 26, 2026
150 checks passed
@mohityadav8

Copy link
Copy Markdown
Contributor Author

thanks @njhensley :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Prometheus + Alertmanager StatefulSet readiness to kube-prometheus-stack check

2 participants