Skip to content

fix(uat): gate on Prometheus readiness; retry conformance metric APIs#1452

Merged
njhensley merged 3 commits into
NVIDIA:mainfrom
njhensley:fix/uat-prometheus-readiness-gate
Jun 24, 2026
Merged

fix(uat): gate on Prometheus readiness; retry conformance metric APIs#1452
njhensley merged 3 commits into
NVIDIA:mainfrom
njhensley:fix/uat-prometheus-readiness-gate

Conversation

@njhensley

Copy link
Copy Markdown
Member

Summary

Make the UAT conformance metric checks (ai-service-metrics, pod-autoscaling) pass reliably: gate the deployment phase on the Prometheus server being Ready, and give the pod-autoscaling metric-API probes a bounded, fail-closed retry. Also folds in the #1439 review fixes (gate-deadline + UAT job timeout).

Motivation / Context

After the deployment-phase readiness gate (#1439), ai-service-metrics and pod-autoscaling still failed — but not because of the single-GPU-node count. The deployment gate's kube-prometheus-stack health-check only asserted the prometheus-operator Deployment, and its pod selector (app.kubernetes.io/name=kube-prometheus-stack) matches zero pods on chart 84.4.0 (the real pods are labeled prometheus, grafana, etc.), so the negative pod-health asserts were vacuous. The gate therefore passed before the Prometheus server StatefulSet was serving, and conformance ran against a cold Prometheus / prometheus-adapter pipeline. The pod-autoscaling metric-API probes compounded it by being single-shot, racing the adapter's first relist.

Confirmed on a live cluster that the corrected health-check gates on the Prometheus server as intended.

Fixes/Closes: #1425
Related: #1439, #1426, #1429

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • Build/CI/tooling

Component(s) Affected

  • Validator (validators/conformance, pkg/defaults)
  • Recipe engine / data (recipes/checks)
  • Other: UAT test harness (tests/uat/{aws,gcp}, workflows)

Implementation Notes

  • Health-check gates on the Prometheus server (recipes/checks/kube-prometheus-stack/health-check.yaml): add a positive assert that the prometheus-kube-prometheus-prometheus StatefulSet has readyReplicas > 0, so the deployment phase waits for the server itself, not just the operator. (Hydrated into the recipe by aicr and run from the validation data — no validator-image rebuild required for this part.)
  • pod-autoscaling bounded retries (Conformance validators: bounded retries on single-shot metrics-API probes #1425) (validators/conformance/pod_autoscaling_check.go, pkg/defaults/timeouts.go): wrap the single-shot custom- and external-metrics API probes in a fail-closed retry (new MetricsAPIWarmupTimeout, default 60s) via a testable waitForMetricsAPI helper; also fail closed on an unparseable external-metrics response (previously a vacuous pass).
  • fix(uat): gate install on the deployment validation phase #1439 review fixes (tests/uat/{aws,gcp}/run, .github/workflows/uat-{aws,gcp}.yaml): check the readiness-gate deadline before each attempt (no run launched past the budget); raise the UAT job timeout-minutes to 180 so the always()-run Destroy Cluster teardown isn't cancelled by a job-level timeout.

Testing

go build ./validators/conformance/... ./pkg/defaults/...      # OK
go vet  ./validators/conformance/...                          # OK
go test ./validators/conformance/... -run TestWaitForMetricsAPI -v   # PASS (4 subtests)
go test -cover ./validators/conformance/...                   # ok, 27.0%
golangci-lint run -c .golangci.yaml ./validators/conformance/... ./pkg/defaults/...   # 0 issues (v2.12.2, repo-pinned)
bash -n tests/uat/{aws,gcp}/run && shellcheck -S warning tests/uat/{aws,gcp}/run      # clean
  • New waitForMetricsAPI unit test (immediate-success / retry-then-success / fail-closed timeout / canceled-context).
  • Coverage validators/conformance: 26.8% → 27.0% (no decrease).
  • Health-check confirmed on a live cluster.
  • -race requires cgo (no compiler in the dev env); CI runs it.

Risk Assessment

  • Low — validator change is a fail-closed retry (no fail-open); the health-check adds a stricter readiness assert; UAT-harness/workflow changes are CI-only. Worst case is a slower gate, not a masked failure.

Rollout notes: MetricsAPIWarmupTimeout and READINESS_TIMEOUT_SECONDS are tunable; the metric-API retries are bounded to stay within the pod-autoscaling catalog timeout.

Checklist

  • Tests pass locally (make test with -race) — unit tests pass; -race runs in CI (no local cgo)
  • Linter passes (make lint) — golangci-lint v2.12.2 (pinned): 0 issues; bash -n + shellcheck clean
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality — TestWaitForMetricsAPI
  • I updated docs if user-facing behavior changed — N/A (internal UAT/validator behavior)
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

njhensley and others added 2 commits June 24, 2026 10:48
The deployment-phase readiness gate let conformance start before the
metrics data pipeline was serving, so ai-service-metrics and
pod-autoscaling
failed against a cold Prometheus / prometheus-adapter even though the
GPU
stack had converged. Not the single-GPU-node count.

- recipes/checks/kube-prometheus-stack: the health-check only asserted
the
  prometheus-operator Deployment, and its pod selector
  (app.kubernetes.io/name=kube-prometheus-stack) matches no pods on
chart
  84.4.0, so the gate passed before the Prometheus server was Ready. Add
a
  positive assert on the prometheus-kube-prometheus-prometheus
StatefulSet's
  readyReplicas so the deployment phase waits for the server itself.

- validators/conformance/pod-autoscaling: the custom- and
external-metrics
  API probes were single-shot, racing the prometheus-adapter relist
right
  after the deployment phase. Wrap both in a bounded, fail-closed retry
  (new defaults.MetricsAPIWarmupTimeout) via a testable
waitForMetricsAPI
  helper, and fail closed on an unparseable external-metrics response.

- tests/uat + workflows (folds in the NVIDIA#1439 review): share one readiness
  budget by checking the gate deadline before each attempt, and raise
the
  UAT job timeout to 180m so the always()-run Destroy Cluster teardown
is
  not cancelled by a job-level timeout.

Closes NVIDIA#1425.
@njhensley njhensley requested a review from a team as a code owner June 24, 2026 18:43
@njhensley njhensley added the theme/validation Constraint evaluation, health checks, and conformance evidence label Jun 24, 2026
@njhensley njhensley requested a review from a team as a code owner June 24, 2026 18:43
@github-actions

Copy link
Copy Markdown
Contributor

Recipe evidence check

No leaf overlays affected by this PR.

This gate is warning-only and never blocks merge.

@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

The PR increases the UAT job timeout in the AWS and GCP workflows, adds a Prometheus StatefulSet readiness gate to the kube-prometheus-stack health check, refactors AWS and GCP UAT deployment validation loops to log retries and clean up on timeout, adds a metrics API warm-up timeout constant, and updates pod autoscaling checks to poll custom and external metrics APIs with tests for the polling helper.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • NVIDIA/aicr#1426: Also changes the UAT phase_install readiness gate around aicr validate --phase deployment.
  • NVIDIA/aicr#1429: Also edits the same UAT deadline/loop logic in both AWS and GCP runners.
  • NVIDIA/aicr#1439: Also combines UAT readiness-loop changes with workflow timeout adjustments.

Suggested labels

area/ci, area/tests, size/L

Suggested reviewers

  • mchmarny
  • ayuskauskas
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: Prometheus readiness gating and bounded retries for conformance metric APIs.
Description check ✅ Passed The description is directly related to the diff and accurately explains the readiness gate, retries, and UAT timeout fixes.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
.github/workflows/uat-gcp.yaml (1)

337-356: 🩺 Stability & Availability | 🟠 Major

Add a fail-closed guard to GCP teardown
After 3 failed docker run attempts, this step still exits 0 because the loop just falls through. Add a destroyed flag and fail the step if teardown never succeeds, so a broken destroy doesn’t leave the GPU node behind.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/uat-gcp.yaml around lines 337 - 356, The Destroy Cluster
step in uat-gcp teardown is fail-open because the retry loop falls through and
exits successfully after all docker run attempts fail. Update the destroy logic
to track success with a destroyed flag in the same step, set it when the docker
run in the loop succeeds, and explicitly fail the bash script if none of the
attempts in the Destroy Cluster block succeed so teardown cannot silently pass.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In @.github/workflows/uat-gcp.yaml:
- Around line 337-356: The Destroy Cluster step in uat-gcp teardown is fail-open
because the retry loop falls through and exits successfully after all docker run
attempts fail. Update the destroy logic to track success with a destroyed flag
in the same step, set it when the docker run in the loop succeeds, and
explicitly fail the bash script if none of the attempts in the Destroy Cluster
block succeed so teardown cannot silently pass.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 3405cc7e-5d0b-4767-baf7-86964a934b8b

📥 Commits

Reviewing files that changed from the base of the PR and between d25244b and 973f74d.

📒 Files selected for processing (8)
  • .github/workflows/uat-aws.yaml
  • .github/workflows/uat-gcp.yaml
  • pkg/defaults/timeouts.go
  • recipes/checks/kube-prometheus-stack/health-check.yaml
  • tests/uat/aws/run
  • tests/uat/gcp/run
  • validators/conformance/pod_autoscaling_check.go
  • validators/conformance/pod_autoscaling_check_test.go

@njhensley njhensley enabled auto-merge (squash) June 24, 2026 19:01
@njhensley njhensley merged commit ebbd322 into NVIDIA:main Jun 24, 2026
155 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/ci area/recipes area/tests size/L theme/validation Constraint evaluation, health checks, and conformance evidence

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Conformance validators: bounded retries on single-shot metrics-API probes

2 participants