fix(uat): gate on Prometheus readiness; retry conformance metric APIs by njhensley · Pull Request #1452 · NVIDIA/aicr

njhensley · 2026-06-24T18:43:42Z

Summary

Make the UAT conformance metric checks (ai-service-metrics, pod-autoscaling) pass reliably: gate the deployment phase on the Prometheus server being Ready, and give the pod-autoscaling metric-API probes a bounded, fail-closed retry. Also folds in the #1439 review fixes (gate-deadline + UAT job timeout).

Motivation / Context

After the deployment-phase readiness gate (#1439), ai-service-metrics and pod-autoscaling still failed — but not because of the single-GPU-node count. The deployment gate's kube-prometheus-stack health-check only asserted the prometheus-operator Deployment, and its pod selector (app.kubernetes.io/name=kube-prometheus-stack) matches zero pods on chart 84.4.0 (the real pods are labeled prometheus, grafana, etc.), so the negative pod-health asserts were vacuous. The gate therefore passed before the Prometheus server StatefulSet was serving, and conformance ran against a cold Prometheus / prometheus-adapter pipeline. The pod-autoscaling metric-API probes compounded it by being single-shot, racing the adapter's first relist.

Confirmed on a live cluster that the corrected health-check gates on the Prometheus server as intended.

Fixes/Closes: #1425
Related: #1439, #1426, #1429

Type of Change

Bug fix (non-breaking change that fixes an issue)
Build/CI/tooling

Component(s) Affected

Validator (validators/conformance, pkg/defaults)
Recipe engine / data (recipes/checks)
Other: UAT test harness (tests/uat/{aws,gcp}, workflows)

Implementation Notes

Health-check gates on the Prometheus server (recipes/checks/kube-prometheus-stack/health-check.yaml): add a positive assert that the prometheus-kube-prometheus-prometheus StatefulSet has readyReplicas > 0, so the deployment phase waits for the server itself, not just the operator. (Hydrated into the recipe by aicr and run from the validation data — no validator-image rebuild required for this part.)
pod-autoscaling bounded retries (Conformance validators: bounded retries on single-shot metrics-API probes #1425) (validators/conformance/pod_autoscaling_check.go, pkg/defaults/timeouts.go): wrap the single-shot custom- and external-metrics API probes in a fail-closed retry (new MetricsAPIWarmupTimeout, default 60s) via a testable waitForMetricsAPI helper; also fail closed on an unparseable external-metrics response (previously a vacuous pass).
fix(uat): gate install on the deployment validation phase #1439 review fixes (tests/uat/{aws,gcp}/run, .github/workflows/uat-{aws,gcp}.yaml): check the readiness-gate deadline before each attempt (no run launched past the budget); raise the UAT job timeout-minutes to 180 so the always()-run Destroy Cluster teardown isn't cancelled by a job-level timeout.

Testing

go build ./validators/conformance/... ./pkg/defaults/...      # OK
go vet  ./validators/conformance/...                          # OK
go test ./validators/conformance/... -run TestWaitForMetricsAPI -v   # PASS (4 subtests)
go test -cover ./validators/conformance/...                   # ok, 27.0%
golangci-lint run -c .golangci.yaml ./validators/conformance/... ./pkg/defaults/...   # 0 issues (v2.12.2, repo-pinned)
bash -n tests/uat/{aws,gcp}/run && shellcheck -S warning tests/uat/{aws,gcp}/run      # clean

New waitForMetricsAPI unit test (immediate-success / retry-then-success / fail-closed timeout / canceled-context).
Coverage validators/conformance: 26.8% → 27.0% (no decrease).
Health-check confirmed on a live cluster.
-race requires cgo (no compiler in the dev env); CI runs it.

Risk Assessment

Low — validator change is a fail-closed retry (no fail-open); the health-check adds a stricter readiness assert; UAT-harness/workflow changes are CI-only. Worst case is a slower gate, not a masked failure.

Rollout notes: MetricsAPIWarmupTimeout and READINESS_TIMEOUT_SECONDS are tunable; the metric-API retries are bounded to stay within the pod-autoscaling catalog timeout.

Checklist

Tests pass locally (make test with -race) — unit tests pass; -race runs in CI (no local cgo)
Linter passes (make lint) — golangci-lint v2.12.2 (pinned): 0 issues; bash -n + shellcheck clean
I did not skip/disable tests to make CI green
I added/updated tests for new functionality — TestWaitForMetricsAPI
I updated docs if user-facing behavior changed — N/A (internal UAT/validator behavior)
Changes follow existing patterns in the codebase
Commits are cryptographically signed (git commit -S)

The deployment-phase readiness gate let conformance start before the metrics data pipeline was serving, so ai-service-metrics and pod-autoscaling failed against a cold Prometheus / prometheus-adapter even though the GPU stack had converged. Not the single-GPU-node count. - recipes/checks/kube-prometheus-stack: the health-check only asserted the prometheus-operator Deployment, and its pod selector (app.kubernetes.io/name=kube-prometheus-stack) matches no pods on chart 84.4.0, so the gate passed before the Prometheus server was Ready. Add a positive assert on the prometheus-kube-prometheus-prometheus StatefulSet's readyReplicas so the deployment phase waits for the server itself. - validators/conformance/pod-autoscaling: the custom- and external-metrics API probes were single-shot, racing the prometheus-adapter relist right after the deployment phase. Wrap both in a bounded, fail-closed retry (new defaults.MetricsAPIWarmupTimeout) via a testable waitForMetricsAPI helper, and fail closed on an unparseable external-metrics response. - tests/uat + workflows (folds in the NVIDIA#1439 review): share one readiness budget by checking the gate deadline before each attempt, and raise the UAT job timeout to 180m so the always()-run Destroy Cluster teardown is not cancelled by a job-level timeout. Closes NVIDIA#1425.

github-actions · 2026-06-24T18:46:05Z

Recipe evidence check

No leaf overlays affected by this PR.

This gate is warning-only and never blocks merge.

coderabbitai · 2026-06-24T18:50:17Z

📝 Walkthrough

Walkthrough

The PR increases the UAT job timeout in the AWS and GCP workflows, adds a Prometheus StatefulSet readiness gate to the kube-prometheus-stack health check, refactors AWS and GCP UAT deployment validation loops to log retries and clean up on timeout, adds a metrics API warm-up timeout constant, and updates pod autoscaling checks to poll custom and external metrics APIs with tests for the polling helper.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

NVIDIA/aicr#1426: Also changes the UAT phase_install readiness gate around aicr validate --phase deployment.
NVIDIA/aicr#1429: Also edits the same UAT deadline/loop logic in both AWS and GCP runners.
NVIDIA/aicr#1439: Also combines UAT readiness-loop changes with workflow timeout adjustments.

Suggested labels

area/ci, area/tests, size/L

Suggested reviewers

mchmarny
ayuskauskas

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main change: Prometheus readiness gating and bounded retries for conformance metric APIs.
Description check	✅ Passed	The description is directly related to the diff and accurately explains the readiness gate, retries, and UAT timeout fixes.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

.github/workflows/uat-gcp.yaml (1)
337-356: 🩺 Stability & Availability | 🟠 Major

Add a fail-closed guard to GCP teardown
After 3 failed docker run attempts, this step still exits 0 because the loop just falls through. Add a destroyed flag and fail the step if teardown never succeeds, so a broken destroy doesn’t leave the GPU node behind.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/uat-gcp.yaml around lines 337 - 356, The Destroy Cluster
step in uat-gcp teardown is fail-open because the retry loop falls through and
exits successfully after all docker run attempts fail. Update the destroy logic
to track success with a destroyed flag in the same step, set it when the docker
run in the loop succeeds, and explicitly fail the bash script if none of the
attempts in the Destroy Cluster block succeed so teardown cannot silently pass.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In @.github/workflows/uat-gcp.yaml:
- Around line 337-356: The Destroy Cluster step in uat-gcp teardown is fail-open
because the retry loop falls through and exits successfully after all docker run
attempts fail. Update the destroy logic to track success with a destroyed flag
in the same step, set it when the docker run in the loop succeeds, and
explicitly fail the bash script if none of the attempts in the Destroy Cluster
block succeed so teardown cannot silently pass.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 3405cc7e-5d0b-4767-baf7-86964a934b8b

📥 Commits

Reviewing files that changed from the base of the PR and between d25244b and 973f74d.

📒 Files selected for processing (8)

.github/workflows/uat-aws.yaml
.github/workflows/uat-gcp.yaml
pkg/defaults/timeouts.go
recipes/checks/kube-prometheus-stack/health-check.yaml
tests/uat/aws/run
tests/uat/gcp/run
validators/conformance/pod_autoscaling_check.go
validators/conformance/pod_autoscaling_check_test.go

njhensley and others added 2 commits June 24, 2026 10:48

Merge branch 'NVIDIA:main' into fix/uat-prometheus-readiness-gate

973f74d

njhensley requested a review from a team as a code owner June 24, 2026 18:43

njhensley added the theme/validation Constraint evaluation, health checks, and conformance evidence label Jun 24, 2026

njhensley requested a review from a team as a code owner June 24, 2026 18:43

github-actions Bot added area/recipes area/ci area/tests size/L labels Jun 24, 2026

coderabbitai Bot reviewed Jun 24, 2026

View reviewed changes

Merge branch 'main' into fix/uat-prometheus-readiness-gate

8c5ab5d

mchmarny approved these changes Jun 24, 2026

View reviewed changes

njhensley enabled auto-merge (squash) June 24, 2026 19:01

njhensley merged commit ebbd322 into NVIDIA:main Jun 24, 2026
155 checks passed

coderabbitai Bot mentioned this pull request Jun 26, 2026

feat(recipes): add Prometheus/Alertmanager StatefulSet readiness to kube-prometheus-stack check #1497

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(uat): gate on Prometheus readiness; retry conformance metric APIs#1452

fix(uat): gate on Prometheus readiness; retry conformance metric APIs#1452
njhensley merged 3 commits into
NVIDIA:mainfrom
njhensley:fix/uat-prometheus-readiness-gate

njhensley commented Jun 24, 2026

Uh oh!

github-actions Bot commented Jun 24, 2026

Uh oh!

coderabbitai Bot commented Jun 24, 2026

Walkthrough

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

njhensley commented Jun 24, 2026

Summary

Motivation / Context

Type of Change

Component(s) Affected

Implementation Notes

Testing

Risk Assessment

Checklist

Uh oh!

github-actions Bot commented Jun 24, 2026

Recipe evidence check

Uh oh!

coderabbitai Bot commented Jun 24, 2026

Walkthrough

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants