fix(uat): gate on Prometheus readiness; retry conformance metric APIs#1452
Conversation
The deployment-phase readiness gate let conformance start before the metrics data pipeline was serving, so ai-service-metrics and pod-autoscaling failed against a cold Prometheus / prometheus-adapter even though the GPU stack had converged. Not the single-GPU-node count. - recipes/checks/kube-prometheus-stack: the health-check only asserted the prometheus-operator Deployment, and its pod selector (app.kubernetes.io/name=kube-prometheus-stack) matches no pods on chart 84.4.0, so the gate passed before the Prometheus server was Ready. Add a positive assert on the prometheus-kube-prometheus-prometheus StatefulSet's readyReplicas so the deployment phase waits for the server itself. - validators/conformance/pod-autoscaling: the custom- and external-metrics API probes were single-shot, racing the prometheus-adapter relist right after the deployment phase. Wrap both in a bounded, fail-closed retry (new defaults.MetricsAPIWarmupTimeout) via a testable waitForMetricsAPI helper, and fail closed on an unparseable external-metrics response. - tests/uat + workflows (folds in the NVIDIA#1439 review): share one readiness budget by checking the gate deadline before each attempt, and raise the UAT job timeout to 180m so the always()-run Destroy Cluster teardown is not cancelled by a job-level timeout. Closes NVIDIA#1425.
Recipe evidence checkNo leaf overlays affected by this PR. This gate is warning-only and never blocks merge. |
📝 WalkthroughWalkthroughThe PR increases the UAT job timeout in the AWS and GCP workflows, adds a Prometheus StatefulSet readiness gate to the kube-prometheus-stack health check, refactors AWS and GCP UAT deployment validation loops to log retries and clean up on timeout, adds a metrics API warm-up timeout constant, and updates pod autoscaling checks to poll custom and external metrics APIs with tests for the polling helper. Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
.github/workflows/uat-gcp.yaml (1)
337-356: 🩺 Stability & Availability | 🟠 MajorAdd a fail-closed guard to GCP teardown
After 3 faileddocker runattempts, this step still exits 0 because the loop just falls through. Add adestroyedflag and fail the step if teardown never succeeds, so a broken destroy doesn’t leave the GPU node behind.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/uat-gcp.yaml around lines 337 - 356, The Destroy Cluster step in uat-gcp teardown is fail-open because the retry loop falls through and exits successfully after all docker run attempts fail. Update the destroy logic to track success with a destroyed flag in the same step, set it when the docker run in the loop succeeds, and explicitly fail the bash script if none of the attempts in the Destroy Cluster block succeed so teardown cannot silently pass.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In @.github/workflows/uat-gcp.yaml:
- Around line 337-356: The Destroy Cluster step in uat-gcp teardown is fail-open
because the retry loop falls through and exits successfully after all docker run
attempts fail. Update the destroy logic to track success with a destroyed flag
in the same step, set it when the docker run in the loop succeeds, and
explicitly fail the bash script if none of the attempts in the Destroy Cluster
block succeed so teardown cannot silently pass.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 3405cc7e-5d0b-4767-baf7-86964a934b8b
📒 Files selected for processing (8)
.github/workflows/uat-aws.yaml.github/workflows/uat-gcp.yamlpkg/defaults/timeouts.gorecipes/checks/kube-prometheus-stack/health-check.yamltests/uat/aws/runtests/uat/gcp/runvalidators/conformance/pod_autoscaling_check.govalidators/conformance/pod_autoscaling_check_test.go
Summary
Make the UAT conformance metric checks (
ai-service-metrics,pod-autoscaling) pass reliably: gate the deployment phase on the Prometheus server being Ready, and give the pod-autoscaling metric-API probes a bounded, fail-closed retry. Also folds in the #1439 review fixes (gate-deadline + UAT job timeout).Motivation / Context
After the deployment-phase readiness gate (#1439),
ai-service-metricsandpod-autoscalingstill failed — but not because of the single-GPU-node count. The deployment gate'skube-prometheus-stackhealth-check only asserted the prometheus-operator Deployment, and its pod selector (app.kubernetes.io/name=kube-prometheus-stack) matches zero pods on chart 84.4.0 (the real pods are labeledprometheus,grafana, etc.), so the negative pod-health asserts were vacuous. The gate therefore passed before the Prometheus server StatefulSet was serving, and conformance ran against a cold Prometheus / prometheus-adapter pipeline. The pod-autoscaling metric-API probes compounded it by being single-shot, racing the adapter's first relist.Confirmed on a live cluster that the corrected health-check gates on the Prometheus server as intended.
Fixes/Closes: #1425
Related: #1439, #1426, #1429
Type of Change
Component(s) Affected
validators/conformance,pkg/defaults)recipes/checks)tests/uat/{aws,gcp}, workflows)Implementation Notes
recipes/checks/kube-prometheus-stack/health-check.yaml): add a positive assert that theprometheus-kube-prometheus-prometheusStatefulSet hasreadyReplicas > 0, so the deployment phase waits for the server itself, not just the operator. (Hydrated into the recipe byaicrand run from the validation data — no validator-image rebuild required for this part.)validators/conformance/pod_autoscaling_check.go,pkg/defaults/timeouts.go): wrap the single-shot custom- and external-metrics API probes in a fail-closed retry (newMetricsAPIWarmupTimeout, default 60s) via a testablewaitForMetricsAPIhelper; also fail closed on an unparseable external-metrics response (previously a vacuous pass).tests/uat/{aws,gcp}/run,.github/workflows/uat-{aws,gcp}.yaml): check the readiness-gate deadline before each attempt (no run launched past the budget); raise the UAT jobtimeout-minutesto 180 so thealways()-run Destroy Cluster teardown isn't cancelled by a job-level timeout.Testing
waitForMetricsAPIunit test (immediate-success / retry-then-success / fail-closed timeout / canceled-context).validators/conformance: 26.8% → 27.0% (no decrease).-racerequires cgo (no compiler in the dev env); CI runs it.Risk Assessment
Rollout notes:
MetricsAPIWarmupTimeoutandREADINESS_TIMEOUT_SECONDSare tunable; the metric-API retries are bounded to stay within the pod-autoscaling catalog timeout.Checklist
make testwith-race) — unit tests pass;-raceruns in CI (no local cgo)make lint) — golangci-lint v2.12.2 (pinned): 0 issues; bash -n + shellcheck cleanTestWaitForMetricsAPIgit commit -S)