fix(uat): gate validate on cluster convergence (readiness wait + failFast)#1426
Conversation
…Fast) helmfile apply returns before operator-driven convergence (ClusterPolicy, DRA rollout, prometheus-adapter APIServices), so --phase all (NVIDIA#1416) ran conformance against a not-yet-ready cluster — expected-resources timed out and accelerator/ai-service-metrics/pod-autoscaling fast-failed. Add a post-install readiness gate mirroring deploy.sh's kubectl-waits, and set failFast so a failed deployment phase doesn't yield misleading conformance failures. Confirmed: re-running conformance on the converged cluster passes.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Enterprise Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughBoth the AWS and GCP UAT runner scripts gain a configurable Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/uat/aws/run`:
- Around line 221-226: The wait_for calls on lines 221, 223, and 225 each use
the full READINESS_TIMEOUT_SECONDS, causing the total barrier time to be
multiplied by the number of checks rather than enforcing a single overall
deadline. Calculate a single deadline timestamp before these wait_for calls by
adding READINESS_TIMEOUT_SECONDS to the current time, then modify each wait_for
invocation (_clusterpolicy_ready, _apiservice_available for both custom.metrics
and external.metrics) to pass the remaining time until that deadline instead of
the full timeout value.
In `@tests/uat/gcp/run`:
- Around line 221-226: The three consecutive wait_for calls for gpu-operator
ClusterPolicy, custom.metrics.k8s.io APIService, and external.metrics.k8s.io
APIService each use the full READINESS_TIMEOUT_SECONDS value, which causes the
total timeout budget to be multiplied rather than shared across all three
checks. Refactor this section to establish a single shared deadline at the
beginning, then pass the remaining seconds (calculated from the initial deadline
minus elapsed time) to each subsequent wait_for call to ensure the combined
readiness checks do not exceed the intended total timeout budget.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 5df7add6-56a3-4b1b-887e-b8b9a525b26b
📒 Files selected for processing (4)
tests/uat/aws/runtests/uat/aws/tests/h100-training-config.yamltests/uat/gcp/runtests/uat/gcp/tests/h100-training-config.yaml
Summary
Gate the UAT
validatestep on cluster convergence: add a post-helmfile applyreadiness wait to the UAT runners and setfailFastso a failed deployment phase doesn't yield misleading conformance failures.Motivation / Context
helmfile applyreturns once each release's own objects are Ready, but not after operator-driven convergence (gpu-operator ClusterPolicy → DCGM exporter; prometheus-adapter'scustom/externalmetrics APIServices). The helmdeploy.shpathkubectl waits on these; the helmfile path does not. When the UAT validate step moved to--phase all(#1416), the deployment phase (expected-resources) timed out and the conformance checks (accelerator-metrics,ai-service-metrics,pod-autoscaling) fast-failed — all because validation ran against a not-yet-converged cluster, not because of any check, recipe, or networking defect.Confirmed empirically: re-running
aicr validate --phase conformanceagainst the converged UAT cluster passes all conformance validators. (An initial hypothesis that the multi-SG cluster topology blocked cross-subnet Prometheus/DCGM traffic was investigated and ruled out — the dials work once the stack is up.)Fixes: N/A
Related: #1425
Type of Change
Component(s) Affected
tests/uat/{aws,gcp})Implementation Notes
tests/uat/{aws,gcp}/run): await_forpoll helper blocks at the end ofphase_installuntilClusterPolicyreachesstate=ready(⇒ GPU stack incl. DCGM exporter rolled out) and bothv1beta1.custom.metrics.k8s.io/v1beta1.external.metrics.k8s.ioAPIServices areAvailable(⇒ prometheus-adapter serving). Predicates are fail-closed: an absent resource or unset status keeps polling, and the gate exits non-zero with a diagnostic on timeout (READINESS_TIMEOUT_SECONDS, default 15m) so a cluster that never converges fails the install loudly rather than handing a not-ready cluster tovalidate.failFast: true(tests/uat/{aws,gcp}/tests/h100-training-config.yaml,spec.validate.execution.failFast): on the rare path where the deployment phase still fails after the gate, conformance/performance are skipped instead of producing misleading failures. Mirrors the existing precedent intests/uat/chainsaw-config.yaml.ClusterPolicy+ the two metrics APIServices as the gate signals because they were the actual failure surface; the deployment-phase validator (now gated byfailFast) remains the comprehensive backstop, and the gate is easy to extend if DRA-rollout / nodewright ever prove to be the long pole.kube-prometheus-stackco-location selector (the Prometheus pod still carriesapp.kubernetes.io/name=prometheuson chart 84.4.0 — verified on the live cluster) and any validator Go code (the single-shot-probe hardening is tracked separately as Conformance validators: bounded retries on single-shot metrics-API probes #1425).Testing
aicr validate --phase conformanceagainst the live, converged UAT EKS cluster → all conformance validators pass, confirming the failures were a pre-convergence timing race.make qualify(not installed in the dev environment used here).Risk Assessment
tests/uat); no product/runtime code paths touched; fail-closed; trivially revertable.Rollout notes: N/A — UAT-only. Gate budget is overridable via
READINESS_TIMEOUT_SECONDS.Checklist
make testwith-race) — N/A, no Go changesmake lint) —bash -n+ YAML parse verified locally; yamllint/shellcheck run in CI (tools not present in this dev env)git commit -S)