fix(uat): run all validate phases to gate on GPU readiness#1416
Conversation
The AWS/GCP UATs ran `aicr validate --phase conformance` only. Unlike the helm deploy.sh (which `kubectl wait`s on driver migration and the DRA plugin), the helmfile bundle returns from `helmfile apply` as soon as each release's own resources are ready — operator-driven work (GPU driver install, nodewright node tuning, DRA plugin rollout) is still in flight. Conformance then validated a not-yet-converged cluster and failed. Switch the validate step to `--phase all`. The deployment phase is the missing readiness barrier: its health-check asserts poll until the GPU stack converges (gpu-operator ClusterPolicy state=ready, DRA kubelet-plugin DaemonSet fully rolled out, nodewright tuning complete) — the same gate deploy.sh provides, deployer-agnostically. Evidence is rendered/attested from the merged multi-phase report, so the signed bundle now covers all phases. The performance phase's nccl-all-reduce-bw check needs >=2 GPU nodes and skips gracefully on the single-GPU-node UAT clusters. Bump the validate step timeout (20m->40m) and job timeout (90m->110m) to absorb cold-start GPU bring-up. Mirror the change across AWS and GCP.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Enterprise Run ID: 📒 Files selected for processing (4)
📝 WalkthroughWalkthroughBoth UAT runner scripts ( Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
LGTM. Right fix — the deployment phase is the deployer-agnostic readiness barrier the helmfile bundle lacks, and putting it ahead of conformance is the correct way to stop validating a not-yet-converged cluster.
Verified the mechanics rather than taking them on faith:
--phase allis a documented flag value (pkg/cli/validate.go:437) and the"all"wildcard resolves tocfg.phases == nil→ the run-all-phases default path (validate.go:308-312), so evidence is rendered from the merged multi-phase report as described.- No
--fail-fast, so all phases run and the binary exits non-zero if any phase fails (validate.go:411) — a deployment-phase failure now correctly fails the step, which is the whole point. - Timeout bumps are consistent (step +20m == job +20m), and AWS/GCP
runfiles stay byte-identical. - Retaining
id: conformancekeeps the downstreamsteps.conformance.outcomegating untouched; the slightphase_conformancemisnomer is documented inline in the updated Phases block — fine trade-off.
The comments are excellent — the rationale for why helmfile needs this where helm's deploy.sh kubectl wait doesn't is captured right where the next person will look.
Summary
Switch the AWS and GCP UAT validate step from
--phase conformanceto--phase all, so the deployment phase runs first and gates on GPU-stack readiness before conformance.Motivation / Context
The UAT validate step ran
aicr validate --phase conformanceonly. Unlike the helmdeploy.sh(whichkubectl waits on GPU-driver migration and the DRA kubelet-plugin rollout), the helmfile bundle returns fromhelmfile applyas soon as each release's own resources are ready — operator-driven work (GPU driver install, nodewright node tuning, DRA plugin rollout) is still in flight. Conformance then validated a not-yet-converged cluster and failed.The deployment phase is the missing readiness barrier: its health-check asserts poll until the GPU stack converges (gpu-operator
ClusterPolicy state=ready, DRA kubelet-plugin DaemonSet fully rolled out, nodewright tuning complete) — the same gatedeploy.shprovides, but deployer-agnostic. Running all phases puts that gate ahead of conformance.Fixes: N/A
Related: failing
UAT - AWSrun (conformance step) on the helmfile bundleType of Change
Component(s) Affected
tests/uat/{aws,gcp}/run) and UAT workflows (.github/workflows/uat-{aws,gcp}.yaml)Implementation Notes
--phase all→cfg.phasesis nil → the documented "run all phases" default path; evidence is rendered/attested from the merged multi-phase report, so the signed bundle now covers all phases.nccl-all-reduce-bw, requires ≥2 GPU nodes for its East-West fabric test and skips gracefully on the single-GPU-node UAT clusters (skip ≠ fail), so it adds no benchmark runtime or new failure mode here; it activates automatically if a cluster ever scales to multiple GPU nodes.WithValidationTimeout(0)), so the Actions steptimeout-minutesis the binding limit. Bumped the validate step 20m→40m and the job 90m→110m to absorb cold-start GPU bring-up.id: conformanceis kept so thetrain/verify/evidence/summary steps that gate onsteps.conformance.outcomeare unchanged.--readiness-hooksremains unsupported for helmfile, see feat: component readiness via chainsaw post-install hook (replaces #610) #904). Tracking the deployer-level gate separately.Testing
No Go sources changed (shell + YAML only), so the Go test/coverage gate is not exercised. The authoritative validation is the UAT pipeline itself.
Risk Assessment
Rollout notes: N/A — CI-only change, no runtime/library impact.
Checklist
make testwith-race) — N/A, no Go changesmake lint) —yamllint+bash -ncleangit commit -S)