Skip to content

fix(uat): run all validate phases to gate on GPU readiness#1416

Merged
mchmarny merged 4 commits into
NVIDIA:mainfrom
njhensley:fix/uat-run-all-validate-phases
Jun 23, 2026
Merged

fix(uat): run all validate phases to gate on GPU readiness#1416
mchmarny merged 4 commits into
NVIDIA:mainfrom
njhensley:fix/uat-run-all-validate-phases

Conversation

@njhensley

Copy link
Copy Markdown
Member

Summary

Switch the AWS and GCP UAT validate step from --phase conformance to --phase all, so the deployment phase runs first and gates on GPU-stack readiness before conformance.

Motivation / Context

The UAT validate step ran aicr validate --phase conformance only. Unlike the helm deploy.sh (which kubectl waits on GPU-driver migration and the DRA kubelet-plugin rollout), the helmfile bundle returns from helmfile apply as soon as each release's own resources are ready — operator-driven work (GPU driver install, nodewright node tuning, DRA plugin rollout) is still in flight. Conformance then validated a not-yet-converged cluster and failed.

The deployment phase is the missing readiness barrier: its health-check asserts poll until the GPU stack converges (gpu-operator ClusterPolicy state=ready, DRA kubelet-plugin DaemonSet fully rolled out, nodewright tuning complete) — the same gate deploy.sh provides, but deployer-agnostic. Running all phases puts that gate ahead of conformance.

Fixes: N/A
Related: failing UAT - AWS run (conformance step) on the helmfile bundle

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • Build/CI/tooling

Component(s) Affected

  • Other: UAT test harness (tests/uat/{aws,gcp}/run) and UAT workflows (.github/workflows/uat-{aws,gcp}.yaml)

Implementation Notes

  • --phase allcfg.phases is nil → the documented "run all phases" default path; evidence is rendered/attested from the merged multi-phase report, so the signed bundle now covers all phases.
  • The performance phase's only check, nccl-all-reduce-bw, requires ≥2 GPU nodes for its East-West fabric test and skips gracefully on the single-GPU-node UAT clusters (skip ≠ fail), so it adds no benchmark runtime or new failure mode here; it activates automatically if a cluster ever scales to multiple GPU nodes.
  • The binary's facade validation deadline is uncapped (WithValidationTimeout(0)), so the Actions step timeout-minutes is the binding limit. Bumped the validate step 20m→40m and the job 90m→110m to absorb cold-start GPU bring-up.
  • Step id: conformance is kept so the train/verify/evidence/summary steps that gate on steps.conformance.outcome are unchanged.
  • This fixes the UAT harness; it does not change the shipped helmfile bundle (which still has no built-in readiness gate — --readiness-hooks remains unsupported for helmfile, see feat: component readiness via chainsaw post-install hook (replaces #610) #904). Tracking the deployer-level gate separately.

Testing

bash -n tests/uat/aws/run tests/uat/gcp/run        # OK
yamllint -c .yamllint.yaml .github/workflows/uat-{aws,gcp}.yaml   # OK

No Go sources changed (shell + YAML only), so the Go test/coverage gate is not exercised. The authoritative validation is the UAT pipeline itself.

Risk Assessment

  • Low — Isolated change to the UAT harness, easy to revert; the only behavioral change is running more validate phases against the test cluster.

Rollout notes: N/A — CI-only change, no runtime/library impact.

Checklist

  • Tests pass locally (make test with -race) — N/A, no Go changes
  • Linter passes (make lint) — yamllint + bash -n clean
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality — N/A
  • I updated docs if user-facing behavior changed — N/A (CI harness)
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

The AWS/GCP UATs ran `aicr validate --phase conformance` only. Unlike
the
helm deploy.sh (which `kubectl wait`s on driver migration and the DRA
plugin), the helmfile bundle returns from `helmfile apply` as soon as
each
release's own resources are ready — operator-driven work (GPU driver
install, nodewright node tuning, DRA plugin rollout) is still in flight.
Conformance then validated a not-yet-converged cluster and failed.

Switch the validate step to `--phase all`. The deployment phase is the
missing readiness barrier: its health-check asserts poll until the GPU
stack converges (gpu-operator ClusterPolicy state=ready, DRA
kubelet-plugin
DaemonSet fully rolled out, nodewright tuning complete) — the same gate
deploy.sh provides, deployer-agnostically. Evidence is rendered/attested
from the merged multi-phase report, so the signed bundle now covers all
phases. The performance phase's nccl-all-reduce-bw check needs >=2 GPU
nodes and skips gracefully on the single-GPU-node UAT clusters.

Bump the validate step timeout (20m->40m) and job timeout (90m->110m) to
absorb cold-start GPU bring-up. Mirror the change across AWS and GCP.
@njhensley njhensley requested review from a team as code owners June 23, 2026 06:30
@njhensley njhensley added the theme/ci-dx CI pipelines, developer experience, and build tooling label Jun 23, 2026
@coderabbitai

coderabbitai Bot commented Jun 23, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 905b4253-69a8-496d-ac7a-665d97a283f1

📥 Commits

Reviewing files that changed from the base of the PR and between a1b6e20 and 2a7d31d.

📒 Files selected for processing (4)
  • .github/workflows/uat-aws.yaml
  • .github/workflows/uat-gcp.yaml
  • tests/uat/aws/run
  • tests/uat/gcp/run

📝 Walkthrough

Walkthrough

Both UAT runner scripts (tests/uat/aws/run and tests/uat/gcp/run) are updated so that phase_conformance invokes aicr validate --phase all instead of --phase conformance, causing deployment readiness, conformance, and performance phases to execute together and produce a merged evidence bundle. The surrounding log group label is updated to match. The corresponding CI workflow files (uat-aws.yaml, uat-gcp.yaml) increase the job timeout from 90 to 110 minutes and the validation step timeout from 20 to 40 minutes; the step description and Test Summary row label change from "Conformance" to "Validate (all phases)" while the step id and downstream gating remain unchanged.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: switching UAT validation to run all phases instead of only conformance, with the specific purpose of gating on GPU readiness before conformance validation.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, providing clear context about the issue (helmfile returning early before GPU stack convergence), the solution (running all phases to include the deployment readiness gate), and implementation details.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@mchmarny mchmarny enabled auto-merge (squash) June 23, 2026 12:17
@mchmarny mchmarny self-requested a review June 23, 2026 12:19

@mchmarny mchmarny left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Right fix — the deployment phase is the deployer-agnostic readiness barrier the helmfile bundle lacks, and putting it ahead of conformance is the correct way to stop validating a not-yet-converged cluster.

Verified the mechanics rather than taking them on faith:

  • --phase all is a documented flag value (pkg/cli/validate.go:437) and the "all" wildcard resolves to cfg.phases == nil → the run-all-phases default path (validate.go:308-312), so evidence is rendered from the merged multi-phase report as described.
  • No --fail-fast, so all phases run and the binary exits non-zero if any phase fails (validate.go:411) — a deployment-phase failure now correctly fails the step, which is the whole point.
  • Timeout bumps are consistent (step +20m == job +20m), and AWS/GCP run files stay byte-identical.
  • Retaining id: conformance keeps the downstream steps.conformance.outcome gating untouched; the slight phase_conformance misnomer is documented inline in the updated Phases block — fine trade-off.

The comments are excellent — the rationale for why helmfile needs this where helm's deploy.sh kubectl wait doesn't is captured right where the next person will look.

@mchmarny mchmarny merged commit c4f0872 into NVIDIA:main Jun 23, 2026
30 checks passed
@njhensley njhensley deleted the fix/uat-run-all-validate-phases branch June 23, 2026 16:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/ci area/tests size/M theme/ci-dx CI pipelines, developer experience, and build tooling

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants