fix(uat): gate validate on cluster convergence (readiness wait + failFast) by njhensley · Pull Request #1426 · NVIDIA/aicr

njhensley · 2026-06-23T18:07:50Z

Summary

Gate the UAT validate step on cluster convergence: add a post-helmfile apply readiness wait to the UAT runners and set failFast so a failed deployment phase doesn't yield misleading conformance failures.

Motivation / Context

helmfile apply returns once each release's own objects are Ready, but not after operator-driven convergence (gpu-operator ClusterPolicy → DCGM exporter; prometheus-adapter's custom/external metrics APIServices). The helm deploy.sh path kubectl waits on these; the helmfile path does not. When the UAT validate step moved to --phase all (#1416), the deployment phase (expected-resources) timed out and the conformance checks (accelerator-metrics, ai-service-metrics, pod-autoscaling) fast-failed — all because validation ran against a not-yet-converged cluster, not because of any check, recipe, or networking defect.

Confirmed empirically: re-running aicr validate --phase conformance against the converged UAT cluster passes all conformance validators. (An initial hypothesis that the multi-SG cluster topology blocked cross-subnet Prometheus/DCGM traffic was investigated and ruled out — the dials work once the stack is up.)

Fixes: N/A
Related: #1425

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation update
Refactoring (no functional changes)
Build/CI/tooling

Component(s) Affected

Other: UAT test harness (tests/uat/{aws,gcp})

Implementation Notes

Readiness gate (tests/uat/{aws,gcp}/run): a wait_for poll helper blocks at the end of phase_install until ClusterPolicy reaches state=ready (⇒ GPU stack incl. DCGM exporter rolled out) and both v1beta1.custom.metrics.k8s.io / v1beta1.external.metrics.k8s.io APIServices are Available (⇒ prometheus-adapter serving). Predicates are fail-closed: an absent resource or unset status keeps polling, and the gate exits non-zero with a diagnostic on timeout (READINESS_TIMEOUT_SECONDS, default 15m) so a cluster that never converges fails the install loudly rather than handing a not-ready cluster to validate.
failFast: true (tests/uat/{aws,gcp}/tests/h100-training-config.yaml, spec.validate.execution.failFast): on the rare path where the deployment phase still fails after the gate, conformance/performance are skipped instead of producing misleading failures. Mirrors the existing precedent in tests/uat/chainsaw-config.yaml.
Chose ClusterPolicy + the two metrics APIServices as the gate signals because they were the actual failure surface; the deployment-phase validator (now gated by failFast) remains the comprehensive backstop, and the gate is easy to extend if DRA-rollout / nodewright ever prove to be the long pole.
Deliberately not changed: the kube-prometheus-stack co-location selector (the Prometheus pod still carries app.kubernetes.io/name=prometheus on chart 84.4.0 — verified on the live cluster) and any validator Go code (the single-shot-probe hardening is tracked separately as Conformance validators: bounded retries on single-shot metrics-API probes #1425).

Testing

bash -n tests/uat/aws/run tests/uat/gcp/run   # shell syntax OK
# YAML parses; spec.validate.execution.failFast=true present; comment lines within yamllint max (200)

Empirical confirmation: re-ran aicr validate --phase conformance against the live, converged UAT EKS cluster → all conformance validators pass, confirming the failures were a pre-convergence timing race.
yamllint + shellcheck run in CI via make qualify (not installed in the dev environment used here).

Risk Assessment

Low — Isolated to the UAT test harness (tests/uat); no product/runtime code paths touched; fail-closed; trivially revertable.

Rollout notes: N/A — UAT-only. Gate budget is overridable via READINESS_TIMEOUT_SECONDS.

Checklist

Tests pass locally (make test with -race) — N/A, no Go changes
Linter passes (make lint) — bash -n + YAML parse verified locally; yamllint/shellcheck run in CI (tools not present in this dev env)
I did not skip/disable tests to make CI green
I added/updated tests for new functionality — N/A (shell/YAML harness; validated against the live cluster as above)
I updated docs if user-facing behavior changed — N/A (internal UAT harness, no user-facing behavior)
Changes follow existing patterns in the codebase
Commits are cryptographically signed (git commit -S)

…Fast) helmfile apply returns before operator-driven convergence (ClusterPolicy, DRA rollout, prometheus-adapter APIServices), so --phase all (NVIDIA#1416) ran conformance against a not-yet-ready cluster — expected-resources timed out and accelerator/ai-service-metrics/pod-autoscaling fast-failed. Add a post-install readiness gate mirroring deploy.sh's kubectl-waits, and set failFast so a failed deployment phase doesn't yield misleading conformance failures. Confirmed: re-running conformance on the converged cluster passes.

coderabbitai · 2026-06-23T18:12:16Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: f1686ab8-aeab-4091-80e8-63460be16366

📥 Commits

Reviewing files that changed from the base of the PR and between 2c0a49c and 9bcdd79.

📒 Files selected for processing (2)

tests/uat/aws/run
tests/uat/gcp/run

📝 Walkthrough

Walkthrough

Both the AWS and GCP UAT runner scripts gain a configurable READINESS_TIMEOUT_SECONDS variable and three new functions: a generic wait_for polling helper that retries a command every 10 seconds until success or timeout, and two predicates (_clusterpolicy_ready, _apiservice_available) that use kubectl JSONPath queries to check GPU-operator ClusterPolicy state and APIService Available=True conditions. At the end of phase_install, after helmfile apply, the script now calls wait_for for all three checks before returning. Alongside this, both the AWS and GCP H100 training YAML configs add spec.validate.execution.failFast: true with documentation comments explaining that a failed deployment/readiness barrier halts the run before conformance and performance phases execute.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

NVIDIA/aicr#1416: Modifies the same tests/uat/*/run harness to ensure deployment/readiness checks run before conformance phases, using a --phase all approach rather than explicit polling predicates.

Suggested labels

area/ci, size/M, area/tests

Suggested reviewers

mchmarny

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main change: adding a readiness gate tied to cluster convergence with both the wait mechanism and failFast setting.
Description check	✅ Passed	The description is well-structured and directly related to the changeset, explaining the problem, solution, implementation details, and validation approach.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/uat/aws/run`:
- Around line 221-226: The wait_for calls on lines 221, 223, and 225 each use
the full READINESS_TIMEOUT_SECONDS, causing the total barrier time to be
multiplied by the number of checks rather than enforcing a single overall
deadline. Calculate a single deadline timestamp before these wait_for calls by
adding READINESS_TIMEOUT_SECONDS to the current time, then modify each wait_for
invocation (_clusterpolicy_ready, _apiservice_available for both custom.metrics
and external.metrics) to pass the remaining time until that deadline instead of
the full timeout value.

In `@tests/uat/gcp/run`:
- Around line 221-226: The three consecutive wait_for calls for gpu-operator
ClusterPolicy, custom.metrics.k8s.io APIService, and external.metrics.k8s.io
APIService each use the full READINESS_TIMEOUT_SECONDS value, which causes the
total timeout budget to be multiplied rather than shared across all three
checks. Refactor this section to establish a single shared deadline at the
beginning, then pass the remaining seconds (calculated from the initial deadline
minus elapsed time) to each subsequent wait_for call to ensure the combined
readiness checks do not exceed the intended total timeout budget.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 5df7add6-56a3-4b1b-887e-b8b9a525b26b

📥 Commits

Reviewing files that changed from the base of the PR and between fd8ba27 and 2c0a49c.

📒 Files selected for processing (4)

tests/uat/aws/run
tests/uat/aws/tests/h100-training-config.yaml
tests/uat/gcp/run
tests/uat/gcp/tests/h100-training-config.yaml

njhensley requested a review from a team as a code owner June 23, 2026 18:07

njhensley added the theme/ci-dx CI pipelines, developer experience, and build tooling label Jun 23, 2026

github-actions Bot added area/tests size/M labels Jun 23, 2026

coderabbitai Bot reviewed Jun 23, 2026

View reviewed changes

Comment thread tests/uat/aws/run Outdated

Comment thread tests/uat/gcp/run Outdated

fix: shared readiness deadline

9bcdd79

mchmarny approved these changes Jun 23, 2026

View reviewed changes

Merge branch 'main' into fix/uat-cluster-readiness

ec44898

mchmarny enabled auto-merge (squash) June 23, 2026 18:30

mchmarny assigned njhensley Jun 23, 2026

mchmarny merged commit 6663eb5 into NVIDIA:main Jun 23, 2026
30 checks passed

njhensley deleted the fix/uat-cluster-readiness branch June 23, 2026 19:06

This was referenced Jun 23, 2026

fix(uat): wait for nodewright tuning before validate readiness gate #1429

Merged

fix(uat): gate install on the deployment validation phase #1439

Merged

fix(uat): gate on Prometheus readiness; retry conformance metric APIs #1452

Merged

coderabbitai Bot mentioned this pull request Jun 26, 2026

fix(validator): drop single-shot ClusterPolicy check from expected-resources #1495

Merged

15 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(uat): gate validate on cluster convergence (readiness wait + failFast)#1426

fix(uat): gate validate on cluster convergence (readiness wait + failFast)#1426
mchmarny merged 3 commits into
NVIDIA:mainfrom
njhensley:fix/uat-cluster-readiness

njhensley commented Jun 23, 2026

Uh oh!

coderabbitai Bot commented Jun 23, 2026 •

edited

Loading

Walkthrough

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

njhensley commented Jun 23, 2026

Summary

Motivation / Context

Type of Change

Component(s) Affected

Implementation Notes

Testing

Risk Assessment

Checklist

Uh oh!

coderabbitai Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented Jun 23, 2026 •

edited

Loading