Skip to content

fix(uat): gate validate on cluster convergence (readiness wait + failFast)#1426

Merged
mchmarny merged 3 commits into
NVIDIA:mainfrom
njhensley:fix/uat-cluster-readiness
Jun 23, 2026
Merged

fix(uat): gate validate on cluster convergence (readiness wait + failFast)#1426
mchmarny merged 3 commits into
NVIDIA:mainfrom
njhensley:fix/uat-cluster-readiness

Conversation

@njhensley

Copy link
Copy Markdown
Member

Summary

Gate the UAT validate step on cluster convergence: add a post-helmfile apply readiness wait to the UAT runners and set failFast so a failed deployment phase doesn't yield misleading conformance failures.

Motivation / Context

helmfile apply returns once each release's own objects are Ready, but not after operator-driven convergence (gpu-operator ClusterPolicy → DCGM exporter; prometheus-adapter's custom/external metrics APIServices). The helm deploy.sh path kubectl waits on these; the helmfile path does not. When the UAT validate step moved to --phase all (#1416), the deployment phase (expected-resources) timed out and the conformance checks (accelerator-metrics, ai-service-metrics, pod-autoscaling) fast-failed — all because validation ran against a not-yet-converged cluster, not because of any check, recipe, or networking defect.

Confirmed empirically: re-running aicr validate --phase conformance against the converged UAT cluster passes all conformance validators. (An initial hypothesis that the multi-SG cluster topology blocked cross-subnet Prometheus/DCGM traffic was investigated and ruled out — the dials work once the stack is up.)

Fixes: N/A
Related: #1425

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • Other: UAT test harness (tests/uat/{aws,gcp})

Implementation Notes

  • Readiness gate (tests/uat/{aws,gcp}/run): a wait_for poll helper blocks at the end of phase_install until ClusterPolicy reaches state=ready (⇒ GPU stack incl. DCGM exporter rolled out) and both v1beta1.custom.metrics.k8s.io / v1beta1.external.metrics.k8s.io APIServices are Available (⇒ prometheus-adapter serving). Predicates are fail-closed: an absent resource or unset status keeps polling, and the gate exits non-zero with a diagnostic on timeout (READINESS_TIMEOUT_SECONDS, default 15m) so a cluster that never converges fails the install loudly rather than handing a not-ready cluster to validate.
  • failFast: true (tests/uat/{aws,gcp}/tests/h100-training-config.yaml, spec.validate.execution.failFast): on the rare path where the deployment phase still fails after the gate, conformance/performance are skipped instead of producing misleading failures. Mirrors the existing precedent in tests/uat/chainsaw-config.yaml.
  • Chose ClusterPolicy + the two metrics APIServices as the gate signals because they were the actual failure surface; the deployment-phase validator (now gated by failFast) remains the comprehensive backstop, and the gate is easy to extend if DRA-rollout / nodewright ever prove to be the long pole.
  • Deliberately not changed: the kube-prometheus-stack co-location selector (the Prometheus pod still carries app.kubernetes.io/name=prometheus on chart 84.4.0 — verified on the live cluster) and any validator Go code (the single-shot-probe hardening is tracked separately as Conformance validators: bounded retries on single-shot metrics-API probes #1425).

Testing

bash -n tests/uat/aws/run tests/uat/gcp/run   # shell syntax OK
# YAML parses; spec.validate.execution.failFast=true present; comment lines within yamllint max (200)
  • Empirical confirmation: re-ran aicr validate --phase conformance against the live, converged UAT EKS cluster → all conformance validators pass, confirming the failures were a pre-convergence timing race.
  • yamllint + shellcheck run in CI via make qualify (not installed in the dev environment used here).

Risk Assessment

  • Low — Isolated to the UAT test harness (tests/uat); no product/runtime code paths touched; fail-closed; trivially revertable.

Rollout notes: N/A — UAT-only. Gate budget is overridable via READINESS_TIMEOUT_SECONDS.

Checklist

  • Tests pass locally (make test with -race) — N/A, no Go changes
  • Linter passes (make lint) — bash -n + YAML parse verified locally; yamllint/shellcheck run in CI (tools not present in this dev env)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality — N/A (shell/YAML harness; validated against the live cluster as above)
  • I updated docs if user-facing behavior changed — N/A (internal UAT harness, no user-facing behavior)
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

…Fast)

helmfile apply returns before operator-driven convergence
(ClusterPolicy,
DRA rollout, prometheus-adapter APIServices), so --phase all (NVIDIA#1416) ran
conformance against a not-yet-ready cluster — expected-resources timed
out
and accelerator/ai-service-metrics/pod-autoscaling fast-failed. Add a
post-install readiness gate mirroring deploy.sh's kubectl-waits, and set
failFast so a failed deployment phase doesn't yield misleading
conformance
failures. Confirmed: re-running conformance on the converged cluster
passes.
@njhensley njhensley requested a review from a team as a code owner June 23, 2026 18:07
@njhensley njhensley added the theme/ci-dx CI pipelines, developer experience, and build tooling label Jun 23, 2026
@coderabbitai

coderabbitai Bot commented Jun 23, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: f1686ab8-aeab-4091-80e8-63460be16366

📥 Commits

Reviewing files that changed from the base of the PR and between 2c0a49c and 9bcdd79.

📒 Files selected for processing (2)
  • tests/uat/aws/run
  • tests/uat/gcp/run

📝 Walkthrough

Walkthrough

Both the AWS and GCP UAT runner scripts gain a configurable READINESS_TIMEOUT_SECONDS variable and three new functions: a generic wait_for polling helper that retries a command every 10 seconds until success or timeout, and two predicates (_clusterpolicy_ready, _apiservice_available) that use kubectl JSONPath queries to check GPU-operator ClusterPolicy state and APIService Available=True conditions. At the end of phase_install, after helmfile apply, the script now calls wait_for for all three checks before returning. Alongside this, both the AWS and GCP H100 training YAML configs add spec.validate.execution.failFast: true with documentation comments explaining that a failed deployment/readiness barrier halts the run before conformance and performance phases execute.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

  • NVIDIA/aicr#1416: Modifies the same tests/uat/*/run harness to ensure deployment/readiness checks run before conformance phases, using a --phase all approach rather than explicit polling predicates.

Suggested labels

area/ci, size/M, area/tests

Suggested reviewers

  • mchmarny
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: adding a readiness gate tied to cluster convergence with both the wait mechanism and failFast setting.
Description check ✅ Passed The description is well-structured and directly related to the changeset, explaining the problem, solution, implementation details, and validation approach.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/uat/aws/run`:
- Around line 221-226: The wait_for calls on lines 221, 223, and 225 each use
the full READINESS_TIMEOUT_SECONDS, causing the total barrier time to be
multiplied by the number of checks rather than enforcing a single overall
deadline. Calculate a single deadline timestamp before these wait_for calls by
adding READINESS_TIMEOUT_SECONDS to the current time, then modify each wait_for
invocation (_clusterpolicy_ready, _apiservice_available for both custom.metrics
and external.metrics) to pass the remaining time until that deadline instead of
the full timeout value.

In `@tests/uat/gcp/run`:
- Around line 221-226: The three consecutive wait_for calls for gpu-operator
ClusterPolicy, custom.metrics.k8s.io APIService, and external.metrics.k8s.io
APIService each use the full READINESS_TIMEOUT_SECONDS value, which causes the
total timeout budget to be multiplied rather than shared across all three
checks. Refactor this section to establish a single shared deadline at the
beginning, then pass the remaining seconds (calculated from the initial deadline
minus elapsed time) to each subsequent wait_for call to ensure the combined
readiness checks do not exceed the intended total timeout budget.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 5df7add6-56a3-4b1b-887e-b8b9a525b26b

📥 Commits

Reviewing files that changed from the base of the PR and between fd8ba27 and 2c0a49c.

📒 Files selected for processing (4)
  • tests/uat/aws/run
  • tests/uat/aws/tests/h100-training-config.yaml
  • tests/uat/gcp/run
  • tests/uat/gcp/tests/h100-training-config.yaml

Comment thread tests/uat/aws/run Outdated
Comment thread tests/uat/gcp/run Outdated
@mchmarny mchmarny enabled auto-merge (squash) June 23, 2026 18:30
@mchmarny mchmarny merged commit 6663eb5 into NVIDIA:main Jun 23, 2026
30 checks passed
@njhensley njhensley deleted the fix/uat-cluster-readiness branch June 23, 2026 19:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/tests size/M theme/ci-dx CI pipelines, developer experience, and build tooling

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants