fix(uat): wait for nodewright tuning before validate readiness gate by njhensley · Pull Request #1429 · NVIDIA/aicr

njhensley · 2026-06-23T20:10:20Z

Summary

Wait for nodewright/skyhook node tuning to reach complete before the UAT readiness gate checks the GPU/metrics stack, so aicr validate isn't handed a cluster that's mid-reboot.

Motivation / Context

Follow-up to #1426. On a fresh cluster, nodewright (skyhook) applies kernel tuning that reboots the GPU node — and it does so after gpu-operator first reports ClusterPolicy=ready. The reboot re-initializes the gpu-operator operands (their pods drop back to Init, i.e. phase Pending). The gate added in #1426 checks ClusterPolicy=ready + the metrics APIServices, which were satisfied in that transient pre-reboot window, so it passed; aicr validate then ran against the re-converging cluster and the deployment-phase expected-resources check failed on the Pending pods in gpu-operator.

Observed in a UAT run as synchronized restarts of every node-local DaemonSet (aws-node, kube-proxy, ebs-csi-node, nfd-*) at the same moment — the signature of a single node reboot.

Fixes: N/A
Related: #1426

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation update
Refactoring (no functional changes)
Build/CI/tooling

Component(s) Affected

Other: UAT test harness (tests/uat/{aws,gcp})

Implementation Notes

_skyhook_complete predicate: true only when every Skyhook CR (skyhook.nvidia.com/v1alpha1, cluster-scoped) reports status.status=complete. Empty list (CR not created yet) → keep waiting; Skyhook CRD absent → the recipe has no nodewright, so skip rather than block. Mirrors the nodewright signal expected-resources itself asserts (validators/deployment/expected_resources.go).
Ordered first in phase_install, ahead of the ClusterPolicy / metrics-APIService waits, because the tuning reboot disrupts the GPU stack after ClusterPolicy first goes ready — gating the GPU/metrics checks before tuning completes is what let the gate pass prematurely.
READINESS_TIMEOUT_SECONDS 900 → 1200 (still a single budget shared across all four checks): it now spans a node reboot plus double GPU-stack convergence. Fail-closed; overridable via env.
No-op cost on clusters that don't reboot: the CR reaches complete immediately (or the CRD is absent → skipped), so the new wait only blocks while tuning is genuinely in flight.

Testing

bash -n tests/uat/aws/run tests/uat/gcp/run   # shell syntax OK
shellcheck -S warning tests/uat/{aws,gcp}/run # clean

Predicate truth table verified: only all-CRs-complete passes; in-progress / mixed / missing-status / empty all keep waiting; set -e-safe.
Not validated end-to-end on a live cluster — the UAT cluster was torn down before re-test. The diagnosis rests on the observed node-reboot signature plus skyhook's documented reboot-for-tuning behavior. yamllint/shellcheck run in CI via make qualify.

Risk Assessment

Low — UAT test harness only (tests/uat); no product/runtime code; fail-closed; no-op when no reboot occurs; trivially revertable.

Rollout notes: N/A — UAT-only. Gate budget overridable via READINESS_TIMEOUT_SECONDS.

Checklist

Tests pass locally (make test with -race) — N/A, no Go changes
Linter passes (make lint) — bash -n + shellcheck verified locally; yamllint/shellcheck also run in CI
I did not skip/disable tests to make CI green
I added/updated tests for new functionality — N/A (shell harness; predicate logic validated as above)
I updated docs if user-facing behavior changed — N/A (internal UAT harness)
Changes follow existing patterns in the codebase
Commits are cryptographically signed (git commit -S)

On a fresh cluster, skyhook (nodewright) node tuning reboots the GPU node after gpu-operator first reports ClusterPolicy=ready, re-initializing the operands (pods -> Init/Pending). The readiness gate passed on that transient pre-reboot state, so validate raced re-convergence and expected-resources failed on Pending gpu-operator pods. Gate on skyhook tuning status=complete first, then ClusterPolicy + metrics APIServices; raise the shared budget to 20m to span the reboot.

coderabbitai · 2026-06-23T20:15:57Z

📝 Walkthrough

Walkthrough

Both tests/uat/aws/run and tests/uat/gcp/run receive identical three-part modifications. READINESS_TIMEOUT_SECONDS is increased (from 15 to 20 minutes on GCP; similarly on AWS) with updated inline comments. A new _skyhook_complete function is added to each script: it returns success immediately when the Skyhook CRD is absent, otherwise polls all Skyhook CR .status.status values and succeeds only when all are complete. The post-install readiness barrier is reworked to compute a single shared deadline and calls wait_for _skyhook_complete first, before the existing ClusterPolicy and metrics APIService checks consume the remaining budget.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

NVIDIA/aicr#1416: Modifies UAT runner workflows to execute aicr validate --phase all, directly exercising the phase_install readiness gating that this PR extends with the _skyhook_complete predicate.
NVIDIA/aicr#1426: Modifies the same tests/uat/{aws,gcp}/run files to add post-helmfile apply readiness polling for ClusterPolicy and metrics APIService — the exact barrier that this PR extends with Skyhook checking.

Suggested labels

size/M, area/tests

Suggested reviewers

mchmarny

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'fix(uat): wait for nodewright tuning before validate readiness gate' clearly and concisely describes the main change—adding a wait for nodewright/skyhook tuning completion before the readiness gate validation checks.
Description check	✅ Passed	The description is thorough and directly related to the changeset, explaining the root cause, the fix, implementation details, testing approach, and risk assessment—all aligned with the code changes.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/uat/aws/run`:
- Line 72: Locate the GitHub Actions workflow file that invokes the AWS install
step (likely in .github/workflows) and increase the `timeout-minutes` parameter
for that step. The timeout must be recalculated to accommodate the total time
needed: HELMFILE_TIMEOUT_SECONDS plus READINESS_TIMEOUT_SECONDS (now set to 1200
seconds or 20 minutes on line 72 of the run script) plus additional buffer for
retries and setup operations. Since the current 25-minute limit is insufficient
for this expanded readiness budget, increase the timeout-minutes value to ensure
the script's fail-closed timeout/error handling can execute before GitHub
Actions terminates the workflow step prematurely.
- Around line 120-124: The kubectl command on line 120 currently treats all
failures (including transient API, RBAC, or discovery errors) as if the CRD
doesn't exist, causing the function to return 0 and skip the Skyhook gate check.
Instead of blindly returning 0 on any failure, distinguish between a genuine
"CRD not found" error and other transient failures. Modify the kubectl get crd
command to capture the error output and check specifically for NotFound or
resource not found indicators before returning 0; for all other error
conditions, allow the script to continue waiting rather than skipping the gate,
ensuring fail-closed behavior on transient errors.

In `@tests/uat/gcp/run`:
- Line 72: The GCP install workflow step timeout-minutes is set to 25 minutes
total, but the READINESS_TIMEOUT_SECONDS on line 72 alone is now 1200 seconds
(20 minutes), leaving insufficient buffer for the overall install process. Find
the GitHub Actions workflow file (likely a YAML file defining the GCP install
step) and increase the timeout-minutes value for that step to accommodate the
sum of HELMFILE_TIMEOUT_SECONDS plus READINESS_TIMEOUT_SECONDS plus additional
time for retry logic and setup activities. Ensure the workflow timeout is large
enough that GitHub Actions will not prematurely kill the script before its own
fail-closed timeout mechanism can trigger.
- Around line 120-124: The kubectl get crd command on line 120 suppresses all
errors with output redirection, treating transient API/RBAC/discovery errors the
same as a genuinely missing CRD. Instead of returning 0 for any failure, capture
the actual error output and only return 0 when the CRD is genuinely absent
(NotFound error). For other errors like API server issues or RBAC problems,
allow the function to fail closed by not suppressing the error or returning
early, ensuring the gate properly waits or fails rather than skipping the
Skyhook validation.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 0fe98af7-9064-4ce6-9f80-f93b08631a56

📥 Commits

Reviewing files that changed from the base of the PR and between 6663eb5 and 3dc5676.

📒 Files selected for processing (2)

tests/uat/aws/run
tests/uat/gcp/run

coderabbitai · 2026-06-23T20:16:00Z

-READINESS_TIMEOUT_SECONDS="${READINESS_TIMEOUT_SECONDS:-900}" # 15 min
+# DCGM exporter) and prometheus-adapter APIService aggregation. Generous; the
+# gate fails closed if exceeded.
+READINESS_TIMEOUT_SECONDS="${READINESS_TIMEOUT_SECONDS:-1200}" # 20 min


🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Align the workflow step timeout with the expanded readiness budget.

Line 72 allows the readiness gate alone to consume 20 minutes, but the AWS install workflow step is still capped at 25 minutes total, including plugin install and helmfile apply attempts. A slow-but-healthy install can now be killed by GitHub Actions before this script’s fail-closed timeout/error path runs; raise the workflow timeout-minutes to cover HELMFILE_TIMEOUT_SECONDS + READINESS_TIMEOUT_SECONDS + retry/setup slack.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/uat/aws/run` at line 72, Locate the GitHub Actions workflow file that invokes the AWS install step (likely in .github/workflows) and increase the `timeout-minutes` parameter for that step. The timeout must be recalculated to accommodate the total time needed: HELMFILE_TIMEOUT_SECONDS plus READINESS_TIMEOUT_SECONDS (now set to 1200 seconds or 20 minutes on line 72 of the run script) plus additional buffer for retries and setup operations. Since the current 25-minute limit is insufficient for this expanded readiness budget, increase the timeout-minutes value to ensure the script's fail-closed timeout/error handling can execute before GitHub Actions terminates the workflow step prematurely.

coderabbitai · 2026-06-23T20:16:00Z

+  kubectl get crd skyhooks.skyhook.nvidia.com >/dev/null 2>&1 || return 0
+  local states
+  states="$(kubectl get skyhooks.skyhook.nvidia.com \
+    -o jsonpath='{range .items[*]}{.status.status}{"\n"}{end}' 2>/dev/null)"
+  [[ -n "$states" ]] && ! grep -qxv 'complete' <<<"$states"


🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Fail closed on CRD discovery errors other than NotFound.

Line 120 treats every kubectl get crd failure as “no nodewright”, so a transient API/RBAC/discovery error can skip the Skyhook gate and reintroduce the pre-reboot readiness race. Mirror the validator contract: skip only when the CRD is genuinely absent; otherwise keep waiting/fail closed.

Suggested fix

_skyhook_complete() { - kubectl get crd skyhooks.skyhook.nvidia.com >/dev/null 2>&1 || return 0 + local crd_err + crd_err="$(kubectl get crd skyhooks.skyhook.nvidia.com 2>&1 >/dev/null)" || { + [[ "$crd_err" == *"NotFound"* ]] && return 0 + return 1 + } local states states="$(kubectl get skyhooks.skyhook.nvidia.com \ - -o jsonpath='{range .items[*]}{.status.status}{"\n"}{end}' 2>/dev/null)" + -o jsonpath='{range .items[*]}{.status.status}{"\n"}{end}' 2>/dev/null)" || return 1 [[ -n "$states" ]] && ! grep -qxv 'complete' <<<"$states" }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/uat/aws/run` around lines 120 - 124, The kubectl command on line 120 currently treats all failures (including transient API, RBAC, or discovery errors) as if the CRD doesn't exist, causing the function to return 0 and skip the Skyhook gate check. Instead of blindly returning 0 on any failure, distinguish between a genuine "CRD not found" error and other transient failures. Modify the kubectl get crd command to capture the error output and check specifically for NotFound or resource not found indicators before returning 0; for all other error conditions, allow the script to continue waiting rather than skipping the gate, ensuring fail-closed behavior on transient errors.

coderabbitai · 2026-06-23T20:16:00Z

-READINESS_TIMEOUT_SECONDS="${READINESS_TIMEOUT_SECONDS:-900}" # 15 min
+# and prometheus-adapter APIService aggregation. Generous; the gate fails closed
+# if exceeded.
+READINESS_TIMEOUT_SECONDS="${READINESS_TIMEOUT_SECONDS:-1200}" # 20 min


🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Align the workflow step timeout with the expanded readiness budget.

Line 72 allows the readiness gate alone to consume 20 minutes, but the GCP install workflow step is still capped at 25 minutes total, including plugin install and helmfile apply attempts. A slow-but-healthy install can now be killed by GitHub Actions before this script’s fail-closed timeout/error path runs; raise the workflow timeout-minutes to cover HELMFILE_TIMEOUT_SECONDS + READINESS_TIMEOUT_SECONDS + retry/setup slack.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/uat/gcp/run` at line 72, The GCP install workflow step timeout-minutes is set to 25 minutes total, but the READINESS_TIMEOUT_SECONDS on line 72 alone is now 1200 seconds (20 minutes), leaving insufficient buffer for the overall install process. Find the GitHub Actions workflow file (likely a YAML file defining the GCP install step) and increase the timeout-minutes value for that step to accommodate the sum of HELMFILE_TIMEOUT_SECONDS plus READINESS_TIMEOUT_SECONDS plus additional time for retry logic and setup activities. Ensure the workflow timeout is large enough that GitHub Actions will not prematurely kill the script before its own fail-closed timeout mechanism can trigger.

coderabbitai · 2026-06-23T20:16:00Z

+  kubectl get crd skyhooks.skyhook.nvidia.com >/dev/null 2>&1 || return 0
+  local states
+  states="$(kubectl get skyhooks.skyhook.nvidia.com \
+    -o jsonpath='{range .items[*]}{.status.status}{"\n"}{end}' 2>/dev/null)"
+  [[ -n "$states" ]] && ! grep -qxv 'complete' <<<"$states"


🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Fail closed on CRD discovery errors other than NotFound.

Line 120 treats every kubectl get crd failure as “no nodewright”, so a transient API/RBAC/discovery error can skip the Skyhook gate and reintroduce the pre-reboot readiness race. Mirror the validator contract: skip only when the CRD is genuinely absent; otherwise keep waiting/fail closed.

Suggested fix

_skyhook_complete() { - kubectl get crd skyhooks.skyhook.nvidia.com >/dev/null 2>&1 || return 0 + local crd_err + crd_err="$(kubectl get crd skyhooks.skyhook.nvidia.com 2>&1 >/dev/null)" || { + [[ "$crd_err" == *"NotFound"* ]] && return 0 + return 1 + } local states states="$(kubectl get skyhooks.skyhook.nvidia.com \ - -o jsonpath='{range .items[*]}{.status.status}{"\n"}{end}' 2>/dev/null)" + -o jsonpath='{range .items[*]}{.status.status}{"\n"}{end}' 2>/dev/null)" || return 1 [[ -n "$states" ]] && ! grep -qxv 'complete' <<<"$states" }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/uat/gcp/run` around lines 120 - 124, The kubectl get crd command on line 120 suppresses all errors with output redirection, treating transient API/RBAC/discovery errors the same as a genuinely missing CRD. Instead of returning 0 for any failure, capture the actual error output and only return 0 when the CRD is genuinely absent (NotFound error). For other errors like API server issues or RBAC problems, allow the function to fail closed by not suppressing the error or returning early, ensuring the gate properly waits or fails rather than skipping the Skyhook validation.

njhensley requested a review from a team as a code owner June 23, 2026 20:10

njhensley added the theme/ci-dx CI pipelines, developer experience, and build tooling label Jun 23, 2026

github-actions Bot added area/tests size/M labels Jun 23, 2026

mchmarny approved these changes Jun 23, 2026

View reviewed changes

mchmarny enabled auto-merge (squash) June 23, 2026 20:11

mchmarny assigned njhensley Jun 23, 2026

coderabbitai Bot reviewed Jun 23, 2026

View reviewed changes

mchmarny merged commit 58721c5 into NVIDIA:main Jun 23, 2026
32 of 33 checks passed

njhensley deleted the fix/uat-readiness-skyhook-tuning branch June 23, 2026 20:21

This was referenced Jun 23, 2026

fix(uat): gate install on the deployment validation phase #1439

Merged

fix(uat): gate on Prometheus readiness; retry conformance metric APIs #1452

Merged

coderabbitai Bot mentioned this pull request Jun 26, 2026

fix(validator): drop single-shot ClusterPolicy check from expected-resources #1495

Merged

15 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(uat): wait for nodewright tuning before validate readiness gate#1429

fix(uat): wait for nodewright tuning before validate readiness gate#1429
mchmarny merged 1 commit into
NVIDIA:mainfrom
njhensley:fix/uat-readiness-skyhook-tuning

njhensley commented Jun 23, 2026

Uh oh!

coderabbitai Bot commented Jun 23, 2026

Walkthrough

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 23, 2026

Uh oh!

coderabbitai Bot Jun 23, 2026

Uh oh!

coderabbitai Bot Jun 23, 2026

Uh oh!

coderabbitai Bot Jun 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

njhensley commented Jun 23, 2026

Summary

Motivation / Context

Type of Change

Component(s) Affected

Implementation Notes

Testing

Risk Assessment

Checklist

Uh oh!

coderabbitai Bot commented Jun 23, 2026

Walkthrough

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants