Skip to content

fix(uat): wait for nodewright tuning before validate readiness gate#1429

Merged
mchmarny merged 1 commit into
NVIDIA:mainfrom
njhensley:fix/uat-readiness-skyhook-tuning
Jun 23, 2026
Merged

fix(uat): wait for nodewright tuning before validate readiness gate#1429
mchmarny merged 1 commit into
NVIDIA:mainfrom
njhensley:fix/uat-readiness-skyhook-tuning

Conversation

@njhensley

Copy link
Copy Markdown
Member

Summary

Wait for nodewright/skyhook node tuning to reach complete before the UAT readiness gate checks the GPU/metrics stack, so aicr validate isn't handed a cluster that's mid-reboot.

Motivation / Context

Follow-up to #1426. On a fresh cluster, nodewright (skyhook) applies kernel tuning that reboots the GPU node — and it does so after gpu-operator first reports ClusterPolicy=ready. The reboot re-initializes the gpu-operator operands (their pods drop back to Init, i.e. phase Pending). The gate added in #1426 checks ClusterPolicy=ready + the metrics APIServices, which were satisfied in that transient pre-reboot window, so it passed; aicr validate then ran against the re-converging cluster and the deployment-phase expected-resources check failed on the Pending pods in gpu-operator.

Observed in a UAT run as synchronized restarts of every node-local DaemonSet (aws-node, kube-proxy, ebs-csi-node, nfd-*) at the same moment — the signature of a single node reboot.

Fixes: N/A
Related: #1426

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • Other: UAT test harness (tests/uat/{aws,gcp})

Implementation Notes

  • _skyhook_complete predicate: true only when every Skyhook CR (skyhook.nvidia.com/v1alpha1, cluster-scoped) reports status.status=complete. Empty list (CR not created yet) → keep waiting; Skyhook CRD absent → the recipe has no nodewright, so skip rather than block. Mirrors the nodewright signal expected-resources itself asserts (validators/deployment/expected_resources.go).
  • Ordered first in phase_install, ahead of the ClusterPolicy / metrics-APIService waits, because the tuning reboot disrupts the GPU stack after ClusterPolicy first goes ready — gating the GPU/metrics checks before tuning completes is what let the gate pass prematurely.
  • READINESS_TIMEOUT_SECONDS 900 → 1200 (still a single budget shared across all four checks): it now spans a node reboot plus double GPU-stack convergence. Fail-closed; overridable via env.
  • No-op cost on clusters that don't reboot: the CR reaches complete immediately (or the CRD is absent → skipped), so the new wait only blocks while tuning is genuinely in flight.

Testing

bash -n tests/uat/aws/run tests/uat/gcp/run   # shell syntax OK
shellcheck -S warning tests/uat/{aws,gcp}/run # clean
  • Predicate truth table verified: only all-CRs-complete passes; in-progress / mixed / missing-status / empty all keep waiting; set -e-safe.
  • Not validated end-to-end on a live cluster — the UAT cluster was torn down before re-test. The diagnosis rests on the observed node-reboot signature plus skyhook's documented reboot-for-tuning behavior. yamllint/shellcheck run in CI via make qualify.

Risk Assessment

  • Low — UAT test harness only (tests/uat); no product/runtime code; fail-closed; no-op when no reboot occurs; trivially revertable.

Rollout notes: N/A — UAT-only. Gate budget overridable via READINESS_TIMEOUT_SECONDS.

Checklist

  • Tests pass locally (make test with -race) — N/A, no Go changes
  • Linter passes (make lint) — bash -n + shellcheck verified locally; yamllint/shellcheck also run in CI
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality — N/A (shell harness; predicate logic validated as above)
  • I updated docs if user-facing behavior changed — N/A (internal UAT harness)
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

On a fresh cluster, skyhook (nodewright) node tuning reboots the GPU
node
after gpu-operator first reports ClusterPolicy=ready, re-initializing
the
operands (pods -> Init/Pending). The readiness gate passed on that
transient
pre-reboot state, so validate raced re-convergence and
expected-resources
failed on Pending gpu-operator pods. Gate on skyhook tuning
status=complete
first, then ClusterPolicy + metrics APIServices; raise the shared budget
to
20m to span the reboot.
@njhensley njhensley requested a review from a team as a code owner June 23, 2026 20:10
@njhensley njhensley added the theme/ci-dx CI pipelines, developer experience, and build tooling label Jun 23, 2026
@mchmarny mchmarny enabled auto-merge (squash) June 23, 2026 20:11
@coderabbitai

coderabbitai Bot commented Jun 23, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

Both tests/uat/aws/run and tests/uat/gcp/run receive identical three-part modifications. READINESS_TIMEOUT_SECONDS is increased (from 15 to 20 minutes on GCP; similarly on AWS) with updated inline comments. A new _skyhook_complete function is added to each script: it returns success immediately when the Skyhook CRD is absent, otherwise polls all Skyhook CR .status.status values and succeeds only when all are complete. The post-install readiness barrier is reworked to compute a single shared deadline and calls wait_for _skyhook_complete first, before the existing ClusterPolicy and metrics APIService checks consume the remaining budget.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

  • NVIDIA/aicr#1416: Modifies UAT runner workflows to execute aicr validate --phase all, directly exercising the phase_install readiness gating that this PR extends with the _skyhook_complete predicate.
  • NVIDIA/aicr#1426: Modifies the same tests/uat/{aws,gcp}/run files to add post-helmfile apply readiness polling for ClusterPolicy and metrics APIService — the exact barrier that this PR extends with Skyhook checking.

Suggested labels

size/M, area/tests

Suggested reviewers

  • mchmarny
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'fix(uat): wait for nodewright tuning before validate readiness gate' clearly and concisely describes the main change—adding a wait for nodewright/skyhook tuning completion before the readiness gate validation checks.
Description check ✅ Passed The description is thorough and directly related to the changeset, explaining the root cause, the fix, implementation details, testing approach, and risk assessment—all aligned with the code changes.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/uat/aws/run`:
- Line 72: Locate the GitHub Actions workflow file that invokes the AWS install
step (likely in .github/workflows) and increase the `timeout-minutes` parameter
for that step. The timeout must be recalculated to accommodate the total time
needed: HELMFILE_TIMEOUT_SECONDS plus READINESS_TIMEOUT_SECONDS (now set to 1200
seconds or 20 minutes on line 72 of the run script) plus additional buffer for
retries and setup operations. Since the current 25-minute limit is insufficient
for this expanded readiness budget, increase the timeout-minutes value to ensure
the script's fail-closed timeout/error handling can execute before GitHub
Actions terminates the workflow step prematurely.
- Around line 120-124: The kubectl command on line 120 currently treats all
failures (including transient API, RBAC, or discovery errors) as if the CRD
doesn't exist, causing the function to return 0 and skip the Skyhook gate check.
Instead of blindly returning 0 on any failure, distinguish between a genuine
"CRD not found" error and other transient failures. Modify the kubectl get crd
command to capture the error output and check specifically for NotFound or
resource not found indicators before returning 0; for all other error
conditions, allow the script to continue waiting rather than skipping the gate,
ensuring fail-closed behavior on transient errors.

In `@tests/uat/gcp/run`:
- Line 72: The GCP install workflow step timeout-minutes is set to 25 minutes
total, but the READINESS_TIMEOUT_SECONDS on line 72 alone is now 1200 seconds
(20 minutes), leaving insufficient buffer for the overall install process. Find
the GitHub Actions workflow file (likely a YAML file defining the GCP install
step) and increase the timeout-minutes value for that step to accommodate the
sum of HELMFILE_TIMEOUT_SECONDS plus READINESS_TIMEOUT_SECONDS plus additional
time for retry logic and setup activities. Ensure the workflow timeout is large
enough that GitHub Actions will not prematurely kill the script before its own
fail-closed timeout mechanism can trigger.
- Around line 120-124: The kubectl get crd command on line 120 suppresses all
errors with output redirection, treating transient API/RBAC/discovery errors the
same as a genuinely missing CRD. Instead of returning 0 for any failure, capture
the actual error output and only return 0 when the CRD is genuinely absent
(NotFound error). For other errors like API server issues or RBAC problems,
allow the function to fail closed by not suppressing the error or returning
early, ensuring the gate properly waits or fails rather than skipping the
Skyhook validation.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 0fe98af7-9064-4ce6-9f80-f93b08631a56

📥 Commits

Reviewing files that changed from the base of the PR and between 6663eb5 and 3dc5676.

📒 Files selected for processing (2)
  • tests/uat/aws/run
  • tests/uat/gcp/run

Comment thread tests/uat/aws/run
READINESS_TIMEOUT_SECONDS="${READINESS_TIMEOUT_SECONDS:-900}" # 15 min
# DCGM exporter) and prometheus-adapter APIService aggregation. Generous; the
# gate fails closed if exceeded.
READINESS_TIMEOUT_SECONDS="${READINESS_TIMEOUT_SECONDS:-1200}" # 20 min

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Align the workflow step timeout with the expanded readiness budget.

Line 72 allows the readiness gate alone to consume 20 minutes, but the AWS install workflow step is still capped at 25 minutes total, including plugin install and helmfile apply attempts. A slow-but-healthy install can now be killed by GitHub Actions before this script’s fail-closed timeout/error path runs; raise the workflow timeout-minutes to cover HELMFILE_TIMEOUT_SECONDS + READINESS_TIMEOUT_SECONDS + retry/setup slack.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/uat/aws/run` at line 72, Locate the GitHub Actions workflow file that
invokes the AWS install step (likely in .github/workflows) and increase the
`timeout-minutes` parameter for that step. The timeout must be recalculated to
accommodate the total time needed: HELMFILE_TIMEOUT_SECONDS plus
READINESS_TIMEOUT_SECONDS (now set to 1200 seconds or 20 minutes on line 72 of
the run script) plus additional buffer for retries and setup operations. Since
the current 25-minute limit is insufficient for this expanded readiness budget,
increase the timeout-minutes value to ensure the script's fail-closed
timeout/error handling can execute before GitHub Actions terminates the workflow
step prematurely.

Comment thread tests/uat/aws/run
Comment on lines +120 to +124
kubectl get crd skyhooks.skyhook.nvidia.com >/dev/null 2>&1 || return 0
local states
states="$(kubectl get skyhooks.skyhook.nvidia.com \
-o jsonpath='{range .items[*]}{.status.status}{"\n"}{end}' 2>/dev/null)"
[[ -n "$states" ]] && ! grep -qxv 'complete' <<<"$states"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Fail closed on CRD discovery errors other than NotFound.

Line 120 treats every kubectl get crd failure as “no nodewright”, so a transient API/RBAC/discovery error can skip the Skyhook gate and reintroduce the pre-reboot readiness race. Mirror the validator contract: skip only when the CRD is genuinely absent; otherwise keep waiting/fail closed.

Suggested fix
 _skyhook_complete() {
-  kubectl get crd skyhooks.skyhook.nvidia.com >/dev/null 2>&1 || return 0
+  local crd_err
+  crd_err="$(kubectl get crd skyhooks.skyhook.nvidia.com 2>&1 >/dev/null)" || {
+    [[ "$crd_err" == *"NotFound"* ]] && return 0
+    return 1
+  }
   local states
   states="$(kubectl get skyhooks.skyhook.nvidia.com \
-    -o jsonpath='{range .items[*]}{.status.status}{"\n"}{end}' 2>/dev/null)"
+    -o jsonpath='{range .items[*]}{.status.status}{"\n"}{end}' 2>/dev/null)" || return 1
   [[ -n "$states" ]] && ! grep -qxv 'complete' <<<"$states"
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/uat/aws/run` around lines 120 - 124, The kubectl command on line 120
currently treats all failures (including transient API, RBAC, or discovery
errors) as if the CRD doesn't exist, causing the function to return 0 and skip
the Skyhook gate check. Instead of blindly returning 0 on any failure,
distinguish between a genuine "CRD not found" error and other transient
failures. Modify the kubectl get crd command to capture the error output and
check specifically for NotFound or resource not found indicators before
returning 0; for all other error conditions, allow the script to continue
waiting rather than skipping the gate, ensuring fail-closed behavior on
transient errors.

Comment thread tests/uat/gcp/run
READINESS_TIMEOUT_SECONDS="${READINESS_TIMEOUT_SECONDS:-900}" # 15 min
# and prometheus-adapter APIService aggregation. Generous; the gate fails closed
# if exceeded.
READINESS_TIMEOUT_SECONDS="${READINESS_TIMEOUT_SECONDS:-1200}" # 20 min

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Align the workflow step timeout with the expanded readiness budget.

Line 72 allows the readiness gate alone to consume 20 minutes, but the GCP install workflow step is still capped at 25 minutes total, including plugin install and helmfile apply attempts. A slow-but-healthy install can now be killed by GitHub Actions before this script’s fail-closed timeout/error path runs; raise the workflow timeout-minutes to cover HELMFILE_TIMEOUT_SECONDS + READINESS_TIMEOUT_SECONDS + retry/setup slack.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/uat/gcp/run` at line 72, The GCP install workflow step timeout-minutes
is set to 25 minutes total, but the READINESS_TIMEOUT_SECONDS on line 72 alone
is now 1200 seconds (20 minutes), leaving insufficient buffer for the overall
install process. Find the GitHub Actions workflow file (likely a YAML file
defining the GCP install step) and increase the timeout-minutes value for that
step to accommodate the sum of HELMFILE_TIMEOUT_SECONDS plus
READINESS_TIMEOUT_SECONDS plus additional time for retry logic and setup
activities. Ensure the workflow timeout is large enough that GitHub Actions will
not prematurely kill the script before its own fail-closed timeout mechanism can
trigger.

Comment thread tests/uat/gcp/run
Comment on lines +120 to +124
kubectl get crd skyhooks.skyhook.nvidia.com >/dev/null 2>&1 || return 0
local states
states="$(kubectl get skyhooks.skyhook.nvidia.com \
-o jsonpath='{range .items[*]}{.status.status}{"\n"}{end}' 2>/dev/null)"
[[ -n "$states" ]] && ! grep -qxv 'complete' <<<"$states"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Fail closed on CRD discovery errors other than NotFound.

Line 120 treats every kubectl get crd failure as “no nodewright”, so a transient API/RBAC/discovery error can skip the Skyhook gate and reintroduce the pre-reboot readiness race. Mirror the validator contract: skip only when the CRD is genuinely absent; otherwise keep waiting/fail closed.

Suggested fix
 _skyhook_complete() {
-  kubectl get crd skyhooks.skyhook.nvidia.com >/dev/null 2>&1 || return 0
+  local crd_err
+  crd_err="$(kubectl get crd skyhooks.skyhook.nvidia.com 2>&1 >/dev/null)" || {
+    [[ "$crd_err" == *"NotFound"* ]] && return 0
+    return 1
+  }
   local states
   states="$(kubectl get skyhooks.skyhook.nvidia.com \
-    -o jsonpath='{range .items[*]}{.status.status}{"\n"}{end}' 2>/dev/null)"
+    -o jsonpath='{range .items[*]}{.status.status}{"\n"}{end}' 2>/dev/null)" || return 1
   [[ -n "$states" ]] && ! grep -qxv 'complete' <<<"$states"
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/uat/gcp/run` around lines 120 - 124, The kubectl get crd command on
line 120 suppresses all errors with output redirection, treating transient
API/RBAC/discovery errors the same as a genuinely missing CRD. Instead of
returning 0 for any failure, capture the actual error output and only return 0
when the CRD is genuinely absent (NotFound error). For other errors like API
server issues or RBAC problems, allow the function to fail closed by not
suppressing the error or returning early, ensuring the gate properly waits or
fails rather than skipping the Skyhook validation.

@mchmarny mchmarny merged commit 58721c5 into NVIDIA:main Jun 23, 2026
32 of 33 checks passed
@njhensley njhensley deleted the fix/uat-readiness-skyhook-tuning branch June 23, 2026 20:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/tests size/M theme/ci-dx CI pipelines, developer experience, and build tooling

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants