fix(uat): wait for nodewright tuning before validate readiness gate#1429
Conversation
On a fresh cluster, skyhook (nodewright) node tuning reboots the GPU node after gpu-operator first reports ClusterPolicy=ready, re-initializing the operands (pods -> Init/Pending). The readiness gate passed on that transient pre-reboot state, so validate raced re-convergence and expected-resources failed on Pending gpu-operator pods. Gate on skyhook tuning status=complete first, then ClusterPolicy + metrics APIServices; raise the shared budget to 20m to span the reboot.
📝 WalkthroughWalkthroughBoth Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/uat/aws/run`:
- Line 72: Locate the GitHub Actions workflow file that invokes the AWS install
step (likely in .github/workflows) and increase the `timeout-minutes` parameter
for that step. The timeout must be recalculated to accommodate the total time
needed: HELMFILE_TIMEOUT_SECONDS plus READINESS_TIMEOUT_SECONDS (now set to 1200
seconds or 20 minutes on line 72 of the run script) plus additional buffer for
retries and setup operations. Since the current 25-minute limit is insufficient
for this expanded readiness budget, increase the timeout-minutes value to ensure
the script's fail-closed timeout/error handling can execute before GitHub
Actions terminates the workflow step prematurely.
- Around line 120-124: The kubectl command on line 120 currently treats all
failures (including transient API, RBAC, or discovery errors) as if the CRD
doesn't exist, causing the function to return 0 and skip the Skyhook gate check.
Instead of blindly returning 0 on any failure, distinguish between a genuine
"CRD not found" error and other transient failures. Modify the kubectl get crd
command to capture the error output and check specifically for NotFound or
resource not found indicators before returning 0; for all other error
conditions, allow the script to continue waiting rather than skipping the gate,
ensuring fail-closed behavior on transient errors.
In `@tests/uat/gcp/run`:
- Line 72: The GCP install workflow step timeout-minutes is set to 25 minutes
total, but the READINESS_TIMEOUT_SECONDS on line 72 alone is now 1200 seconds
(20 minutes), leaving insufficient buffer for the overall install process. Find
the GitHub Actions workflow file (likely a YAML file defining the GCP install
step) and increase the timeout-minutes value for that step to accommodate the
sum of HELMFILE_TIMEOUT_SECONDS plus READINESS_TIMEOUT_SECONDS plus additional
time for retry logic and setup activities. Ensure the workflow timeout is large
enough that GitHub Actions will not prematurely kill the script before its own
fail-closed timeout mechanism can trigger.
- Around line 120-124: The kubectl get crd command on line 120 suppresses all
errors with output redirection, treating transient API/RBAC/discovery errors the
same as a genuinely missing CRD. Instead of returning 0 for any failure, capture
the actual error output and only return 0 when the CRD is genuinely absent
(NotFound error). For other errors like API server issues or RBAC problems,
allow the function to fail closed by not suppressing the error or returning
early, ensuring the gate properly waits or fails rather than skipping the
Skyhook validation.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 0fe98af7-9064-4ce6-9f80-f93b08631a56
📒 Files selected for processing (2)
tests/uat/aws/runtests/uat/gcp/run
| READINESS_TIMEOUT_SECONDS="${READINESS_TIMEOUT_SECONDS:-900}" # 15 min | ||
| # DCGM exporter) and prometheus-adapter APIService aggregation. Generous; the | ||
| # gate fails closed if exceeded. | ||
| READINESS_TIMEOUT_SECONDS="${READINESS_TIMEOUT_SECONDS:-1200}" # 20 min |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Align the workflow step timeout with the expanded readiness budget.
Line 72 allows the readiness gate alone to consume 20 minutes, but the AWS install workflow step is still capped at 25 minutes total, including plugin install and helmfile apply attempts. A slow-but-healthy install can now be killed by GitHub Actions before this script’s fail-closed timeout/error path runs; raise the workflow timeout-minutes to cover HELMFILE_TIMEOUT_SECONDS + READINESS_TIMEOUT_SECONDS + retry/setup slack.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/uat/aws/run` at line 72, Locate the GitHub Actions workflow file that
invokes the AWS install step (likely in .github/workflows) and increase the
`timeout-minutes` parameter for that step. The timeout must be recalculated to
accommodate the total time needed: HELMFILE_TIMEOUT_SECONDS plus
READINESS_TIMEOUT_SECONDS (now set to 1200 seconds or 20 minutes on line 72 of
the run script) plus additional buffer for retries and setup operations. Since
the current 25-minute limit is insufficient for this expanded readiness budget,
increase the timeout-minutes value to ensure the script's fail-closed
timeout/error handling can execute before GitHub Actions terminates the workflow
step prematurely.
| kubectl get crd skyhooks.skyhook.nvidia.com >/dev/null 2>&1 || return 0 | ||
| local states | ||
| states="$(kubectl get skyhooks.skyhook.nvidia.com \ | ||
| -o jsonpath='{range .items[*]}{.status.status}{"\n"}{end}' 2>/dev/null)" | ||
| [[ -n "$states" ]] && ! grep -qxv 'complete' <<<"$states" |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
Fail closed on CRD discovery errors other than NotFound.
Line 120 treats every kubectl get crd failure as “no nodewright”, so a transient API/RBAC/discovery error can skip the Skyhook gate and reintroduce the pre-reboot readiness race. Mirror the validator contract: skip only when the CRD is genuinely absent; otherwise keep waiting/fail closed.
Suggested fix
_skyhook_complete() {
- kubectl get crd skyhooks.skyhook.nvidia.com >/dev/null 2>&1 || return 0
+ local crd_err
+ crd_err="$(kubectl get crd skyhooks.skyhook.nvidia.com 2>&1 >/dev/null)" || {
+ [[ "$crd_err" == *"NotFound"* ]] && return 0
+ return 1
+ }
local states
states="$(kubectl get skyhooks.skyhook.nvidia.com \
- -o jsonpath='{range .items[*]}{.status.status}{"\n"}{end}' 2>/dev/null)"
+ -o jsonpath='{range .items[*]}{.status.status}{"\n"}{end}' 2>/dev/null)" || return 1
[[ -n "$states" ]] && ! grep -qxv 'complete' <<<"$states"
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/uat/aws/run` around lines 120 - 124, The kubectl command on line 120
currently treats all failures (including transient API, RBAC, or discovery
errors) as if the CRD doesn't exist, causing the function to return 0 and skip
the Skyhook gate check. Instead of blindly returning 0 on any failure,
distinguish between a genuine "CRD not found" error and other transient
failures. Modify the kubectl get crd command to capture the error output and
check specifically for NotFound or resource not found indicators before
returning 0; for all other error conditions, allow the script to continue
waiting rather than skipping the gate, ensuring fail-closed behavior on
transient errors.
| READINESS_TIMEOUT_SECONDS="${READINESS_TIMEOUT_SECONDS:-900}" # 15 min | ||
| # and prometheus-adapter APIService aggregation. Generous; the gate fails closed | ||
| # if exceeded. | ||
| READINESS_TIMEOUT_SECONDS="${READINESS_TIMEOUT_SECONDS:-1200}" # 20 min |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Align the workflow step timeout with the expanded readiness budget.
Line 72 allows the readiness gate alone to consume 20 minutes, but the GCP install workflow step is still capped at 25 minutes total, including plugin install and helmfile apply attempts. A slow-but-healthy install can now be killed by GitHub Actions before this script’s fail-closed timeout/error path runs; raise the workflow timeout-minutes to cover HELMFILE_TIMEOUT_SECONDS + READINESS_TIMEOUT_SECONDS + retry/setup slack.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/uat/gcp/run` at line 72, The GCP install workflow step timeout-minutes
is set to 25 minutes total, but the READINESS_TIMEOUT_SECONDS on line 72 alone
is now 1200 seconds (20 minutes), leaving insufficient buffer for the overall
install process. Find the GitHub Actions workflow file (likely a YAML file
defining the GCP install step) and increase the timeout-minutes value for that
step to accommodate the sum of HELMFILE_TIMEOUT_SECONDS plus
READINESS_TIMEOUT_SECONDS plus additional time for retry logic and setup
activities. Ensure the workflow timeout is large enough that GitHub Actions will
not prematurely kill the script before its own fail-closed timeout mechanism can
trigger.
| kubectl get crd skyhooks.skyhook.nvidia.com >/dev/null 2>&1 || return 0 | ||
| local states | ||
| states="$(kubectl get skyhooks.skyhook.nvidia.com \ | ||
| -o jsonpath='{range .items[*]}{.status.status}{"\n"}{end}' 2>/dev/null)" | ||
| [[ -n "$states" ]] && ! grep -qxv 'complete' <<<"$states" |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
Fail closed on CRD discovery errors other than NotFound.
Line 120 treats every kubectl get crd failure as “no nodewright”, so a transient API/RBAC/discovery error can skip the Skyhook gate and reintroduce the pre-reboot readiness race. Mirror the validator contract: skip only when the CRD is genuinely absent; otherwise keep waiting/fail closed.
Suggested fix
_skyhook_complete() {
- kubectl get crd skyhooks.skyhook.nvidia.com >/dev/null 2>&1 || return 0
+ local crd_err
+ crd_err="$(kubectl get crd skyhooks.skyhook.nvidia.com 2>&1 >/dev/null)" || {
+ [[ "$crd_err" == *"NotFound"* ]] && return 0
+ return 1
+ }
local states
states="$(kubectl get skyhooks.skyhook.nvidia.com \
- -o jsonpath='{range .items[*]}{.status.status}{"\n"}{end}' 2>/dev/null)"
+ -o jsonpath='{range .items[*]}{.status.status}{"\n"}{end}' 2>/dev/null)" || return 1
[[ -n "$states" ]] && ! grep -qxv 'complete' <<<"$states"
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/uat/gcp/run` around lines 120 - 124, The kubectl get crd command on
line 120 suppresses all errors with output redirection, treating transient
API/RBAC/discovery errors the same as a genuinely missing CRD. Instead of
returning 0 for any failure, capture the actual error output and only return 0
when the CRD is genuinely absent (NotFound error). For other errors like API
server issues or RBAC problems, allow the function to fail closed by not
suppressing the error or returning early, ensuring the gate properly waits or
fails rather than skipping the Skyhook validation.
Summary
Wait for nodewright/skyhook node tuning to reach
completebefore the UAT readiness gate checks the GPU/metrics stack, soaicr validateisn't handed a cluster that's mid-reboot.Motivation / Context
Follow-up to #1426. On a fresh cluster, nodewright (skyhook) applies kernel tuning that reboots the GPU node — and it does so after gpu-operator first reports
ClusterPolicy=ready. The reboot re-initializes the gpu-operator operands (their pods drop back toInit, i.e. phasePending). The gate added in #1426 checksClusterPolicy=ready+ the metrics APIServices, which were satisfied in that transient pre-reboot window, so it passed;aicr validatethen ran against the re-converging cluster and the deployment-phaseexpected-resourcescheck failed on thePendingpods ingpu-operator.Observed in a UAT run as synchronized restarts of every node-local DaemonSet (
aws-node,kube-proxy,ebs-csi-node,nfd-*) at the same moment — the signature of a single node reboot.Fixes: N/A
Related: #1426
Type of Change
Component(s) Affected
tests/uat/{aws,gcp})Implementation Notes
_skyhook_completepredicate: true only when everySkyhookCR (skyhook.nvidia.com/v1alpha1, cluster-scoped) reportsstatus.status=complete. Empty list (CR not created yet) → keep waiting; Skyhook CRD absent → the recipe has no nodewright, so skip rather than block. Mirrors the nodewright signalexpected-resourcesitself asserts (validators/deployment/expected_resources.go).phase_install, ahead of theClusterPolicy/ metrics-APIService waits, because the tuning reboot disrupts the GPU stack afterClusterPolicyfirst goes ready — gating the GPU/metrics checks before tuning completes is what let the gate pass prematurely.READINESS_TIMEOUT_SECONDS900 → 1200 (still a single budget shared across all four checks): it now spans a node reboot plus double GPU-stack convergence. Fail-closed; overridable via env.completeimmediately (or the CRD is absent → skipped), so the new wait only blocks while tuning is genuinely in flight.Testing
completepasses; in-progress / mixed / missing-status / empty all keep waiting;set -e-safe.make qualify.Risk Assessment
tests/uat); no product/runtime code; fail-closed; no-op when no reboot occurs; trivially revertable.Rollout notes: N/A — UAT-only. Gate budget overridable via
READINESS_TIMEOUT_SECONDS.Checklist
make testwith-race) — N/A, no Go changesmake lint) —bash -n+ shellcheck verified locally; yamllint/shellcheck also run in CIgit commit -S)