fix(uat): gate install on the deployment validation phase#1439
Conversation
…gnals The post-install readiness gate polled proxy signals (Skyhook status.status, ClusterPolicy=ready, metrics APIServices) that read "ready" during the transient window before nodewright (skyhook) reboots the GPU node for tuning and re-inits the gpu-operator operands -- so expected-resources still failed in the later validate step. Replace the proxy predicates with a retry loop running the authoritative deployment phase (aicr validate --phase deployment) until it passes; it includes expected-resources and rides through the reboot via its own poll plus the outer retry. Run against an evidence-stripped config copy so the gate never emits an evidence bundle. Raise install/job timeouts to 45m/130m to fit the gate budget on top of helmfile apply.
📝 WalkthroughWalkthroughThe post-install readiness gate in both UAT AWS and GCP runner scripts ( Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/workflows/uat-aws.yaml:
- Around line 43-48: The job timeout-minutes value is set to 130, which equals
the total of all visible UAT phases (prep 15 + install 45 + validate 40 + train
25 + verify 5 = 130 minutes), leaving no headroom for checkout, build,
provisioning, evidence upload, and the Destroy Cluster cleanup step. Increase
the timeout-minutes value at line 48 to add sufficient buffer beyond the
130-minute UAT phase total to ensure cleanup steps can complete without the job
being terminated due to timeout.
In @.github/workflows/uat-gcp.yaml:
- Around line 43-48: The timeout-minutes field is currently set to 130, which
exactly matches the total duration of the UAT phase steps (prep 15 + install 45
+ validate 40 + train 25 + verify 5 minutes), leaving no buffer for non-UAT
operations like checkout, build, provisioning, evidence upload, and the Destroy
Cluster teardown step. Increase the timeout-minutes value to provide sufficient
headroom beyond the 130-minute UAT phase total so that cleanup steps can
complete without risk of job timeout truncation.
In `@tests/uat/aws/run`:
- Around line 197-210: The readiness deadline check is only performed after a
failed validation attempt returns, which allows the until loop on line 200 to
start the AICR_BIN validate command after the deadline has already passed or run
beyond the remaining time budget. Add a deadline check before executing the
validation command in each loop iteration to ensure that no validation attempt
is launched after the ready_deadline has expired, enforcing the timeout
constraint proactively rather than reactively.
In `@tests/uat/gcp/run`:
- Around line 197-210: The readiness deadline is checked only after the validate
command executes in the until loop, allowing validation attempts to start or
complete after the deadline has passed. Add a deadline check before executing
the validate command on line 200 to ensure no validation attempts run past the
READINESS_TIMEOUT_SECONDS window. Move or restructure the deadline validation
(currently on line 201) to occur before the validate command runs in each
iteration.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 41cf8ddf-a162-422a-a13f-a812e1807313
📒 Files selected for processing (4)
.github/workflows/uat-aws.yaml.github/workflows/uat-gcp.yamltests/uat/aws/runtests/uat/gcp/run
| # Bumped from 90 to 110 for the validate step running all phases (deployment | ||
| # readiness gate + conformance + NCCL performance), then to 130 so the | ||
| # install step's deployment-phase readiness gate (up to | ||
| # READINESS_TIMEOUT_SECONDS, spanning a nodewright reboot) fits on top of | ||
| # helmfile apply without the job cap truncating later steps. | ||
| timeout-minutes: 130 |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Leave job-level headroom for teardown and artifact steps.
Line 48 sets the job cap to 130 minutes, but the visible UAT phase caps already total 130 minutes: prep 15 + install 45 + validate 40 + train 25 + verify 5. That leaves no budget for checkout/build/provisioning, evidence upload, or Destroy Cluster, so a near-cap run can hit the job timeout before cleanup.
Proposed timeout headroom
- timeout-minutes: 130
+ timeout-minutes: 170📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Bumped from 90 to 110 for the validate step running all phases (deployment | |
| # readiness gate + conformance + NCCL performance), then to 130 so the | |
| # install step's deployment-phase readiness gate (up to | |
| # READINESS_TIMEOUT_SECONDS, spanning a nodewright reboot) fits on top of | |
| # helmfile apply without the job cap truncating later steps. | |
| timeout-minutes: 130 | |
| # Bumped from 90 to 110 for the validate step running all phases (deployment | |
| # readiness gate + conformance + NCCL performance), then to 130 so the | |
| # install step's deployment-phase readiness gate (up to | |
| # READINESS_TIMEOUT_SECONDS, spanning a nodewright reboot) fits on top of | |
| # helmfile apply without the job cap truncating later steps. | |
| timeout-minutes: 170 |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.github/workflows/uat-aws.yaml around lines 43 - 48, The job timeout-minutes
value is set to 130, which equals the total of all visible UAT phases (prep 15 +
install 45 + validate 40 + train 25 + verify 5 = 130 minutes), leaving no
headroom for checkout, build, provisioning, evidence upload, and the Destroy
Cluster cleanup step. Increase the timeout-minutes value at line 48 to add
sufficient buffer beyond the 130-minute UAT phase total to ensure cleanup steps
can complete without the job being terminated due to timeout.
| # Bumped from 90 to 110 for the validate step running all phases (deployment | ||
| # readiness gate + conformance + NCCL performance), then to 130 so the | ||
| # install step's deployment-phase readiness gate (up to | ||
| # READINESS_TIMEOUT_SECONDS, spanning a nodewright reboot) fits on top of | ||
| # helmfile apply without the job cap truncating later steps. | ||
| timeout-minutes: 130 |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Leave job-level headroom for teardown and artifact steps.
Line 48 sets the job cap to 130 minutes, but the visible UAT phase caps already total 130 minutes: prep 15 + install 45 + validate 40 + train 25 + verify 5. That leaves no budget for checkout/build/provisioning, evidence upload, or Destroy Cluster, so a near-cap run can hit the job timeout before cleanup.
Proposed timeout headroom
- timeout-minutes: 130
+ timeout-minutes: 170📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Bumped from 90 to 110 for the validate step running all phases (deployment | |
| # readiness gate + conformance + NCCL performance), then to 130 so the | |
| # install step's deployment-phase readiness gate (up to | |
| # READINESS_TIMEOUT_SECONDS, spanning a nodewright reboot) fits on top of | |
| # helmfile apply without the job cap truncating later steps. | |
| timeout-minutes: 130 | |
| # Bumped from 90 to 110 for the validate step running all phases (deployment | |
| # readiness gate + conformance + NCCL performance), then to 130 so the | |
| # install step's deployment-phase readiness gate (up to | |
| # READINESS_TIMEOUT_SECONDS, spanning a nodewright reboot) fits on top of | |
| # helmfile apply without the job cap truncating later steps. | |
| timeout-minutes: 170 |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.github/workflows/uat-gcp.yaml around lines 43 - 48, The timeout-minutes
field is currently set to 130, which exactly matches the total duration of the
UAT phase steps (prep 15 + install 45 + validate 40 + train 25 + verify 5
minutes), leaving no buffer for non-UAT operations like checkout, build,
provisioning, evidence upload, and the Destroy Cluster teardown step. Increase
the timeout-minutes value to provide sufficient headroom beyond the 130-minute
UAT phase total so that cleanup steps can complete without risk of job timeout
truncation.
| local ready_deadline=$(( SECONDS + READINESS_TIMEOUT_SECONDS )) | ||
| wait_for "nodewright (skyhook) tuning complete" "$(( ready_deadline - SECONDS ))" \ | ||
| _skyhook_complete | ||
| wait_for "gpu-operator ClusterPolicy state=ready" "$(( ready_deadline - SECONDS ))" \ | ||
| _clusterpolicy_ready | ||
| wait_for "custom.metrics.k8s.io APIService Available" "$(( ready_deadline - SECONDS ))" \ | ||
| _apiservice_available v1beta1.custom.metrics.k8s.io | ||
| wait_for "external.metrics.k8s.io APIService Available" "$(( ready_deadline - SECONDS ))" \ | ||
| _apiservice_available v1beta1.external.metrics.k8s.io | ||
| local attempt=1 | ||
| echo "::group::Readiness gate: validate --phase deployment (timeout ${READINESS_TIMEOUT_SECONDS}s)" | ||
| until "${AICR_BIN}" validate --config "${gate_config}" --phase deployment --output /dev/null >/dev/null 2>&1; do | ||
| if (( SECONDS >= ready_deadline )); then | ||
| echo "::error::deployment readiness gate did not pass within ${READINESS_TIMEOUT_SECONDS}s; final attempt:" >&2 | ||
| "${AICR_BIN}" validate --config "${gate_config}" --phase deployment --output /dev/null 2>&1 || true | ||
| rm -f "${gate_config}" | ||
| echo "::endgroup::" | ||
| exit 1 | ||
| fi | ||
| echo "deployment phase not ready (attempt ${attempt}); retrying in 15s..." | ||
| attempt=$(( attempt + 1 )) | ||
| sleep 15 |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Enforce the readiness deadline before and during each validation attempt.
Line 200 can start aicr validate after the deadline has already expired, and each invocation can run past the remaining budget because SECONDS is only checked after a failed attempt returns. That can pass the gate after READINESS_TIMEOUT_SECONDS or let the GitHub step timeout bypass the intended diagnostic path.
Proposed bounded retry loop
local ready_deadline=$(( SECONDS + READINESS_TIMEOUT_SECONDS ))
local attempt=1
+ local ready=false
echo "::group::Readiness gate: validate --phase deployment (timeout ${READINESS_TIMEOUT_SECONDS}s)"
- until "${AICR_BIN}" validate --config "${gate_config}" --phase deployment --output /dev/null >/dev/null 2>&1; do
- if (( SECONDS >= ready_deadline )); then
- echo "::error::deployment readiness gate did not pass within ${READINESS_TIMEOUT_SECONDS}s; final attempt:" >&2
- "${AICR_BIN}" validate --config "${gate_config}" --phase deployment --output /dev/null 2>&1 || true
- rm -f "${gate_config}"
- echo "::endgroup::"
- exit 1
- fi
+ while (( SECONDS < ready_deadline )); do
+ remaining=$(( ready_deadline - SECONDS ))
+ if timeout "${remaining}" "${AICR_BIN}" validate --config "${gate_config}" --phase deployment --output /dev/null >/dev/null 2>&1; then
+ ready=true
+ break
+ fi
+ if (( SECONDS >= ready_deadline )); then
+ break
+ fi
echo "deployment phase not ready (attempt ${attempt}); retrying in 15s..."
attempt=$(( attempt + 1 ))
- sleep 15
+ sleep_for=15
+ if (( SECONDS + sleep_for > ready_deadline )); then
+ sleep_for=$(( ready_deadline - SECONDS ))
+ fi
+ (( sleep_for > 0 )) && sleep "${sleep_for}"
done
+ if [[ "${ready}" != "true" ]]; then
+ echo "::error::deployment readiness gate did not pass within ${READINESS_TIMEOUT_SECONDS}s; final attempt:" >&2
+ timeout 300 "${AICR_BIN}" validate --config "${gate_config}" --phase deployment --output /dev/null 2>&1 || true
+ rm -f "${gate_config}"
+ echo "::endgroup::"
+ exit 1
+ fi🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/uat/aws/run` around lines 197 - 210, The readiness deadline check is
only performed after a failed validation attempt returns, which allows the until
loop on line 200 to start the AICR_BIN validate command after the deadline has
already passed or run beyond the remaining time budget. Add a deadline check
before executing the validation command in each loop iteration to ensure that no
validation attempt is launched after the ready_deadline has expired, enforcing
the timeout constraint proactively rather than reactively.
| local ready_deadline=$(( SECONDS + READINESS_TIMEOUT_SECONDS )) | ||
| wait_for "nodewright (skyhook) tuning complete" "$(( ready_deadline - SECONDS ))" \ | ||
| _skyhook_complete | ||
| wait_for "gpu-operator ClusterPolicy state=ready" "$(( ready_deadline - SECONDS ))" \ | ||
| _clusterpolicy_ready | ||
| wait_for "custom.metrics.k8s.io APIService Available" "$(( ready_deadline - SECONDS ))" \ | ||
| _apiservice_available v1beta1.custom.metrics.k8s.io | ||
| wait_for "external.metrics.k8s.io APIService Available" "$(( ready_deadline - SECONDS ))" \ | ||
| _apiservice_available v1beta1.external.metrics.k8s.io | ||
| local attempt=1 | ||
| echo "::group::Readiness gate: validate --phase deployment (timeout ${READINESS_TIMEOUT_SECONDS}s)" | ||
| until "${AICR_BIN}" validate --config "${gate_config}" --phase deployment --output /dev/null >/dev/null 2>&1; do | ||
| if (( SECONDS >= ready_deadline )); then | ||
| echo "::error::deployment readiness gate did not pass within ${READINESS_TIMEOUT_SECONDS}s; final attempt:" >&2 | ||
| "${AICR_BIN}" validate --config "${gate_config}" --phase deployment --output /dev/null 2>&1 || true | ||
| rm -f "${gate_config}" | ||
| echo "::endgroup::" | ||
| exit 1 | ||
| fi | ||
| echo "deployment phase not ready (attempt ${attempt}); retrying in 15s..." | ||
| attempt=$(( attempt + 1 )) | ||
| sleep 15 |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Enforce the readiness deadline before and during each validation attempt.
Line 200 can start aicr validate after the deadline has already expired, and each invocation can run past the remaining budget because SECONDS is only checked after a failed attempt returns. That can pass the gate after READINESS_TIMEOUT_SECONDS or let the GitHub step timeout bypass the intended diagnostic path.
Proposed bounded retry loop
local ready_deadline=$(( SECONDS + READINESS_TIMEOUT_SECONDS ))
local attempt=1
+ local ready=false
echo "::group::Readiness gate: validate --phase deployment (timeout ${READINESS_TIMEOUT_SECONDS}s)"
- until "${AICR_BIN}" validate --config "${gate_config}" --phase deployment --output /dev/null >/dev/null 2>&1; do
- if (( SECONDS >= ready_deadline )); then
- echo "::error::deployment readiness gate did not pass within ${READINESS_TIMEOUT_SECONDS}s; final attempt:" >&2
- "${AICR_BIN}" validate --config "${gate_config}" --phase deployment --output /dev/null 2>&1 || true
- rm -f "${gate_config}"
- echo "::endgroup::"
- exit 1
- fi
+ while (( SECONDS < ready_deadline )); do
+ remaining=$(( ready_deadline - SECONDS ))
+ if timeout "${remaining}" "${AICR_BIN}" validate --config "${gate_config}" --phase deployment --output /dev/null >/dev/null 2>&1; then
+ ready=true
+ break
+ fi
+ if (( SECONDS >= ready_deadline )); then
+ break
+ fi
echo "deployment phase not ready (attempt ${attempt}); retrying in 15s..."
attempt=$(( attempt + 1 ))
- sleep 15
+ sleep_for=15
+ if (( SECONDS + sleep_for > ready_deadline )); then
+ sleep_for=$(( ready_deadline - SECONDS ))
+ fi
+ (( sleep_for > 0 )) && sleep "${sleep_for}"
done
+ if [[ "${ready}" != "true" ]]; then
+ echo "::error::deployment readiness gate did not pass within ${READINESS_TIMEOUT_SECONDS}s; final attempt:" >&2
+ timeout 300 "${AICR_BIN}" validate --config "${gate_config}" --phase deployment --output /dev/null 2>&1 || true
+ rm -f "${gate_config}"
+ echo "::endgroup::"
+ exit 1
+ fi🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/uat/gcp/run` around lines 197 - 210, The readiness deadline is checked
only after the validate command executes in the until loop, allowing validation
attempts to start or complete after the deadline has passed. Add a deadline
check before executing the validate command on line 200 to ensure no validation
attempts run past the READINESS_TIMEOUT_SECONDS window. Move or restructure the
deadline validation (currently on line 201) to occur before the validate command
runs in each iteration.
Summary
Replace the UAT post-install readiness gate's proxy signals with the authoritative check: run
aicr validate --phase deploymentin a retry loop until it passes, then proceed to--phase all.Motivation / Context
The readiness gate added in #1426/#1429 polled proxy signals —
Skyhook.status.status == complete,ClusterPolicy.state == ready, and the metrics APIServices. On a fresh cluster those read "ready" during a transient window: nodewright (skyhook) reboots the GPU node for tuning after gpu-operator first reports ready, which re-inits the gpu-operator operands (their pods drop back toInit/Pending). The gate passed on that pre-reboot state, and the later--phase allrun's deployment-phaseexpected-resourcescheck then failed on thePendingpods.Observed directly in run
28054511667: the post-install snapshot showed the GPU nodeReady,SchedulingDisabled(skyhook cordon) with the tuning pod atInit:1/3, yet the gate reportednodewright (skyhook) tuning completeandClusterPolicy state=readyimmediately after. Investigatingdeploy.shconfirmed it has no extra gpu-operator wait we were missing — its gpu-operator readiness is the sameClusterPolicy.state=readyassertion — so the fix is to stop proxying and gate on the real deployment phase.Related: #1426, #1429 (this supersedes their proxy-signal gate)
Fixes: N/A
Type of Change
Component(s) Affected
tests/uat/{aws,gcp})Implementation Notes
wait_for,_skyhook_complete,_clusterpolicy_ready,_apiservice_available) from both runners.aicr validate --phase deploymentuntil it passes, bounded byREADINESS_TIMEOUT_SECONDS(20m). That phase isexpected-resources(+ ClusterPolicy/DRA/nodewright), so it can't pass before the cluster genuinely converges, and it rides through the skyhook reboot via expected-resources' own poll plus the outer retry. Fail-closed: a final diagnostic run +exit 1if it never passes in budget (failOnErrordefaults true, sovalidateexits non-zero on phase failure).yq 'del(.spec.validate.evidence)'of the test config so the gate never emits/pushes an evidence bundle (emit fires whenevercfg.evidence != nil,pkg/cli/validate.go:389). The full bundle is still emitted once by the conformance phase (--phase all).25→45, job110→130— the gate runs inside the install step for up toREADINESS_TIMEOUT_SECONDSon top of one helmfile attempt, and the job cap was the sum of step timeouts.Testing
set -e-safe (the failinguntilcondition doesn't exit the script), success and timeout paths both correct.del(.spec.validate.evidence)preserves the agent block.28054511667(above). yamllint/shellcheck also run in CI viamake qualify.Risk Assessment
tests/uat); no product/runtime code; fail-closed; revertable.Rollout notes: The gate now deploys validator Jobs/RBAC during the install step (cleaned up after), and
--phase allredeploys them in the conformance step — minor extra time; the deployment phase passes instantly there since the cluster is already converged. Budget overridable viaREADINESS_TIMEOUT_SECONDS.Checklist
make testwith-race) — N/A, no Go changesmake lint) —bash -n+ shellcheck verified locally; yamllint/shellcheck also run in CIgit commit -S)