Skip to content

fix(uat): gate install on the deployment validation phase#1439

Merged
mchmarny merged 1 commit into
NVIDIA:mainfrom
njhensley:fix/uat-deployment-phase-gate
Jun 23, 2026
Merged

fix(uat): gate install on the deployment validation phase#1439
mchmarny merged 1 commit into
NVIDIA:mainfrom
njhensley:fix/uat-deployment-phase-gate

Conversation

@njhensley

Copy link
Copy Markdown
Member

Summary

Replace the UAT post-install readiness gate's proxy signals with the authoritative check: run aicr validate --phase deployment in a retry loop until it passes, then proceed to --phase all.

Motivation / Context

The readiness gate added in #1426/#1429 polled proxy signals — Skyhook.status.status == complete, ClusterPolicy.state == ready, and the metrics APIServices. On a fresh cluster those read "ready" during a transient window: nodewright (skyhook) reboots the GPU node for tuning after gpu-operator first reports ready, which re-inits the gpu-operator operands (their pods drop back to Init/Pending). The gate passed on that pre-reboot state, and the later --phase all run's deployment-phase expected-resources check then failed on the Pending pods.

Observed directly in run 28054511667: the post-install snapshot showed the GPU node Ready,SchedulingDisabled (skyhook cordon) with the tuning pod at Init:1/3, yet the gate reported nodewright (skyhook) tuning complete and ClusterPolicy state=ready immediately after. Investigating deploy.sh confirmed it has no extra gpu-operator wait we were missing — its gpu-operator readiness is the same ClusterPolicy.state=ready assertion — so the fix is to stop proxying and gate on the real deployment phase.

Related: #1426, #1429 (this supersedes their proxy-signal gate)
Fixes: N/A

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • Other: UAT test harness (tests/uat/{aws,gcp})

Implementation Notes

  • Removed the proxy predicates (wait_for, _skyhook_complete, _clusterpolicy_ready, _apiservice_available) from both runners.
  • Added a retry loop running aicr validate --phase deployment until it passes, bounded by READINESS_TIMEOUT_SECONDS (20m). That phase is expected-resources (+ ClusterPolicy/DRA/nodewright), so it can't pass before the cluster genuinely converges, and it rides through the skyhook reboot via expected-resources' own poll plus the outer retry. Fail-closed: a final diagnostic run + exit 1 if it never passes in budget (failOnError defaults true, so validate exits non-zero on phase failure).
  • Evidence-stripped gate config: the loop runs against yq 'del(.spec.validate.evidence)' of the test config so the gate never emits/pushes an evidence bundle (emit fires whenever cfg.evidence != nil, pkg/cli/validate.go:389). The full bundle is still emitted once by the conformance phase (--phase all).
  • Timeouts: install step 25→45, job 110→130 — the gate runs inside the install step for up to READINESS_TIMEOUT_SECONDS on top of one helmfile attempt, and the job cap was the sum of step timeouts.

Testing

bash -n tests/uat/aws/run tests/uat/gcp/run        # OK
shellcheck -S warning tests/uat/{aws,gcp}/run      # clean
  • Gate-loop logic verified in isolation (stub failing twice then succeeding): retries, set -e-safe (the failing until condition doesn't exit the script), success and timeout paths both correct.
  • Workflow YAML parses; del(.spec.validate.evidence) preserves the agent block.
  • Not validated end-to-end on a live cluster — the UAT cluster was torn down before re-test. The proxy-gate's premature pass that this fixes was captured in run 28054511667 (above). yamllint/shellcheck also run in CI via make qualify.

Risk Assessment

  • Low — UAT test harness only (tests/uat); no product/runtime code; fail-closed; revertable.

Rollout notes: The gate now deploys validator Jobs/RBAC during the install step (cleaned up after), and --phase all redeploys them in the conformance step — minor extra time; the deployment phase passes instantly there since the cluster is already converged. Budget overridable via READINESS_TIMEOUT_SECONDS.

Checklist

  • Tests pass locally (make test with -race) — N/A, no Go changes
  • Linter passes (make lint) — bash -n + shellcheck verified locally; yamllint/shellcheck also run in CI
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality — N/A (shell harness; gate logic validated as above)
  • I updated docs if user-facing behavior changed — N/A (internal UAT harness)
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

…gnals

The post-install readiness gate polled proxy signals (Skyhook
status.status,
ClusterPolicy=ready, metrics APIServices) that read "ready" during the
transient
window before nodewright (skyhook) reboots the GPU node for tuning and
re-inits
the gpu-operator operands -- so expected-resources still failed in the
later
validate step. Replace the proxy predicates with a retry loop running
the
authoritative deployment phase (aicr validate --phase deployment) until
it
passes; it includes expected-resources and rides through the reboot via
its own
poll plus the outer retry. Run against an evidence-stripped config copy
so the
gate never emits an evidence bundle. Raise install/job timeouts to
45m/130m to
fit the gate budget on top of helmfile apply.
@njhensley njhensley requested a review from a team as a code owner June 23, 2026 23:17
@njhensley njhensley added the theme/ci-dx CI pipelines, developer experience, and build tooling label Jun 23, 2026
@njhensley njhensley requested a review from a team as a code owner June 23, 2026 23:17
@coderabbitai

coderabbitai Bot commented Jun 23, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

The post-install readiness gate in both UAT AWS and GCP runner scripts (tests/uat/aws/run, tests/uat/gcp/run) is replaced. The previous implementation used wait_for polling with per-resource predicate functions (nodewright/skyhook completion, ClusterPolicy readiness, metrics APIService availability). The new implementation retries aicr validate --phase deployment in a deadline loop against a temporary config copy with spec.validate.evidence deleted to suppress evidence bundle emission. On timeout, a final diagnostic validation call is made before the install fails. The associated CI workflow files raise the uat-aws and uat-gcp job-level timeout from 110 to 130 minutes and the install step timeout from 25 to 45 minutes.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • NVIDIA/aicr#1416: Adjusts UAT validation gating by switching conformance to aicr validate --phase all, directly intersecting with this PR's change to gate on aicr validate --phase deployment in phase_install.
  • NVIDIA/aicr#1426: Adds the post-install readiness gating in phase_install that this PR refactors, with direct overlap in tests/uat/{aws,gcp}/run.
  • NVIDIA/aicr#1429: Modifies the same UAT post-helmfile-apply readiness gate logic, adding _skyhook_complete predicate and reordering checks that this PR removes.

Suggested labels

area/ci, size/M, area/tests

Suggested reviewers

  • mchmarny
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title directly describes the main change: gating the UAT install phase on deployment validation, which is the central fix replacing proxy signal checks with authoritative validation.
Description check ✅ Passed The description comprehensively explains the problem (proxy signals passing prematurely before skyhook reboot), the solution (retry loop with aicr validate --phase deployment), and all implementation details including removed predicates, timeout adjustments, and testing approach.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/uat-aws.yaml:
- Around line 43-48: The job timeout-minutes value is set to 130, which equals
the total of all visible UAT phases (prep 15 + install 45 + validate 40 + train
25 + verify 5 = 130 minutes), leaving no headroom for checkout, build,
provisioning, evidence upload, and the Destroy Cluster cleanup step. Increase
the timeout-minutes value at line 48 to add sufficient buffer beyond the
130-minute UAT phase total to ensure cleanup steps can complete without the job
being terminated due to timeout.

In @.github/workflows/uat-gcp.yaml:
- Around line 43-48: The timeout-minutes field is currently set to 130, which
exactly matches the total duration of the UAT phase steps (prep 15 + install 45
+ validate 40 + train 25 + verify 5 minutes), leaving no buffer for non-UAT
operations like checkout, build, provisioning, evidence upload, and the Destroy
Cluster teardown step. Increase the timeout-minutes value to provide sufficient
headroom beyond the 130-minute UAT phase total so that cleanup steps can
complete without risk of job timeout truncation.

In `@tests/uat/aws/run`:
- Around line 197-210: The readiness deadline check is only performed after a
failed validation attempt returns, which allows the until loop on line 200 to
start the AICR_BIN validate command after the deadline has already passed or run
beyond the remaining time budget. Add a deadline check before executing the
validation command in each loop iteration to ensure that no validation attempt
is launched after the ready_deadline has expired, enforcing the timeout
constraint proactively rather than reactively.

In `@tests/uat/gcp/run`:
- Around line 197-210: The readiness deadline is checked only after the validate
command executes in the until loop, allowing validation attempts to start or
complete after the deadline has passed. Add a deadline check before executing
the validate command on line 200 to ensure no validation attempts run past the
READINESS_TIMEOUT_SECONDS window. Move or restructure the deadline validation
(currently on line 201) to occur before the validate command runs in each
iteration.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 41cf8ddf-a162-422a-a13f-a812e1807313

📥 Commits

Reviewing files that changed from the base of the PR and between 58721c5 and ea3c1d5.

📒 Files selected for processing (4)
  • .github/workflows/uat-aws.yaml
  • .github/workflows/uat-gcp.yaml
  • tests/uat/aws/run
  • tests/uat/gcp/run

Comment on lines +43 to +48
# Bumped from 90 to 110 for the validate step running all phases (deployment
# readiness gate + conformance + NCCL performance), then to 130 so the
# install step's deployment-phase readiness gate (up to
# READINESS_TIMEOUT_SECONDS, spanning a nodewright reboot) fits on top of
# helmfile apply without the job cap truncating later steps.
timeout-minutes: 130

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Leave job-level headroom for teardown and artifact steps.

Line 48 sets the job cap to 130 minutes, but the visible UAT phase caps already total 130 minutes: prep 15 + install 45 + validate 40 + train 25 + verify 5. That leaves no budget for checkout/build/provisioning, evidence upload, or Destroy Cluster, so a near-cap run can hit the job timeout before cleanup.

Proposed timeout headroom
-    timeout-minutes: 130
+    timeout-minutes: 170
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Bumped from 90 to 110 for the validate step running all phases (deployment
# readiness gate + conformance + NCCL performance), then to 130 so the
# install step's deployment-phase readiness gate (up to
# READINESS_TIMEOUT_SECONDS, spanning a nodewright reboot) fits on top of
# helmfile apply without the job cap truncating later steps.
timeout-minutes: 130
# Bumped from 90 to 110 for the validate step running all phases (deployment
# readiness gate + conformance + NCCL performance), then to 130 so the
# install step's deployment-phase readiness gate (up to
# READINESS_TIMEOUT_SECONDS, spanning a nodewright reboot) fits on top of
# helmfile apply without the job cap truncating later steps.
timeout-minutes: 170
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/uat-aws.yaml around lines 43 - 48, The job timeout-minutes
value is set to 130, which equals the total of all visible UAT phases (prep 15 +
install 45 + validate 40 + train 25 + verify 5 = 130 minutes), leaving no
headroom for checkout, build, provisioning, evidence upload, and the Destroy
Cluster cleanup step. Increase the timeout-minutes value at line 48 to add
sufficient buffer beyond the 130-minute UAT phase total to ensure cleanup steps
can complete without the job being terminated due to timeout.

Comment on lines +43 to +48
# Bumped from 90 to 110 for the validate step running all phases (deployment
# readiness gate + conformance + NCCL performance), then to 130 so the
# install step's deployment-phase readiness gate (up to
# READINESS_TIMEOUT_SECONDS, spanning a nodewright reboot) fits on top of
# helmfile apply without the job cap truncating later steps.
timeout-minutes: 130

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Leave job-level headroom for teardown and artifact steps.

Line 48 sets the job cap to 130 minutes, but the visible UAT phase caps already total 130 minutes: prep 15 + install 45 + validate 40 + train 25 + verify 5. That leaves no budget for checkout/build/provisioning, evidence upload, or Destroy Cluster, so a near-cap run can hit the job timeout before cleanup.

Proposed timeout headroom
-    timeout-minutes: 130
+    timeout-minutes: 170
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Bumped from 90 to 110 for the validate step running all phases (deployment
# readiness gate + conformance + NCCL performance), then to 130 so the
# install step's deployment-phase readiness gate (up to
# READINESS_TIMEOUT_SECONDS, spanning a nodewright reboot) fits on top of
# helmfile apply without the job cap truncating later steps.
timeout-minutes: 130
# Bumped from 90 to 110 for the validate step running all phases (deployment
# readiness gate + conformance + NCCL performance), then to 130 so the
# install step's deployment-phase readiness gate (up to
# READINESS_TIMEOUT_SECONDS, spanning a nodewright reboot) fits on top of
# helmfile apply without the job cap truncating later steps.
timeout-minutes: 170
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/uat-gcp.yaml around lines 43 - 48, The timeout-minutes
field is currently set to 130, which exactly matches the total duration of the
UAT phase steps (prep 15 + install 45 + validate 40 + train 25 + verify 5
minutes), leaving no buffer for non-UAT operations like checkout, build,
provisioning, evidence upload, and the Destroy Cluster teardown step. Increase
the timeout-minutes value to provide sufficient headroom beyond the 130-minute
UAT phase total so that cleanup steps can complete without risk of job timeout
truncation.

Comment thread tests/uat/aws/run
Comment on lines 197 to +210
local ready_deadline=$(( SECONDS + READINESS_TIMEOUT_SECONDS ))
wait_for "nodewright (skyhook) tuning complete" "$(( ready_deadline - SECONDS ))" \
_skyhook_complete
wait_for "gpu-operator ClusterPolicy state=ready" "$(( ready_deadline - SECONDS ))" \
_clusterpolicy_ready
wait_for "custom.metrics.k8s.io APIService Available" "$(( ready_deadline - SECONDS ))" \
_apiservice_available v1beta1.custom.metrics.k8s.io
wait_for "external.metrics.k8s.io APIService Available" "$(( ready_deadline - SECONDS ))" \
_apiservice_available v1beta1.external.metrics.k8s.io
local attempt=1
echo "::group::Readiness gate: validate --phase deployment (timeout ${READINESS_TIMEOUT_SECONDS}s)"
until "${AICR_BIN}" validate --config "${gate_config}" --phase deployment --output /dev/null >/dev/null 2>&1; do
if (( SECONDS >= ready_deadline )); then
echo "::error::deployment readiness gate did not pass within ${READINESS_TIMEOUT_SECONDS}s; final attempt:" >&2
"${AICR_BIN}" validate --config "${gate_config}" --phase deployment --output /dev/null 2>&1 || true
rm -f "${gate_config}"
echo "::endgroup::"
exit 1
fi
echo "deployment phase not ready (attempt ${attempt}); retrying in 15s..."
attempt=$(( attempt + 1 ))
sleep 15

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Enforce the readiness deadline before and during each validation attempt.

Line 200 can start aicr validate after the deadline has already expired, and each invocation can run past the remaining budget because SECONDS is only checked after a failed attempt returns. That can pass the gate after READINESS_TIMEOUT_SECONDS or let the GitHub step timeout bypass the intended diagnostic path.

Proposed bounded retry loop
   local ready_deadline=$(( SECONDS + READINESS_TIMEOUT_SECONDS ))
   local attempt=1
+  local ready=false
   echo "::group::Readiness gate: validate --phase deployment (timeout ${READINESS_TIMEOUT_SECONDS}s)"
-  until "${AICR_BIN}" validate --config "${gate_config}" --phase deployment --output /dev/null >/dev/null 2>&1; do
-    if (( SECONDS >= ready_deadline )); then
-      echo "::error::deployment readiness gate did not pass within ${READINESS_TIMEOUT_SECONDS}s; final attempt:" >&2
-      "${AICR_BIN}" validate --config "${gate_config}" --phase deployment --output /dev/null 2>&1 || true
-      rm -f "${gate_config}"
-      echo "::endgroup::"
-      exit 1
-    fi
+  while (( SECONDS < ready_deadline )); do
+    remaining=$(( ready_deadline - SECONDS ))
+    if timeout "${remaining}" "${AICR_BIN}" validate --config "${gate_config}" --phase deployment --output /dev/null >/dev/null 2>&1; then
+      ready=true
+      break
+    fi
+    if (( SECONDS >= ready_deadline )); then
+      break
+    fi
     echo "deployment phase not ready (attempt ${attempt}); retrying in 15s..."
     attempt=$(( attempt + 1 ))
-    sleep 15
+    sleep_for=15
+    if (( SECONDS + sleep_for > ready_deadline )); then
+      sleep_for=$(( ready_deadline - SECONDS ))
+    fi
+    (( sleep_for > 0 )) && sleep "${sleep_for}"
   done
+  if [[ "${ready}" != "true" ]]; then
+    echo "::error::deployment readiness gate did not pass within ${READINESS_TIMEOUT_SECONDS}s; final attempt:" >&2
+    timeout 300 "${AICR_BIN}" validate --config "${gate_config}" --phase deployment --output /dev/null 2>&1 || true
+    rm -f "${gate_config}"
+    echo "::endgroup::"
+    exit 1
+  fi
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/uat/aws/run` around lines 197 - 210, The readiness deadline check is
only performed after a failed validation attempt returns, which allows the until
loop on line 200 to start the AICR_BIN validate command after the deadline has
already passed or run beyond the remaining time budget. Add a deadline check
before executing the validation command in each loop iteration to ensure that no
validation attempt is launched after the ready_deadline has expired, enforcing
the timeout constraint proactively rather than reactively.

Comment thread tests/uat/gcp/run
Comment on lines 197 to +210
local ready_deadline=$(( SECONDS + READINESS_TIMEOUT_SECONDS ))
wait_for "nodewright (skyhook) tuning complete" "$(( ready_deadline - SECONDS ))" \
_skyhook_complete
wait_for "gpu-operator ClusterPolicy state=ready" "$(( ready_deadline - SECONDS ))" \
_clusterpolicy_ready
wait_for "custom.metrics.k8s.io APIService Available" "$(( ready_deadline - SECONDS ))" \
_apiservice_available v1beta1.custom.metrics.k8s.io
wait_for "external.metrics.k8s.io APIService Available" "$(( ready_deadline - SECONDS ))" \
_apiservice_available v1beta1.external.metrics.k8s.io
local attempt=1
echo "::group::Readiness gate: validate --phase deployment (timeout ${READINESS_TIMEOUT_SECONDS}s)"
until "${AICR_BIN}" validate --config "${gate_config}" --phase deployment --output /dev/null >/dev/null 2>&1; do
if (( SECONDS >= ready_deadline )); then
echo "::error::deployment readiness gate did not pass within ${READINESS_TIMEOUT_SECONDS}s; final attempt:" >&2
"${AICR_BIN}" validate --config "${gate_config}" --phase deployment --output /dev/null 2>&1 || true
rm -f "${gate_config}"
echo "::endgroup::"
exit 1
fi
echo "deployment phase not ready (attempt ${attempt}); retrying in 15s..."
attempt=$(( attempt + 1 ))
sleep 15

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Enforce the readiness deadline before and during each validation attempt.

Line 200 can start aicr validate after the deadline has already expired, and each invocation can run past the remaining budget because SECONDS is only checked after a failed attempt returns. That can pass the gate after READINESS_TIMEOUT_SECONDS or let the GitHub step timeout bypass the intended diagnostic path.

Proposed bounded retry loop
   local ready_deadline=$(( SECONDS + READINESS_TIMEOUT_SECONDS ))
   local attempt=1
+  local ready=false
   echo "::group::Readiness gate: validate --phase deployment (timeout ${READINESS_TIMEOUT_SECONDS}s)"
-  until "${AICR_BIN}" validate --config "${gate_config}" --phase deployment --output /dev/null >/dev/null 2>&1; do
-    if (( SECONDS >= ready_deadline )); then
-      echo "::error::deployment readiness gate did not pass within ${READINESS_TIMEOUT_SECONDS}s; final attempt:" >&2
-      "${AICR_BIN}" validate --config "${gate_config}" --phase deployment --output /dev/null 2>&1 || true
-      rm -f "${gate_config}"
-      echo "::endgroup::"
-      exit 1
-    fi
+  while (( SECONDS < ready_deadline )); do
+    remaining=$(( ready_deadline - SECONDS ))
+    if timeout "${remaining}" "${AICR_BIN}" validate --config "${gate_config}" --phase deployment --output /dev/null >/dev/null 2>&1; then
+      ready=true
+      break
+    fi
+    if (( SECONDS >= ready_deadline )); then
+      break
+    fi
     echo "deployment phase not ready (attempt ${attempt}); retrying in 15s..."
     attempt=$(( attempt + 1 ))
-    sleep 15
+    sleep_for=15
+    if (( SECONDS + sleep_for > ready_deadline )); then
+      sleep_for=$(( ready_deadline - SECONDS ))
+    fi
+    (( sleep_for > 0 )) && sleep "${sleep_for}"
   done
+  if [[ "${ready}" != "true" ]]; then
+    echo "::error::deployment readiness gate did not pass within ${READINESS_TIMEOUT_SECONDS}s; final attempt:" >&2
+    timeout 300 "${AICR_BIN}" validate --config "${gate_config}" --phase deployment --output /dev/null 2>&1 || true
+    rm -f "${gate_config}"
+    echo "::endgroup::"
+    exit 1
+  fi
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/uat/gcp/run` around lines 197 - 210, The readiness deadline is checked
only after the validate command executes in the until loop, allowing validation
attempts to start or complete after the deadline has passed. Add a deadline
check before executing the validate command on line 200 to ensure no validation
attempts run past the READINESS_TIMEOUT_SECONDS window. Move or restructure the
deadline validation (currently on line 201) to occur before the validate command
runs in each iteration.

@mchmarny mchmarny enabled auto-merge (squash) June 23, 2026 23:23
@mchmarny mchmarny merged commit 026804e into NVIDIA:main Jun 23, 2026
32 of 33 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/ci area/tests size/L theme/ci-dx CI pipelines, developer experience, and build tooling

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants