script(bundler): clarify deploy.sh completion != readiness by yuanchen8911 · Pull Request #609 · NVIDIA/aicr

yuanchen8911 · 2026-04-17T19:08:25Z

Summary

deploy.sh now clearly distinguishes install/apply status from workload readiness in its final output and docs. No flag or behavior changes to the install flow.

Motivation / Context

deploy.sh currently ends with a success-like message that users can reasonably read as "ready for GPU workloads." For GPU bundles that is not the same thing: node tuning (Skyhook), GPU operator operand rollout, and DRA kubelet plugin registration continue asynchronously after the script exits.

This PR keeps the scope intentionally narrow to the messaging/docs portion of #607: make the final output accurate, distinguish install/apply completion from workload readiness, and avoid overstating success in --best-effort partial-failure runs.

This is a deploy-side symmetry gap after the recent undeploy pre/post-flight hardening in #602.

Fixes: #607 (messaging/docs portion only)
Issue: #607
Related: #602, #516

Type of Change

Bug fix (non-breaking change that fixes an issue)
Documentation update

Component(s) Affected

Bundlers (pkg/bundler, pkg/component/*)
Docs/examples (docs/, examples/)

Implementation Notes

Scope is intentionally narrow (per the revised issue #607):

deploy.sh.tmpl: keep the full-success status line (Deployment complete.), add a distinct partial-failure status line for --best-effort runs (Deployment completed with non-fatal errors (--best-effort).), and use a shared neutral note below both paths explaining that install/apply status is not workload readiness.
docs/user/cli-reference.md: tighten the --no-wait row to name what it actually skips (Helm chart-level wait where AICR uses it) and add a note paragraph explicitly distinguishing chart-level wait, --best-effort partial failures, and bundle-level convergence.
README.md.tmpl: per-bundle README gets the same readiness note after the --no-wait example so bundle consumers see it in context.
helm_test.go: render test covers both final status lines plus the shared readiness note wording.

Broader readiness design (day-0 bridge Jobs under #516, day-N re-verification possibly in the validator) is tracked as follow-ups on #607. This PR is intentionally not choosing between those longer-term designs.

Testing

go test -race ./pkg/bundler/deployer/helm/...
# ok  github.com/NVIDIA/aicr/pkg/bundler/deployer/helm

# Earlier full `make qualify` passed on this PR before the final
# best-effort wording refinement; this follow-up reran the package test
# after updating the script/docs/test text.

Sample of rendered deploy.sh tail after this change:

if [[ -n "${FAILED_COMPONENTS}" ]]; then
  echo "WARNING: the following components failed:${FAILED_COMPONENTS}"
  echo "Deployment completed with non-fatal errors (--best-effort)."
else
  echo "Deployment complete."
fi

echo "NOTE: The above status reflects Helm install and manifest apply results,"
echo "not whether the cluster is ready for GPU workloads. On fresh"
echo "GPU nodes, cluster convergence may continue asynchronously after this"

Risk Assessment

Low — Messaging and docs only. No flag changes, no exit-code changes, no install-path changes. The full-success path still prints Deployment complete.; --best-effort partial-failure runs now print a more accurate status line.

Rollout notes: None — backward-compatible. Operators parsing deploy output should treat the final status line as install/apply status only, not workload readiness.

Checklist

Tests pass locally (make test with -race)
Linter passes (make lint)
I did not skip/disable tests to make CI green
I added/updated tests for new functionality
I updated docs if user-facing behavior changed
Changes follow existing patterns in the codebase
Commits are cryptographically signed (git commit -S)

github-actions · 2026-04-17T19:09:11Z

🌿 Preview your docs: https://nvidia-preview-fix-deploy-sh-readiness-messaging.docs.buildwithfern.com/aicr

coderabbitai · 2026-04-17T19:11:17Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Clarified CLI and Helm deploy template docs and the generated deploy.sh to state that --no-wait skips Helm chart-level waiting only where AICR would pass --wait, while hook --timeout behavior remains. The deploy script now conditionally prints either a success line or a “completed with non-fatal errors (--best-effort)” line and includes a multi-line NOTE distinguishing Helm install/manifest-apply completion from full GPU workload readiness, listing asynchronous convergence items (e.g., Skyhook tuning, GPU operator operand rollout, NVIDIA DRA kubelet plugin registration) and linking to the CLI reference. Added a unit test verifying the final messages and ordering in deploy.sh.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Linked Issues check	❓ Inconclusive	This PR addresses issue `#607`'s messaging/docs portion only. The PR explicitly states it keeps scope narrow to messaging and documentation, fixing the partial objective of distinguishing install completion from workload readiness. However, the linked issue `#607` requires comprehensive validation checks (Skyhook CR completion, DRA kubelet-plugin readiness, GPU Operator ClusterPolicy state, and deployer-neutral API implementation), which are out of scope for this PR—they are tracked as follow-ups.	Clarify in the PR description or linked issue whether this messaging PR is intended as a complete fix for `#607` or if implementation of the validation checks is deferred. If deferred, ensure follow-up issues track the validation implementation work.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description check	✅ Passed	The description is well-related to the changeset, clearly explaining the motivation (distinguishing install/apply status from workload readiness), scope (messaging/docs only), and changes across multiple files (deploy.sh.tmpl, CLI reference, README, and tests).
Out of Scope Changes check	✅ Passed	All changes are directly aligned with the stated scope: documentation updates (cli-reference.md, README.md.tmpl, deploy.sh.tmpl) and test updates for the clarified messaging. No behavioral, flag, or exit-code changes introduce out-of-scope modifications.
Title check	✅ Passed	The title accurately and concisely summarizes the main change: clarifying that deploy.sh completion does not equal readiness for GPU workloads, which is the core purpose across all modified files.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/bundler/deployer/helm/templates/deploy.sh.tmpl`:
- Around line 454-456: The success note currently prints an unconditional
"Deployment complete" message that is inaccurate when the script was run with
the --best-effort option; update the deploy.sh.tmpl message logic to detect the
best-effort mode (the CLI flag "--best-effort" or the template variable used to
represent it) and change the text so it does not assert all installs/applies
succeeded — e.g., conditionally append a qualifier like "may have had non-fatal
failures in --best-effort mode" or switch to a neutral phrasing when best-effort
is enabled; locate the existing echo that outputs "NOTE: \"Deployment complete\"
means Helm installs and manifest applies" and modify it accordingly.

In `@pkg/bundler/deployer/helm/templates/README.md.tmpl`:
- Around line 59-60: Update the "Deployment complete" note to explicitly state
that it only means Helm install and manifest apply succeeded, and to add a
sentence about the --best-effort flag: when run with --best-effort the script
may ignore component failures so some components can be left failed or
incomplete and convergence may continue asynchronously; mention that operators
should check component statuses and logs rather than relying solely on the
"Deployment complete" message (refer to the existing note text and the
--best-effort flag in README.md.tmpl to locate where to insert this
clarification).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 8b79f8e6-666d-4ade-8bff-07803e54972d

📥 Commits

Reviewing files that changed from the base of the PR and between eef12fd and 0e36d6e.

📒 Files selected for processing (4)

docs/user/cli-reference.md
pkg/bundler/deployer/helm/helm_test.go
pkg/bundler/deployer/helm/templates/README.md.tmpl
pkg/bundler/deployer/helm/templates/deploy.sh.tmpl

coderabbitai

♻️ Duplicate comments (2)

pkg/bundler/deployer/helm/templates/README.md.tmpl (1)

59-60: ⚠️ Potential issue | 🟠 Major

README note still guarantees success too strongly.

The note states installs/applies “succeeded,” but --best-effort can finish with component failures. The wording should be neutral and direct operators to warning output.

Suggested wording update

-> **Note:** "Deployment complete" means Helm installs and manifest applies succeeded, not that the cluster is ready for GPU workloads. On fresh GPU nodes, cluster convergence (Skyhook node tuning, GPU operator operand rollout, DRA kubelet plugin registration) continues asynchronously after the script exits. See the [AICR CLI Reference](https://github.com/NVIDIA/aicr/blob/main/docs/user/cli-reference.md#deploy-script-behavior-deploysh) for details.
+> **Note:** "Deployment complete" means the deploy script finished its install/apply pass. If `--best-effort` was used, one or more components may have failed (check warning lines/status). This does not mean the cluster is ready for GPU workloads. On fresh GPU nodes, cluster convergence (Skyhook node tuning, GPU operator operand rollout, DRA kubelet plugin registration) continues asynchronously after the script exits. See the [AICR CLI Reference](https://github.com/NVIDIA/aicr/blob/main/docs/user/cli-reference.md#deploy-script-behavior-deploysh) for details.

As per coding guidelines, "Documentation must not invent features, guarantees, timelines, or roadmap commitments. Clearly distinguish current behavior from future intent."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/deployer/helm/templates/README.md.tmpl` around lines 59 - 60,
Update the README note so it does not guarantee success: change the "Deployment
complete" wording to a neutral statement that Helm install and manifest apply
steps finished but may include component failures when using the --best-effort
option; explicitly tell operators to inspect warning output and logs (and refer
to the AICR CLI Reference link already present) for any remaining component
errors and asynchronous GPU convergence tasks like Skyhook tuning, GPU operator
rollout, and DRA kubelet registration.

pkg/bundler/deployer/helm/templates/deploy.sh.tmpl (1)

454-456: ⚠️ Potential issue | 🟠 Major

Final NOTE overstates success in --best-effort runs.

When Line 449 records failed components, Lines 454-455 still claim installs/applies “succeeded.” That is inaccurate and can mislead operators during partial-failure deployments.

Suggested fix

 echo "Deployment complete."
 echo
-echo "NOTE: \"Deployment complete\" means Helm installs and manifest applies"
-echo "succeeded, not that the cluster is ready for GPU workloads. On fresh"
+echo "NOTE: \"Deployment complete\" means the deploy script finished its install/apply pass,"
+if [[ -n "${FAILED_COMPONENTS}" ]]; then
+  echo "including non-fatal component failures under --best-effort (see WARNING lines above),"
+fi
+echo "not that the cluster is ready for GPU workloads. On fresh"

As per coding guidelines, "Documentation must not invent features, guarantees, timelines, or roadmap commitments. Clearly distinguish current behavior from future intent."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/deployer/helm/templates/deploy.sh.tmpl` around lines 454 - 456,
The current NOTE echo (echo "NOTE: \"Deployment complete\" means Helm installs
and manifest applies" ...) incorrectly claims success even when the script
recorded failures; update the deploy.sh.tmpl so the message is conditional on
the recorded failed-components variable (e.g., the variable used where failures
are recorded on Line 449, e.g., failed_components/FAILED_COMPONENTS) and either
(a) only print the existing "succeeded" wording when no failures were recorded,
or (b) print an amended message when failures exist such as "Helm installs and
manifest applies were attempted; some components failed (see failed components
list) — cluster may be partially deployed and not ready for GPU workloads."
Ensure the echo text replaces the existing two-line NOTE block and clearly
distinguishes full success vs. partial/failure in --best-effort runs.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@pkg/bundler/deployer/helm/templates/deploy.sh.tmpl`:
- Around line 454-456: The current NOTE echo (echo "NOTE: \"Deployment
complete\" means Helm installs and manifest applies" ...) incorrectly claims
success even when the script recorded failures; update the deploy.sh.tmpl so the
message is conditional on the recorded failed-components variable (e.g., the
variable used where failures are recorded on Line 449, e.g.,
failed_components/FAILED_COMPONENTS) and either (a) only print the existing
"succeeded" wording when no failures were recorded, or (b) print an amended
message when failures exist such as "Helm installs and manifest applies were
attempted; some components failed (see failed components list) — cluster may be
partially deployed and not ready for GPU workloads." Ensure the echo text
replaces the existing two-line NOTE block and clearly distinguishes full success
vs. partial/failure in --best-effort runs.

In `@pkg/bundler/deployer/helm/templates/README.md.tmpl`:
- Around line 59-60: Update the README note so it does not guarantee success:
change the "Deployment complete" wording to a neutral statement that Helm
install and manifest apply steps finished but may include component failures
when using the --best-effort option; explicitly tell operators to inspect
warning output and logs (and refer to the AICR CLI Reference link already
present) for any remaining component errors and asynchronous GPU convergence
tasks like Skyhook tuning, GPU operator rollout, and DRA kubelet registration.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 43cdcec3-fcc0-466d-ad3e-6c7ce4ea21fd

📥 Commits

Reviewing files that changed from the base of the PR and between 0e36d6e and 0fa7d00.

📒 Files selected for processing (4)

docs/user/cli-reference.md
pkg/bundler/deployer/helm/helm_test.go
pkg/bundler/deployer/helm/templates/README.md.tmpl
pkg/bundler/deployer/helm/templates/deploy.sh.tmpl

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/user/cli-reference.md`:
- Line 1278: Replace the misspelled flag example string "--best-effrot" with the
correct "--best-effort" in the CLI reference text that explains unknown flag
rejection so readers copying the example won't be confused; search for the
literal "--best-effrot" and update it to "--best-effort".

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: ceb90bb7-9ee4-478d-807a-7ee4b632a94d

📥 Commits

Reviewing files that changed from the base of the PR and between 17a3bcd and dfbc920.

📒 Files selected for processing (4)

docs/user/cli-reference.md
pkg/bundler/deployer/helm/helm_test.go
pkg/bundler/deployer/helm/templates/README.md.tmpl
pkg/bundler/deployer/helm/templates/deploy.sh.tmpl

…in deploy.sh deploy.sh prints "Deployment complete." once Helm installs and manifest applies succeed, which users can reasonably read as "ready for GPU workloads." For GPU bundles, that is not the same: node tuning (Skyhook), GPU operator operand rollout, and DRA kubelet plugin registration continue asynchronously after the script exits. On fresh GPU nodes that gap can be ~15–20 minutes. Scope of this change (messaging + docs only, per NVIDIA#607): - deploy.sh now keeps "Deployment complete." as the first line (preserves existing CI log matchers) and appends a clear note that install success is not the same as workload readiness, naming the main sources of lag. - cli-reference.md: the --no-wait row is tightened to name what it actually skips (Helm chart-level wait), plus a note paragraph that explicitly distinguishes chart-level wait from bundle-level convergence. - README.md.tmpl: per-bundle README gets the same note after the --no-wait example so bundle consumers see it in context. - helm_test.go: new render test asserts the readiness note is emitted after "Deployment complete." and covers the key phrases. No behavior changes, no new flags, no exit-code changes. Broader readiness design (day-0 bridge Jobs, day-N validator) tracked as follow-ups under NVIDIA#607. Refs: NVIDIA#607

yuanchen8911 · 2026-04-21T02:11:47Z

@njhensley fixed a typo. can you re-approve? thanks

yuanchen8911 requested a review from a team as a code owner April 17, 2026 19:08

yuanchen8911 added enhancement area/bundler labels Apr 17, 2026

github-actions Bot added area/docs size/M labels Apr 17, 2026

yuanchen8911 force-pushed the fix/deploy-sh-readiness-messaging branch from eef12fd to 0e36d6e Compare April 17, 2026 19:22

yuanchen8911 mentioned this pull request Apr 17, 2026

Enhance deployment-phase validation to reflect post-install workload readiness #607

Closed

coderabbitai Bot reviewed Apr 17, 2026

View reviewed changes

Comment thread pkg/bundler/deployer/helm/templates/deploy.sh.tmpl Outdated

Comment thread pkg/bundler/deployer/helm/templates/README.md.tmpl Outdated

yuanchen8911 requested review from ayuskauskas, lockwobr and mchmarny April 17, 2026 19:28

yuanchen8911 force-pushed the fix/deploy-sh-readiness-messaging branch from 0e36d6e to 0fa7d00 Compare April 17, 2026 19:50

coderabbitai Bot reviewed Apr 17, 2026

View reviewed changes

yuanchen8911 force-pushed the fix/deploy-sh-readiness-messaging branch from 0fa7d00 to 17a3bcd Compare April 17, 2026 20:07

yuanchen8911 mentioned this pull request Apr 20, 2026

enhance(validator): add targeted post-deployment GPU readiness checks #611

Merged

25 tasks

yuanchen8911 changed the title ~~fix(bundler): distinguish install completion from workload readiness in deploy.sh~~ docs(bundler): clarify deploy.sh install vs readiness status Apr 21, 2026

yuanchen8911 force-pushed the fix/deploy-sh-readiness-messaging branch from 17a3bcd to dfbc920 Compare April 21, 2026 00:38

yuanchen8911 changed the title ~~docs(bundler): clarify deploy.sh install vs readiness status~~ script(bundler): clarify deploy.sh completion vs readiness Apr 21, 2026

yuanchen8911 changed the title ~~script(bundler): clarify deploy.sh completion vs readiness~~ script(bundler): clarify deploy.sh completion != readiness Apr 21, 2026

yuanchen8911 requested a review from dims April 21, 2026 00:42

coderabbitai Bot reviewed Apr 21, 2026

View reviewed changes

Comment thread docs/user/cli-reference.md Outdated

njhensley previously approved these changes Apr 21, 2026

View reviewed changes

yuanchen8911 dismissed njhensley’s stale review via 6077d8b April 21, 2026 02:10

yuanchen8911 force-pushed the fix/deploy-sh-readiness-messaging branch from dfbc920 to 6077d8b Compare April 21, 2026 02:10

yuanchen8911 requested a review from njhensley April 21, 2026 02:10

yuanchen8911 enabled auto-merge (squash) April 21, 2026 02:10

ayuskauskas approved these changes Apr 21, 2026

View reviewed changes

yuanchen8911 merged commit ceffa8c into NVIDIA:main Apr 21, 2026
33 checks passed

yuanchen8911 mentioned this pull request Apr 23, 2026

[EPIC] Reuse Chainsaw health checks in aicr validate --phase deployment #660

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

script(bundler): clarify deploy.sh completion != readiness#609

script(bundler): clarify deploy.sh completion != readiness#609
yuanchen8911 merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:fix/deploy-sh-readiness-messaging

yuanchen8911 commented Apr 17, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 17, 2026

Uh oh!

coderabbitai Bot commented Apr 17, 2026 •

edited

Loading

Reviews paused

Walkthrough

Estimated code review effort

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

yuanchen8911 commented Apr 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

yuanchen8911 commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation / Context

Type of Change

Component(s) Affected

Implementation Notes

Testing

Risk Assessment

Checklist

Uh oh!

github-actions Bot commented Apr 17, 2026

Uh oh!

coderabbitai Bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Estimated code review effort

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yuanchen8911 commented Apr 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yuanchen8911 commented Apr 17, 2026 •

edited

Loading

coderabbitai Bot commented Apr 17, 2026 •

edited

Loading