Skip to content

script(bundler): clarify deploy.sh completion != readiness#609

Merged
yuanchen8911 merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:fix/deploy-sh-readiness-messaging
Apr 21, 2026
Merged

script(bundler): clarify deploy.sh completion != readiness#609
yuanchen8911 merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:fix/deploy-sh-readiness-messaging

Conversation

@yuanchen8911

@yuanchen8911 yuanchen8911 commented Apr 17, 2026

Copy link
Copy Markdown
Contributor

Summary

deploy.sh now clearly distinguishes install/apply status from workload readiness in its final output and docs. No flag or behavior changes to the install flow.

Motivation / Context

deploy.sh currently ends with a success-like message that users can reasonably read as "ready for GPU workloads." For GPU bundles that is not the same thing: node tuning (Skyhook), GPU operator operand rollout, and DRA kubelet plugin registration continue asynchronously after the script exits.

This PR keeps the scope intentionally narrow to the messaging/docs portion of #607: make the final output accurate, distinguish install/apply completion from workload readiness, and avoid overstating success in --best-effort partial-failure runs.

This is a deploy-side symmetry gap after the recent undeploy pre/post-flight hardening in #602.

Fixes: #607 (messaging/docs portion only)
Issue: #607
Related: #602, #516

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • Documentation update

Component(s) Affected

  • Bundlers (pkg/bundler, pkg/component/*)
  • Docs/examples (docs/, examples/)

Implementation Notes

Scope is intentionally narrow (per the revised issue #607):

  • deploy.sh.tmpl: keep the full-success status line (Deployment complete.), add a distinct partial-failure status line for --best-effort runs (Deployment completed with non-fatal errors (--best-effort).), and use a shared neutral note below both paths explaining that install/apply status is not workload readiness.
  • docs/user/cli-reference.md: tighten the --no-wait row to name what it actually skips (Helm chart-level wait where AICR uses it) and add a note paragraph explicitly distinguishing chart-level wait, --best-effort partial failures, and bundle-level convergence.
  • README.md.tmpl: per-bundle README gets the same readiness note after the --no-wait example so bundle consumers see it in context.
  • helm_test.go: render test covers both final status lines plus the shared readiness note wording.

Broader readiness design (day-0 bridge Jobs under #516, day-N re-verification possibly in the validator) is tracked as follow-ups on #607. This PR is intentionally not choosing between those longer-term designs.

Testing

go test -race ./pkg/bundler/deployer/helm/...
# ok  github.com/NVIDIA/aicr/pkg/bundler/deployer/helm

# Earlier full `make qualify` passed on this PR before the final
# best-effort wording refinement; this follow-up reran the package test
# after updating the script/docs/test text.

Sample of rendered deploy.sh tail after this change:

if [[ -n "${FAILED_COMPONENTS}" ]]; then
  echo "WARNING: the following components failed:${FAILED_COMPONENTS}"
  echo "Deployment completed with non-fatal errors (--best-effort)."
else
  echo "Deployment complete."
fi

echo "NOTE: The above status reflects Helm install and manifest apply results,"
echo "not whether the cluster is ready for GPU workloads. On fresh"
echo "GPU nodes, cluster convergence may continue asynchronously after this"

Risk Assessment

  • Low — Messaging and docs only. No flag changes, no exit-code changes, no install-path changes. The full-success path still prints Deployment complete.; --best-effort partial-failure runs now print a more accurate status line.

Rollout notes: None — backward-compatible. Operators parsing deploy output should treat the final status line as install/apply status only, not workload readiness.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

@github-actions

Copy link
Copy Markdown
Contributor

@coderabbitai

coderabbitai Bot commented Apr 17, 2026

Copy link
Copy Markdown

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Clarified CLI and Helm deploy template docs and the generated deploy.sh to state that --no-wait skips Helm chart-level waiting only where AICR would pass --wait, while hook --timeout behavior remains. The deploy script now conditionally prints either a success line or a “completed with non-fatal errors (--best-effort)” line and includes a multi-line NOTE distinguishing Helm install/manifest-apply completion from full GPU workload readiness, listing asynchronous convergence items (e.g., Skyhook tuning, GPU operator operand rollout, NVIDIA DRA kubelet plugin registration) and linking to the CLI reference. Added a unit test verifying the final messages and ordering in deploy.sh.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Linked Issues check ❓ Inconclusive This PR addresses issue #607's messaging/docs portion only. The PR explicitly states it keeps scope narrow to messaging and documentation, fixing the partial objective of distinguishing install completion from workload readiness. However, the linked issue #607 requires comprehensive validation checks (Skyhook CR completion, DRA kubelet-plugin readiness, GPU Operator ClusterPolicy state, and deployer-neutral API implementation), which are out of scope for this PR—they are tracked as follow-ups. Clarify in the PR description or linked issue whether this messaging PR is intended as a complete fix for #607 or if implementation of the validation checks is deferred. If deferred, ensure follow-up issues track the validation implementation work.
✅ Passed checks (3 passed)
Check name Status Explanation
Description check ✅ Passed The description is well-related to the changeset, clearly explaining the motivation (distinguishing install/apply status from workload readiness), scope (messaging/docs only), and changes across multiple files (deploy.sh.tmpl, CLI reference, README, and tests).
Out of Scope Changes check ✅ Passed All changes are directly aligned with the stated scope: documentation updates (cli-reference.md, README.md.tmpl, deploy.sh.tmpl) and test updates for the clarified messaging. No behavioral, flag, or exit-code changes introduce out-of-scope modifications.
Title check ✅ Passed The title accurately and concisely summarizes the main change: clarifying that deploy.sh completion does not equal readiness for GPU workloads, which is the core purpose across all modified files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/bundler/deployer/helm/templates/deploy.sh.tmpl`:
- Around line 454-456: The success note currently prints an unconditional
"Deployment complete" message that is inaccurate when the script was run with
the --best-effort option; update the deploy.sh.tmpl message logic to detect the
best-effort mode (the CLI flag "--best-effort" or the template variable used to
represent it) and change the text so it does not assert all installs/applies
succeeded — e.g., conditionally append a qualifier like "may have had non-fatal
failures in --best-effort mode" or switch to a neutral phrasing when best-effort
is enabled; locate the existing echo that outputs "NOTE: \"Deployment complete\"
means Helm installs and manifest applies" and modify it accordingly.

In `@pkg/bundler/deployer/helm/templates/README.md.tmpl`:
- Around line 59-60: Update the "Deployment complete" note to explicitly state
that it only means Helm install and manifest apply succeeded, and to add a
sentence about the --best-effort flag: when run with --best-effort the script
may ignore component failures so some components can be left failed or
incomplete and convergence may continue asynchronously; mention that operators
should check component statuses and logs rather than relying solely on the
"Deployment complete" message (refer to the existing note text and the
--best-effort flag in README.md.tmpl to locate where to insert this
clarification).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 8b79f8e6-666d-4ade-8bff-07803e54972d

📥 Commits

Reviewing files that changed from the base of the PR and between eef12fd and 0e36d6e.

📒 Files selected for processing (4)
  • docs/user/cli-reference.md
  • pkg/bundler/deployer/helm/helm_test.go
  • pkg/bundler/deployer/helm/templates/README.md.tmpl
  • pkg/bundler/deployer/helm/templates/deploy.sh.tmpl

Comment thread pkg/bundler/deployer/helm/templates/deploy.sh.tmpl Outdated
Comment thread pkg/bundler/deployer/helm/templates/README.md.tmpl Outdated
@yuanchen8911 yuanchen8911 force-pushed the fix/deploy-sh-readiness-messaging branch from 0e36d6e to 0fa7d00 Compare April 17, 2026 19:50

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
pkg/bundler/deployer/helm/templates/README.md.tmpl (1)

59-60: ⚠️ Potential issue | 🟠 Major

README note still guarantees success too strongly.

The note states installs/applies “succeeded,” but --best-effort can finish with component failures. The wording should be neutral and direct operators to warning output.

Suggested wording update
-> **Note:** "Deployment complete" means Helm installs and manifest applies succeeded, not that the cluster is ready for GPU workloads. On fresh GPU nodes, cluster convergence (Skyhook node tuning, GPU operator operand rollout, DRA kubelet plugin registration) continues asynchronously after the script exits. See the [AICR CLI Reference](https://github.com/NVIDIA/aicr/blob/main/docs/user/cli-reference.md#deploy-script-behavior-deploysh) for details.
+> **Note:** "Deployment complete" means the deploy script finished its install/apply pass. If `--best-effort` was used, one or more components may have failed (check warning lines/status). This does not mean the cluster is ready for GPU workloads. On fresh GPU nodes, cluster convergence (Skyhook node tuning, GPU operator operand rollout, DRA kubelet plugin registration) continues asynchronously after the script exits. See the [AICR CLI Reference](https://github.com/NVIDIA/aicr/blob/main/docs/user/cli-reference.md#deploy-script-behavior-deploysh) for details.

As per coding guidelines, "Documentation must not invent features, guarantees, timelines, or roadmap commitments. Clearly distinguish current behavior from future intent."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/deployer/helm/templates/README.md.tmpl` around lines 59 - 60,
Update the README note so it does not guarantee success: change the "Deployment
complete" wording to a neutral statement that Helm install and manifest apply
steps finished but may include component failures when using the --best-effort
option; explicitly tell operators to inspect warning output and logs (and refer
to the AICR CLI Reference link already present) for any remaining component
errors and asynchronous GPU convergence tasks like Skyhook tuning, GPU operator
rollout, and DRA kubelet registration.
pkg/bundler/deployer/helm/templates/deploy.sh.tmpl (1)

454-456: ⚠️ Potential issue | 🟠 Major

Final NOTE overstates success in --best-effort runs.

When Line 449 records failed components, Lines 454-455 still claim installs/applies “succeeded.” That is inaccurate and can mislead operators during partial-failure deployments.

Suggested fix
 echo "Deployment complete."
 echo
-echo "NOTE: \"Deployment complete\" means Helm installs and manifest applies"
-echo "succeeded, not that the cluster is ready for GPU workloads. On fresh"
+echo "NOTE: \"Deployment complete\" means the deploy script finished its install/apply pass,"
+if [[ -n "${FAILED_COMPONENTS}" ]]; then
+  echo "including non-fatal component failures under --best-effort (see WARNING lines above),"
+fi
+echo "not that the cluster is ready for GPU workloads. On fresh"

As per coding guidelines, "Documentation must not invent features, guarantees, timelines, or roadmap commitments. Clearly distinguish current behavior from future intent."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/deployer/helm/templates/deploy.sh.tmpl` around lines 454 - 456,
The current NOTE echo (echo "NOTE: \"Deployment complete\" means Helm installs
and manifest applies" ...) incorrectly claims success even when the script
recorded failures; update the deploy.sh.tmpl so the message is conditional on
the recorded failed-components variable (e.g., the variable used where failures
are recorded on Line 449, e.g., failed_components/FAILED_COMPONENTS) and either
(a) only print the existing "succeeded" wording when no failures were recorded,
or (b) print an amended message when failures exist such as "Helm installs and
manifest applies were attempted; some components failed (see failed components
list) — cluster may be partially deployed and not ready for GPU workloads."
Ensure the echo text replaces the existing two-line NOTE block and clearly
distinguishes full success vs. partial/failure in --best-effort runs.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@pkg/bundler/deployer/helm/templates/deploy.sh.tmpl`:
- Around line 454-456: The current NOTE echo (echo "NOTE: \"Deployment
complete\" means Helm installs and manifest applies" ...) incorrectly claims
success even when the script recorded failures; update the deploy.sh.tmpl so the
message is conditional on the recorded failed-components variable (e.g., the
variable used where failures are recorded on Line 449, e.g.,
failed_components/FAILED_COMPONENTS) and either (a) only print the existing
"succeeded" wording when no failures were recorded, or (b) print an amended
message when failures exist such as "Helm installs and manifest applies were
attempted; some components failed (see failed components list) — cluster may be
partially deployed and not ready for GPU workloads." Ensure the echo text
replaces the existing two-line NOTE block and clearly distinguishes full success
vs. partial/failure in --best-effort runs.

In `@pkg/bundler/deployer/helm/templates/README.md.tmpl`:
- Around line 59-60: Update the README note so it does not guarantee success:
change the "Deployment complete" wording to a neutral statement that Helm
install and manifest apply steps finished but may include component failures
when using the --best-effort option; explicitly tell operators to inspect
warning output and logs (and refer to the AICR CLI Reference link already
present) for any remaining component errors and asynchronous GPU convergence
tasks like Skyhook tuning, GPU operator rollout, and DRA kubelet registration.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 43cdcec3-fcc0-466d-ad3e-6c7ce4ea21fd

📥 Commits

Reviewing files that changed from the base of the PR and between 0e36d6e and 0fa7d00.

📒 Files selected for processing (4)
  • docs/user/cli-reference.md
  • pkg/bundler/deployer/helm/helm_test.go
  • pkg/bundler/deployer/helm/templates/README.md.tmpl
  • pkg/bundler/deployer/helm/templates/deploy.sh.tmpl

@yuanchen8911 yuanchen8911 force-pushed the fix/deploy-sh-readiness-messaging branch from 0fa7d00 to 17a3bcd Compare April 17, 2026 20:07
@yuanchen8911 yuanchen8911 changed the title fix(bundler): distinguish install completion from workload readiness in deploy.sh docs(bundler): clarify deploy.sh install vs readiness status Apr 21, 2026
@yuanchen8911 yuanchen8911 force-pushed the fix/deploy-sh-readiness-messaging branch from 17a3bcd to dfbc920 Compare April 21, 2026 00:38
@yuanchen8911 yuanchen8911 changed the title docs(bundler): clarify deploy.sh install vs readiness status script(bundler): clarify deploy.sh completion vs readiness Apr 21, 2026
@yuanchen8911 yuanchen8911 changed the title script(bundler): clarify deploy.sh completion vs readiness script(bundler): clarify deploy.sh completion != readiness Apr 21, 2026
@yuanchen8911 yuanchen8911 requested a review from dims April 21, 2026 00:42

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/user/cli-reference.md`:
- Line 1278: Replace the misspelled flag example string "--best-effrot" with the
correct "--best-effort" in the CLI reference text that explains unknown flag
rejection so readers copying the example won't be confused; search for the
literal "--best-effrot" and update it to "--best-effort".
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: ceb90bb7-9ee4-478d-807a-7ee4b632a94d

📥 Commits

Reviewing files that changed from the base of the PR and between 17a3bcd and dfbc920.

📒 Files selected for processing (4)
  • docs/user/cli-reference.md
  • pkg/bundler/deployer/helm/helm_test.go
  • pkg/bundler/deployer/helm/templates/README.md.tmpl
  • pkg/bundler/deployer/helm/templates/deploy.sh.tmpl

Comment thread docs/user/cli-reference.md Outdated
njhensley
njhensley previously approved these changes Apr 21, 2026
…in deploy.sh

deploy.sh prints "Deployment complete." once Helm installs and manifest
applies succeed, which users can reasonably read as "ready for GPU
workloads." For GPU bundles, that is not the same: node tuning (Skyhook),
GPU operator operand rollout, and DRA kubelet plugin registration continue
asynchronously after the script exits. On fresh GPU nodes that gap can be
~15–20 minutes.

Scope of this change (messaging + docs only, per NVIDIA#607):
- deploy.sh now keeps "Deployment complete." as the first line (preserves
  existing CI log matchers) and appends a clear note that install success
  is not the same as workload readiness, naming the main sources of lag.
- cli-reference.md: the --no-wait row is tightened to name what it
  actually skips (Helm chart-level wait), plus a note paragraph that
  explicitly distinguishes chart-level wait from bundle-level convergence.
- README.md.tmpl: per-bundle README gets the same note after the
  --no-wait example so bundle consumers see it in context.
- helm_test.go: new render test asserts the readiness note is emitted
  after "Deployment complete." and covers the key phrases.

No behavior changes, no new flags, no exit-code changes. Broader
readiness design (day-0 bridge Jobs, day-N validator) tracked as
follow-ups under NVIDIA#607.

Refs: NVIDIA#607
@yuanchen8911 yuanchen8911 force-pushed the fix/deploy-sh-readiness-messaging branch from dfbc920 to 6077d8b Compare April 21, 2026 02:10
@yuanchen8911 yuanchen8911 requested a review from njhensley April 21, 2026 02:10
@yuanchen8911 yuanchen8911 enabled auto-merge (squash) April 21, 2026 02:10
@yuanchen8911

Copy link
Copy Markdown
Contributor Author

@njhensley fixed a typo. can you re-approve? thanks

@yuanchen8911 yuanchen8911 merged commit ceffa8c into NVIDIA:main Apr 21, 2026
33 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enhance deployment-phase validation to reflect post-install workload readiness

3 participants