script(bundler): clarify deploy.sh completion != readiness#609
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughClarified CLI and Helm deploy template docs and the generated Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
eef12fd to
0e36d6e
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/bundler/deployer/helm/templates/deploy.sh.tmpl`:
- Around line 454-456: The success note currently prints an unconditional
"Deployment complete" message that is inaccurate when the script was run with
the --best-effort option; update the deploy.sh.tmpl message logic to detect the
best-effort mode (the CLI flag "--best-effort" or the template variable used to
represent it) and change the text so it does not assert all installs/applies
succeeded — e.g., conditionally append a qualifier like "may have had non-fatal
failures in --best-effort mode" or switch to a neutral phrasing when best-effort
is enabled; locate the existing echo that outputs "NOTE: \"Deployment complete\"
means Helm installs and manifest applies" and modify it accordingly.
In `@pkg/bundler/deployer/helm/templates/README.md.tmpl`:
- Around line 59-60: Update the "Deployment complete" note to explicitly state
that it only means Helm install and manifest apply succeeded, and to add a
sentence about the --best-effort flag: when run with --best-effort the script
may ignore component failures so some components can be left failed or
incomplete and convergence may continue asynchronously; mention that operators
should check component statuses and logs rather than relying solely on the
"Deployment complete" message (refer to the existing note text and the
--best-effort flag in README.md.tmpl to locate where to insert this
clarification).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: 8b79f8e6-666d-4ade-8bff-07803e54972d
📒 Files selected for processing (4)
docs/user/cli-reference.mdpkg/bundler/deployer/helm/helm_test.gopkg/bundler/deployer/helm/templates/README.md.tmplpkg/bundler/deployer/helm/templates/deploy.sh.tmpl
0e36d6e to
0fa7d00
Compare
There was a problem hiding this comment.
♻️ Duplicate comments (2)
pkg/bundler/deployer/helm/templates/README.md.tmpl (1)
59-60:⚠️ Potential issue | 🟠 MajorREADME note still guarantees success too strongly.
The note states installs/applies “succeeded,” but
--best-effortcan finish with component failures. The wording should be neutral and direct operators to warning output.Suggested wording update
-> **Note:** "Deployment complete" means Helm installs and manifest applies succeeded, not that the cluster is ready for GPU workloads. On fresh GPU nodes, cluster convergence (Skyhook node tuning, GPU operator operand rollout, DRA kubelet plugin registration) continues asynchronously after the script exits. See the [AICR CLI Reference](https://github.com/NVIDIA/aicr/blob/main/docs/user/cli-reference.md#deploy-script-behavior-deploysh) for details. +> **Note:** "Deployment complete" means the deploy script finished its install/apply pass. If `--best-effort` was used, one or more components may have failed (check warning lines/status). This does not mean the cluster is ready for GPU workloads. On fresh GPU nodes, cluster convergence (Skyhook node tuning, GPU operator operand rollout, DRA kubelet plugin registration) continues asynchronously after the script exits. See the [AICR CLI Reference](https://github.com/NVIDIA/aicr/blob/main/docs/user/cli-reference.md#deploy-script-behavior-deploysh) for details.As per coding guidelines, "Documentation must not invent features, guarantees, timelines, or roadmap commitments. Clearly distinguish current behavior from future intent."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/helm/templates/README.md.tmpl` around lines 59 - 60, Update the README note so it does not guarantee success: change the "Deployment complete" wording to a neutral statement that Helm install and manifest apply steps finished but may include component failures when using the --best-effort option; explicitly tell operators to inspect warning output and logs (and refer to the AICR CLI Reference link already present) for any remaining component errors and asynchronous GPU convergence tasks like Skyhook tuning, GPU operator rollout, and DRA kubelet registration.pkg/bundler/deployer/helm/templates/deploy.sh.tmpl (1)
454-456:⚠️ Potential issue | 🟠 MajorFinal NOTE overstates success in
--best-effortruns.When Line 449 records failed components, Lines 454-455 still claim installs/applies “succeeded.” That is inaccurate and can mislead operators during partial-failure deployments.
Suggested fix
echo "Deployment complete." echo -echo "NOTE: \"Deployment complete\" means Helm installs and manifest applies" -echo "succeeded, not that the cluster is ready for GPU workloads. On fresh" +echo "NOTE: \"Deployment complete\" means the deploy script finished its install/apply pass," +if [[ -n "${FAILED_COMPONENTS}" ]]; then + echo "including non-fatal component failures under --best-effort (see WARNING lines above)," +fi +echo "not that the cluster is ready for GPU workloads. On fresh"As per coding guidelines, "Documentation must not invent features, guarantees, timelines, or roadmap commitments. Clearly distinguish current behavior from future intent."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/helm/templates/deploy.sh.tmpl` around lines 454 - 456, The current NOTE echo (echo "NOTE: \"Deployment complete\" means Helm installs and manifest applies" ...) incorrectly claims success even when the script recorded failures; update the deploy.sh.tmpl so the message is conditional on the recorded failed-components variable (e.g., the variable used where failures are recorded on Line 449, e.g., failed_components/FAILED_COMPONENTS) and either (a) only print the existing "succeeded" wording when no failures were recorded, or (b) print an amended message when failures exist such as "Helm installs and manifest applies were attempted; some components failed (see failed components list) — cluster may be partially deployed and not ready for GPU workloads." Ensure the echo text replaces the existing two-line NOTE block and clearly distinguishes full success vs. partial/failure in --best-effort runs.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@pkg/bundler/deployer/helm/templates/deploy.sh.tmpl`:
- Around line 454-456: The current NOTE echo (echo "NOTE: \"Deployment
complete\" means Helm installs and manifest applies" ...) incorrectly claims
success even when the script recorded failures; update the deploy.sh.tmpl so the
message is conditional on the recorded failed-components variable (e.g., the
variable used where failures are recorded on Line 449, e.g.,
failed_components/FAILED_COMPONENTS) and either (a) only print the existing
"succeeded" wording when no failures were recorded, or (b) print an amended
message when failures exist such as "Helm installs and manifest applies were
attempted; some components failed (see failed components list) — cluster may be
partially deployed and not ready for GPU workloads." Ensure the echo text
replaces the existing two-line NOTE block and clearly distinguishes full success
vs. partial/failure in --best-effort runs.
In `@pkg/bundler/deployer/helm/templates/README.md.tmpl`:
- Around line 59-60: Update the README note so it does not guarantee success:
change the "Deployment complete" wording to a neutral statement that Helm
install and manifest apply steps finished but may include component failures
when using the --best-effort option; explicitly tell operators to inspect
warning output and logs (and refer to the AICR CLI Reference link already
present) for any remaining component errors and asynchronous GPU convergence
tasks like Skyhook tuning, GPU operator rollout, and DRA kubelet registration.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: 43cdcec3-fcc0-466d-ad3e-6c7ce4ea21fd
📒 Files selected for processing (4)
docs/user/cli-reference.mdpkg/bundler/deployer/helm/helm_test.gopkg/bundler/deployer/helm/templates/README.md.tmplpkg/bundler/deployer/helm/templates/deploy.sh.tmpl
0fa7d00 to
17a3bcd
Compare
17a3bcd to
dfbc920
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/user/cli-reference.md`:
- Line 1278: Replace the misspelled flag example string "--best-effrot" with the
correct "--best-effort" in the CLI reference text that explains unknown flag
rejection so readers copying the example won't be confused; search for the
literal "--best-effrot" and update it to "--best-effort".
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: ceb90bb7-9ee4-478d-807a-7ee4b632a94d
📒 Files selected for processing (4)
docs/user/cli-reference.mdpkg/bundler/deployer/helm/helm_test.gopkg/bundler/deployer/helm/templates/README.md.tmplpkg/bundler/deployer/helm/templates/deploy.sh.tmpl
…in deploy.sh deploy.sh prints "Deployment complete." once Helm installs and manifest applies succeed, which users can reasonably read as "ready for GPU workloads." For GPU bundles, that is not the same: node tuning (Skyhook), GPU operator operand rollout, and DRA kubelet plugin registration continue asynchronously after the script exits. On fresh GPU nodes that gap can be ~15–20 minutes. Scope of this change (messaging + docs only, per NVIDIA#607): - deploy.sh now keeps "Deployment complete." as the first line (preserves existing CI log matchers) and appends a clear note that install success is not the same as workload readiness, naming the main sources of lag. - cli-reference.md: the --no-wait row is tightened to name what it actually skips (Helm chart-level wait), plus a note paragraph that explicitly distinguishes chart-level wait from bundle-level convergence. - README.md.tmpl: per-bundle README gets the same note after the --no-wait example so bundle consumers see it in context. - helm_test.go: new render test asserts the readiness note is emitted after "Deployment complete." and covers the key phrases. No behavior changes, no new flags, no exit-code changes. Broader readiness design (day-0 bridge Jobs, day-N validator) tracked as follow-ups under NVIDIA#607. Refs: NVIDIA#607
dfbc920 to
6077d8b
Compare
|
@njhensley fixed a typo. can you re-approve? thanks |
Summary
deploy.shnow clearly distinguishes install/apply status from workload readiness in its final output and docs. No flag or behavior changes to the install flow.Motivation / Context
deploy.shcurrently ends with a success-like message that users can reasonably read as "ready for GPU workloads." For GPU bundles that is not the same thing: node tuning (Skyhook), GPU operator operand rollout, and DRA kubelet plugin registration continue asynchronously after the script exits.This PR keeps the scope intentionally narrow to the messaging/docs portion of #607: make the final output accurate, distinguish install/apply completion from workload readiness, and avoid overstating success in
--best-effortpartial-failure runs.This is a deploy-side symmetry gap after the recent undeploy pre/post-flight hardening in #602.
Fixes: #607 (messaging/docs portion only)
Issue: #607
Related: #602, #516
Type of Change
Component(s) Affected
pkg/bundler,pkg/component/*)docs/,examples/)Implementation Notes
Scope is intentionally narrow (per the revised issue #607):
deploy.sh.tmpl: keep the full-success status line (Deployment complete.), add a distinct partial-failure status line for--best-effortruns (Deployment completed with non-fatal errors (--best-effort).), and use a shared neutral note below both paths explaining that install/apply status is not workload readiness.docs/user/cli-reference.md: tighten the--no-waitrow to name what it actually skips (Helm chart-level wait where AICR uses it) and add a note paragraph explicitly distinguishing chart-level wait,--best-effortpartial failures, and bundle-level convergence.README.md.tmpl: per-bundle README gets the same readiness note after the--no-waitexample so bundle consumers see it in context.helm_test.go: render test covers both final status lines plus the shared readiness note wording.Broader readiness design (day-0 bridge Jobs under #516, day-N re-verification possibly in the validator) is tracked as follow-ups on #607. This PR is intentionally not choosing between those longer-term designs.
Testing
Sample of rendered
deploy.shtail after this change:Risk Assessment
Deployment complete.;--best-effortpartial-failure runs now print a more accurate status line.Rollout notes: None — backward-compatible. Operators parsing deploy output should treat the final status line as install/apply status only, not workload readiness.
Checklist
make testwith-race)make lint)git commit -S)