Skip to content

fix(bundler): fix undeploy PVC ordering, harden deploy scripts, add deployment docs#282

Merged
yuanchen8911 merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:fix/deploy-undeploy-scripts
Mar 5, 2026
Merged

fix(bundler): fix undeploy PVC ordering, harden deploy scripts, add deployment docs#282
yuanchen8911 merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:fix/deploy-undeploy-scripts

Conversation

@yuanchen8911

@yuanchen8911 yuanchen8911 commented Mar 4, 2026

Copy link
Copy Markdown
Contributor

Summary

Fix undeploy script PVC deletion ordering that causes namespace termination hangs, make PVC deletion opt-in, harden deploy script error handling, and add comprehensive deployment script reference documentation.

Motivation / Context

During redeployment of a dynamo inference bundle on EKS, the undeploy script hung indefinitely. Root cause: when multiple components share a namespace (e.g., monitoring with prometheus-adapter, k8s-ephemeral-storage-metrics, kube-prometheus-stack), PVC deletion ran after each component uninstall. The StatefulSet owner (kube-prometheus-stack) was uninstalled last, so earlier PVC deletions hung on kubernetes.io/pvc-protection finalizers.

Additionally, the deploy script silently masked all pre-install manifest errors, accepted unknown CLI flags without warning, and the deployment script behaviors were undocumented — making it difficult for users to succeed without reading the script internals.

Fixes: N/A
Related: N/A

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • CLI (cmd/aicr, pkg/cli)
  • API server (cmd/aicrd, pkg/api, pkg/server)
  • Recipe engine / data (pkg/recipe)
  • Bundlers (pkg/bundler, pkg/component/*)
  • Collectors / snapshotter (pkg/collector, pkg/snapshotter)
  • Validator (pkg/validator)
  • Core libraries (pkg/errors, pkg/k8s)
  • Docs/examples (docs/, examples/)
  • Other: ____________

Implementation Notes

Undeploy (undeploy.sh.tmpl + helm.go)

  • PVC deletion and namespace cleanup moved from per-component loop to a post-loop phase iterating over unique namespaces. This ensures all components in a shared namespace are fully uninstalled before any PVC/namespace cleanup.
  • Added --delete-pvcs flag (default: off). PVCs are preserved by default to retain historical data (Prometheus metrics, etc.). EBS AZ-binding issues are situational and better handled reactively.
  • PVC deletion skips protected namespaces (kube-system, kube-public, kube-node-lease, default).
  • Namespace cleanup scoped to Helm/Kustomize-managed components only — manifest-only components no longer trigger namespace deletion.
  • Added Namespaces field to undeployTemplateData and uniqueNamespaces() helper.

Deploy (deploy.sh.tmpl)

  • Pre-install manifest apply: classifies error-like output lines (error:, forbidden, denied, timed out, etc.) and only suppresses known CRD-race patterns (no matches for kind, ensure CRDs are installed first). Informational lines (created, configured) no longer cause false failures.
  • Unknown CLI flags now error instead of being silently ignored.

Documentation

  • Added "Deploy Script Behavior" and "Undeploy Script Behavior" reference sections to docs/user/cli-reference.md and site/content/docs/user/cli-reference.md, covering:
    • Pre-install manifest CRD ordering and error filtering
    • --best-effort and async component behavior
    • PVC preservation semantics and --delete-pvcs flag
    • Shared namespace ordering rationale
    • Stuck release and orphaned webhook handling
  • Added doc pointers in generated deploy.sh and undeploy.sh script headers linking to the CLI Reference, so users who run the scripts directly can find detailed behavior docs.
  • Added node labels/taints reference section to demos/cuj2-demo.md.
  • Updated bundle command examples and demos/workloads/inference/vllm-agg.yaml to use dedicated=gpu-workload / dedicated=system-workload naming convention.

Testing

go test -race ./pkg/bundler/deployer/helm/...
  • All bundler tests pass including new tests for --delete-pvcs flag behavior, protected namespace exclusion, and uniqueNamespaces().
  • Verified generated undeploy.sh no longer has PVC deletion in per-component loop.
  • Live tested on EKS cluster: undeploy completed without hangs on shared monitoring namespace.

Risk Assessment

  • Low — Isolated change, well-tested, easy to revert

Rollout notes: The --delete-pvcs flag is a new opt-in behavior. Existing users who relied on automatic PVC deletion during undeploy will need to add --delete-pvcs to their undeploy commands. This is intentional — preserving data by default is the safer choice.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint) — golangci-lint version mismatch (pre-existing)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

mchmarny
mchmarny previously approved these changes Mar 4, 2026
@yuanchen8911 yuanchen8911 force-pushed the fix/deploy-undeploy-scripts branch from e8873d2 to d1fa0b9 Compare March 4, 2026 23:36
@github-actions github-actions Bot added size/L and removed size/M labels Mar 4, 2026
@yuanchen8911 yuanchen8911 force-pushed the fix/deploy-undeploy-scripts branch 2 times, most recently from fa7ec26 to 238d6f5 Compare March 4, 2026 23:45
@yuanchen8911 yuanchen8911 changed the title fix(bundler): fix undeploy PVC ordering, add --delete-pvcs flag, harden deploy scripts fix(bundler): fix undeploy PVC ordering, harden deploy scripts, add deployment docs Mar 4, 2026
@yuanchen8911 yuanchen8911 force-pushed the fix/deploy-undeploy-scripts branch 2 times, most recently from d3f9f58 to 2a6f100 Compare March 5, 2026 00:40
…en deploy error handling

Undeploy fixes:
- Move PVC deletion and namespace cleanup to a post-loop phase so all
  components in a shared namespace are fully uninstalled before cleanup.
  This prevents hangs caused by kubernetes.io/pvc-protection finalizers
  when StatefulSet owners are still running.
- Add --delete-pvcs flag (default off) to make PVC deletion opt-in,
  preserving historical data (e.g., Prometheus metrics) across redeploys.
- Skip PVC deletion in protected namespaces (kube-system, etc.).
- Only Helm/Kustomize-managed components trigger namespace cleanup,
  excluding manifest-only components from namespace deletion scope.

Deploy fixes:
- Surface real kubectl failures in pre-install manifest apply instead of
  masking all errors with || true. CRD-race warnings are still suppressed.
- Reject unknown CLI flags with an error instead of silently ignoring them.

Docs:
- Add Deploy/Undeploy Script Behavior reference sections to CLI docs
  covering CRD ordering, async components, PVC preservation, shared
  namespace ordering, stuck release handling, and webhook cleanup.
- Add doc pointers in generated script headers linking to CLI Reference.
- Add node labels/taints reference to cuj2-demo.md.
- Update bundle command examples and workload manifests to use
  dedicated=gpu-workload / dedicated=system-workload naming convention.

Signed-off-by: Yuan Chen <[email protected]>
@yuanchen8911 yuanchen8911 force-pushed the fix/deploy-undeploy-scripts branch from 2a6f100 to 234894d Compare March 5, 2026 00:46
@yuanchen8911 yuanchen8911 requested a review from mchmarny March 5, 2026 01:10
@github-actions

github-actions Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

This pull request has been automatically locked since it has been closed for 90 days with no further activity. Please open a new pull request for related changes.

@github-actions github-actions Bot locked as resolved and limited conversation to collaborators Jun 3, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants