Skip to content

fix(bundler): improve deploy/undeploy script reliability#253

Merged
mchmarny merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:fix/deploy-undeploy-script-improvements
Mar 2, 2026
Merged

fix(bundler): improve deploy/undeploy script reliability#253
mchmarny merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:fix/deploy-undeploy-script-improvements

Conversation

@yuanchen8911

@yuanchen8911 yuanchen8911 commented Feb 28, 2026

Copy link
Copy Markdown
Contributor

Summary

Fix deploy and undeploy script templates to prevent deadlocks, handle interrupted deploys, clean up orphaned cluster-scoped resources, and respect --best-effort semantics.

Motivation / Context

Multiple issues discovered during dynamo inference deployment on EKS:

  1. deploy.sh deadlock: helm install --wait blocks on pods that need pre-existing resources (e.g., gpu-operator dcgm-exporter ConfigMap applied after helm completes)
  2. undeploy.sh hangs: Slow helm cleanup hooks (gpu-operator, skyhook-operator) block indefinitely; stale webhooks and CRDs survive namespace deletion and block subsequent deployments
  3. Interrupted deploys: Helm releases left in pending-install state can't be cleaned up by normal helm uninstall
  4. Orphaned webhooks: Operator-created webhooks (kai-scheduler admission) without Helm labels block pod creation after their service namespace is deleted
  5. --best-effort not honored: Namespace creation and manifest apply failures exit immediately instead of continuing

Fixes: N/A
Related: N/A

Type of Change

  • Bug fix (non-breaking change that fixes an issue)

Component(s) Affected

  • Bundlers (pkg/bundler, pkg/component/*)

Implementation Notes

deploy.sh.tmpl:

  • Apply component manifests both before and after helm install. Pre-install handles ConfigMaps (e.g., gpu-operator dcgm-exporter), post-install handles CRD-dependent resources (e.g., kai-scheduler Config CR patch). Pre-install filters CRD-missing errors specifically.
  • All per-component steps (namespace create, manifest apply) respect --best-effort via || helm_failed

undeploy.sh.tmpl:

  • helm_force_uninstall(): tries normal uninstall first, retries with --no-hooks if it fails (handles pending-install state without fragile status parsing)
  • --timeout flag (default 120s) with input validation
  • delete_release_cluster_resources(): deletes Helm-labeled webhooks and CRDs after each component uninstall
  • delete_orphaned_webhooks_for_ns(): finds webhooks referencing services in the component namespace, deletes only if namespace or service explicitly returns NotFound (skips transient API errors). Scoped to component namespace to avoid touching unrelated platform webhooks.
  • All jq usage guarded behind availability check with graceful fallback
  • --wait=false on namespace deletion; webhook/finalizer listing pipeline wrapped in { ... || true; } for resilience
  • force_clear_namespace_finalizers(): full api-resources sweep (not just kubectl get all) to catch CRs that block namespace termination

Testing

go test -race ./pkg/bundler/deployer/helm/...

All helm deployer tests pass.

Validated end-to-end on two EKS clusters:

  • eidos-validation-2-11: Full undeploy/deploy cycle, all 16 components, no hangs
  • ktsetfavua-dgxc-k8s-aws-use1-non-prod: Recovered stuck pending-install release, orphaned kai-scheduler webhooks cleaned up, CNCF conformance 8/8 pass

Tested edge cases:

  • Undeploy without jq: completes cleanly, skips cluster-resource cleanup
  • Stuck namespace with CR finalizers: full api-resources sweep patches finalizers, namespace terminates

Risk Assessment

  • Low — Isolated change, well-tested, easy to revert

Rollout notes: Only affects newly generated bundles. Existing bundles retain their current scripts.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

@yuanchen8911 yuanchen8911 requested a review from a team as a code owner February 28, 2026 17:11
@yuanchen8911 yuanchen8911 force-pushed the fix/deploy-undeploy-script-improvements branch 4 times, most recently from 69b66e1 to 8e7d85f Compare February 28, 2026 17:56
deploy.sh:
- Apply component manifests both before and after helm install.
  Pre-install handles ConfigMaps (e.g., gpu-operator dcgm-exporter),
  post-install handles CRD-dependent resources (e.g., kai-scheduler
  Config CR patch). Pre-install filters CRD-missing errors specifically.
- Namespace creation and manifest apply respect --best-effort flag
  via || helm_failed instead of exiting on error.

undeploy.sh:
- Add helm_force_uninstall() that tries normal uninstall first, then
  retries with --no-hooks if it fails (handles pending-install state
  from interrupted deploys without fragile status parsing)
- Add --timeout flag for helm uninstall (default 120s) with validation
- Delete cluster-scoped webhooks and CRDs owned by each Helm release
  using Helm label/annotation matching
- Delete orphaned webhooks scoped to component namespace whose backing
  service returns explicit NotFound (skips transient API errors)
- Guard all jq usage behind availability check with graceful fallback
- Use --wait=false on namespace deletion to avoid blocking the script
- Force-clear finalizers on stuck Terminating namespaces using full
  api-resources sweep to catch CRs
- Tolerate missing CRDs when deleting manifests

Signed-off-by: Yuan Chen <[email protected]>
@yuanchen8911 yuanchen8911 force-pushed the fix/deploy-undeploy-script-improvements branch from 8e7d85f to b16765f Compare February 28, 2026 21:44

@mchmarny mchmarny left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@github-actions

github-actions Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

This pull request has been automatically locked since it has been closed for 90 days with no further activity. Please open a new pull request for related changes.

@github-actions github-actions Bot locked as resolved and limited conversation to collaborators Jun 1, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants