fix(bundler): improve deploy/undeploy script reliability#253
Merged
mchmarny merged 1 commit intoMar 2, 2026
Merged
Conversation
69b66e1 to
8e7d85f
Compare
deploy.sh: - Apply component manifests both before and after helm install. Pre-install handles ConfigMaps (e.g., gpu-operator dcgm-exporter), post-install handles CRD-dependent resources (e.g., kai-scheduler Config CR patch). Pre-install filters CRD-missing errors specifically. - Namespace creation and manifest apply respect --best-effort flag via || helm_failed instead of exiting on error. undeploy.sh: - Add helm_force_uninstall() that tries normal uninstall first, then retries with --no-hooks if it fails (handles pending-install state from interrupted deploys without fragile status parsing) - Add --timeout flag for helm uninstall (default 120s) with validation - Delete cluster-scoped webhooks and CRDs owned by each Helm release using Helm label/annotation matching - Delete orphaned webhooks scoped to component namespace whose backing service returns explicit NotFound (skips transient API errors) - Guard all jq usage behind availability check with graceful fallback - Use --wait=false on namespace deletion to avoid blocking the script - Force-clear finalizers on stuck Terminating namespaces using full api-resources sweep to catch CRs - Tolerate missing CRDs when deleting manifests Signed-off-by: Yuan Chen <[email protected]>
8e7d85f to
b16765f
Compare
This was referenced Apr 17, 2026
6 tasks
25 tasks
12 tasks
Contributor
|
This pull request has been automatically locked since it has been closed for 90 days with no further activity. Please open a new pull request for related changes. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fix deploy and undeploy script templates to prevent deadlocks, handle interrupted deploys, clean up orphaned cluster-scoped resources, and respect --best-effort semantics.
Motivation / Context
Multiple issues discovered during dynamo inference deployment on EKS:
helm install --waitblocks on pods that need pre-existing resources (e.g., gpu-operator dcgm-exporter ConfigMap applied after helm completes)pending-installstate can't be cleaned up by normalhelm uninstallFixes: N/A
Related: N/A
Type of Change
Component(s) Affected
pkg/bundler,pkg/component/*)Implementation Notes
deploy.sh.tmpl:
--best-effortvia|| helm_failedundeploy.sh.tmpl:
helm_force_uninstall(): tries normal uninstall first, retries with--no-hooksif it fails (handles pending-install state without fragile status parsing)--timeoutflag (default 120s) with input validationdelete_release_cluster_resources(): deletes Helm-labeled webhooks and CRDs after each component uninstalldelete_orphaned_webhooks_for_ns(): finds webhooks referencing services in the component namespace, deletes only if namespace or service explicitly returns NotFound (skips transient API errors). Scoped to component namespace to avoid touching unrelated platform webhooks.--wait=falseon namespace deletion; webhook/finalizer listing pipeline wrapped in{ ... || true; }for resilienceforce_clear_namespace_finalizers(): fullapi-resourcessweep (not justkubectl get all) to catch CRs that block namespace terminationTesting
go test -race ./pkg/bundler/deployer/helm/...All helm deployer tests pass.
Validated end-to-end on two EKS clusters:
eidos-validation-2-11: Full undeploy/deploy cycle, all 16 components, no hangsktsetfavua-dgxc-k8s-aws-use1-non-prod: Recovered stuck pending-install release, orphaned kai-scheduler webhooks cleaned up, CNCF conformance 8/8 passTested edge cases:
Risk Assessment
Rollout notes: Only affects newly generated bundles. Existing bundles retain their current scripts.
Checklist
make testwith-race)make lint)git commit -S)