fix(bundler): fix undeploy PVC ordering, harden deploy scripts, add deployment docs#282
Merged
yuanchen8911 merged 1 commit intoMar 5, 2026
Conversation
e9440ae to
e8873d2
Compare
mchmarny
previously approved these changes
Mar 4, 2026
e8873d2 to
d1fa0b9
Compare
fa7ec26 to
238d6f5
Compare
d3f9f58 to
2a6f100
Compare
…en deploy error handling Undeploy fixes: - Move PVC deletion and namespace cleanup to a post-loop phase so all components in a shared namespace are fully uninstalled before cleanup. This prevents hangs caused by kubernetes.io/pvc-protection finalizers when StatefulSet owners are still running. - Add --delete-pvcs flag (default off) to make PVC deletion opt-in, preserving historical data (e.g., Prometheus metrics) across redeploys. - Skip PVC deletion in protected namespaces (kube-system, etc.). - Only Helm/Kustomize-managed components trigger namespace cleanup, excluding manifest-only components from namespace deletion scope. Deploy fixes: - Surface real kubectl failures in pre-install manifest apply instead of masking all errors with || true. CRD-race warnings are still suppressed. - Reject unknown CLI flags with an error instead of silently ignoring them. Docs: - Add Deploy/Undeploy Script Behavior reference sections to CLI docs covering CRD ordering, async components, PVC preservation, shared namespace ordering, stuck release handling, and webhook cleanup. - Add doc pointers in generated script headers linking to CLI Reference. - Add node labels/taints reference to cuj2-demo.md. - Update bundle command examples and workload manifests to use dedicated=gpu-workload / dedicated=system-workload naming convention. Signed-off-by: Yuan Chen <[email protected]>
2a6f100 to
234894d
Compare
mchmarny
approved these changes
Mar 5, 2026
This was referenced Apr 17, 2026
6 tasks
25 tasks
12 tasks
Contributor
|
This pull request has been automatically locked since it has been closed for 90 days with no further activity. Please open a new pull request for related changes. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fix undeploy script PVC deletion ordering that causes namespace termination hangs, make PVC deletion opt-in, harden deploy script error handling, and add comprehensive deployment script reference documentation.
Motivation / Context
During redeployment of a dynamo inference bundle on EKS, the undeploy script hung indefinitely. Root cause: when multiple components share a namespace (e.g.,
monitoringwithprometheus-adapter,k8s-ephemeral-storage-metrics,kube-prometheus-stack), PVC deletion ran after each component uninstall. The StatefulSet owner (kube-prometheus-stack) was uninstalled last, so earlier PVC deletions hung onkubernetes.io/pvc-protectionfinalizers.Additionally, the deploy script silently masked all pre-install manifest errors, accepted unknown CLI flags without warning, and the deployment script behaviors were undocumented — making it difficult for users to succeed without reading the script internals.
Fixes: N/A
Related: N/A
Type of Change
Component(s) Affected
cmd/aicr,pkg/cli)cmd/aicrd,pkg/api,pkg/server)pkg/recipe)pkg/bundler,pkg/component/*)pkg/collector,pkg/snapshotter)pkg/validator)pkg/errors,pkg/k8s)docs/,examples/)Implementation Notes
Undeploy (
undeploy.sh.tmpl+helm.go)--delete-pvcsflag (default: off). PVCs are preserved by default to retain historical data (Prometheus metrics, etc.). EBS AZ-binding issues are situational and better handled reactively.kube-system,kube-public,kube-node-lease,default).Namespacesfield toundeployTemplateDataanduniqueNamespaces()helper.Deploy (
deploy.sh.tmpl)error:,forbidden,denied,timed out, etc.) and only suppresses known CRD-race patterns (no matches for kind,ensure CRDs are installed first). Informational lines (created,configured) no longer cause false failures.Documentation
docs/user/cli-reference.mdandsite/content/docs/user/cli-reference.md, covering:--best-effortand async component behavior--delete-pvcsflagdeploy.shandundeploy.shscript headers linking to the CLI Reference, so users who run the scripts directly can find detailed behavior docs.demos/cuj2-demo.md.demos/workloads/inference/vllm-agg.yamlto usededicated=gpu-workload/dedicated=system-workloadnaming convention.Testing
go test -race ./pkg/bundler/deployer/helm/...--delete-pvcsflag behavior, protected namespace exclusion, anduniqueNamespaces().undeploy.shno longer has PVC deletion in per-component loop.Risk Assessment
Rollout notes: The
--delete-pvcsflag is a new opt-in behavior. Existing users who relied on automatic PVC deletion during undeploy will need to add--delete-pvcsto their undeploy commands. This is intentional — preserving data by default is the safer choice.Checklist
make testwith-race)make lint) — golangci-lint version mismatch (pre-existing)git commit -S)