fix(bundler): scope cleanup to bundle components and remove stale skyhook taints by yuanchen8911 · Pull Request #477 · NVIDIA/aicr

yuanchen8911 · 2026-04-01T21:11:19Z

Summary

Expand undeploy.sh and deploy.sh pre-flight to handle runtime-created CRDs, webhooks, and node taints that survive Helm uninstall
Fully scope cleanup to the specific bundle being deployed/undeployed to avoid affecting other releases on shared clusters
Fix stale skyhook node taints that block scheduling on redeploy

Changes

undeploy.sh:

Delete resource-policy: keep CRDs matching both Helm release-name AND release-namespace
Scope orphaned CRD groups to operators present in this bundle via Go template conditionals
Force-clear finalizers on CRDs stuck in deleting state
Delete stale webhooks whose backing service no longer exists
Remove skyhook node taints after operator uninstall (supports custom taint keys via runtimeRequiredTaint)

deploy.sh pre-flight:

Scope orphaned CRD group check to match bundle components via Go template conditionals
Detect and remove stale skyhook taints only when skyhook-operator is not already running (avoids stripping legitimate scheduling guards on active clusters)

Scoping: Both templates use Go template conditionals so CRD groups and skyhook cleanup are only emitted for components present in the current recipe. No hardcoded CRD groups or taint keys.

Test plan

go test ./pkg/bundler/deployer/helm/... passes
Generated NIM bundle includes only NIM-relevant CRD groups (no grove.io)
Generated Dynamo bundle includes only Dynamo-relevant CRD groups (no apps.nvidia.com)
Ran undeploy.sh --delete-pvcs on EKS cluster — post-flight reports "cluster is clean"
Verified deploy.sh pre-flight only removes skyhook taints when operator is not running
Verified custom taint key is read from skyhook-operator/values.yaml
Full deploy with deploy.sh completes successfully after undeploy

Closes #474

yuanchen8911 · 2026-04-01T21:11:57Z

Addresses #474. See issue for the full list of stale resources observed before this fix.

…y/undeploy Expand deploy.sh pre-flight and undeploy.sh cleanup to handle runtime-created resources, fully scoped to the specific bundle to avoid affecting other releases on shared clusters. undeploy.sh: - Delete resource-policy:keep CRDs matching both release-name AND release-namespace - Scope orphaned CRD groups to operators in this bundle (fully template-conditional) - Force-clear finalizers on CRDs stuck in deleting state - Delete stale webhooks with missing backing services deploy.sh pre-flight: - Scope orphaned CRD group check to match bundle components (template-conditional) Both templates use Go template conditionals so CRD groups are only emitted for components present in the current recipe (e.g., grove.io only when dynamo-platform is included, apps.nvidia.com only when k8s-nim-operator is). Closes NVIDIA#474

The orphaned-CRD cleanup pipeline introduced in NVIDIA#477 and the two helper functions delete_release_cluster_resources and force_clear_namespace_finalizers all share the shape: kubectl get ... 2>/dev/null \ | jq -r '...' 2>/dev/null \ | while read -r name; do ... done With set -euo pipefail, a single transient kubectl failure (API 502, brief auth refresh, etc.) causes the pipeline to exit non-zero and terminates the script. In a typical run with ~15 components, this means any 100 ms control-plane hiccup can kill undeploy.sh after Helm uninstalls succeeded but before namespace cleanup, leaving the cluster in a partially-cleaned state that the user must finish by hand. Add a trailing || true to each of the three pipelines so transient API errors during cleanup are tolerated. The adjacent ORPHANED_CRD_GROUPS loop and stuck_crds extraction already use this pattern.

The orphaned-CRD cleanup pipeline introduced in NVIDIA#477 and the two helper functions delete_release_cluster_resources and force_clear_namespace_finalizers all share the shape: kubectl get ... 2>/dev/null \ | jq -r '...' 2>/dev/null \ | while read -r name; do ... done With set -euo pipefail, a single transient kubectl failure (API 502, brief auth refresh, etc.) causes the pipeline to exit non-zero and terminates the script. In a typical run with ~15 components, this means any 100 ms control-plane hiccup can kill undeploy.sh after Helm uninstalls succeeded but before namespace cleanup, leaving the cluster in a partially-cleaned state that the user must finish by hand. Append a warn-on-failure handler to each of the three pipelines so transient API errors during cleanup are tolerated AND visible: the script keeps going into the post-flight verification phase, and operators see a clear stderr warning naming the failing context (kind / release / namespace / component) instead of nothing. The adjacent ORPHANED_CRD_GROUPS loop and stuck_crds extraction already use the "tolerate-and-continue" pattern; this PR brings the three remaining sites into line and adds context-rich warnings per CodeRabbit review.

Combines three related hardening changes to the generated undeploy.sh (previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a theme -- make undeploy recover from partial or transient failure -- so bundling them avoids three sequential PR rebases over the same template. 1. Transient-failure resilience (was NVIDIA#599). Three `kubectl | jq | while` pipelines in the post-uninstall cleanup path ran under `set -euo pipefail` with no fallback. A momentary control-plane 502 or auth refresh killed the whole script after Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and `Active` namespaces. Wrap each pipeline's trailing `done` in `|| echo "Warning: ..." >&2` so the failure is visible but the script continues. Matches the `|| true` pattern already used in the adjacent ORPHANED_CRD_GROUPS loop. 2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601). check_release_for_stuck_crds sourced its CRD list only from `helm get manifest`, which returns CRDs under the chart's templates/ section but omits CRDs installed from the chart's crds/ directory (the Helm-recommended layout used by most operator charts). CRs backed by crds/-installed CRDs (e.g., NIMService) slipped past the finalizer pre-flight. Union both sources (manifest + annotated) and dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty lines without the grep-exits-1-on-no-match behavior, which would abort the script under pipefail when a release has no CRDs. 3. Post-flight Helm-annotation CRD check (was NVIDIA#600). Helm-managed CRDs that survive the per-release deletion loop went unnoticed until the next deploy.sh run rejected them. Add a verification pass at the end of undeploy.sh that re-enumerates CRDs filtered by meta.helm.sh/release-name + release-namespace annotations and surfaces any leftovers as a post-flight warning. CRDs already being deleted (non-null .metadata.deletionTimestamp) are excluded: the earlier `kubectl delete crd ... --wait=false` returns immediately, so a slow finalizer can leave the CRD listed though cleanup is normal. Only truly stuck (not-yet-terminating) CRDs are reported. Unit test TestUndeployScript_TransientFailureWarnsAndContinues generates a bundle, stubs kubectl to exit non-zero, sources each of the three helpers, and asserts the helper returns 0 and emits the expected Warning: on stderr. Fixes: N/A Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline), NVIDIA#599, NVIDIA#600, NVIDIA#601 (superseded by this PR). Signed-off-by: Yuan Chen <[email protected]>

Combines three related hardening changes to the generated undeploy.sh (previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a theme -- make undeploy recover from partial or transient failure -- so bundling them avoids three sequential PR rebases over the same template. 1. Transient-failure resilience (was NVIDIA#599). Three `kubectl | jq | while` pipelines in the post-uninstall cleanup path ran under `set -euo pipefail` with no fallback. A momentary control-plane 502 or auth refresh killed the whole script after Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and `Active` namespaces. Wrap each pipeline's trailing `done` in `|| echo "Warning: ..." >&2` so the failure is visible but the script continues. Matches the `|| true` pattern already used in the adjacent ORPHANED_CRD_GROUPS loop. 2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601). check_release_for_stuck_crds sourced its CRD list only from `helm get manifest`, which returns CRDs under the chart's templates/ section but omits CRDs installed from the chart's crds/ directory (the Helm-recommended layout used by most operator charts). CRs backed by crds/-installed CRDs (e.g., NIMService) slipped past the finalizer pre-flight. Union both sources (manifest + annotated) and dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty lines without the grep-exits-1-on-no-match behavior, which would abort the script under pipefail when a release has no CRDs. 3. Post-flight Helm-annotation CRD check (was NVIDIA#600). Helm-managed CRDs that survive the per-release deletion loop went unnoticed until the next deploy.sh run rejected them. Add a verification pass at the end of undeploy.sh that re-enumerates CRDs filtered by meta.helm.sh/release-name + release-namespace annotations and surfaces any leftovers as a post-flight warning. CRDs already being deleted (non-null .metadata.deletionTimestamp) are excluded: the earlier `kubectl delete crd ... --wait=false` returns immediately, so a slow finalizer can leave the CRD listed though cleanup is normal. Only truly stuck (not-yet-terminating) CRDs are reported. Unit test TestUndeployScript_TransientFailureWarnsAndContinues generates a bundle, stubs kubectl to exit non-zero, sources each of the three helpers, and asserts the helper returns 0 and emits the expected Warning: on stderr. Fixes: N/A Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline), Signed-off-by: Yuan Chen <[email protected]>

Combines three related hardening changes to the generated undeploy.sh (previously split across #599, #601, #600). They share a file and a theme -- make undeploy recover from partial or transient failure -- so bundling them avoids three sequential PR rebases over the same template. 1. Transient-failure resilience (was #599). Three `kubectl | jq | while` pipelines in the post-uninstall cleanup path ran under `set -euo pipefail` with no fallback. A momentary control-plane 502 or auth refresh killed the whole script after Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and `Active` namespaces. Wrap each pipeline's trailing `done` in `|| echo "Warning: ..." >&2` so the failure is visible but the script continues. Matches the `|| true` pattern already used in the adjacent ORPHANED_CRD_GROUPS loop. 2. Pre-flight CRD discovery via Helm annotation (was #601). check_release_for_stuck_crds sourced its CRD list only from `helm get manifest`, which returns CRDs under the chart's templates/ section but omits CRDs installed from the chart's crds/ directory (the Helm-recommended layout used by most operator charts). CRs backed by crds/-installed CRDs (e.g., NIMService) slipped past the finalizer pre-flight. Union both sources (manifest + annotated) and dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty lines without the grep-exits-1-on-no-match behavior, which would abort the script under pipefail when a release has no CRDs. 3. Post-flight Helm-annotation CRD check (was #600). Helm-managed CRDs that survive the per-release deletion loop went unnoticed until the next deploy.sh run rejected them. Add a verification pass at the end of undeploy.sh that re-enumerates CRDs filtered by meta.helm.sh/release-name + release-namespace annotations and surfaces any leftovers as a post-flight warning. CRDs already being deleted (non-null .metadata.deletionTimestamp) are excluded: the earlier `kubectl delete crd ... --wait=false` returns immediately, so a slow finalizer can leave the CRD listed though cleanup is normal. Only truly stuck (not-yet-terminating) CRDs are reported. Unit test TestUndeployScript_TransientFailureWarnsAndContinues generates a bundle, stubs kubectl to exit non-zero, sources each of the three helpers, and asserts the helper returns 0 and emits the expected Warning: on stderr. Fixes: N/A Related: #477 (introduced the per-component CRD cleanup pipeline), Signed-off-by: Yuan Chen <[email protected]>

Combines three related hardening changes to the generated undeploy.sh (previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a theme -- make undeploy recover from partial or transient failure -- so bundling them avoids three sequential PR rebases over the same template. 1. Transient-failure resilience (was NVIDIA#599). Three `kubectl | jq | while` pipelines in the post-uninstall cleanup path ran under `set -euo pipefail` with no fallback. A momentary control-plane 502 or auth refresh killed the whole script after Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and `Active` namespaces. Wrap each pipeline's trailing `done` in `|| echo "Warning: ..." >&2` so the failure is visible but the script continues. Matches the `|| true` pattern already used in the adjacent ORPHANED_CRD_GROUPS loop. 2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601). check_release_for_stuck_crds sourced its CRD list only from `helm get manifest`, which returns CRDs under the chart's templates/ section but omits CRDs installed from the chart's crds/ directory (the Helm-recommended layout used by most operator charts). CRs backed by crds/-installed CRDs (e.g., NIMService) slipped past the finalizer pre-flight. Union both sources (manifest + annotated) and dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty lines without the grep-exits-1-on-no-match behavior, which would abort the script under pipefail when a release has no CRDs. 3. Post-flight Helm-annotation CRD check (was NVIDIA#600). Helm-managed CRDs that survive the per-release deletion loop went unnoticed until the next deploy.sh run rejected them. Add a verification pass at the end of undeploy.sh that re-enumerates CRDs filtered by meta.helm.sh/release-name + release-namespace annotations and surfaces any leftovers as a post-flight warning. CRDs already being deleted (non-null .metadata.deletionTimestamp) are excluded: the earlier `kubectl delete crd ... --wait=false` returns immediately, so a slow finalizer can leave the CRD listed though cleanup is normal. Only truly stuck (not-yet-terminating) CRDs are reported. Unit test TestUndeployScript_TransientFailureWarnsAndContinues generates a bundle, stubs kubectl to exit non-zero, sources each of the three helpers, and asserts the helper returns 0 and emits the expected Warning: on stderr. Fixes: N/A Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline), Signed-off-by: Yuan Chen <[email protected]>

github-actions · 2026-07-02T07:05:12Z

This pull request has been automatically locked since it has been closed for 90 days with no further activity. Please open a new pull request for related changes.

yuanchen8911 requested a review from a team as a code owner April 1, 2026 21:11

yuanchen8911 added bug area/recipes labels Apr 1, 2026

github-actions Bot added area/bundler size/M and removed area/recipes labels Apr 1, 2026

yuanchen8911 mentioned this pull request Apr 1, 2026

fix: undeploy.sh misses runtime-created CRDs, webhooks, and workload namespaces #474

Closed

yuanchen8911 requested a review from mchmarny April 1, 2026 21:21

yuanchen8911 force-pushed the fix/undeploy-cleanup branch 2 times, most recently from 322a708 to 58ef0f9 Compare April 1, 2026 21:49

yuanchen8911 changed the title ~~fix(bundler): scope CRD/webhook cleanup to bundle components in deploy/undeploy~~ fix(bundler): scope cleanup to bundle components and remove stale skyhook taints Apr 1, 2026

yuanchen8911 force-pushed the fix/undeploy-cleanup branch from 58ef0f9 to 6e41c33 Compare April 1, 2026 21:52

yuanchen8911 force-pushed the fix/undeploy-cleanup branch from 6e41c33 to af2278e Compare April 1, 2026 21:55

yuanchen8911 requested a review from lockwobr April 1, 2026 22:00

mchmarny approved these changes Apr 2, 2026

View reviewed changes

Merge branch 'main' into fix/undeploy-cleanup

fa4d23e

mchmarny enabled auto-merge (squash) April 2, 2026 11:27

mchmarny disabled auto-merge April 2, 2026 11:28

mchmarny merged commit ca9d96c into NVIDIA:main Apr 2, 2026
11 checks passed

yuanchen8911 mentioned this pull request Apr 16, 2026

fix(bundler): make undeploy.sh resilient to transient kubectl failures #599

Closed

24 tasks

yuanchen8911 mentioned this pull request Apr 16, 2026

feat(bundler): add Helm-annotation post-flight check to undeploy.sh #600

Closed

17 tasks

yuanchen8911 mentioned this pull request Apr 17, 2026

fix(bundler): fix undeploy template pre/post-flight checks #602

Merged

18 tasks

This was referenced Apr 17, 2026

Bridge Jobs for deploy/undeploy lifecycle flow control #610

Closed

[Feature]: Helmfile deployer for declarative, bash-free deployments #632

Closed

yuanchen8911 mentioned this pull request Apr 24, 2026

Move undeploy.sh to tools/ for dev use; direct users to deployer-native uninstall #671

Closed

6 tasks

lockwobr mentioned this pull request May 14, 2026

feat(bundler): add helmfile deployer (#632) #899

Merged

25 tasks

mchmarny mentioned this pull request May 28, 2026

feat(bundler): drop generated undeploy.sh; delegate to helm uninstall #1095

Merged

12 tasks

github-actions Bot locked as resolved and limited conversation to collaborators Jul 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(bundler): scope cleanup to bundle components and remove stale skyhook taints#477

fix(bundler): scope cleanup to bundle components and remove stale skyhook taints#477
mchmarny merged 2 commits into
NVIDIA:mainfrom
yuanchen8911:fix/undeploy-cleanup

yuanchen8911 commented Apr 1, 2026 •

edited

Loading

Uh oh!

yuanchen8911 commented Apr 1, 2026

Uh oh!

Uh oh!

github-actions Bot commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

yuanchen8911 commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test plan

Uh oh!

yuanchen8911 commented Apr 1, 2026

Uh oh!

Uh oh!

github-actions Bot commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yuanchen8911 commented Apr 1, 2026 •

edited

Loading