Skip to content

fix(bundler): scope cleanup to bundle components and remove stale skyhook taints#477

Merged
mchmarny merged 2 commits into
NVIDIA:mainfrom
yuanchen8911:fix/undeploy-cleanup
Apr 2, 2026
Merged

fix(bundler): scope cleanup to bundle components and remove stale skyhook taints#477
mchmarny merged 2 commits into
NVIDIA:mainfrom
yuanchen8911:fix/undeploy-cleanup

Conversation

@yuanchen8911

@yuanchen8911 yuanchen8911 commented Apr 1, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Expand undeploy.sh and deploy.sh pre-flight to handle runtime-created CRDs, webhooks, and node taints that survive Helm uninstall
  • Fully scope cleanup to the specific bundle being deployed/undeployed to avoid affecting other releases on shared clusters
  • Fix stale skyhook node taints that block scheduling on redeploy

Changes

undeploy.sh:

  • Delete resource-policy: keep CRDs matching both Helm release-name AND release-namespace
  • Scope orphaned CRD groups to operators present in this bundle via Go template conditionals
  • Force-clear finalizers on CRDs stuck in deleting state
  • Delete stale webhooks whose backing service no longer exists
  • Remove skyhook node taints after operator uninstall (supports custom taint keys via runtimeRequiredTaint)

deploy.sh pre-flight:

  • Scope orphaned CRD group check to match bundle components via Go template conditionals
  • Detect and remove stale skyhook taints only when skyhook-operator is not already running (avoids stripping legitimate scheduling guards on active clusters)

Scoping: Both templates use Go template conditionals so CRD groups and skyhook cleanup are only emitted for components present in the current recipe. No hardcoded CRD groups or taint keys.

Test plan

  • go test ./pkg/bundler/deployer/helm/... passes
  • Generated NIM bundle includes only NIM-relevant CRD groups (no grove.io)
  • Generated Dynamo bundle includes only Dynamo-relevant CRD groups (no apps.nvidia.com)
  • Ran undeploy.sh --delete-pvcs on EKS cluster — post-flight reports "cluster is clean"
  • Verified deploy.sh pre-flight only removes skyhook taints when operator is not running
  • Verified custom taint key is read from skyhook-operator/values.yaml
  • Full deploy with deploy.sh completes successfully after undeploy

Closes #474

@yuanchen8911

Copy link
Copy Markdown
Contributor Author

Addresses #474. See issue for the full list of stale resources observed before this fix.

@yuanchen8911 yuanchen8911 requested a review from mchmarny April 1, 2026 21:21
@yuanchen8911 yuanchen8911 force-pushed the fix/undeploy-cleanup branch 2 times, most recently from 322a708 to 58ef0f9 Compare April 1, 2026 21:49
@yuanchen8911 yuanchen8911 changed the title fix(bundler): scope CRD/webhook cleanup to bundle components in deploy/undeploy fix(bundler): scope cleanup to bundle components and remove stale skyhook taints Apr 1, 2026
@yuanchen8911 yuanchen8911 force-pushed the fix/undeploy-cleanup branch from 58ef0f9 to 6e41c33 Compare April 1, 2026 21:52
…y/undeploy

Expand deploy.sh pre-flight and undeploy.sh cleanup to handle runtime-created
resources, fully scoped to the specific bundle to avoid affecting other
releases on shared clusters.

undeploy.sh:
- Delete resource-policy:keep CRDs matching both release-name AND release-namespace
- Scope orphaned CRD groups to operators in this bundle (fully template-conditional)
- Force-clear finalizers on CRDs stuck in deleting state
- Delete stale webhooks with missing backing services

deploy.sh pre-flight:
- Scope orphaned CRD group check to match bundle components (template-conditional)

Both templates use Go template conditionals so CRD groups are only emitted
for components present in the current recipe (e.g., grove.io only when
dynamo-platform is included, apps.nvidia.com only when k8s-nim-operator is).

Closes NVIDIA#474
@yuanchen8911 yuanchen8911 force-pushed the fix/undeploy-cleanup branch from 6e41c33 to af2278e Compare April 1, 2026 21:55
@yuanchen8911 yuanchen8911 requested a review from lockwobr April 1, 2026 22:00
@mchmarny mchmarny enabled auto-merge (squash) April 2, 2026 11:27
@mchmarny mchmarny disabled auto-merge April 2, 2026 11:28
@mchmarny mchmarny merged commit ca9d96c into NVIDIA:main Apr 2, 2026
11 checks passed
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 16, 2026
The orphaned-CRD cleanup pipeline introduced in NVIDIA#477 and the two helper
functions delete_release_cluster_resources and force_clear_namespace_finalizers
all share the shape:

    kubectl get ... 2>/dev/null \
      | jq -r '...' 2>/dev/null \
      | while read -r name; do ... done

With set -euo pipefail, a single transient kubectl failure (API 502, brief
auth refresh, etc.) causes the pipeline to exit non-zero and terminates the
script. In a typical run with ~15 components, this means any 100 ms control-plane
hiccup can kill undeploy.sh after Helm uninstalls succeeded but before
namespace cleanup, leaving the cluster in a partially-cleaned state that
the user must finish by hand.

Add a trailing || true to each of the three pipelines so transient API
errors during cleanup are tolerated. The adjacent ORPHANED_CRD_GROUPS loop
and stuck_crds extraction already use this pattern.
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 16, 2026
The orphaned-CRD cleanup pipeline introduced in NVIDIA#477 and the two helper
functions delete_release_cluster_resources and force_clear_namespace_finalizers
all share the shape:

    kubectl get ... 2>/dev/null \
      | jq -r '...' 2>/dev/null \
      | while read -r name; do ... done

With set -euo pipefail, a single transient kubectl failure (API 502, brief
auth refresh, etc.) causes the pipeline to exit non-zero and terminates the
script. In a typical run with ~15 components, this means any 100 ms control-plane
hiccup can kill undeploy.sh after Helm uninstalls succeeded but before
namespace cleanup, leaving the cluster in a partially-cleaned state that
the user must finish by hand.

Append a warn-on-failure handler to each of the three pipelines so transient
API errors during cleanup are tolerated AND visible: the script keeps going
into the post-flight verification phase, and operators see a clear stderr
warning naming the failing context (kind / release / namespace / component)
instead of nothing. The adjacent ORPHANED_CRD_GROUPS loop and stuck_crds
extraction already use the "tolerate-and-continue" pattern; this PR brings
the three remaining sites into line and adds context-rich warnings per
CodeRabbit review.
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 17, 2026
The orphaned-CRD cleanup pipeline introduced in NVIDIA#477 and the two helper
functions delete_release_cluster_resources and force_clear_namespace_finalizers
all share the shape:

    kubectl get ... 2>/dev/null \
      | jq -r '...' 2>/dev/null \
      | while read -r name; do ... done

With set -euo pipefail, a single transient kubectl failure (API 502, brief
auth refresh, etc.) causes the pipeline to exit non-zero and terminates the
script. In a typical run with ~15 components, this means any 100 ms control-plane
hiccup can kill undeploy.sh after Helm uninstalls succeeded but before
namespace cleanup, leaving the cluster in a partially-cleaned state that
the user must finish by hand.

Append a warn-on-failure handler to each of the three pipelines so transient
API errors during cleanup are tolerated AND visible: the script keeps going
into the post-flight verification phase, and operators see a clear stderr
warning naming the failing context (kind / release / namespace / component)
instead of nothing. The adjacent ORPHANED_CRD_GROUPS loop and stuck_crds
extraction already use the "tolerate-and-continue" pattern; this PR brings
the three remaining sites into line and adds context-rich warnings per
CodeRabbit review.
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 17, 2026
Combines three related hardening changes to the generated undeploy.sh
(previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a
theme -- make undeploy recover from partial or transient failure --
so bundling them avoids three sequential PR rebases over the same
template.

1. Transient-failure resilience (was NVIDIA#599).
   Three `kubectl | jq | while` pipelines in the post-uninstall cleanup
   path ran under `set -euo pipefail` with no fallback. A momentary
   control-plane 502 or auth refresh killed the whole script after
   Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and
   `Active` namespaces. Wrap each pipeline's trailing `done` in
   `|| echo "Warning: ..." >&2` so the failure is visible but the
   script continues. Matches the `|| true` pattern already used in the
   adjacent ORPHANED_CRD_GROUPS loop.

2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601).
   check_release_for_stuck_crds sourced its CRD list only from
   `helm get manifest`, which returns CRDs under the chart's templates/
   section but omits CRDs installed from the chart's crds/ directory
   (the Helm-recommended layout used by most operator charts). CRs
   backed by crds/-installed CRDs (e.g., NIMService) slipped past the
   finalizer pre-flight. Union both sources (manifest + annotated) and
   dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty
   lines without the grep-exits-1-on-no-match behavior, which would
   abort the script under pipefail when a release has no CRDs.

3. Post-flight Helm-annotation CRD check (was NVIDIA#600).
   Helm-managed CRDs that survive the per-release deletion loop went
   unnoticed until the next deploy.sh run rejected them. Add a
   verification pass at the end of undeploy.sh that re-enumerates CRDs
   filtered by meta.helm.sh/release-name + release-namespace
   annotations and surfaces any leftovers as a post-flight warning.
   CRDs already being deleted (non-null .metadata.deletionTimestamp)
   are excluded: the earlier `kubectl delete crd ... --wait=false`
   returns immediately, so a slow finalizer can leave the CRD listed
   though cleanup is normal. Only truly stuck (not-yet-terminating)
   CRDs are reported.

Unit test TestUndeployScript_TransientFailureWarnsAndContinues
generates a bundle, stubs kubectl to exit non-zero, sources each of
the three helpers, and asserts the helper returns 0 and emits the
expected Warning: on stderr.

Fixes: N/A
Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline),
NVIDIA#599, NVIDIA#600, NVIDIA#601 (superseded by this PR).

Signed-off-by: Yuan Chen <[email protected]>
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 17, 2026
Combines three related hardening changes to the generated undeploy.sh
(previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a
theme -- make undeploy recover from partial or transient failure --
so bundling them avoids three sequential PR rebases over the same
template.

1. Transient-failure resilience (was NVIDIA#599).
   Three `kubectl | jq | while` pipelines in the post-uninstall cleanup
   path ran under `set -euo pipefail` with no fallback. A momentary
   control-plane 502 or auth refresh killed the whole script after
   Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and
   `Active` namespaces. Wrap each pipeline's trailing `done` in
   `|| echo "Warning: ..." >&2` so the failure is visible but the
   script continues. Matches the `|| true` pattern already used in the
   adjacent ORPHANED_CRD_GROUPS loop.

2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601).
   check_release_for_stuck_crds sourced its CRD list only from
   `helm get manifest`, which returns CRDs under the chart's templates/
   section but omits CRDs installed from the chart's crds/ directory
   (the Helm-recommended layout used by most operator charts). CRs
   backed by crds/-installed CRDs (e.g., NIMService) slipped past the
   finalizer pre-flight. Union both sources (manifest + annotated) and
   dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty
   lines without the grep-exits-1-on-no-match behavior, which would
   abort the script under pipefail when a release has no CRDs.

3. Post-flight Helm-annotation CRD check (was NVIDIA#600).
   Helm-managed CRDs that survive the per-release deletion loop went
   unnoticed until the next deploy.sh run rejected them. Add a
   verification pass at the end of undeploy.sh that re-enumerates CRDs
   filtered by meta.helm.sh/release-name + release-namespace
   annotations and surfaces any leftovers as a post-flight warning.
   CRDs already being deleted (non-null .metadata.deletionTimestamp)
   are excluded: the earlier `kubectl delete crd ... --wait=false`
   returns immediately, so a slow finalizer can leave the CRD listed
   though cleanup is normal. Only truly stuck (not-yet-terminating)
   CRDs are reported.

Unit test TestUndeployScript_TransientFailureWarnsAndContinues
generates a bundle, stubs kubectl to exit non-zero, sources each of
the three helpers, and asserts the helper returns 0 and emits the
expected Warning: on stderr.

Fixes: N/A
Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline),
NVIDIA#599, NVIDIA#600, NVIDIA#601 (superseded by this PR).

Signed-off-by: Yuan Chen <[email protected]>
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 17, 2026
Combines three related hardening changes to the generated undeploy.sh
(previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a
theme -- make undeploy recover from partial or transient failure --
so bundling them avoids three sequential PR rebases over the same
template.

1. Transient-failure resilience (was NVIDIA#599).
   Three `kubectl | jq | while` pipelines in the post-uninstall cleanup
   path ran under `set -euo pipefail` with no fallback. A momentary
   control-plane 502 or auth refresh killed the whole script after
   Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and
   `Active` namespaces. Wrap each pipeline's trailing `done` in
   `|| echo "Warning: ..." >&2` so the failure is visible but the
   script continues. Matches the `|| true` pattern already used in the
   adjacent ORPHANED_CRD_GROUPS loop.

2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601).
   check_release_for_stuck_crds sourced its CRD list only from
   `helm get manifest`, which returns CRDs under the chart's templates/
   section but omits CRDs installed from the chart's crds/ directory
   (the Helm-recommended layout used by most operator charts). CRs
   backed by crds/-installed CRDs (e.g., NIMService) slipped past the
   finalizer pre-flight. Union both sources (manifest + annotated) and
   dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty
   lines without the grep-exits-1-on-no-match behavior, which would
   abort the script under pipefail when a release has no CRDs.

3. Post-flight Helm-annotation CRD check (was NVIDIA#600).
   Helm-managed CRDs that survive the per-release deletion loop went
   unnoticed until the next deploy.sh run rejected them. Add a
   verification pass at the end of undeploy.sh that re-enumerates CRDs
   filtered by meta.helm.sh/release-name + release-namespace
   annotations and surfaces any leftovers as a post-flight warning.
   CRDs already being deleted (non-null .metadata.deletionTimestamp)
   are excluded: the earlier `kubectl delete crd ... --wait=false`
   returns immediately, so a slow finalizer can leave the CRD listed
   though cleanup is normal. Only truly stuck (not-yet-terminating)
   CRDs are reported.

Unit test TestUndeployScript_TransientFailureWarnsAndContinues
generates a bundle, stubs kubectl to exit non-zero, sources each of
the three helpers, and asserts the helper returns 0 and emits the
expected Warning: on stderr.

Fixes: N/A
Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline),
NVIDIA#599, NVIDIA#600, NVIDIA#601 (superseded by this PR).

Signed-off-by: Yuan Chen <[email protected]>
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 17, 2026
Combines three related hardening changes to the generated undeploy.sh
(previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a
theme -- make undeploy recover from partial or transient failure --
so bundling them avoids three sequential PR rebases over the same
template.

1. Transient-failure resilience (was NVIDIA#599).
   Three `kubectl | jq | while` pipelines in the post-uninstall cleanup
   path ran under `set -euo pipefail` with no fallback. A momentary
   control-plane 502 or auth refresh killed the whole script after
   Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and
   `Active` namespaces. Wrap each pipeline's trailing `done` in
   `|| echo "Warning: ..." >&2` so the failure is visible but the
   script continues. Matches the `|| true` pattern already used in the
   adjacent ORPHANED_CRD_GROUPS loop.

2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601).
   check_release_for_stuck_crds sourced its CRD list only from
   `helm get manifest`, which returns CRDs under the chart's templates/
   section but omits CRDs installed from the chart's crds/ directory
   (the Helm-recommended layout used by most operator charts). CRs
   backed by crds/-installed CRDs (e.g., NIMService) slipped past the
   finalizer pre-flight. Union both sources (manifest + annotated) and
   dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty
   lines without the grep-exits-1-on-no-match behavior, which would
   abort the script under pipefail when a release has no CRDs.

3. Post-flight Helm-annotation CRD check (was NVIDIA#600).
   Helm-managed CRDs that survive the per-release deletion loop went
   unnoticed until the next deploy.sh run rejected them. Add a
   verification pass at the end of undeploy.sh that re-enumerates CRDs
   filtered by meta.helm.sh/release-name + release-namespace
   annotations and surfaces any leftovers as a post-flight warning.
   CRDs already being deleted (non-null .metadata.deletionTimestamp)
   are excluded: the earlier `kubectl delete crd ... --wait=false`
   returns immediately, so a slow finalizer can leave the CRD listed
   though cleanup is normal. Only truly stuck (not-yet-terminating)
   CRDs are reported.

Unit test TestUndeployScript_TransientFailureWarnsAndContinues
generates a bundle, stubs kubectl to exit non-zero, sources each of
the three helpers, and asserts the helper returns 0 and emits the
expected Warning: on stderr.

Fixes: N/A
Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline),
NVIDIA#599, NVIDIA#600, NVIDIA#601 (superseded by this PR).

Signed-off-by: Yuan Chen <[email protected]>
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 17, 2026
Combines three related hardening changes to the generated undeploy.sh
(previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a
theme -- make undeploy recover from partial or transient failure --
so bundling them avoids three sequential PR rebases over the same
template.

1. Transient-failure resilience (was NVIDIA#599).
   Three `kubectl | jq | while` pipelines in the post-uninstall cleanup
   path ran under `set -euo pipefail` with no fallback. A momentary
   control-plane 502 or auth refresh killed the whole script after
   Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and
   `Active` namespaces. Wrap each pipeline's trailing `done` in
   `|| echo "Warning: ..." >&2` so the failure is visible but the
   script continues. Matches the `|| true` pattern already used in the
   adjacent ORPHANED_CRD_GROUPS loop.

2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601).
   check_release_for_stuck_crds sourced its CRD list only from
   `helm get manifest`, which returns CRDs under the chart's templates/
   section but omits CRDs installed from the chart's crds/ directory
   (the Helm-recommended layout used by most operator charts). CRs
   backed by crds/-installed CRDs (e.g., NIMService) slipped past the
   finalizer pre-flight. Union both sources (manifest + annotated) and
   dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty
   lines without the grep-exits-1-on-no-match behavior, which would
   abort the script under pipefail when a release has no CRDs.

3. Post-flight Helm-annotation CRD check (was NVIDIA#600).
   Helm-managed CRDs that survive the per-release deletion loop went
   unnoticed until the next deploy.sh run rejected them. Add a
   verification pass at the end of undeploy.sh that re-enumerates CRDs
   filtered by meta.helm.sh/release-name + release-namespace
   annotations and surfaces any leftovers as a post-flight warning.
   CRDs already being deleted (non-null .metadata.deletionTimestamp)
   are excluded: the earlier `kubectl delete crd ... --wait=false`
   returns immediately, so a slow finalizer can leave the CRD listed
   though cleanup is normal. Only truly stuck (not-yet-terminating)
   CRDs are reported.

Unit test TestUndeployScript_TransientFailureWarnsAndContinues
generates a bundle, stubs kubectl to exit non-zero, sources each of
the three helpers, and asserts the helper returns 0 and emits the
expected Warning: on stderr.

Fixes: N/A
Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline),

Signed-off-by: Yuan Chen <[email protected]>
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 17, 2026
Combines three related hardening changes to the generated undeploy.sh
(previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a
theme -- make undeploy recover from partial or transient failure --
so bundling them avoids three sequential PR rebases over the same
template.

1. Transient-failure resilience (was NVIDIA#599).
   Three `kubectl | jq | while` pipelines in the post-uninstall cleanup
   path ran under `set -euo pipefail` with no fallback. A momentary
   control-plane 502 or auth refresh killed the whole script after
   Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and
   `Active` namespaces. Wrap each pipeline's trailing `done` in
   `|| echo "Warning: ..." >&2` so the failure is visible but the
   script continues. Matches the `|| true` pattern already used in the
   adjacent ORPHANED_CRD_GROUPS loop.

2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601).
   check_release_for_stuck_crds sourced its CRD list only from
   `helm get manifest`, which returns CRDs under the chart's templates/
   section but omits CRDs installed from the chart's crds/ directory
   (the Helm-recommended layout used by most operator charts). CRs
   backed by crds/-installed CRDs (e.g., NIMService) slipped past the
   finalizer pre-flight. Union both sources (manifest + annotated) and
   dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty
   lines without the grep-exits-1-on-no-match behavior, which would
   abort the script under pipefail when a release has no CRDs.

3. Post-flight Helm-annotation CRD check (was NVIDIA#600).
   Helm-managed CRDs that survive the per-release deletion loop went
   unnoticed until the next deploy.sh run rejected them. Add a
   verification pass at the end of undeploy.sh that re-enumerates CRDs
   filtered by meta.helm.sh/release-name + release-namespace
   annotations and surfaces any leftovers as a post-flight warning.
   CRDs already being deleted (non-null .metadata.deletionTimestamp)
   are excluded: the earlier `kubectl delete crd ... --wait=false`
   returns immediately, so a slow finalizer can leave the CRD listed
   though cleanup is normal. Only truly stuck (not-yet-terminating)
   CRDs are reported.

Unit test TestUndeployScript_TransientFailureWarnsAndContinues
generates a bundle, stubs kubectl to exit non-zero, sources each of
the three helpers, and asserts the helper returns 0 and emits the
expected Warning: on stderr.

Fixes: N/A
Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline),

Signed-off-by: Yuan Chen <[email protected]>
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 17, 2026
Combines three related hardening changes to the generated undeploy.sh
(previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a
theme -- make undeploy recover from partial or transient failure --
so bundling them avoids three sequential PR rebases over the same
template.

1. Transient-failure resilience (was NVIDIA#599).
   Three `kubectl | jq | while` pipelines in the post-uninstall cleanup
   path ran under `set -euo pipefail` with no fallback. A momentary
   control-plane 502 or auth refresh killed the whole script after
   Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and
   `Active` namespaces. Wrap each pipeline's trailing `done` in
   `|| echo "Warning: ..." >&2` so the failure is visible but the
   script continues. Matches the `|| true` pattern already used in the
   adjacent ORPHANED_CRD_GROUPS loop.

2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601).
   check_release_for_stuck_crds sourced its CRD list only from
   `helm get manifest`, which returns CRDs under the chart's templates/
   section but omits CRDs installed from the chart's crds/ directory
   (the Helm-recommended layout used by most operator charts). CRs
   backed by crds/-installed CRDs (e.g., NIMService) slipped past the
   finalizer pre-flight. Union both sources (manifest + annotated) and
   dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty
   lines without the grep-exits-1-on-no-match behavior, which would
   abort the script under pipefail when a release has no CRDs.

3. Post-flight Helm-annotation CRD check (was NVIDIA#600).
   Helm-managed CRDs that survive the per-release deletion loop went
   unnoticed until the next deploy.sh run rejected them. Add a
   verification pass at the end of undeploy.sh that re-enumerates CRDs
   filtered by meta.helm.sh/release-name + release-namespace
   annotations and surfaces any leftovers as a post-flight warning.
   CRDs already being deleted (non-null .metadata.deletionTimestamp)
   are excluded: the earlier `kubectl delete crd ... --wait=false`
   returns immediately, so a slow finalizer can leave the CRD listed
   though cleanup is normal. Only truly stuck (not-yet-terminating)
   CRDs are reported.

Unit test TestUndeployScript_TransientFailureWarnsAndContinues
generates a bundle, stubs kubectl to exit non-zero, sources each of
the three helpers, and asserts the helper returns 0 and emits the
expected Warning: on stderr.

Fixes: N/A
Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline),

Signed-off-by: Yuan Chen <[email protected]>
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 17, 2026
Combines three related hardening changes to the generated undeploy.sh
(previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a
theme -- make undeploy recover from partial or transient failure --
so bundling them avoids three sequential PR rebases over the same
template.

1. Transient-failure resilience (was NVIDIA#599).
   Three `kubectl | jq | while` pipelines in the post-uninstall cleanup
   path ran under `set -euo pipefail` with no fallback. A momentary
   control-plane 502 or auth refresh killed the whole script after
   Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and
   `Active` namespaces. Wrap each pipeline's trailing `done` in
   `|| echo "Warning: ..." >&2` so the failure is visible but the
   script continues. Matches the `|| true` pattern already used in the
   adjacent ORPHANED_CRD_GROUPS loop.

2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601).
   check_release_for_stuck_crds sourced its CRD list only from
   `helm get manifest`, which returns CRDs under the chart's templates/
   section but omits CRDs installed from the chart's crds/ directory
   (the Helm-recommended layout used by most operator charts). CRs
   backed by crds/-installed CRDs (e.g., NIMService) slipped past the
   finalizer pre-flight. Union both sources (manifest + annotated) and
   dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty
   lines without the grep-exits-1-on-no-match behavior, which would
   abort the script under pipefail when a release has no CRDs.

3. Post-flight Helm-annotation CRD check (was NVIDIA#600).
   Helm-managed CRDs that survive the per-release deletion loop went
   unnoticed until the next deploy.sh run rejected them. Add a
   verification pass at the end of undeploy.sh that re-enumerates CRDs
   filtered by meta.helm.sh/release-name + release-namespace
   annotations and surfaces any leftovers as a post-flight warning.
   CRDs already being deleted (non-null .metadata.deletionTimestamp)
   are excluded: the earlier `kubectl delete crd ... --wait=false`
   returns immediately, so a slow finalizer can leave the CRD listed
   though cleanup is normal. Only truly stuck (not-yet-terminating)
   CRDs are reported.

Unit test TestUndeployScript_TransientFailureWarnsAndContinues
generates a bundle, stubs kubectl to exit non-zero, sources each of
the three helpers, and asserts the helper returns 0 and emits the
expected Warning: on stderr.

Fixes: N/A
Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline),

Signed-off-by: Yuan Chen <[email protected]>
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 17, 2026
Combines three related hardening changes to the generated undeploy.sh
(previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a
theme -- make undeploy recover from partial or transient failure --
so bundling them avoids three sequential PR rebases over the same
template.

1. Transient-failure resilience (was NVIDIA#599).
   Three `kubectl | jq | while` pipelines in the post-uninstall cleanup
   path ran under `set -euo pipefail` with no fallback. A momentary
   control-plane 502 or auth refresh killed the whole script after
   Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and
   `Active` namespaces. Wrap each pipeline's trailing `done` in
   `|| echo "Warning: ..." >&2` so the failure is visible but the
   script continues. Matches the `|| true` pattern already used in the
   adjacent ORPHANED_CRD_GROUPS loop.

2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601).
   check_release_for_stuck_crds sourced its CRD list only from
   `helm get manifest`, which returns CRDs under the chart's templates/
   section but omits CRDs installed from the chart's crds/ directory
   (the Helm-recommended layout used by most operator charts). CRs
   backed by crds/-installed CRDs (e.g., NIMService) slipped past the
   finalizer pre-flight. Union both sources (manifest + annotated) and
   dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty
   lines without the grep-exits-1-on-no-match behavior, which would
   abort the script under pipefail when a release has no CRDs.

3. Post-flight Helm-annotation CRD check (was NVIDIA#600).
   Helm-managed CRDs that survive the per-release deletion loop went
   unnoticed until the next deploy.sh run rejected them. Add a
   verification pass at the end of undeploy.sh that re-enumerates CRDs
   filtered by meta.helm.sh/release-name + release-namespace
   annotations and surfaces any leftovers as a post-flight warning.
   CRDs already being deleted (non-null .metadata.deletionTimestamp)
   are excluded: the earlier `kubectl delete crd ... --wait=false`
   returns immediately, so a slow finalizer can leave the CRD listed
   though cleanup is normal. Only truly stuck (not-yet-terminating)
   CRDs are reported.

Unit test TestUndeployScript_TransientFailureWarnsAndContinues
generates a bundle, stubs kubectl to exit non-zero, sources each of
the three helpers, and asserts the helper returns 0 and emits the
expected Warning: on stderr.

Fixes: N/A
Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline),

Signed-off-by: Yuan Chen <[email protected]>
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 17, 2026
Combines three related hardening changes to the generated undeploy.sh
(previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a
theme -- make undeploy recover from partial or transient failure --
so bundling them avoids three sequential PR rebases over the same
template.

1. Transient-failure resilience (was NVIDIA#599).
   Three `kubectl | jq | while` pipelines in the post-uninstall cleanup
   path ran under `set -euo pipefail` with no fallback. A momentary
   control-plane 502 or auth refresh killed the whole script after
   Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and
   `Active` namespaces. Wrap each pipeline's trailing `done` in
   `|| echo "Warning: ..." >&2` so the failure is visible but the
   script continues. Matches the `|| true` pattern already used in the
   adjacent ORPHANED_CRD_GROUPS loop.

2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601).
   check_release_for_stuck_crds sourced its CRD list only from
   `helm get manifest`, which returns CRDs under the chart's templates/
   section but omits CRDs installed from the chart's crds/ directory
   (the Helm-recommended layout used by most operator charts). CRs
   backed by crds/-installed CRDs (e.g., NIMService) slipped past the
   finalizer pre-flight. Union both sources (manifest + annotated) and
   dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty
   lines without the grep-exits-1-on-no-match behavior, which would
   abort the script under pipefail when a release has no CRDs.

3. Post-flight Helm-annotation CRD check (was NVIDIA#600).
   Helm-managed CRDs that survive the per-release deletion loop went
   unnoticed until the next deploy.sh run rejected them. Add a
   verification pass at the end of undeploy.sh that re-enumerates CRDs
   filtered by meta.helm.sh/release-name + release-namespace
   annotations and surfaces any leftovers as a post-flight warning.
   CRDs already being deleted (non-null .metadata.deletionTimestamp)
   are excluded: the earlier `kubectl delete crd ... --wait=false`
   returns immediately, so a slow finalizer can leave the CRD listed
   though cleanup is normal. Only truly stuck (not-yet-terminating)
   CRDs are reported.

Unit test TestUndeployScript_TransientFailureWarnsAndContinues
generates a bundle, stubs kubectl to exit non-zero, sources each of
the three helpers, and asserts the helper returns 0 and emits the
expected Warning: on stderr.

Fixes: N/A
Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline),

Signed-off-by: Yuan Chen <[email protected]>
yuanchen8911 added a commit that referenced this pull request Apr 17, 2026
Combines three related hardening changes to the generated undeploy.sh
(previously split across #599, #601, #600). They share a file and a
theme -- make undeploy recover from partial or transient failure --
so bundling them avoids three sequential PR rebases over the same
template.

1. Transient-failure resilience (was #599).
   Three `kubectl | jq | while` pipelines in the post-uninstall cleanup
   path ran under `set -euo pipefail` with no fallback. A momentary
   control-plane 502 or auth refresh killed the whole script after
   Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and
   `Active` namespaces. Wrap each pipeline's trailing `done` in
   `|| echo "Warning: ..." >&2` so the failure is visible but the
   script continues. Matches the `|| true` pattern already used in the
   adjacent ORPHANED_CRD_GROUPS loop.

2. Pre-flight CRD discovery via Helm annotation (was #601).
   check_release_for_stuck_crds sourced its CRD list only from
   `helm get manifest`, which returns CRDs under the chart's templates/
   section but omits CRDs installed from the chart's crds/ directory
   (the Helm-recommended layout used by most operator charts). CRs
   backed by crds/-installed CRDs (e.g., NIMService) slipped past the
   finalizer pre-flight. Union both sources (manifest + annotated) and
   dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty
   lines without the grep-exits-1-on-no-match behavior, which would
   abort the script under pipefail when a release has no CRDs.

3. Post-flight Helm-annotation CRD check (was #600).
   Helm-managed CRDs that survive the per-release deletion loop went
   unnoticed until the next deploy.sh run rejected them. Add a
   verification pass at the end of undeploy.sh that re-enumerates CRDs
   filtered by meta.helm.sh/release-name + release-namespace
   annotations and surfaces any leftovers as a post-flight warning.
   CRDs already being deleted (non-null .metadata.deletionTimestamp)
   are excluded: the earlier `kubectl delete crd ... --wait=false`
   returns immediately, so a slow finalizer can leave the CRD listed
   though cleanup is normal. Only truly stuck (not-yet-terminating)
   CRDs are reported.

Unit test TestUndeployScript_TransientFailureWarnsAndContinues
generates a bundle, stubs kubectl to exit non-zero, sources each of
the three helpers, and asserts the helper returns 0 and emits the
expected Warning: on stderr.

Fixes: N/A
Related: #477 (introduced the per-component CRD cleanup pipeline),

Signed-off-by: Yuan Chen <[email protected]>
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 17, 2026
Combines three related hardening changes to the generated undeploy.sh
(previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a
theme -- make undeploy recover from partial or transient failure --
so bundling them avoids three sequential PR rebases over the same
template.

1. Transient-failure resilience (was NVIDIA#599).
   Three `kubectl | jq | while` pipelines in the post-uninstall cleanup
   path ran under `set -euo pipefail` with no fallback. A momentary
   control-plane 502 or auth refresh killed the whole script after
   Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and
   `Active` namespaces. Wrap each pipeline's trailing `done` in
   `|| echo "Warning: ..." >&2` so the failure is visible but the
   script continues. Matches the `|| true` pattern already used in the
   adjacent ORPHANED_CRD_GROUPS loop.

2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601).
   check_release_for_stuck_crds sourced its CRD list only from
   `helm get manifest`, which returns CRDs under the chart's templates/
   section but omits CRDs installed from the chart's crds/ directory
   (the Helm-recommended layout used by most operator charts). CRs
   backed by crds/-installed CRDs (e.g., NIMService) slipped past the
   finalizer pre-flight. Union both sources (manifest + annotated) and
   dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty
   lines without the grep-exits-1-on-no-match behavior, which would
   abort the script under pipefail when a release has no CRDs.

3. Post-flight Helm-annotation CRD check (was NVIDIA#600).
   Helm-managed CRDs that survive the per-release deletion loop went
   unnoticed until the next deploy.sh run rejected them. Add a
   verification pass at the end of undeploy.sh that re-enumerates CRDs
   filtered by meta.helm.sh/release-name + release-namespace
   annotations and surfaces any leftovers as a post-flight warning.
   CRDs already being deleted (non-null .metadata.deletionTimestamp)
   are excluded: the earlier `kubectl delete crd ... --wait=false`
   returns immediately, so a slow finalizer can leave the CRD listed
   though cleanup is normal. Only truly stuck (not-yet-terminating)
   CRDs are reported.

Unit test TestUndeployScript_TransientFailureWarnsAndContinues
generates a bundle, stubs kubectl to exit non-zero, sources each of
the three helpers, and asserts the helper returns 0 and emits the
expected Warning: on stderr.

Fixes: N/A
Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline),

Signed-off-by: Yuan Chen <[email protected]>
@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

This pull request has been automatically locked since it has been closed for 90 days with no further activity. Please open a new pull request for related changes.

@github-actions github-actions Bot locked as resolved and limited conversation to collaborators Jul 2, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix: undeploy.sh misses runtime-created CRDs, webhooks, and workload namespaces

2 participants