Skip to content

feat(bundler): add Helm-annotation post-flight check to undeploy.sh#600

Closed
yuanchen8911 wants to merge 1 commit into
NVIDIA:mainfrom
yuanchen8911:feat/undeploy-helm-crd-postflight
Closed

feat(bundler): add Helm-annotation post-flight check to undeploy.sh#600
yuanchen8911 wants to merge 1 commit into
NVIDIA:mainfrom
yuanchen8911:feat/undeploy-helm-crd-postflight

Conversation

@yuanchen8911

@yuanchen8911 yuanchen8911 commented Apr 16, 2026

Copy link
Copy Markdown
Contributor

Summary

Add a post-flight verification pass that re-enumerates CRDs filtered by the same meta.helm.sh/release-name annotation used during the per-component CRD deletion loop, and reports any leftovers as a WARNING (also flips postflight_issues=true).

Motivation / Context

Today the undeploy.sh post-flight section checks for:

  • terminating namespaces,
  • unavailable API services,
  • operator-runtime CRDs in ORPHANED_CRD_GROUPS (e.g. nvidia.com, nfd.k8s-sigs.io, kai.scheduler).

Helm-managed CRDs that survive the per-release deletion loop are not surfaced anywhere — they go unnoticed until the next deploy.sh run rejects them.

After #599, the per-component CRD cleanup pipelines no longer kill the script on a transient kubectl/jq failure; they emit a Warning: to stderr and continue. That makes a verification check at the end of the script the natural complement: confirm the warnings did not correspond to real leftovers, and warn (with deletion command) if they did.

Suggested by CodeRabbit review on #599 (suggestion #2). I deferred it in #599 to keep that PR focused on the immediate set -e exit fix; this PR is the dedicated follow-up.

Fixes: N/A
Related: #599 (warn-on-failure resilience for the same pipelines), #477 (introduced the per-component cleanup loop)

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • Bundlers (pkg/bundler, pkg/component/*)

Implementation Notes

Single block added to pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl immediately after the existing ORPHANED_CRD_GROUPS post-flight check, mirroring its style:

helm_orphaned_crds=""
{{- range .ComponentsReversed }}{{ if .HasChart }}
remaining_helm_crds=$(kubectl get crd -o json 2>/dev/null \
  | jq -r --arg rel "{{ .Name }}" --arg ns "{{ .Namespace }}" \
    '.items[] | select(.metadata.annotations["meta.helm.sh/release-name"]==$rel and .metadata.annotations["meta.helm.sh/release-namespace"]==$ns) | .metadata.name' 2>/dev/null || true)
if [[ -n "${remaining_helm_crds}" ]]; then
  helm_orphaned_crds="${helm_orphaned_crds} ${remaining_helm_crds}"
fi
{{- end }}{{ end }}
if [[ -n "${helm_orphaned_crds}" ]]; then
  echo "WARNING: Helm-annotated CRDs from uninstalled releases still present:${helm_orphaned_crds}"
  echo "  Cleanup did not remove all CRDs owned by this bundle's releases."
  echo "  Delete with: kubectl delete crd <name>"
  postflight_issues=true
fi

Notes:

Testing

GOFLAGS=-mod=vendor go test -race -count=1 ./pkg/bundler/deployer/helm/...
make build
./dist/aicr_<host>/aicr bundle --recipe recipe.yaml --output /tmp/...
bash -n /tmp/.../undeploy.sh   # syntax check
  • Unit tests in pkg/bundler/deployer/helm/... pass with -race.
  • Regenerated a real bundle (15-component recipe), confirmed one stanza per HelmChart component is emitted into the post-flight section.
  • bash -n on the generated script: clean.

Risk Assessment

  • Low — Isolated change, easy to revert
  • Medium
  • High

Rollout notes: Generated-script change only. No API surface, no migration. Failure mode is "warning printed where there shouldn't have been one" (false positive if a CRD legitimately keeps its Helm annotation post-uninstall — none of the bundle's components do this today). Reverting is a one-commit revert.

Checklist

  • Tests pass locally (make test with -race on pkg/bundler/...)
  • Linter passes — local make lint has pre-existing yamllint failures in untracked .codex-worktrees/; go vet + golangci-lint clean on the changed package
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality (the change is template-only and exercised by the existing helm bundler tests; no new behavior to assert in Go)
  • I updated docs if user-facing behavior changed (post-flight warnings surface in the same WARNING: block already documented)
  • Changes follow existing patterns in the codebase (mirrors the ORPHANED_CRD_GROUPS post-flight check directly above)
  • Commits are cryptographically signed (git commit -S)

Summary by CodeRabbit

  • New Features
    • Improved undeploy post-flight checks for chart-based components: scans for remaining Custom Resource Definitions, excludes those already terminating, aggregates any leftovers, lists their names in a warning, and flags the post-flight as having issues so stale resources are reported.

@yuanchen8911 yuanchen8911 requested a review from a team as a code owner April 16, 2026 23:55
@coderabbitai

coderabbitai Bot commented Apr 16, 2026

Copy link
Copy Markdown

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: ca8efb4b-d807-4c1a-ad1e-c7ac102201bc

📥 Commits

Reviewing files that changed from the base of the PR and between 15d4aef and 381731f.

📒 Files selected for processing (1)
  • pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl

📝 Walkthrough

Walkthrough

A new post-flight verification is added to the undeploy script template that scans all CRDs, filters those annotated with Helm release metadata (meta.helm.sh/release-name and meta.helm.sh/release-namespace) matching each chart component, and reports any remaining CRDs as warnings.

Changes

Cohort / File(s) Summary
Helm Undeploy Post-Flight Verification
pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl
Adds a post-flight check that queries all CRDs, filters by Helm ownership annotations (meta.helm.sh/release-name and meta.helm.sh/release-namespace) while excluding CRDs with .metadata.deletionTimestamp set, aggregates matches per component, sets postflight_issues=true, and emits a warning block listing remaining CRD names.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~15-30 minutes

Poem

🐰 Hopping through manifests under moonlit skies,
Searching for CRDs that stubbornly rise,
With kubectl and jq I gently comb,
Listing the stragglers before they're gone home,
A tidy depart — hop, skip, and goodbyes.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title accurately summarizes the main change: adding a Helm-annotation post-flight check to the undeploy.sh template. It is concise, specific, and clearly conveys the primary modification.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 17, 2026
… just chart manifest

The pre-flight finalizer check (check_release_for_stuck_crds) discovers
CRDs by parsing `helm get manifest`. That only returns the chart's
templates/ section — not the crds/ directory. Most operator charts in
this bundle (k8s-nim-operator, gpu-operator, kai-scheduler, etc.) ship
their CRDs under crds/ per Helm's recommended layout, so those CRDs
are invisible to pre-flight today.

Effect: workload CRs (e.g., NIMService instances created from
demos/workloads/inference/nim*.yaml) are not scanned for finalizers
before the operator is uninstalled. The operator's finalizer on the CR
becomes orphaned, the CRD's customresourcecleanup finalizer waits
forever, and the namespace stays Terminating. Recovery requires manual
finalizer patching.

Discover CRDs from a second source as well: any CRD annotated with
meta.helm.sh/release-name matching the release. This catches crds/-
installed CRDs and any CRD created by the release at runtime. The two
sources are deduplicated before per-CRD scanning, so existing behavior
is preserved and CRDs visible only via the manifest still get checked.

The same annotation-based filter is already used by the per-component
CRD cleanup loop and (in NVIDIA#600) by the post-flight verification check.
This brings pre-flight into line.
Add a verification pass at the end of undeploy.sh that re-enumerates CRDs
filtered by the same meta.helm.sh/release-name annotation used by the
per-component CRD deletion loop, and surfaces any leftovers as a post-flight
warning.

Today the script's post-flight section only checks for terminating namespaces,
unavailable API services, and ORPHANED_CRD_GROUPS-style operator-runtime CRDs.
Helm-managed CRDs that survive the deletion loop go unnoticed until the next
deploy.sh run rejects them. After PR NVIDIA#599 made the deletion pipelines warn-on-
failure instead of script-killing, this check is the missing verification
surface that confirms the warnings did not point to actual leftovers.

The check mirrors the existing ORPHANED_CRD_GROUPS post-flight pattern: per-
release kubectl|jq pipeline guarded with `|| true`, results accumulated into
a helm_orphaned_crds variable, single warning emitted if non-empty. When
cleanup succeeded normally the variable stays empty and the check is silent.

CRDs already being deleted (non-null .metadata.deletionTimestamp) are excluded
from the rescan: the earlier `kubectl delete crd ... --wait=false` returns
immediately, so a slow finalizer can leave the CRD still listed even though
it is being cleaned up normally. Only truly stuck (not-yet-terminating) CRDs
should be surfaced as a warning.

Suggested by CodeRabbit review on PR NVIDIA#599; terminating-CRD filter added in
response to review feedback on PR NVIDIA#600.

Signed-off-by: Yuan Chen <[email protected]>
@yuanchen8911 yuanchen8911 force-pushed the feat/undeploy-helm-crd-postflight branch from 15d4aef to 381731f Compare April 17, 2026 00:11
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 17, 2026
Combines three related hardening changes to the generated undeploy.sh
(previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a
theme -- make undeploy recover from partial or transient failure --
so bundling them avoids three sequential PR rebases over the same
template.

1. Transient-failure resilience (was NVIDIA#599).
   Three `kubectl | jq | while` pipelines in the post-uninstall cleanup
   path ran under `set -euo pipefail` with no fallback. A momentary
   control-plane 502 or auth refresh killed the whole script after
   Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and
   `Active` namespaces. Wrap each pipeline's trailing `done` in
   `|| echo "Warning: ..." >&2` so the failure is visible but the
   script continues. Matches the `|| true` pattern already used in the
   adjacent ORPHANED_CRD_GROUPS loop.

2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601).
   check_release_for_stuck_crds sourced its CRD list only from
   `helm get manifest`, which returns CRDs under the chart's templates/
   section but omits CRDs installed from the chart's crds/ directory
   (the Helm-recommended layout used by most operator charts). CRs
   backed by crds/-installed CRDs (e.g., NIMService) slipped past the
   finalizer pre-flight. Union both sources (manifest + annotated) and
   dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty
   lines without the grep-exits-1-on-no-match behavior, which would
   abort the script under pipefail when a release has no CRDs.

3. Post-flight Helm-annotation CRD check (was NVIDIA#600).
   Helm-managed CRDs that survive the per-release deletion loop went
   unnoticed until the next deploy.sh run rejected them. Add a
   verification pass at the end of undeploy.sh that re-enumerates CRDs
   filtered by meta.helm.sh/release-name + release-namespace
   annotations and surfaces any leftovers as a post-flight warning.
   CRDs already being deleted (non-null .metadata.deletionTimestamp)
   are excluded: the earlier `kubectl delete crd ... --wait=false`
   returns immediately, so a slow finalizer can leave the CRD listed
   though cleanup is normal. Only truly stuck (not-yet-terminating)
   CRDs are reported.

Unit test TestUndeployScript_TransientFailureWarnsAndContinues
generates a bundle, stubs kubectl to exit non-zero, sources each of
the three helpers, and asserts the helper returns 0 and emits the
expected Warning: on stderr.

Fixes: N/A
Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline),
NVIDIA#599, NVIDIA#600, NVIDIA#601 (superseded by this PR).

Signed-off-by: Yuan Chen <[email protected]>
@yuanchen8911

Copy link
Copy Markdown
Contributor Author

Superseded by #602, which consolidates this change with #599 and #601. Closing to keep review focused on the combined PR.

yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 17, 2026
Combines three related hardening changes to the generated undeploy.sh
(previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a
theme -- make undeploy recover from partial or transient failure --
so bundling them avoids three sequential PR rebases over the same
template.

1. Transient-failure resilience (was NVIDIA#599).
   Three `kubectl | jq | while` pipelines in the post-uninstall cleanup
   path ran under `set -euo pipefail` with no fallback. A momentary
   control-plane 502 or auth refresh killed the whole script after
   Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and
   `Active` namespaces. Wrap each pipeline's trailing `done` in
   `|| echo "Warning: ..." >&2` so the failure is visible but the
   script continues. Matches the `|| true` pattern already used in the
   adjacent ORPHANED_CRD_GROUPS loop.

2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601).
   check_release_for_stuck_crds sourced its CRD list only from
   `helm get manifest`, which returns CRDs under the chart's templates/
   section but omits CRDs installed from the chart's crds/ directory
   (the Helm-recommended layout used by most operator charts). CRs
   backed by crds/-installed CRDs (e.g., NIMService) slipped past the
   finalizer pre-flight. Union both sources (manifest + annotated) and
   dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty
   lines without the grep-exits-1-on-no-match behavior, which would
   abort the script under pipefail when a release has no CRDs.

3. Post-flight Helm-annotation CRD check (was NVIDIA#600).
   Helm-managed CRDs that survive the per-release deletion loop went
   unnoticed until the next deploy.sh run rejected them. Add a
   verification pass at the end of undeploy.sh that re-enumerates CRDs
   filtered by meta.helm.sh/release-name + release-namespace
   annotations and surfaces any leftovers as a post-flight warning.
   CRDs already being deleted (non-null .metadata.deletionTimestamp)
   are excluded: the earlier `kubectl delete crd ... --wait=false`
   returns immediately, so a slow finalizer can leave the CRD listed
   though cleanup is normal. Only truly stuck (not-yet-terminating)
   CRDs are reported.

Unit test TestUndeployScript_TransientFailureWarnsAndContinues
generates a bundle, stubs kubectl to exit non-zero, sources each of
the three helpers, and asserts the helper returns 0 and emits the
expected Warning: on stderr.

Fixes: N/A
Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline),
NVIDIA#599, NVIDIA#600, NVIDIA#601 (superseded by this PR).

Signed-off-by: Yuan Chen <[email protected]>
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 17, 2026
Combines three related hardening changes to the generated undeploy.sh
(previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a
theme -- make undeploy recover from partial or transient failure --
so bundling them avoids three sequential PR rebases over the same
template.

1. Transient-failure resilience (was NVIDIA#599).
   Three `kubectl | jq | while` pipelines in the post-uninstall cleanup
   path ran under `set -euo pipefail` with no fallback. A momentary
   control-plane 502 or auth refresh killed the whole script after
   Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and
   `Active` namespaces. Wrap each pipeline's trailing `done` in
   `|| echo "Warning: ..." >&2` so the failure is visible but the
   script continues. Matches the `|| true` pattern already used in the
   adjacent ORPHANED_CRD_GROUPS loop.

2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601).
   check_release_for_stuck_crds sourced its CRD list only from
   `helm get manifest`, which returns CRDs under the chart's templates/
   section but omits CRDs installed from the chart's crds/ directory
   (the Helm-recommended layout used by most operator charts). CRs
   backed by crds/-installed CRDs (e.g., NIMService) slipped past the
   finalizer pre-flight. Union both sources (manifest + annotated) and
   dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty
   lines without the grep-exits-1-on-no-match behavior, which would
   abort the script under pipefail when a release has no CRDs.

3. Post-flight Helm-annotation CRD check (was NVIDIA#600).
   Helm-managed CRDs that survive the per-release deletion loop went
   unnoticed until the next deploy.sh run rejected them. Add a
   verification pass at the end of undeploy.sh that re-enumerates CRDs
   filtered by meta.helm.sh/release-name + release-namespace
   annotations and surfaces any leftovers as a post-flight warning.
   CRDs already being deleted (non-null .metadata.deletionTimestamp)
   are excluded: the earlier `kubectl delete crd ... --wait=false`
   returns immediately, so a slow finalizer can leave the CRD listed
   though cleanup is normal. Only truly stuck (not-yet-terminating)
   CRDs are reported.

Unit test TestUndeployScript_TransientFailureWarnsAndContinues
generates a bundle, stubs kubectl to exit non-zero, sources each of
the three helpers, and asserts the helper returns 0 and emits the
expected Warning: on stderr.

Fixes: N/A
Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline),
NVIDIA#599, NVIDIA#600, NVIDIA#601 (superseded by this PR).

Signed-off-by: Yuan Chen <[email protected]>
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 17, 2026
Combines three related hardening changes to the generated undeploy.sh
(previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a
theme -- make undeploy recover from partial or transient failure --
so bundling them avoids three sequential PR rebases over the same
template.

1. Transient-failure resilience (was NVIDIA#599).
   Three `kubectl | jq | while` pipelines in the post-uninstall cleanup
   path ran under `set -euo pipefail` with no fallback. A momentary
   control-plane 502 or auth refresh killed the whole script after
   Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and
   `Active` namespaces. Wrap each pipeline's trailing `done` in
   `|| echo "Warning: ..." >&2` so the failure is visible but the
   script continues. Matches the `|| true` pattern already used in the
   adjacent ORPHANED_CRD_GROUPS loop.

2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601).
   check_release_for_stuck_crds sourced its CRD list only from
   `helm get manifest`, which returns CRDs under the chart's templates/
   section but omits CRDs installed from the chart's crds/ directory
   (the Helm-recommended layout used by most operator charts). CRs
   backed by crds/-installed CRDs (e.g., NIMService) slipped past the
   finalizer pre-flight. Union both sources (manifest + annotated) and
   dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty
   lines without the grep-exits-1-on-no-match behavior, which would
   abort the script under pipefail when a release has no CRDs.

3. Post-flight Helm-annotation CRD check (was NVIDIA#600).
   Helm-managed CRDs that survive the per-release deletion loop went
   unnoticed until the next deploy.sh run rejected them. Add a
   verification pass at the end of undeploy.sh that re-enumerates CRDs
   filtered by meta.helm.sh/release-name + release-namespace
   annotations and surfaces any leftovers as a post-flight warning.
   CRDs already being deleted (non-null .metadata.deletionTimestamp)
   are excluded: the earlier `kubectl delete crd ... --wait=false`
   returns immediately, so a slow finalizer can leave the CRD listed
   though cleanup is normal. Only truly stuck (not-yet-terminating)
   CRDs are reported.

Unit test TestUndeployScript_TransientFailureWarnsAndContinues
generates a bundle, stubs kubectl to exit non-zero, sources each of
the three helpers, and asserts the helper returns 0 and emits the
expected Warning: on stderr.

Fixes: N/A
Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline),
NVIDIA#599, NVIDIA#600, NVIDIA#601 (superseded by this PR).

Signed-off-by: Yuan Chen <[email protected]>
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 17, 2026
Combines three related hardening changes to the generated undeploy.sh
(previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a
theme -- make undeploy recover from partial or transient failure --
so bundling them avoids three sequential PR rebases over the same
template.

1. Transient-failure resilience (was NVIDIA#599).
   Three `kubectl | jq | while` pipelines in the post-uninstall cleanup
   path ran under `set -euo pipefail` with no fallback. A momentary
   control-plane 502 or auth refresh killed the whole script after
   Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and
   `Active` namespaces. Wrap each pipeline's trailing `done` in
   `|| echo "Warning: ..." >&2` so the failure is visible but the
   script continues. Matches the `|| true` pattern already used in the
   adjacent ORPHANED_CRD_GROUPS loop.

2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601).
   check_release_for_stuck_crds sourced its CRD list only from
   `helm get manifest`, which returns CRDs under the chart's templates/
   section but omits CRDs installed from the chart's crds/ directory
   (the Helm-recommended layout used by most operator charts). CRs
   backed by crds/-installed CRDs (e.g., NIMService) slipped past the
   finalizer pre-flight. Union both sources (manifest + annotated) and
   dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty
   lines without the grep-exits-1-on-no-match behavior, which would
   abort the script under pipefail when a release has no CRDs.

3. Post-flight Helm-annotation CRD check (was NVIDIA#600).
   Helm-managed CRDs that survive the per-release deletion loop went
   unnoticed until the next deploy.sh run rejected them. Add a
   verification pass at the end of undeploy.sh that re-enumerates CRDs
   filtered by meta.helm.sh/release-name + release-namespace
   annotations and surfaces any leftovers as a post-flight warning.
   CRDs already being deleted (non-null .metadata.deletionTimestamp)
   are excluded: the earlier `kubectl delete crd ... --wait=false`
   returns immediately, so a slow finalizer can leave the CRD listed
   though cleanup is normal. Only truly stuck (not-yet-terminating)
   CRDs are reported.

Unit test TestUndeployScript_TransientFailureWarnsAndContinues
generates a bundle, stubs kubectl to exit non-zero, sources each of
the three helpers, and asserts the helper returns 0 and emits the
expected Warning: on stderr.

Fixes: N/A
Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline),

Signed-off-by: Yuan Chen <[email protected]>
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 17, 2026
Combines three related hardening changes to the generated undeploy.sh
(previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a
theme -- make undeploy recover from partial or transient failure --
so bundling them avoids three sequential PR rebases over the same
template.

1. Transient-failure resilience (was NVIDIA#599).
   Three `kubectl | jq | while` pipelines in the post-uninstall cleanup
   path ran under `set -euo pipefail` with no fallback. A momentary
   control-plane 502 or auth refresh killed the whole script after
   Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and
   `Active` namespaces. Wrap each pipeline's trailing `done` in
   `|| echo "Warning: ..." >&2` so the failure is visible but the
   script continues. Matches the `|| true` pattern already used in the
   adjacent ORPHANED_CRD_GROUPS loop.

2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601).
   check_release_for_stuck_crds sourced its CRD list only from
   `helm get manifest`, which returns CRDs under the chart's templates/
   section but omits CRDs installed from the chart's crds/ directory
   (the Helm-recommended layout used by most operator charts). CRs
   backed by crds/-installed CRDs (e.g., NIMService) slipped past the
   finalizer pre-flight. Union both sources (manifest + annotated) and
   dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty
   lines without the grep-exits-1-on-no-match behavior, which would
   abort the script under pipefail when a release has no CRDs.

3. Post-flight Helm-annotation CRD check (was NVIDIA#600).
   Helm-managed CRDs that survive the per-release deletion loop went
   unnoticed until the next deploy.sh run rejected them. Add a
   verification pass at the end of undeploy.sh that re-enumerates CRDs
   filtered by meta.helm.sh/release-name + release-namespace
   annotations and surfaces any leftovers as a post-flight warning.
   CRDs already being deleted (non-null .metadata.deletionTimestamp)
   are excluded: the earlier `kubectl delete crd ... --wait=false`
   returns immediately, so a slow finalizer can leave the CRD listed
   though cleanup is normal. Only truly stuck (not-yet-terminating)
   CRDs are reported.

Unit test TestUndeployScript_TransientFailureWarnsAndContinues
generates a bundle, stubs kubectl to exit non-zero, sources each of
the three helpers, and asserts the helper returns 0 and emits the
expected Warning: on stderr.

Fixes: N/A
Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline),

Signed-off-by: Yuan Chen <[email protected]>
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 17, 2026
Combines three related hardening changes to the generated undeploy.sh
(previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a
theme -- make undeploy recover from partial or transient failure --
so bundling them avoids three sequential PR rebases over the same
template.

1. Transient-failure resilience (was NVIDIA#599).
   Three `kubectl | jq | while` pipelines in the post-uninstall cleanup
   path ran under `set -euo pipefail` with no fallback. A momentary
   control-plane 502 or auth refresh killed the whole script after
   Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and
   `Active` namespaces. Wrap each pipeline's trailing `done` in
   `|| echo "Warning: ..." >&2` so the failure is visible but the
   script continues. Matches the `|| true` pattern already used in the
   adjacent ORPHANED_CRD_GROUPS loop.

2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601).
   check_release_for_stuck_crds sourced its CRD list only from
   `helm get manifest`, which returns CRDs under the chart's templates/
   section but omits CRDs installed from the chart's crds/ directory
   (the Helm-recommended layout used by most operator charts). CRs
   backed by crds/-installed CRDs (e.g., NIMService) slipped past the
   finalizer pre-flight. Union both sources (manifest + annotated) and
   dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty
   lines without the grep-exits-1-on-no-match behavior, which would
   abort the script under pipefail when a release has no CRDs.

3. Post-flight Helm-annotation CRD check (was NVIDIA#600).
   Helm-managed CRDs that survive the per-release deletion loop went
   unnoticed until the next deploy.sh run rejected them. Add a
   verification pass at the end of undeploy.sh that re-enumerates CRDs
   filtered by meta.helm.sh/release-name + release-namespace
   annotations and surfaces any leftovers as a post-flight warning.
   CRDs already being deleted (non-null .metadata.deletionTimestamp)
   are excluded: the earlier `kubectl delete crd ... --wait=false`
   returns immediately, so a slow finalizer can leave the CRD listed
   though cleanup is normal. Only truly stuck (not-yet-terminating)
   CRDs are reported.

Unit test TestUndeployScript_TransientFailureWarnsAndContinues
generates a bundle, stubs kubectl to exit non-zero, sources each of
the three helpers, and asserts the helper returns 0 and emits the
expected Warning: on stderr.

Fixes: N/A
Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline),

Signed-off-by: Yuan Chen <[email protected]>
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 17, 2026
Combines three related hardening changes to the generated undeploy.sh
(previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a
theme -- make undeploy recover from partial or transient failure --
so bundling them avoids three sequential PR rebases over the same
template.

1. Transient-failure resilience (was NVIDIA#599).
   Three `kubectl | jq | while` pipelines in the post-uninstall cleanup
   path ran under `set -euo pipefail` with no fallback. A momentary
   control-plane 502 or auth refresh killed the whole script after
   Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and
   `Active` namespaces. Wrap each pipeline's trailing `done` in
   `|| echo "Warning: ..." >&2` so the failure is visible but the
   script continues. Matches the `|| true` pattern already used in the
   adjacent ORPHANED_CRD_GROUPS loop.

2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601).
   check_release_for_stuck_crds sourced its CRD list only from
   `helm get manifest`, which returns CRDs under the chart's templates/
   section but omits CRDs installed from the chart's crds/ directory
   (the Helm-recommended layout used by most operator charts). CRs
   backed by crds/-installed CRDs (e.g., NIMService) slipped past the
   finalizer pre-flight. Union both sources (manifest + annotated) and
   dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty
   lines without the grep-exits-1-on-no-match behavior, which would
   abort the script under pipefail when a release has no CRDs.

3. Post-flight Helm-annotation CRD check (was NVIDIA#600).
   Helm-managed CRDs that survive the per-release deletion loop went
   unnoticed until the next deploy.sh run rejected them. Add a
   verification pass at the end of undeploy.sh that re-enumerates CRDs
   filtered by meta.helm.sh/release-name + release-namespace
   annotations and surfaces any leftovers as a post-flight warning.
   CRDs already being deleted (non-null .metadata.deletionTimestamp)
   are excluded: the earlier `kubectl delete crd ... --wait=false`
   returns immediately, so a slow finalizer can leave the CRD listed
   though cleanup is normal. Only truly stuck (not-yet-terminating)
   CRDs are reported.

Unit test TestUndeployScript_TransientFailureWarnsAndContinues
generates a bundle, stubs kubectl to exit non-zero, sources each of
the three helpers, and asserts the helper returns 0 and emits the
expected Warning: on stderr.

Fixes: N/A
Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline),

Signed-off-by: Yuan Chen <[email protected]>
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 17, 2026
Combines three related hardening changes to the generated undeploy.sh
(previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a
theme -- make undeploy recover from partial or transient failure --
so bundling them avoids three sequential PR rebases over the same
template.

1. Transient-failure resilience (was NVIDIA#599).
   Three `kubectl | jq | while` pipelines in the post-uninstall cleanup
   path ran under `set -euo pipefail` with no fallback. A momentary
   control-plane 502 or auth refresh killed the whole script after
   Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and
   `Active` namespaces. Wrap each pipeline's trailing `done` in
   `|| echo "Warning: ..." >&2` so the failure is visible but the
   script continues. Matches the `|| true` pattern already used in the
   adjacent ORPHANED_CRD_GROUPS loop.

2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601).
   check_release_for_stuck_crds sourced its CRD list only from
   `helm get manifest`, which returns CRDs under the chart's templates/
   section but omits CRDs installed from the chart's crds/ directory
   (the Helm-recommended layout used by most operator charts). CRs
   backed by crds/-installed CRDs (e.g., NIMService) slipped past the
   finalizer pre-flight. Union both sources (manifest + annotated) and
   dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty
   lines without the grep-exits-1-on-no-match behavior, which would
   abort the script under pipefail when a release has no CRDs.

3. Post-flight Helm-annotation CRD check (was NVIDIA#600).
   Helm-managed CRDs that survive the per-release deletion loop went
   unnoticed until the next deploy.sh run rejected them. Add a
   verification pass at the end of undeploy.sh that re-enumerates CRDs
   filtered by meta.helm.sh/release-name + release-namespace
   annotations and surfaces any leftovers as a post-flight warning.
   CRDs already being deleted (non-null .metadata.deletionTimestamp)
   are excluded: the earlier `kubectl delete crd ... --wait=false`
   returns immediately, so a slow finalizer can leave the CRD listed
   though cleanup is normal. Only truly stuck (not-yet-terminating)
   CRDs are reported.

Unit test TestUndeployScript_TransientFailureWarnsAndContinues
generates a bundle, stubs kubectl to exit non-zero, sources each of
the three helpers, and asserts the helper returns 0 and emits the
expected Warning: on stderr.

Fixes: N/A
Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline),

Signed-off-by: Yuan Chen <[email protected]>
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 17, 2026
Combines three related hardening changes to the generated undeploy.sh
(previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a
theme -- make undeploy recover from partial or transient failure --
so bundling them avoids three sequential PR rebases over the same
template.

1. Transient-failure resilience (was NVIDIA#599).
   Three `kubectl | jq | while` pipelines in the post-uninstall cleanup
   path ran under `set -euo pipefail` with no fallback. A momentary
   control-plane 502 or auth refresh killed the whole script after
   Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and
   `Active` namespaces. Wrap each pipeline's trailing `done` in
   `|| echo "Warning: ..." >&2` so the failure is visible but the
   script continues. Matches the `|| true` pattern already used in the
   adjacent ORPHANED_CRD_GROUPS loop.

2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601).
   check_release_for_stuck_crds sourced its CRD list only from
   `helm get manifest`, which returns CRDs under the chart's templates/
   section but omits CRDs installed from the chart's crds/ directory
   (the Helm-recommended layout used by most operator charts). CRs
   backed by crds/-installed CRDs (e.g., NIMService) slipped past the
   finalizer pre-flight. Union both sources (manifest + annotated) and
   dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty
   lines without the grep-exits-1-on-no-match behavior, which would
   abort the script under pipefail when a release has no CRDs.

3. Post-flight Helm-annotation CRD check (was NVIDIA#600).
   Helm-managed CRDs that survive the per-release deletion loop went
   unnoticed until the next deploy.sh run rejected them. Add a
   verification pass at the end of undeploy.sh that re-enumerates CRDs
   filtered by meta.helm.sh/release-name + release-namespace
   annotations and surfaces any leftovers as a post-flight warning.
   CRDs already being deleted (non-null .metadata.deletionTimestamp)
   are excluded: the earlier `kubectl delete crd ... --wait=false`
   returns immediately, so a slow finalizer can leave the CRD listed
   though cleanup is normal. Only truly stuck (not-yet-terminating)
   CRDs are reported.

Unit test TestUndeployScript_TransientFailureWarnsAndContinues
generates a bundle, stubs kubectl to exit non-zero, sources each of
the three helpers, and asserts the helper returns 0 and emits the
expected Warning: on stderr.

Fixes: N/A
Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline),

Signed-off-by: Yuan Chen <[email protected]>
yuanchen8911 added a commit that referenced this pull request Apr 17, 2026
Combines three related hardening changes to the generated undeploy.sh
(previously split across #599, #601, #600). They share a file and a
theme -- make undeploy recover from partial or transient failure --
so bundling them avoids three sequential PR rebases over the same
template.

1. Transient-failure resilience (was #599).
   Three `kubectl | jq | while` pipelines in the post-uninstall cleanup
   path ran under `set -euo pipefail` with no fallback. A momentary
   control-plane 502 or auth refresh killed the whole script after
   Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and
   `Active` namespaces. Wrap each pipeline's trailing `done` in
   `|| echo "Warning: ..." >&2` so the failure is visible but the
   script continues. Matches the `|| true` pattern already used in the
   adjacent ORPHANED_CRD_GROUPS loop.

2. Pre-flight CRD discovery via Helm annotation (was #601).
   check_release_for_stuck_crds sourced its CRD list only from
   `helm get manifest`, which returns CRDs under the chart's templates/
   section but omits CRDs installed from the chart's crds/ directory
   (the Helm-recommended layout used by most operator charts). CRs
   backed by crds/-installed CRDs (e.g., NIMService) slipped past the
   finalizer pre-flight. Union both sources (manifest + annotated) and
   dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty
   lines without the grep-exits-1-on-no-match behavior, which would
   abort the script under pipefail when a release has no CRDs.

3. Post-flight Helm-annotation CRD check (was #600).
   Helm-managed CRDs that survive the per-release deletion loop went
   unnoticed until the next deploy.sh run rejected them. Add a
   verification pass at the end of undeploy.sh that re-enumerates CRDs
   filtered by meta.helm.sh/release-name + release-namespace
   annotations and surfaces any leftovers as a post-flight warning.
   CRDs already being deleted (non-null .metadata.deletionTimestamp)
   are excluded: the earlier `kubectl delete crd ... --wait=false`
   returns immediately, so a slow finalizer can leave the CRD listed
   though cleanup is normal. Only truly stuck (not-yet-terminating)
   CRDs are reported.

Unit test TestUndeployScript_TransientFailureWarnsAndContinues
generates a bundle, stubs kubectl to exit non-zero, sources each of
the three helpers, and asserts the helper returns 0 and emits the
expected Warning: on stderr.

Fixes: N/A
Related: #477 (introduced the per-component CRD cleanup pipeline),

Signed-off-by: Yuan Chen <[email protected]>
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 17, 2026
Combines three related hardening changes to the generated undeploy.sh
(previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a
theme -- make undeploy recover from partial or transient failure --
so bundling them avoids three sequential PR rebases over the same
template.

1. Transient-failure resilience (was NVIDIA#599).
   Three `kubectl | jq | while` pipelines in the post-uninstall cleanup
   path ran under `set -euo pipefail` with no fallback. A momentary
   control-plane 502 or auth refresh killed the whole script after
   Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and
   `Active` namespaces. Wrap each pipeline's trailing `done` in
   `|| echo "Warning: ..." >&2` so the failure is visible but the
   script continues. Matches the `|| true` pattern already used in the
   adjacent ORPHANED_CRD_GROUPS loop.

2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601).
   check_release_for_stuck_crds sourced its CRD list only from
   `helm get manifest`, which returns CRDs under the chart's templates/
   section but omits CRDs installed from the chart's crds/ directory
   (the Helm-recommended layout used by most operator charts). CRs
   backed by crds/-installed CRDs (e.g., NIMService) slipped past the
   finalizer pre-flight. Union both sources (manifest + annotated) and
   dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty
   lines without the grep-exits-1-on-no-match behavior, which would
   abort the script under pipefail when a release has no CRDs.

3. Post-flight Helm-annotation CRD check (was NVIDIA#600).
   Helm-managed CRDs that survive the per-release deletion loop went
   unnoticed until the next deploy.sh run rejected them. Add a
   verification pass at the end of undeploy.sh that re-enumerates CRDs
   filtered by meta.helm.sh/release-name + release-namespace
   annotations and surfaces any leftovers as a post-flight warning.
   CRDs already being deleted (non-null .metadata.deletionTimestamp)
   are excluded: the earlier `kubectl delete crd ... --wait=false`
   returns immediately, so a slow finalizer can leave the CRD listed
   though cleanup is normal. Only truly stuck (not-yet-terminating)
   CRDs are reported.

Unit test TestUndeployScript_TransientFailureWarnsAndContinues
generates a bundle, stubs kubectl to exit non-zero, sources each of
the three helpers, and asserts the helper returns 0 and emits the
expected Warning: on stderr.

Fixes: N/A
Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline),

Signed-off-by: Yuan Chen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant