feat(bundler): add Helm-annotation post-flight check to undeploy.sh by yuanchen8911 · Pull Request #600 · NVIDIA/aicr

yuanchen8911 · 2026-04-16T23:55:56Z

Summary

Add a post-flight verification pass that re-enumerates CRDs filtered by the same meta.helm.sh/release-name annotation used during the per-component CRD deletion loop, and reports any leftovers as a WARNING (also flips postflight_issues=true).

Motivation / Context

Today the undeploy.sh post-flight section checks for:

terminating namespaces,
unavailable API services,
operator-runtime CRDs in ORPHANED_CRD_GROUPS (e.g. nvidia.com, nfd.k8s-sigs.io, kai.scheduler).

Helm-managed CRDs that survive the per-release deletion loop are not surfaced anywhere — they go unnoticed until the next deploy.sh run rejects them.

After #599, the per-component CRD cleanup pipelines no longer kill the script on a transient kubectl/jq failure; they emit a Warning: to stderr and continue. That makes a verification check at the end of the script the natural complement: confirm the warnings did not correspond to real leftovers, and warn (with deletion command) if they did.

Suggested by CodeRabbit review on #599 (suggestion #2). I deferred it in #599 to keep that PR focused on the immediate set -e exit fix; this PR is the dedicated follow-up.

Fixes: N/A
Related: #599 (warn-on-failure resilience for the same pipelines), #477 (introduced the per-component cleanup loop)

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change
Documentation update
Refactoring (no functional changes)
Build/CI/tooling

Component(s) Affected

Bundlers (pkg/bundler, pkg/component/*)

Implementation Notes

Single block added to pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl immediately after the existing ORPHANED_CRD_GROUPS post-flight check, mirroring its style:

helm_orphaned_crds=""
{{- range .ComponentsReversed }}{{ if .HasChart }}
remaining_helm_crds=$(kubectl get crd -o json 2>/dev/null \
  | jq -r --arg rel "{{ .Name }}" --arg ns "{{ .Namespace }}" \
    '.items[] | select(.metadata.annotations["meta.helm.sh/release-name"]==$rel and .metadata.annotations["meta.helm.sh/release-namespace"]==$ns) | .metadata.name' 2>/dev/null || true)
if [[ -n "${remaining_helm_crds}" ]]; then
  helm_orphaned_crds="${helm_orphaned_crds} ${remaining_helm_crds}"
fi
{{- end }}{{ end }}
if [[ -n "${helm_orphaned_crds}" ]]; then
  echo "WARNING: Helm-annotated CRDs from uninstalled releases still present:${helm_orphaned_crds}"
  echo "  Cleanup did not remove all CRDs owned by this bundle's releases."
  echo "  Delete with: kubectl delete crd <name>"
  postflight_issues=true
fi

Notes:

One stanza per Helm component (matches the existing per-component delete loop's templating).
Each per-release kubectl|jq is || true-guarded so a transient API failure during post-flight doesn't itself trip set -e.
On a clean undeploy (cleanup succeeded), no annotated CRDs are found, the variable stays empty, and the check is silent.
Only fires when one of the cleanup pipelines either silently failed (pre-fix(bundler): make undeploy.sh resilient to transient kubectl failures #599) or warned-and-continued (fix(bundler): make undeploy.sh resilient to transient kubectl failures #599 onward) but the underlying CRDs are still on the cluster.

Testing

GOFLAGS=-mod=vendor go test -race -count=1 ./pkg/bundler/deployer/helm/...
make build
./dist/aicr_<host>/aicr bundle --recipe recipe.yaml --output /tmp/...
bash -n /tmp/.../undeploy.sh   # syntax check

Unit tests in pkg/bundler/deployer/helm/... pass with -race.
Regenerated a real bundle (15-component recipe), confirmed one stanza per HelmChart component is emitted into the post-flight section.
bash -n on the generated script: clean.

Risk Assessment

Low — Isolated change, easy to revert
Medium
High

Rollout notes: Generated-script change only. No API surface, no migration. Failure mode is "warning printed where there shouldn't have been one" (false positive if a CRD legitimately keeps its Helm annotation post-uninstall — none of the bundle's components do this today). Reverting is a one-commit revert.

Checklist

Tests pass locally (make test with -race on pkg/bundler/...)
Linter passes — local make lint has pre-existing yamllint failures in untracked .codex-worktrees/; go vet + golangci-lint clean on the changed package
I did not skip/disable tests to make CI green
I added/updated tests for new functionality (the change is template-only and exercised by the existing helm bundler tests; no new behavior to assert in Go)
I updated docs if user-facing behavior changed (post-flight warnings surface in the same WARNING: block already documented)
Changes follow existing patterns in the codebase (mirrors the ORPHANED_CRD_GROUPS post-flight check directly above)
Commits are cryptographically signed (git commit -S)

Summary by CodeRabbit

New Features
- Improved undeploy post-flight checks for chart-based components: scans for remaining Custom Resource Definitions, excludes those already terminating, aggregates any leftovers, lists their names in a warning, and flags the post-flight as having issues so stale resources are reported.

coderabbitai · 2026-04-16T23:56:11Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: ca8efb4b-d807-4c1a-ad1e-c7ac102201bc

📥 Commits

Reviewing files that changed from the base of the PR and between 15d4aef and 381731f.

📒 Files selected for processing (1)

pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl

📝 Walkthrough

Walkthrough

A new post-flight verification is added to the undeploy script template that scans all CRDs, filters those annotated with Helm release metadata (meta.helm.sh/release-name and meta.helm.sh/release-namespace) matching each chart component, and reports any remaining CRDs as warnings.

Changes

Cohort / File(s)	Summary
Helm Undeploy Post-Flight Verification `pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl`	Adds a post-flight check that queries all CRDs, filters by Helm ownership annotations (`meta.helm.sh/release-name` and `meta.helm.sh/release-namespace`) while excluding CRDs with `.metadata.deletionTimestamp` set, aggregates matches per component, sets `postflight_issues=true`, and emits a warning block listing remaining CRD names.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~15-30 minutes

Poem

🐰 Hopping through manifests under moonlit skies,
Searching for CRDs that stubbornly rise,
With kubectl and jq I gently comb,
Listing the stragglers before they're gone home,
A tidy depart — hop, skip, and goodbyes.

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The pull request title accurately summarizes the main change: adding a Helm-annotation post-flight check to the undeploy.sh template. It is concise, specific, and clearly conveys the primary modification.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

… just chart manifest The pre-flight finalizer check (check_release_for_stuck_crds) discovers CRDs by parsing `helm get manifest`. That only returns the chart's templates/ section — not the crds/ directory. Most operator charts in this bundle (k8s-nim-operator, gpu-operator, kai-scheduler, etc.) ship their CRDs under crds/ per Helm's recommended layout, so those CRDs are invisible to pre-flight today. Effect: workload CRs (e.g., NIMService instances created from demos/workloads/inference/nim*.yaml) are not scanned for finalizers before the operator is uninstalled. The operator's finalizer on the CR becomes orphaned, the CRD's customresourcecleanup finalizer waits forever, and the namespace stays Terminating. Recovery requires manual finalizer patching. Discover CRDs from a second source as well: any CRD annotated with meta.helm.sh/release-name matching the release. This catches crds/- installed CRDs and any CRD created by the release at runtime. The two sources are deduplicated before per-CRD scanning, so existing behavior is preserved and CRDs visible only via the manifest still get checked. The same annotation-based filter is already used by the per-component CRD cleanup loop and (in NVIDIA#600) by the post-flight verification check. This brings pre-flight into line.

Add a verification pass at the end of undeploy.sh that re-enumerates CRDs filtered by the same meta.helm.sh/release-name annotation used by the per-component CRD deletion loop, and surfaces any leftovers as a post-flight warning. Today the script's post-flight section only checks for terminating namespaces, unavailable API services, and ORPHANED_CRD_GROUPS-style operator-runtime CRDs. Helm-managed CRDs that survive the deletion loop go unnoticed until the next deploy.sh run rejects them. After PR NVIDIA#599 made the deletion pipelines warn-on- failure instead of script-killing, this check is the missing verification surface that confirms the warnings did not point to actual leftovers. The check mirrors the existing ORPHANED_CRD_GROUPS post-flight pattern: per- release kubectl|jq pipeline guarded with `|| true`, results accumulated into a helm_orphaned_crds variable, single warning emitted if non-empty. When cleanup succeeded normally the variable stays empty and the check is silent. CRDs already being deleted (non-null .metadata.deletionTimestamp) are excluded from the rescan: the earlier `kubectl delete crd ... --wait=false` returns immediately, so a slow finalizer can leave the CRD still listed even though it is being cleaned up normally. Only truly stuck (not-yet-terminating) CRDs should be surfaced as a warning. Suggested by CodeRabbit review on PR NVIDIA#599; terminating-CRD filter added in response to review feedback on PR NVIDIA#600. Signed-off-by: Yuan Chen <[email protected]>

Combines three related hardening changes to the generated undeploy.sh (previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a theme -- make undeploy recover from partial or transient failure -- so bundling them avoids three sequential PR rebases over the same template. 1. Transient-failure resilience (was NVIDIA#599). Three `kubectl | jq | while` pipelines in the post-uninstall cleanup path ran under `set -euo pipefail` with no fallback. A momentary control-plane 502 or auth refresh killed the whole script after Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and `Active` namespaces. Wrap each pipeline's trailing `done` in `|| echo "Warning: ..." >&2` so the failure is visible but the script continues. Matches the `|| true` pattern already used in the adjacent ORPHANED_CRD_GROUPS loop. 2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601). check_release_for_stuck_crds sourced its CRD list only from `helm get manifest`, which returns CRDs under the chart's templates/ section but omits CRDs installed from the chart's crds/ directory (the Helm-recommended layout used by most operator charts). CRs backed by crds/-installed CRDs (e.g., NIMService) slipped past the finalizer pre-flight. Union both sources (manifest + annotated) and dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty lines without the grep-exits-1-on-no-match behavior, which would abort the script under pipefail when a release has no CRDs. 3. Post-flight Helm-annotation CRD check (was NVIDIA#600). Helm-managed CRDs that survive the per-release deletion loop went unnoticed until the next deploy.sh run rejected them. Add a verification pass at the end of undeploy.sh that re-enumerates CRDs filtered by meta.helm.sh/release-name + release-namespace annotations and surfaces any leftovers as a post-flight warning. CRDs already being deleted (non-null .metadata.deletionTimestamp) are excluded: the earlier `kubectl delete crd ... --wait=false` returns immediately, so a slow finalizer can leave the CRD listed though cleanup is normal. Only truly stuck (not-yet-terminating) CRDs are reported. Unit test TestUndeployScript_TransientFailureWarnsAndContinues generates a bundle, stubs kubectl to exit non-zero, sources each of the three helpers, and asserts the helper returns 0 and emits the expected Warning: on stderr. Fixes: N/A Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline), NVIDIA#599, NVIDIA#600, NVIDIA#601 (superseded by this PR). Signed-off-by: Yuan Chen <[email protected]>

yuanchen8911 · 2026-04-17T00:24:59Z

Superseded by #602, which consolidates this change with #599 and #601. Closing to keep review focused on the combined PR.

Combines three related hardening changes to the generated undeploy.sh (previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a theme -- make undeploy recover from partial or transient failure -- so bundling them avoids three sequential PR rebases over the same template. 1. Transient-failure resilience (was NVIDIA#599). Three `kubectl | jq | while` pipelines in the post-uninstall cleanup path ran under `set -euo pipefail` with no fallback. A momentary control-plane 502 or auth refresh killed the whole script after Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and `Active` namespaces. Wrap each pipeline's trailing `done` in `|| echo "Warning: ..." >&2` so the failure is visible but the script continues. Matches the `|| true` pattern already used in the adjacent ORPHANED_CRD_GROUPS loop. 2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601). check_release_for_stuck_crds sourced its CRD list only from `helm get manifest`, which returns CRDs under the chart's templates/ section but omits CRDs installed from the chart's crds/ directory (the Helm-recommended layout used by most operator charts). CRs backed by crds/-installed CRDs (e.g., NIMService) slipped past the finalizer pre-flight. Union both sources (manifest + annotated) and dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty lines without the grep-exits-1-on-no-match behavior, which would abort the script under pipefail when a release has no CRDs. 3. Post-flight Helm-annotation CRD check (was NVIDIA#600). Helm-managed CRDs that survive the per-release deletion loop went unnoticed until the next deploy.sh run rejected them. Add a verification pass at the end of undeploy.sh that re-enumerates CRDs filtered by meta.helm.sh/release-name + release-namespace annotations and surfaces any leftovers as a post-flight warning. CRDs already being deleted (non-null .metadata.deletionTimestamp) are excluded: the earlier `kubectl delete crd ... --wait=false` returns immediately, so a slow finalizer can leave the CRD listed though cleanup is normal. Only truly stuck (not-yet-terminating) CRDs are reported. Unit test TestUndeployScript_TransientFailureWarnsAndContinues generates a bundle, stubs kubectl to exit non-zero, sources each of the three helpers, and asserts the helper returns 0 and emits the expected Warning: on stderr. Fixes: N/A Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline), NVIDIA#599, NVIDIA#600, NVIDIA#601 (superseded by this PR). Signed-off-by: Yuan Chen <[email protected]>

Combines three related hardening changes to the generated undeploy.sh (previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a theme -- make undeploy recover from partial or transient failure -- so bundling them avoids three sequential PR rebases over the same template. 1. Transient-failure resilience (was NVIDIA#599). Three `kubectl | jq | while` pipelines in the post-uninstall cleanup path ran under `set -euo pipefail` with no fallback. A momentary control-plane 502 or auth refresh killed the whole script after Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and `Active` namespaces. Wrap each pipeline's trailing `done` in `|| echo "Warning: ..." >&2` so the failure is visible but the script continues. Matches the `|| true` pattern already used in the adjacent ORPHANED_CRD_GROUPS loop. 2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601). check_release_for_stuck_crds sourced its CRD list only from `helm get manifest`, which returns CRDs under the chart's templates/ section but omits CRDs installed from the chart's crds/ directory (the Helm-recommended layout used by most operator charts). CRs backed by crds/-installed CRDs (e.g., NIMService) slipped past the finalizer pre-flight. Union both sources (manifest + annotated) and dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty lines without the grep-exits-1-on-no-match behavior, which would abort the script under pipefail when a release has no CRDs. 3. Post-flight Helm-annotation CRD check (was NVIDIA#600). Helm-managed CRDs that survive the per-release deletion loop went unnoticed until the next deploy.sh run rejected them. Add a verification pass at the end of undeploy.sh that re-enumerates CRDs filtered by meta.helm.sh/release-name + release-namespace annotations and surfaces any leftovers as a post-flight warning. CRDs already being deleted (non-null .metadata.deletionTimestamp) are excluded: the earlier `kubectl delete crd ... --wait=false` returns immediately, so a slow finalizer can leave the CRD listed though cleanup is normal. Only truly stuck (not-yet-terminating) CRDs are reported. Unit test TestUndeployScript_TransientFailureWarnsAndContinues generates a bundle, stubs kubectl to exit non-zero, sources each of the three helpers, and asserts the helper returns 0 and emits the expected Warning: on stderr. Fixes: N/A Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline), Signed-off-by: Yuan Chen <[email protected]>

Combines three related hardening changes to the generated undeploy.sh (previously split across #599, #601, #600). They share a file and a theme -- make undeploy recover from partial or transient failure -- so bundling them avoids three sequential PR rebases over the same template. 1. Transient-failure resilience (was #599). Three `kubectl | jq | while` pipelines in the post-uninstall cleanup path ran under `set -euo pipefail` with no fallback. A momentary control-plane 502 or auth refresh killed the whole script after Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and `Active` namespaces. Wrap each pipeline's trailing `done` in `|| echo "Warning: ..." >&2` so the failure is visible but the script continues. Matches the `|| true` pattern already used in the adjacent ORPHANED_CRD_GROUPS loop. 2. Pre-flight CRD discovery via Helm annotation (was #601). check_release_for_stuck_crds sourced its CRD list only from `helm get manifest`, which returns CRDs under the chart's templates/ section but omits CRDs installed from the chart's crds/ directory (the Helm-recommended layout used by most operator charts). CRs backed by crds/-installed CRDs (e.g., NIMService) slipped past the finalizer pre-flight. Union both sources (manifest + annotated) and dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty lines without the grep-exits-1-on-no-match behavior, which would abort the script under pipefail when a release has no CRDs. 3. Post-flight Helm-annotation CRD check (was #600). Helm-managed CRDs that survive the per-release deletion loop went unnoticed until the next deploy.sh run rejected them. Add a verification pass at the end of undeploy.sh that re-enumerates CRDs filtered by meta.helm.sh/release-name + release-namespace annotations and surfaces any leftovers as a post-flight warning. CRDs already being deleted (non-null .metadata.deletionTimestamp) are excluded: the earlier `kubectl delete crd ... --wait=false` returns immediately, so a slow finalizer can leave the CRD listed though cleanup is normal. Only truly stuck (not-yet-terminating) CRDs are reported. Unit test TestUndeployScript_TransientFailureWarnsAndContinues generates a bundle, stubs kubectl to exit non-zero, sources each of the three helpers, and asserts the helper returns 0 and emits the expected Warning: on stderr. Fixes: N/A Related: #477 (introduced the per-component CRD cleanup pipeline), Signed-off-by: Yuan Chen <[email protected]>

Combines three related hardening changes to the generated undeploy.sh (previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a theme -- make undeploy recover from partial or transient failure -- so bundling them avoids three sequential PR rebases over the same template. 1. Transient-failure resilience (was NVIDIA#599). Three `kubectl | jq | while` pipelines in the post-uninstall cleanup path ran under `set -euo pipefail` with no fallback. A momentary control-plane 502 or auth refresh killed the whole script after Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and `Active` namespaces. Wrap each pipeline's trailing `done` in `|| echo "Warning: ..." >&2` so the failure is visible but the script continues. Matches the `|| true` pattern already used in the adjacent ORPHANED_CRD_GROUPS loop. 2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601). check_release_for_stuck_crds sourced its CRD list only from `helm get manifest`, which returns CRDs under the chart's templates/ section but omits CRDs installed from the chart's crds/ directory (the Helm-recommended layout used by most operator charts). CRs backed by crds/-installed CRDs (e.g., NIMService) slipped past the finalizer pre-flight. Union both sources (manifest + annotated) and dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty lines without the grep-exits-1-on-no-match behavior, which would abort the script under pipefail when a release has no CRDs. 3. Post-flight Helm-annotation CRD check (was NVIDIA#600). Helm-managed CRDs that survive the per-release deletion loop went unnoticed until the next deploy.sh run rejected them. Add a verification pass at the end of undeploy.sh that re-enumerates CRDs filtered by meta.helm.sh/release-name + release-namespace annotations and surfaces any leftovers as a post-flight warning. CRDs already being deleted (non-null .metadata.deletionTimestamp) are excluded: the earlier `kubectl delete crd ... --wait=false` returns immediately, so a slow finalizer can leave the CRD listed though cleanup is normal. Only truly stuck (not-yet-terminating) CRDs are reported. Unit test TestUndeployScript_TransientFailureWarnsAndContinues generates a bundle, stubs kubectl to exit non-zero, sources each of the three helpers, and asserts the helper returns 0 and emits the expected Warning: on stderr. Fixes: N/A Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline), Signed-off-by: Yuan Chen <[email protected]>

yuanchen8911 requested a review from a team as a code owner April 16, 2026 23:55

yuanchen8911 added the area/bundler label Apr 16, 2026

github-actions Bot added the size/S label Apr 16, 2026

yuanchen8911 mentioned this pull request Apr 17, 2026

fix(bundler): pre-flight should discover CRDs by Helm annotation, not just chart manifest #601

Closed

17 tasks

yuanchen8911 force-pushed the feat/undeploy-helm-crd-postflight branch from 15d4aef to 381731f Compare April 17, 2026 00:11

This was referenced Apr 17, 2026

fix(bundler): fix undeploy template pre/post-flight checks #602

Merged

fix(bundler): make undeploy.sh resilient to transient kubectl failures #599

Closed

yuanchen8911 closed this Apr 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(bundler): add Helm-annotation post-flight check to undeploy.sh#600

feat(bundler): add Helm-annotation post-flight check to undeploy.sh#600
yuanchen8911 wants to merge 1 commit into
NVIDIA:mainfrom
yuanchen8911:feat/undeploy-helm-crd-postflight

yuanchen8911 commented Apr 16, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 16, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

yuanchen8911 commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

yuanchen8911 commented Apr 16, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation / Context

Type of Change

Component(s) Affected

Implementation Notes

Testing

Risk Assessment

Checklist

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

yuanchen8911 commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

yuanchen8911 commented Apr 16, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 16, 2026 •

edited

Loading