fix(bundler): pre-flight should discover CRDs by Helm annotation, not just chart manifest by yuanchen8911 · Pull Request #601 · NVIDIA/aicr

yuanchen8911 · 2026-04-17T00:11:09Z

Summary

Fix undeploy.sh's pre-flight finalizer check so that CRDs installed from a Helm chart's crds/ directory are discovered and their CR instances scanned. Today only CRDs declared in the chart's templates/ section (and therefore visible to helm get manifest) are checked, so most operator-managed CRDs slip through and the script can be tricked into removing the operator before its CRs are cleaned up.

Motivation / Context

Hit live on aicr-cuj2 today: I left a NIMService (demos/workloads/inference/nimservice-llama-3-2-1b.yaml) on the cluster, ran ./undeploy.sh, the pre-flight check passed, the NIM operator got uninstalled, and the script then "completed" — but the nimservices.apps.nvidia.com CRD's customresourcecleanup finalizer was waiting on the leftover NIMService CR, whose own NIM-operator finalizer could no longer be processed. The CRD stayed Terminating indefinitely, the nim-workload namespace stayed Terminating, and recovery required manual kubectl patch ... finalizers:null on the CR.

The pre-flight check exists exactly to prevent that scenario, but it didn't fire. Investigating: check_release_for_stuck_crds discovers CRDs by parsing the output of helm get manifest — which Helm restricts to templates/ content. The k8s-nim-operator, gpu-operator, kai-scheduler, etc. charts all ship their CRDs under crds/ (the Helm-recommended layout). Those CRDs are never visible to helm get manifest, so their CR instances are never scanned during pre-flight, and the script proceeds to remove the operator that was the only thing capable of clearing those CRs' finalizers.

Add a second discovery source: query CRDs annotated with meta.helm.sh/release-name matching the release (and meta.helm.sh/release-namespace matching the namespace). This is the same annotation the per-component cleanup loop in this file already filters by, and the same one #600 uses for the post-flight verification check. Bringing pre-flight into line means workload CRs get caught before the operator is removed, and the user is told (via the existing pre-flight error message) to delete them first.

Fixes: N/A (no GitHub issue filed)
Related: #599 (cleanup-pipeline resilience for the same code path), #600 (post-flight verification using the same annotation filter)

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change
Documentation update
Refactoring (no functional changes)
Build/CI/tooling

Component(s) Affected

Bundlers (pkg/bundler, pkg/component/*)

Implementation Notes

Single function rewrite in pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl:

check_release_for_stuck_crds() {
  local release="$1"
  local ns="$2"
  local manifest manifest_crds annotated_crds
  manifest=$(helm get manifest "${release}" -n "${ns}" 2>/dev/null) || return 0
  manifest_crds=$(echo "${manifest}" \
    | awk '/^kind:/{kind=$2} /^  name:/ && kind=="CustomResourceDefinition"{print $2; kind=""}')
  annotated_crds=$(kubectl get crd -o json 2>/dev/null \
    | jq -r --arg rel "${release}" --arg ns "${ns}" \
      '.items[] | select(.metadata.annotations["meta.helm.sh/release-name"]==$rel and .metadata.annotations["meta.helm.sh/release-namespace"]==$ns) | .metadata.name' 2>/dev/null || true)
  printf '%s\n%s\n' "${manifest_crds}" "${annotated_crds}" \
    | sort -u \
    | grep -v '^$' \
    | while read -r crd_name; do
        check_crd_for_stuck_resources "${crd_name}" "${release}"
      done
  return 0
}

Notes:

The two sources are deduplicated by sort -u, so CRDs visible to helm get manifest (e.g., a chart that declares its CRDs in templates/) are still scanned exactly once and existing behavior is preserved.
The annotation-based query is || true-guarded so a transient API failure during pre-flight doesn't false-negative the whole check.
The early return 0 if helm get manifest fails is intentional: it means the release isn't installed and there's nothing to check. Same guarantee as before.

Testing

GOFLAGS=-mod=vendor go test -race -count=1 ./pkg/bundler/deployer/helm/...
make build
./dist/aicr_<host>/aicr bundle --recipe recipe.yaml --output /tmp/...
bash -n /tmp/.../undeploy.sh   # syntax check

Unit tests in pkg/bundler/deployer/helm/... pass with -race.
Regenerated a real bundle and inspected the rendered function — both discovery branches present, dedup logic in place.
bash -n on the generated script: clean.
Live re-test on aicr-cuj2 with a deliberately-applied NIMService is the next step (out of scope for this PR; will verify after merge).

Risk Assessment

Low — Isolated change, easy to revert
Medium
High

Rollout notes: Generated-script change only. No API surface, no migration. Worst-case failure mode is "pre-flight reports stuck CR for a CRD it would have ignored before" — which is the correct behavior; user follows the existing kubectl delete <cr> remediation hint or --skip-preflight to override. Reverting is a one-commit revert.

Checklist

Tests pass locally (make test with -race on pkg/bundler/...)
Linter passes — local make lint has pre-existing yamllint failures in untracked .codex-worktrees/; go vet + golangci-lint clean on the changed package
I did not skip/disable tests to make CI green
I added/updated tests for new functionality (template-only change exercised by the existing helm bundler tests; the new discovery branch is bash + kubectl get crd, both runtime-only)
I updated docs if user-facing behavior changed (pre-flight error message wording is unchanged; only the coverage of which CRDs trigger it has been broadened)
Changes follow existing patterns in the codebase (the meta.helm.sh/release-name/-namespace filter is already used by the per-component cleanup loop and by feat(bundler): add Helm-annotation post-flight check to undeploy.sh #600's post-flight check)
Commits are cryptographically signed (git commit -S)

Summary by CodeRabbit

Bug Fixes
- Improved undeployment reliability by expanding CRD discovery to multiple sources (release manifests and cluster records). This ensures more complete identification and cleanup of custom resources, reducing the risk of orphaned resources or stuck CRDs remaining after undeploy operations.

coderabbitai · 2026-04-17T00:11:26Z

📝 Walkthrough

Walkthrough

The check_release_for_stuck_crds function now discovers CRD names from two sources: parsing helm get manifest and querying live cluster CRDs annotated with the release name/namespace via kubectl/jq. Results are concatenated, filtered, de-duplicated, and iterated for the same downstream checks.

Changes

Cohort / File(s)	Summary
Helm Undeploy CRD Detection `pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl`	`check_release_for_stuck_crds` now builds CRD list from both manifest parsing (`helm get manifest` → `manifest_crds`) and live cluster annotation lookup (`kubectl` + `jq` → `annotated_crds`), concatenates, filters empty entries, de-duplicates with `sort -u`, and iterates the unique set. Adds dependency on `kubectl`/`jq` output for discovery.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐇 I sniffed the manifest and scurried the tree,
Two trails of CRDs now lead back to me,
I hop, I sort, I squash each twin,
Clean fields ahead — the undeploy grin! 🥕✨

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: improving CRD discovery in the pre-flight check by adding annotation-based detection alongside manifest parsing.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl`:
- Around line 211-214: The pipeline that builds CRD names from
"${manifest_crds}" and "${annotated_crds}" (the printf … | sort -u | grep -v
'^$' | while read -r crd_name; do ...) fails under set -euo pipefail when both
variables are empty because grep returns exit code 1; replace the grep filter
with awk 'NF' so blank lines are filtered but empty input yields exit code 0
(i.e., change "| grep -v '^$' \\" to "| awk 'NF' \\").

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 84489e24-38a1-4871-9f9f-a97cec71386b

📥 Commits

Reviewing files that changed from the base of the PR and between b82af6f and 7a77c5a.

📒 Files selected for processing (1)

pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl

check_release_for_stuck_crds previously sourced its CRD list only from `helm get manifest`, which returns CRDs in the chart's templates/ section but omits CRDs installed from the chart's crds/ directory (the Helm-recommended layout used by most operator charts). That meant workload CRs backed by crds/-installed CRDs (e.g., NIMService) slipped past the finalizer pre-flight and blocked namespace deletion later. Augment the lookup with CRDs that carry the release's meta.helm.sh/release-name + release-namespace annotations. Union both sources, dedupe, and scan each CRD for CRs with active finalizers before uninstalling the operator. Also drop empty lines with `awk 'NF'` instead of `grep -v '^$'`: grep exits 1 when no lines match, which would abort the script under `set -euo pipefail` whenever a release has no CRDs (chart ships none, installed with --skip-crds, or already cleaned up). awk 'NF' is equivalent but always exits 0. Signed-off-by: Yuan Chen <[email protected]>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl`:
- Around line 205-210: The code currently aborts early because manifest=$(helm
get manifest "${release}" -n "${ns}" 2>/dev/null) || return 0 prevents the
annotated_crds kubectl/jq fallback from running; change this so a failing helm
get doesn't return from the function—instead set manifest to an empty string on
failure (e.g. manifest=$(helm get manifest ... 2>/dev/null) || manifest="") or
use an if/then to only populate manifest when helm succeeds; keep the subsequent
manifest_crds parsing (awk) and always run the annotated_crds kubectl/jq command
so annotated CRDs are discovered when the Helm release record is missing or
corrupted.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: baeb7523-0bec-426d-9304-2e789a8b72a1

📥 Commits

Reviewing files that changed from the base of the PR and between 7a77c5a and 4a3597f.

📒 Files selected for processing (1)

pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl

coderabbitai · 2026-04-17T00:20:06Z

  manifest=$(helm get manifest "${release}" -n "${ns}" 2>/dev/null) || return 0
-  echo "${manifest}" \
-    | awk '/^kind:/{kind=$2} /^  name:/ && kind=="CustomResourceDefinition"{print $2; kind=""}' \
+  manifest_crds=$(echo "${manifest}" \
+    | awk '/^kind:/{kind=$2} /^  name:/ && kind=="CustomResourceDefinition"{print $2; kind=""}')
+  annotated_crds=$(kubectl get crd -o json 2>/dev/null \
+    | jq -r --arg rel "${release}" --arg ns "${ns}" \
+      '.items[] | select(.metadata.annotations["meta.helm.sh/release-name"]==$rel and .metadata.annotations["meta.helm.sh/release-namespace"]==$ns) | .metadata.name' 2>/dev/null || true)


⚠️ Potential issue | 🟠 Major

Don't return before the annotation fallback runs.

Line 205 exits the function on any helm get manifest failure, so the new kubectl/jq path never executes. That still misses the messy-state case where the Helm release record is gone or corrupted but its CRDs are still present and annotated in-cluster.

Suggested change

- manifest=$(helm get manifest "${release}" -n "${ns}" 2>/dev/null) || return 0 - manifest_crds=$(echo "${manifest}" \ + manifest=$(helm get manifest "${release}" -n "${ns}" 2>/dev/null || true) + manifest_crds=$(printf '%s\n' "${manifest}" \ | awk '/^kind:/{kind=$2} /^ name:/ && kind=="CustomResourceDefinition"{print $2; kind=""}')

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

manifest=$(helm get manifest "${release}" -n "${ns}" 2>/dev/null) || return 0

echo "${manifest}" \

| awk '/^kind:/{kind=$2} /^ name:/ && kind=="CustomResourceDefinition"{print $2; kind=""}' \

manifest_crds=$(echo "${manifest}" \

| awk '/^kind:/{kind=$2} /^ name:/ && kind=="CustomResourceDefinition"{print $2; kind=""}')

annotated_crds=$(kubectl get crd -o json 2>/dev/null \

| jq -r --arg rel "${release}" --arg ns "${ns}" \

'.items[] | select(.metadata.annotations["meta.helm.sh/release-name"]==$rel and .metadata.annotations["meta.helm.sh/release-namespace"]==$ns) | .metadata.name' 2>/dev/null || true)

manifest=$(helm get manifest "${release}" -n "${ns}" 2>/dev/null || true)

manifest_crds=$(printf '%s\n' "${manifest}" \

| awk '/^kind:/{kind=$2} /^ name:/ && kind=="CustomResourceDefinition"{print $2; kind=""}')

annotated_crds=$(kubectl get crd -o json 2>/dev/null \

| jq -r --arg rel "${release}" --arg ns "${ns}" \

'.items[] | select(.metadata.annotations["meta.helm.sh/release-name"]==$rel and .metadata.annotations["meta.helm.sh/release-namespace"]==$ns) | .metadata.name' 2>/dev/null || true)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl` around lines 205 - 210, The code currently aborts early because manifest=$(helm get manifest "${release}" -n "${ns}" 2>/dev/null) || return 0 prevents the annotated_crds kubectl/jq fallback from running; change this so a failing helm get doesn't return from the function—instead set manifest to an empty string on failure (e.g. manifest=$(helm get manifest ... 2>/dev/null) || manifest="") or use an if/then to only populate manifest when helm succeeds; keep the subsequent manifest_crds parsing (awk) and always run the annotated_crds kubectl/jq command so annotated CRDs are discovered when the Helm release record is missing or corrupted.

Combines three related hardening changes to the generated undeploy.sh (previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a theme -- make undeploy recover from partial or transient failure -- so bundling them avoids three sequential PR rebases over the same template. 1. Transient-failure resilience (was NVIDIA#599). Three `kubectl | jq | while` pipelines in the post-uninstall cleanup path ran under `set -euo pipefail` with no fallback. A momentary control-plane 502 or auth refresh killed the whole script after Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and `Active` namespaces. Wrap each pipeline's trailing `done` in `|| echo "Warning: ..." >&2` so the failure is visible but the script continues. Matches the `|| true` pattern already used in the adjacent ORPHANED_CRD_GROUPS loop. 2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601). check_release_for_stuck_crds sourced its CRD list only from `helm get manifest`, which returns CRDs under the chart's templates/ section but omits CRDs installed from the chart's crds/ directory (the Helm-recommended layout used by most operator charts). CRs backed by crds/-installed CRDs (e.g., NIMService) slipped past the finalizer pre-flight. Union both sources (manifest + annotated) and dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty lines without the grep-exits-1-on-no-match behavior, which would abort the script under pipefail when a release has no CRDs. 3. Post-flight Helm-annotation CRD check (was NVIDIA#600). Helm-managed CRDs that survive the per-release deletion loop went unnoticed until the next deploy.sh run rejected them. Add a verification pass at the end of undeploy.sh that re-enumerates CRDs filtered by meta.helm.sh/release-name + release-namespace annotations and surfaces any leftovers as a post-flight warning. CRDs already being deleted (non-null .metadata.deletionTimestamp) are excluded: the earlier `kubectl delete crd ... --wait=false` returns immediately, so a slow finalizer can leave the CRD listed though cleanup is normal. Only truly stuck (not-yet-terminating) CRDs are reported. Unit test TestUndeployScript_TransientFailureWarnsAndContinues generates a bundle, stubs kubectl to exit non-zero, sources each of the three helpers, and asserts the helper returns 0 and emits the expected Warning: on stderr. Fixes: N/A Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline), NVIDIA#599, NVIDIA#600, NVIDIA#601 (superseded by this PR). Signed-off-by: Yuan Chen <[email protected]>

yuanchen8911 · 2026-04-17T00:25:01Z

Superseded by #602, which consolidates this change with #599 and #600. Closing to keep review focused on the combined PR.

Combines three related hardening changes to the generated undeploy.sh (previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a theme -- make undeploy recover from partial or transient failure -- so bundling them avoids three sequential PR rebases over the same template. 1. Transient-failure resilience (was NVIDIA#599). Three `kubectl | jq | while` pipelines in the post-uninstall cleanup path ran under `set -euo pipefail` with no fallback. A momentary control-plane 502 or auth refresh killed the whole script after Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and `Active` namespaces. Wrap each pipeline's trailing `done` in `|| echo "Warning: ..." >&2` so the failure is visible but the script continues. Matches the `|| true` pattern already used in the adjacent ORPHANED_CRD_GROUPS loop. 2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601). check_release_for_stuck_crds sourced its CRD list only from `helm get manifest`, which returns CRDs under the chart's templates/ section but omits CRDs installed from the chart's crds/ directory (the Helm-recommended layout used by most operator charts). CRs backed by crds/-installed CRDs (e.g., NIMService) slipped past the finalizer pre-flight. Union both sources (manifest + annotated) and dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty lines without the grep-exits-1-on-no-match behavior, which would abort the script under pipefail when a release has no CRDs. 3. Post-flight Helm-annotation CRD check (was NVIDIA#600). Helm-managed CRDs that survive the per-release deletion loop went unnoticed until the next deploy.sh run rejected them. Add a verification pass at the end of undeploy.sh that re-enumerates CRDs filtered by meta.helm.sh/release-name + release-namespace annotations and surfaces any leftovers as a post-flight warning. CRDs already being deleted (non-null .metadata.deletionTimestamp) are excluded: the earlier `kubectl delete crd ... --wait=false` returns immediately, so a slow finalizer can leave the CRD listed though cleanup is normal. Only truly stuck (not-yet-terminating) CRDs are reported. Unit test TestUndeployScript_TransientFailureWarnsAndContinues generates a bundle, stubs kubectl to exit non-zero, sources each of the three helpers, and asserts the helper returns 0 and emits the expected Warning: on stderr. Fixes: N/A Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline), NVIDIA#599, NVIDIA#600, NVIDIA#601 (superseded by this PR). Signed-off-by: Yuan Chen <[email protected]>

yuanchen8911 · 2026-04-17T00:43:18Z

Thanks for the careful review — both points are addressed in the combined PR #602 (commit b6bde3b5).

P1 — annotation-only discovery misses crds/ CRDs. You're right: Helm's crds/-directory install path does not stamp meta.helm.sh/release-name onto those CRDs, and they also never appear in helm get manifest. So sources 1 and 2 in the previous version both missed them, which contradicted the PR's stated goal.

Fix in #602: check_release_for_stuck_crds now takes a third source — API groups owned by the release — and walks cluster CRDs whose group matches. The mapping is extracted into a release_groups() shell helper so pre-flight and the existing ORPHANED_CRD_GROUPS post-flight cleanup share a single source of truth. For example, on a bundle with gpu-operator, pre-flight now also scans CRDs in nvidia.com / nfd.k8s-sigs.io even if they were installed via crds/ with no annotations and aren't in the current chart's helm get manifest.

Updated docstring reflects exactly what each source covers and calls out the crds/ limitation for sources 1 and 2.

P2 — source <(awk ...) fails on macOS bash 3.2. Converted both subtests (delete_release_cluster_resources, force_clear_namespace_finalizers) to the snippet=$(sed -n '...'); eval "$snippet" pattern the third subtest (orphan_crd_inline_loop) already uses. Runs the same on every bash without process substitution, and sidesteps awk-dialect escaping differences. All three subtests pass locally on bash 3.2 and bash 5.x.

Closing this PR is already done — tracking in #602.

Combines three related hardening changes to the generated undeploy.sh (previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a theme -- make undeploy recover from partial or transient failure -- so bundling them avoids three sequential PR rebases over the same template. 1. Transient-failure resilience (was NVIDIA#599). Three `kubectl | jq | while` pipelines in the post-uninstall cleanup path ran under `set -euo pipefail` with no fallback. A momentary control-plane 502 or auth refresh killed the whole script after Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and `Active` namespaces. Wrap each pipeline's trailing `done` in `|| echo "Warning: ..." >&2` so the failure is visible but the script continues. Matches the `|| true` pattern already used in the adjacent ORPHANED_CRD_GROUPS loop. 2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601). check_release_for_stuck_crds sourced its CRD list only from `helm get manifest`, which returns CRDs under the chart's templates/ section but omits CRDs installed from the chart's crds/ directory (the Helm-recommended layout used by most operator charts). CRs backed by crds/-installed CRDs (e.g., NIMService) slipped past the finalizer pre-flight. Union both sources (manifest + annotated) and dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty lines without the grep-exits-1-on-no-match behavior, which would abort the script under pipefail when a release has no CRDs. 3. Post-flight Helm-annotation CRD check (was NVIDIA#600). Helm-managed CRDs that survive the per-release deletion loop went unnoticed until the next deploy.sh run rejected them. Add a verification pass at the end of undeploy.sh that re-enumerates CRDs filtered by meta.helm.sh/release-name + release-namespace annotations and surfaces any leftovers as a post-flight warning. CRDs already being deleted (non-null .metadata.deletionTimestamp) are excluded: the earlier `kubectl delete crd ... --wait=false` returns immediately, so a slow finalizer can leave the CRD listed though cleanup is normal. Only truly stuck (not-yet-terminating) CRDs are reported. Unit test TestUndeployScript_TransientFailureWarnsAndContinues generates a bundle, stubs kubectl to exit non-zero, sources each of the three helpers, and asserts the helper returns 0 and emits the expected Warning: on stderr. Fixes: N/A Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline), NVIDIA#599, NVIDIA#600, NVIDIA#601 (superseded by this PR). Signed-off-by: Yuan Chen <[email protected]>

Combines three related hardening changes to the generated undeploy.sh (previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a theme -- make undeploy recover from partial or transient failure -- so bundling them avoids three sequential PR rebases over the same template. 1. Transient-failure resilience (was NVIDIA#599). Three `kubectl | jq | while` pipelines in the post-uninstall cleanup path ran under `set -euo pipefail` with no fallback. A momentary control-plane 502 or auth refresh killed the whole script after Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and `Active` namespaces. Wrap each pipeline's trailing `done` in `|| echo "Warning: ..." >&2` so the failure is visible but the script continues. Matches the `|| true` pattern already used in the adjacent ORPHANED_CRD_GROUPS loop. 2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601). check_release_for_stuck_crds sourced its CRD list only from `helm get manifest`, which returns CRDs under the chart's templates/ section but omits CRDs installed from the chart's crds/ directory (the Helm-recommended layout used by most operator charts). CRs backed by crds/-installed CRDs (e.g., NIMService) slipped past the finalizer pre-flight. Union both sources (manifest + annotated) and dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty lines without the grep-exits-1-on-no-match behavior, which would abort the script under pipefail when a release has no CRDs. 3. Post-flight Helm-annotation CRD check (was NVIDIA#600). Helm-managed CRDs that survive the per-release deletion loop went unnoticed until the next deploy.sh run rejected them. Add a verification pass at the end of undeploy.sh that re-enumerates CRDs filtered by meta.helm.sh/release-name + release-namespace annotations and surfaces any leftovers as a post-flight warning. CRDs already being deleted (non-null .metadata.deletionTimestamp) are excluded: the earlier `kubectl delete crd ... --wait=false` returns immediately, so a slow finalizer can leave the CRD listed though cleanup is normal. Only truly stuck (not-yet-terminating) CRDs are reported. Unit test TestUndeployScript_TransientFailureWarnsAndContinues generates a bundle, stubs kubectl to exit non-zero, sources each of the three helpers, and asserts the helper returns 0 and emits the expected Warning: on stderr. Fixes: N/A Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline), Signed-off-by: Yuan Chen <[email protected]>

Combines three related hardening changes to the generated undeploy.sh (previously split across #599, #601, #600). They share a file and a theme -- make undeploy recover from partial or transient failure -- so bundling them avoids three sequential PR rebases over the same template. 1. Transient-failure resilience (was #599). Three `kubectl | jq | while` pipelines in the post-uninstall cleanup path ran under `set -euo pipefail` with no fallback. A momentary control-plane 502 or auth refresh killed the whole script after Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and `Active` namespaces. Wrap each pipeline's trailing `done` in `|| echo "Warning: ..." >&2` so the failure is visible but the script continues. Matches the `|| true` pattern already used in the adjacent ORPHANED_CRD_GROUPS loop. 2. Pre-flight CRD discovery via Helm annotation (was #601). check_release_for_stuck_crds sourced its CRD list only from `helm get manifest`, which returns CRDs under the chart's templates/ section but omits CRDs installed from the chart's crds/ directory (the Helm-recommended layout used by most operator charts). CRs backed by crds/-installed CRDs (e.g., NIMService) slipped past the finalizer pre-flight. Union both sources (manifest + annotated) and dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty lines without the grep-exits-1-on-no-match behavior, which would abort the script under pipefail when a release has no CRDs. 3. Post-flight Helm-annotation CRD check (was #600). Helm-managed CRDs that survive the per-release deletion loop went unnoticed until the next deploy.sh run rejected them. Add a verification pass at the end of undeploy.sh that re-enumerates CRDs filtered by meta.helm.sh/release-name + release-namespace annotations and surfaces any leftovers as a post-flight warning. CRDs already being deleted (non-null .metadata.deletionTimestamp) are excluded: the earlier `kubectl delete crd ... --wait=false` returns immediately, so a slow finalizer can leave the CRD listed though cleanup is normal. Only truly stuck (not-yet-terminating) CRDs are reported. Unit test TestUndeployScript_TransientFailureWarnsAndContinues generates a bundle, stubs kubectl to exit non-zero, sources each of the three helpers, and asserts the helper returns 0 and emits the expected Warning: on stderr. Fixes: N/A Related: #477 (introduced the per-component CRD cleanup pipeline), Signed-off-by: Yuan Chen <[email protected]>

Combines three related hardening changes to the generated undeploy.sh (previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a theme -- make undeploy recover from partial or transient failure -- so bundling them avoids three sequential PR rebases over the same template. 1. Transient-failure resilience (was NVIDIA#599). Three `kubectl | jq | while` pipelines in the post-uninstall cleanup path ran under `set -euo pipefail` with no fallback. A momentary control-plane 502 or auth refresh killed the whole script after Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and `Active` namespaces. Wrap each pipeline's trailing `done` in `|| echo "Warning: ..." >&2` so the failure is visible but the script continues. Matches the `|| true` pattern already used in the adjacent ORPHANED_CRD_GROUPS loop. 2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601). check_release_for_stuck_crds sourced its CRD list only from `helm get manifest`, which returns CRDs under the chart's templates/ section but omits CRDs installed from the chart's crds/ directory (the Helm-recommended layout used by most operator charts). CRs backed by crds/-installed CRDs (e.g., NIMService) slipped past the finalizer pre-flight. Union both sources (manifest + annotated) and dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty lines without the grep-exits-1-on-no-match behavior, which would abort the script under pipefail when a release has no CRDs. 3. Post-flight Helm-annotation CRD check (was NVIDIA#600). Helm-managed CRDs that survive the per-release deletion loop went unnoticed until the next deploy.sh run rejected them. Add a verification pass at the end of undeploy.sh that re-enumerates CRDs filtered by meta.helm.sh/release-name + release-namespace annotations and surfaces any leftovers as a post-flight warning. CRDs already being deleted (non-null .metadata.deletionTimestamp) are excluded: the earlier `kubectl delete crd ... --wait=false` returns immediately, so a slow finalizer can leave the CRD listed though cleanup is normal. Only truly stuck (not-yet-terminating) CRDs are reported. Unit test TestUndeployScript_TransientFailureWarnsAndContinues generates a bundle, stubs kubectl to exit non-zero, sources each of the three helpers, and asserts the helper returns 0 and emits the expected Warning: on stderr. Fixes: N/A Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline), Signed-off-by: Yuan Chen <[email protected]>

yuanchen8911 requested a review from a team as a code owner April 17, 2026 00:11

yuanchen8911 added the area/bundler label Apr 17, 2026

github-actions Bot added the size/S label Apr 17, 2026

coderabbitai Bot reviewed Apr 17, 2026

View reviewed changes

Comment thread pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl

yuanchen8911 force-pushed the fix/undeploy-preflight-annotation-crds branch from 7a77c5a to 4a3597f Compare April 17, 2026 00:16

coderabbitai Bot reviewed Apr 17, 2026

View reviewed changes

This was referenced Apr 17, 2026

fix(bundler): fix undeploy template pre/post-flight checks #602

Merged

fix(bundler): make undeploy.sh resilient to transient kubectl failures #599

Closed

feat(bundler): add Helm-annotation post-flight check to undeploy.sh #600

Closed

yuanchen8911 closed this Apr 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(bundler): pre-flight should discover CRDs by Helm annotation, not just chart manifest#601

fix(bundler): pre-flight should discover CRDs by Helm annotation, not just chart manifest#601
yuanchen8911 wants to merge 1 commit into
NVIDIA:mainfrom
yuanchen8911:fix/undeploy-preflight-annotation-crds

yuanchen8911 commented Apr 17, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 17, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Apr 17, 2026

Uh oh!

yuanchen8911 commented Apr 17, 2026

Uh oh!

yuanchen8911 commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

yuanchen8911 commented Apr 17, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation / Context

Type of Change

Component(s) Affected

Implementation Notes

Testing

Risk Assessment

Checklist

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

yuanchen8911 commented Apr 17, 2026

Uh oh!

yuanchen8911 commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

yuanchen8911 commented Apr 17, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 17, 2026 •

edited

Loading