Skip to content

fix(bundler): pre-flight should discover CRDs by Helm annotation, not just chart manifest#601

Closed
yuanchen8911 wants to merge 1 commit into
NVIDIA:mainfrom
yuanchen8911:fix/undeploy-preflight-annotation-crds
Closed

fix(bundler): pre-flight should discover CRDs by Helm annotation, not just chart manifest#601
yuanchen8911 wants to merge 1 commit into
NVIDIA:mainfrom
yuanchen8911:fix/undeploy-preflight-annotation-crds

Conversation

@yuanchen8911

@yuanchen8911 yuanchen8911 commented Apr 17, 2026

Copy link
Copy Markdown
Contributor

Summary

Fix undeploy.sh's pre-flight finalizer check so that CRDs installed from a Helm chart's crds/ directory are discovered and their CR instances scanned. Today only CRDs declared in the chart's templates/ section (and therefore visible to helm get manifest) are checked, so most operator-managed CRDs slip through and the script can be tricked into removing the operator before its CRs are cleaned up.

Motivation / Context

Hit live on aicr-cuj2 today: I left a NIMService (demos/workloads/inference/nimservice-llama-3-2-1b.yaml) on the cluster, ran ./undeploy.sh, the pre-flight check passed, the NIM operator got uninstalled, and the script then "completed" — but the nimservices.apps.nvidia.com CRD's customresourcecleanup finalizer was waiting on the leftover NIMService CR, whose own NIM-operator finalizer could no longer be processed. The CRD stayed Terminating indefinitely, the nim-workload namespace stayed Terminating, and recovery required manual kubectl patch ... finalizers:null on the CR.

The pre-flight check exists exactly to prevent that scenario, but it didn't fire. Investigating: check_release_for_stuck_crds discovers CRDs by parsing the output of helm get manifest — which Helm restricts to templates/ content. The k8s-nim-operator, gpu-operator, kai-scheduler, etc. charts all ship their CRDs under crds/ (the Helm-recommended layout). Those CRDs are never visible to helm get manifest, so their CR instances are never scanned during pre-flight, and the script proceeds to remove the operator that was the only thing capable of clearing those CRs' finalizers.

Add a second discovery source: query CRDs annotated with meta.helm.sh/release-name matching the release (and meta.helm.sh/release-namespace matching the namespace). This is the same annotation the per-component cleanup loop in this file already filters by, and the same one #600 uses for the post-flight verification check. Bringing pre-flight into line means workload CRs get caught before the operator is removed, and the user is told (via the existing pre-flight error message) to delete them first.

Fixes: N/A (no GitHub issue filed)
Related: #599 (cleanup-pipeline resilience for the same code path), #600 (post-flight verification using the same annotation filter)

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • Bundlers (pkg/bundler, pkg/component/*)

Implementation Notes

Single function rewrite in pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl:

check_release_for_stuck_crds() {
  local release="$1"
  local ns="$2"
  local manifest manifest_crds annotated_crds
  manifest=$(helm get manifest "${release}" -n "${ns}" 2>/dev/null) || return 0
  manifest_crds=$(echo "${manifest}" \
    | awk '/^kind:/{kind=$2} /^  name:/ && kind=="CustomResourceDefinition"{print $2; kind=""}')
  annotated_crds=$(kubectl get crd -o json 2>/dev/null \
    | jq -r --arg rel "${release}" --arg ns "${ns}" \
      '.items[] | select(.metadata.annotations["meta.helm.sh/release-name"]==$rel and .metadata.annotations["meta.helm.sh/release-namespace"]==$ns) | .metadata.name' 2>/dev/null || true)
  printf '%s\n%s\n' "${manifest_crds}" "${annotated_crds}" \
    | sort -u \
    | grep -v '^$' \
    | while read -r crd_name; do
        check_crd_for_stuck_resources "${crd_name}" "${release}"
      done
  return 0
}

Notes:

  • The two sources are deduplicated by sort -u, so CRDs visible to helm get manifest (e.g., a chart that declares its CRDs in templates/) are still scanned exactly once and existing behavior is preserved.
  • The annotation-based query is || true-guarded so a transient API failure during pre-flight doesn't false-negative the whole check.
  • The early return 0 if helm get manifest fails is intentional: it means the release isn't installed and there's nothing to check. Same guarantee as before.

Testing

GOFLAGS=-mod=vendor go test -race -count=1 ./pkg/bundler/deployer/helm/...
make build
./dist/aicr_<host>/aicr bundle --recipe recipe.yaml --output /tmp/...
bash -n /tmp/.../undeploy.sh   # syntax check
  • Unit tests in pkg/bundler/deployer/helm/... pass with -race.
  • Regenerated a real bundle and inspected the rendered function — both discovery branches present, dedup logic in place.
  • bash -n on the generated script: clean.
  • Live re-test on aicr-cuj2 with a deliberately-applied NIMService is the next step (out of scope for this PR; will verify after merge).

Risk Assessment

  • Low — Isolated change, easy to revert
  • Medium
  • High

Rollout notes: Generated-script change only. No API surface, no migration. Worst-case failure mode is "pre-flight reports stuck CR for a CRD it would have ignored before" — which is the correct behavior; user follows the existing kubectl delete <cr> remediation hint or --skip-preflight to override. Reverting is a one-commit revert.

Checklist

  • Tests pass locally (make test with -race on pkg/bundler/...)
  • Linter passes — local make lint has pre-existing yamllint failures in untracked .codex-worktrees/; go vet + golangci-lint clean on the changed package
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality (template-only change exercised by the existing helm bundler tests; the new discovery branch is bash + kubectl get crd, both runtime-only)
  • I updated docs if user-facing behavior changed (pre-flight error message wording is unchanged; only the coverage of which CRDs trigger it has been broadened)
  • Changes follow existing patterns in the codebase (the meta.helm.sh/release-name/-namespace filter is already used by the per-component cleanup loop and by feat(bundler): add Helm-annotation post-flight check to undeploy.sh #600's post-flight check)
  • Commits are cryptographically signed (git commit -S)

Summary by CodeRabbit

  • Bug Fixes
    • Improved undeployment reliability by expanding CRD discovery to multiple sources (release manifests and cluster records). This ensures more complete identification and cleanup of custom resources, reducing the risk of orphaned resources or stuck CRDs remaining after undeploy operations.

@yuanchen8911 yuanchen8911 requested a review from a team as a code owner April 17, 2026 00:11
@coderabbitai

coderabbitai Bot commented Apr 17, 2026

Copy link
Copy Markdown
📝 Walkthrough

Walkthrough

The check_release_for_stuck_crds function now discovers CRD names from two sources: parsing helm get manifest and querying live cluster CRDs annotated with the release name/namespace via kubectl/jq. Results are concatenated, filtered, de-duplicated, and iterated for the same downstream checks.

Changes

Cohort / File(s) Summary
Helm Undeploy CRD Detection
pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl
check_release_for_stuck_crds now builds CRD list from both manifest parsing (helm get manifestmanifest_crds) and live cluster annotation lookup (kubectl + jqannotated_crds), concatenates, filters empty entries, de-duplicates with sort -u, and iterates the unique set. Adds dependency on kubectl/jq output for discovery.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐇 I sniffed the manifest and scurried the tree,
Two trails of CRDs now lead back to me,
I hop, I sort, I squash each twin,
Clean fields ahead — the undeploy grin! 🥕✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: improving CRD discovery in the pre-flight check by adding annotation-based detection alongside manifest parsing.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl`:
- Around line 211-214: The pipeline that builds CRD names from
"${manifest_crds}" and "${annotated_crds}" (the printf … | sort -u | grep -v
'^$' | while read -r crd_name; do ...) fails under set -euo pipefail when both
variables are empty because grep returns exit code 1; replace the grep filter
with awk 'NF' so blank lines are filtered but empty input yields exit code 0
(i.e., change "| grep -v '^$' \\" to "| awk 'NF' \\").
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 84489e24-38a1-4871-9f9f-a97cec71386b

📥 Commits

Reviewing files that changed from the base of the PR and between b82af6f and 7a77c5a.

📒 Files selected for processing (1)
  • pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl

Comment thread pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl
check_release_for_stuck_crds previously sourced its CRD list only from
`helm get manifest`, which returns CRDs in the chart's templates/
section but omits CRDs installed from the chart's crds/ directory (the
Helm-recommended layout used by most operator charts). That meant
workload CRs backed by crds/-installed CRDs (e.g., NIMService) slipped
past the finalizer pre-flight and blocked namespace deletion later.

Augment the lookup with CRDs that carry the release's
meta.helm.sh/release-name + release-namespace annotations. Union both
sources, dedupe, and scan each CRD for CRs with active finalizers
before uninstalling the operator.

Also drop empty lines with `awk 'NF'` instead of `grep -v '^$'`: grep
exits 1 when no lines match, which would abort the script under
`set -euo pipefail` whenever a release has no CRDs (chart ships none,
installed with --skip-crds, or already cleaned up). awk 'NF' is
equivalent but always exits 0.

Signed-off-by: Yuan Chen <[email protected]>
@yuanchen8911 yuanchen8911 force-pushed the fix/undeploy-preflight-annotation-crds branch from 7a77c5a to 4a3597f Compare April 17, 2026 00:16

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl`:
- Around line 205-210: The code currently aborts early because manifest=$(helm
get manifest "${release}" -n "${ns}" 2>/dev/null) || return 0 prevents the
annotated_crds kubectl/jq fallback from running; change this so a failing helm
get doesn't return from the function—instead set manifest to an empty string on
failure (e.g. manifest=$(helm get manifest ... 2>/dev/null) || manifest="") or
use an if/then to only populate manifest when helm succeeds; keep the subsequent
manifest_crds parsing (awk) and always run the annotated_crds kubectl/jq command
so annotated CRDs are discovered when the Helm release record is missing or
corrupted.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: baeb7523-0bec-426d-9304-2e789a8b72a1

📥 Commits

Reviewing files that changed from the base of the PR and between 7a77c5a and 4a3597f.

📒 Files selected for processing (1)
  • pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl

Comment on lines 205 to +210
manifest=$(helm get manifest "${release}" -n "${ns}" 2>/dev/null) || return 0
echo "${manifest}" \
| awk '/^kind:/{kind=$2} /^ name:/ && kind=="CustomResourceDefinition"{print $2; kind=""}' \
manifest_crds=$(echo "${manifest}" \
| awk '/^kind:/{kind=$2} /^ name:/ && kind=="CustomResourceDefinition"{print $2; kind=""}')
annotated_crds=$(kubectl get crd -o json 2>/dev/null \
| jq -r --arg rel "${release}" --arg ns "${ns}" \
'.items[] | select(.metadata.annotations["meta.helm.sh/release-name"]==$rel and .metadata.annotations["meta.helm.sh/release-namespace"]==$ns) | .metadata.name' 2>/dev/null || true)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't return before the annotation fallback runs.

Line 205 exits the function on any helm get manifest failure, so the new kubectl/jq path never executes. That still misses the messy-state case where the Helm release record is gone or corrupted but its CRDs are still present and annotated in-cluster.

Suggested change
-  manifest=$(helm get manifest "${release}" -n "${ns}" 2>/dev/null) || return 0
-  manifest_crds=$(echo "${manifest}" \
+  manifest=$(helm get manifest "${release}" -n "${ns}" 2>/dev/null || true)
+  manifest_crds=$(printf '%s\n' "${manifest}" \
     | awk '/^kind:/{kind=$2} /^  name:/ && kind=="CustomResourceDefinition"{print $2; kind=""}')
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
manifest=$(helm get manifest "${release}" -n "${ns}" 2>/dev/null) || return 0
echo "${manifest}" \
| awk '/^kind:/{kind=$2} /^ name:/ && kind=="CustomResourceDefinition"{print $2; kind=""}' \
manifest_crds=$(echo "${manifest}" \
| awk '/^kind:/{kind=$2} /^ name:/ && kind=="CustomResourceDefinition"{print $2; kind=""}')
annotated_crds=$(kubectl get crd -o json 2>/dev/null \
| jq -r --arg rel "${release}" --arg ns "${ns}" \
'.items[] | select(.metadata.annotations["meta.helm.sh/release-name"]==$rel and .metadata.annotations["meta.helm.sh/release-namespace"]==$ns) | .metadata.name' 2>/dev/null || true)
manifest=$(helm get manifest "${release}" -n "${ns}" 2>/dev/null || true)
manifest_crds=$(printf '%s\n' "${manifest}" \
| awk '/^kind:/{kind=$2} /^ name:/ && kind=="CustomResourceDefinition"{print $2; kind=""}')
annotated_crds=$(kubectl get crd -o json 2>/dev/null \
| jq -r --arg rel "${release}" --arg ns "${ns}" \
'.items[] | select(.metadata.annotations["meta.helm.sh/release-name"]==$rel and .metadata.annotations["meta.helm.sh/release-namespace"]==$ns) | .metadata.name' 2>/dev/null || true)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl` around lines 205 - 210,
The code currently aborts early because manifest=$(helm get manifest
"${release}" -n "${ns}" 2>/dev/null) || return 0 prevents the annotated_crds
kubectl/jq fallback from running; change this so a failing helm get doesn't
return from the function—instead set manifest to an empty string on failure
(e.g. manifest=$(helm get manifest ... 2>/dev/null) || manifest="") or use an
if/then to only populate manifest when helm succeeds; keep the subsequent
manifest_crds parsing (awk) and always run the annotated_crds kubectl/jq command
so annotated CRDs are discovered when the Helm release record is missing or
corrupted.

yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 17, 2026
Combines three related hardening changes to the generated undeploy.sh
(previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a
theme -- make undeploy recover from partial or transient failure --
so bundling them avoids three sequential PR rebases over the same
template.

1. Transient-failure resilience (was NVIDIA#599).
   Three `kubectl | jq | while` pipelines in the post-uninstall cleanup
   path ran under `set -euo pipefail` with no fallback. A momentary
   control-plane 502 or auth refresh killed the whole script after
   Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and
   `Active` namespaces. Wrap each pipeline's trailing `done` in
   `|| echo "Warning: ..." >&2` so the failure is visible but the
   script continues. Matches the `|| true` pattern already used in the
   adjacent ORPHANED_CRD_GROUPS loop.

2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601).
   check_release_for_stuck_crds sourced its CRD list only from
   `helm get manifest`, which returns CRDs under the chart's templates/
   section but omits CRDs installed from the chart's crds/ directory
   (the Helm-recommended layout used by most operator charts). CRs
   backed by crds/-installed CRDs (e.g., NIMService) slipped past the
   finalizer pre-flight. Union both sources (manifest + annotated) and
   dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty
   lines without the grep-exits-1-on-no-match behavior, which would
   abort the script under pipefail when a release has no CRDs.

3. Post-flight Helm-annotation CRD check (was NVIDIA#600).
   Helm-managed CRDs that survive the per-release deletion loop went
   unnoticed until the next deploy.sh run rejected them. Add a
   verification pass at the end of undeploy.sh that re-enumerates CRDs
   filtered by meta.helm.sh/release-name + release-namespace
   annotations and surfaces any leftovers as a post-flight warning.
   CRDs already being deleted (non-null .metadata.deletionTimestamp)
   are excluded: the earlier `kubectl delete crd ... --wait=false`
   returns immediately, so a slow finalizer can leave the CRD listed
   though cleanup is normal. Only truly stuck (not-yet-terminating)
   CRDs are reported.

Unit test TestUndeployScript_TransientFailureWarnsAndContinues
generates a bundle, stubs kubectl to exit non-zero, sources each of
the three helpers, and asserts the helper returns 0 and emits the
expected Warning: on stderr.

Fixes: N/A
Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline),
NVIDIA#599, NVIDIA#600, NVIDIA#601 (superseded by this PR).

Signed-off-by: Yuan Chen <[email protected]>
@yuanchen8911

Copy link
Copy Markdown
Contributor Author

Superseded by #602, which consolidates this change with #599 and #600. Closing to keep review focused on the combined PR.

yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 17, 2026
Combines three related hardening changes to the generated undeploy.sh
(previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a
theme -- make undeploy recover from partial or transient failure --
so bundling them avoids three sequential PR rebases over the same
template.

1. Transient-failure resilience (was NVIDIA#599).
   Three `kubectl | jq | while` pipelines in the post-uninstall cleanup
   path ran under `set -euo pipefail` with no fallback. A momentary
   control-plane 502 or auth refresh killed the whole script after
   Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and
   `Active` namespaces. Wrap each pipeline's trailing `done` in
   `|| echo "Warning: ..." >&2` so the failure is visible but the
   script continues. Matches the `|| true` pattern already used in the
   adjacent ORPHANED_CRD_GROUPS loop.

2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601).
   check_release_for_stuck_crds sourced its CRD list only from
   `helm get manifest`, which returns CRDs under the chart's templates/
   section but omits CRDs installed from the chart's crds/ directory
   (the Helm-recommended layout used by most operator charts). CRs
   backed by crds/-installed CRDs (e.g., NIMService) slipped past the
   finalizer pre-flight. Union both sources (manifest + annotated) and
   dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty
   lines without the grep-exits-1-on-no-match behavior, which would
   abort the script under pipefail when a release has no CRDs.

3. Post-flight Helm-annotation CRD check (was NVIDIA#600).
   Helm-managed CRDs that survive the per-release deletion loop went
   unnoticed until the next deploy.sh run rejected them. Add a
   verification pass at the end of undeploy.sh that re-enumerates CRDs
   filtered by meta.helm.sh/release-name + release-namespace
   annotations and surfaces any leftovers as a post-flight warning.
   CRDs already being deleted (non-null .metadata.deletionTimestamp)
   are excluded: the earlier `kubectl delete crd ... --wait=false`
   returns immediately, so a slow finalizer can leave the CRD listed
   though cleanup is normal. Only truly stuck (not-yet-terminating)
   CRDs are reported.

Unit test TestUndeployScript_TransientFailureWarnsAndContinues
generates a bundle, stubs kubectl to exit non-zero, sources each of
the three helpers, and asserts the helper returns 0 and emits the
expected Warning: on stderr.

Fixes: N/A
Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline),
NVIDIA#599, NVIDIA#600, NVIDIA#601 (superseded by this PR).

Signed-off-by: Yuan Chen <[email protected]>
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 17, 2026
Combines three related hardening changes to the generated undeploy.sh
(previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a
theme -- make undeploy recover from partial or transient failure --
so bundling them avoids three sequential PR rebases over the same
template.

1. Transient-failure resilience (was NVIDIA#599).
   Three `kubectl | jq | while` pipelines in the post-uninstall cleanup
   path ran under `set -euo pipefail` with no fallback. A momentary
   control-plane 502 or auth refresh killed the whole script after
   Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and
   `Active` namespaces. Wrap each pipeline's trailing `done` in
   `|| echo "Warning: ..." >&2` so the failure is visible but the
   script continues. Matches the `|| true` pattern already used in the
   adjacent ORPHANED_CRD_GROUPS loop.

2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601).
   check_release_for_stuck_crds sourced its CRD list only from
   `helm get manifest`, which returns CRDs under the chart's templates/
   section but omits CRDs installed from the chart's crds/ directory
   (the Helm-recommended layout used by most operator charts). CRs
   backed by crds/-installed CRDs (e.g., NIMService) slipped past the
   finalizer pre-flight. Union both sources (manifest + annotated) and
   dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty
   lines without the grep-exits-1-on-no-match behavior, which would
   abort the script under pipefail when a release has no CRDs.

3. Post-flight Helm-annotation CRD check (was NVIDIA#600).
   Helm-managed CRDs that survive the per-release deletion loop went
   unnoticed until the next deploy.sh run rejected them. Add a
   verification pass at the end of undeploy.sh that re-enumerates CRDs
   filtered by meta.helm.sh/release-name + release-namespace
   annotations and surfaces any leftovers as a post-flight warning.
   CRDs already being deleted (non-null .metadata.deletionTimestamp)
   are excluded: the earlier `kubectl delete crd ... --wait=false`
   returns immediately, so a slow finalizer can leave the CRD listed
   though cleanup is normal. Only truly stuck (not-yet-terminating)
   CRDs are reported.

Unit test TestUndeployScript_TransientFailureWarnsAndContinues
generates a bundle, stubs kubectl to exit non-zero, sources each of
the three helpers, and asserts the helper returns 0 and emits the
expected Warning: on stderr.

Fixes: N/A
Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline),
NVIDIA#599, NVIDIA#600, NVIDIA#601 (superseded by this PR).

Signed-off-by: Yuan Chen <[email protected]>
@yuanchen8911

Copy link
Copy Markdown
Contributor Author

Thanks for the careful review — both points are addressed in the combined PR #602 (commit b6bde3b5).

P1 — annotation-only discovery misses crds/ CRDs. You're right: Helm's crds/-directory install path does not stamp meta.helm.sh/release-name onto those CRDs, and they also never appear in helm get manifest. So sources 1 and 2 in the previous version both missed them, which contradicted the PR's stated goal.

Fix in #602: check_release_for_stuck_crds now takes a third source — API groups owned by the release — and walks cluster CRDs whose group matches. The mapping is extracted into a release_groups() shell helper so pre-flight and the existing ORPHANED_CRD_GROUPS post-flight cleanup share a single source of truth. For example, on a bundle with gpu-operator, pre-flight now also scans CRDs in nvidia.com / nfd.k8s-sigs.io even if they were installed via crds/ with no annotations and aren't in the current chart's helm get manifest.

Updated docstring reflects exactly what each source covers and calls out the crds/ limitation for sources 1 and 2.

P2 — source <(awk ...) fails on macOS bash 3.2. Converted both subtests (delete_release_cluster_resources, force_clear_namespace_finalizers) to the snippet=$(sed -n '...'); eval "$snippet" pattern the third subtest (orphan_crd_inline_loop) already uses. Runs the same on every bash without process substitution, and sidesteps awk-dialect escaping differences. All three subtests pass locally on bash 3.2 and bash 5.x.

Closing this PR is already done — tracking in #602.

yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 17, 2026
Combines three related hardening changes to the generated undeploy.sh
(previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a
theme -- make undeploy recover from partial or transient failure --
so bundling them avoids three sequential PR rebases over the same
template.

1. Transient-failure resilience (was NVIDIA#599).
   Three `kubectl | jq | while` pipelines in the post-uninstall cleanup
   path ran under `set -euo pipefail` with no fallback. A momentary
   control-plane 502 or auth refresh killed the whole script after
   Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and
   `Active` namespaces. Wrap each pipeline's trailing `done` in
   `|| echo "Warning: ..." >&2` so the failure is visible but the
   script continues. Matches the `|| true` pattern already used in the
   adjacent ORPHANED_CRD_GROUPS loop.

2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601).
   check_release_for_stuck_crds sourced its CRD list only from
   `helm get manifest`, which returns CRDs under the chart's templates/
   section but omits CRDs installed from the chart's crds/ directory
   (the Helm-recommended layout used by most operator charts). CRs
   backed by crds/-installed CRDs (e.g., NIMService) slipped past the
   finalizer pre-flight. Union both sources (manifest + annotated) and
   dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty
   lines without the grep-exits-1-on-no-match behavior, which would
   abort the script under pipefail when a release has no CRDs.

3. Post-flight Helm-annotation CRD check (was NVIDIA#600).
   Helm-managed CRDs that survive the per-release deletion loop went
   unnoticed until the next deploy.sh run rejected them. Add a
   verification pass at the end of undeploy.sh that re-enumerates CRDs
   filtered by meta.helm.sh/release-name + release-namespace
   annotations and surfaces any leftovers as a post-flight warning.
   CRDs already being deleted (non-null .metadata.deletionTimestamp)
   are excluded: the earlier `kubectl delete crd ... --wait=false`
   returns immediately, so a slow finalizer can leave the CRD listed
   though cleanup is normal. Only truly stuck (not-yet-terminating)
   CRDs are reported.

Unit test TestUndeployScript_TransientFailureWarnsAndContinues
generates a bundle, stubs kubectl to exit non-zero, sources each of
the three helpers, and asserts the helper returns 0 and emits the
expected Warning: on stderr.

Fixes: N/A
Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline),
NVIDIA#599, NVIDIA#600, NVIDIA#601 (superseded by this PR).

Signed-off-by: Yuan Chen <[email protected]>
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 17, 2026
Combines three related hardening changes to the generated undeploy.sh
(previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a
theme -- make undeploy recover from partial or transient failure --
so bundling them avoids three sequential PR rebases over the same
template.

1. Transient-failure resilience (was NVIDIA#599).
   Three `kubectl | jq | while` pipelines in the post-uninstall cleanup
   path ran under `set -euo pipefail` with no fallback. A momentary
   control-plane 502 or auth refresh killed the whole script after
   Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and
   `Active` namespaces. Wrap each pipeline's trailing `done` in
   `|| echo "Warning: ..." >&2` so the failure is visible but the
   script continues. Matches the `|| true` pattern already used in the
   adjacent ORPHANED_CRD_GROUPS loop.

2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601).
   check_release_for_stuck_crds sourced its CRD list only from
   `helm get manifest`, which returns CRDs under the chart's templates/
   section but omits CRDs installed from the chart's crds/ directory
   (the Helm-recommended layout used by most operator charts). CRs
   backed by crds/-installed CRDs (e.g., NIMService) slipped past the
   finalizer pre-flight. Union both sources (manifest + annotated) and
   dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty
   lines without the grep-exits-1-on-no-match behavior, which would
   abort the script under pipefail when a release has no CRDs.

3. Post-flight Helm-annotation CRD check (was NVIDIA#600).
   Helm-managed CRDs that survive the per-release deletion loop went
   unnoticed until the next deploy.sh run rejected them. Add a
   verification pass at the end of undeploy.sh that re-enumerates CRDs
   filtered by meta.helm.sh/release-name + release-namespace
   annotations and surfaces any leftovers as a post-flight warning.
   CRDs already being deleted (non-null .metadata.deletionTimestamp)
   are excluded: the earlier `kubectl delete crd ... --wait=false`
   returns immediately, so a slow finalizer can leave the CRD listed
   though cleanup is normal. Only truly stuck (not-yet-terminating)
   CRDs are reported.

Unit test TestUndeployScript_TransientFailureWarnsAndContinues
generates a bundle, stubs kubectl to exit non-zero, sources each of
the three helpers, and asserts the helper returns 0 and emits the
expected Warning: on stderr.

Fixes: N/A
Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline),

Signed-off-by: Yuan Chen <[email protected]>
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 17, 2026
Combines three related hardening changes to the generated undeploy.sh
(previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a
theme -- make undeploy recover from partial or transient failure --
so bundling them avoids three sequential PR rebases over the same
template.

1. Transient-failure resilience (was NVIDIA#599).
   Three `kubectl | jq | while` pipelines in the post-uninstall cleanup
   path ran under `set -euo pipefail` with no fallback. A momentary
   control-plane 502 or auth refresh killed the whole script after
   Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and
   `Active` namespaces. Wrap each pipeline's trailing `done` in
   `|| echo "Warning: ..." >&2` so the failure is visible but the
   script continues. Matches the `|| true` pattern already used in the
   adjacent ORPHANED_CRD_GROUPS loop.

2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601).
   check_release_for_stuck_crds sourced its CRD list only from
   `helm get manifest`, which returns CRDs under the chart's templates/
   section but omits CRDs installed from the chart's crds/ directory
   (the Helm-recommended layout used by most operator charts). CRs
   backed by crds/-installed CRDs (e.g., NIMService) slipped past the
   finalizer pre-flight. Union both sources (manifest + annotated) and
   dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty
   lines without the grep-exits-1-on-no-match behavior, which would
   abort the script under pipefail when a release has no CRDs.

3. Post-flight Helm-annotation CRD check (was NVIDIA#600).
   Helm-managed CRDs that survive the per-release deletion loop went
   unnoticed until the next deploy.sh run rejected them. Add a
   verification pass at the end of undeploy.sh that re-enumerates CRDs
   filtered by meta.helm.sh/release-name + release-namespace
   annotations and surfaces any leftovers as a post-flight warning.
   CRDs already being deleted (non-null .metadata.deletionTimestamp)
   are excluded: the earlier `kubectl delete crd ... --wait=false`
   returns immediately, so a slow finalizer can leave the CRD listed
   though cleanup is normal. Only truly stuck (not-yet-terminating)
   CRDs are reported.

Unit test TestUndeployScript_TransientFailureWarnsAndContinues
generates a bundle, stubs kubectl to exit non-zero, sources each of
the three helpers, and asserts the helper returns 0 and emits the
expected Warning: on stderr.

Fixes: N/A
Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline),

Signed-off-by: Yuan Chen <[email protected]>
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 17, 2026
Combines three related hardening changes to the generated undeploy.sh
(previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a
theme -- make undeploy recover from partial or transient failure --
so bundling them avoids three sequential PR rebases over the same
template.

1. Transient-failure resilience (was NVIDIA#599).
   Three `kubectl | jq | while` pipelines in the post-uninstall cleanup
   path ran under `set -euo pipefail` with no fallback. A momentary
   control-plane 502 or auth refresh killed the whole script after
   Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and
   `Active` namespaces. Wrap each pipeline's trailing `done` in
   `|| echo "Warning: ..." >&2` so the failure is visible but the
   script continues. Matches the `|| true` pattern already used in the
   adjacent ORPHANED_CRD_GROUPS loop.

2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601).
   check_release_for_stuck_crds sourced its CRD list only from
   `helm get manifest`, which returns CRDs under the chart's templates/
   section but omits CRDs installed from the chart's crds/ directory
   (the Helm-recommended layout used by most operator charts). CRs
   backed by crds/-installed CRDs (e.g., NIMService) slipped past the
   finalizer pre-flight. Union both sources (manifest + annotated) and
   dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty
   lines without the grep-exits-1-on-no-match behavior, which would
   abort the script under pipefail when a release has no CRDs.

3. Post-flight Helm-annotation CRD check (was NVIDIA#600).
   Helm-managed CRDs that survive the per-release deletion loop went
   unnoticed until the next deploy.sh run rejected them. Add a
   verification pass at the end of undeploy.sh that re-enumerates CRDs
   filtered by meta.helm.sh/release-name + release-namespace
   annotations and surfaces any leftovers as a post-flight warning.
   CRDs already being deleted (non-null .metadata.deletionTimestamp)
   are excluded: the earlier `kubectl delete crd ... --wait=false`
   returns immediately, so a slow finalizer can leave the CRD listed
   though cleanup is normal. Only truly stuck (not-yet-terminating)
   CRDs are reported.

Unit test TestUndeployScript_TransientFailureWarnsAndContinues
generates a bundle, stubs kubectl to exit non-zero, sources each of
the three helpers, and asserts the helper returns 0 and emits the
expected Warning: on stderr.

Fixes: N/A
Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline),

Signed-off-by: Yuan Chen <[email protected]>
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 17, 2026
Combines three related hardening changes to the generated undeploy.sh
(previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a
theme -- make undeploy recover from partial or transient failure --
so bundling them avoids three sequential PR rebases over the same
template.

1. Transient-failure resilience (was NVIDIA#599).
   Three `kubectl | jq | while` pipelines in the post-uninstall cleanup
   path ran under `set -euo pipefail` with no fallback. A momentary
   control-plane 502 or auth refresh killed the whole script after
   Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and
   `Active` namespaces. Wrap each pipeline's trailing `done` in
   `|| echo "Warning: ..." >&2` so the failure is visible but the
   script continues. Matches the `|| true` pattern already used in the
   adjacent ORPHANED_CRD_GROUPS loop.

2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601).
   check_release_for_stuck_crds sourced its CRD list only from
   `helm get manifest`, which returns CRDs under the chart's templates/
   section but omits CRDs installed from the chart's crds/ directory
   (the Helm-recommended layout used by most operator charts). CRs
   backed by crds/-installed CRDs (e.g., NIMService) slipped past the
   finalizer pre-flight. Union both sources (manifest + annotated) and
   dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty
   lines without the grep-exits-1-on-no-match behavior, which would
   abort the script under pipefail when a release has no CRDs.

3. Post-flight Helm-annotation CRD check (was NVIDIA#600).
   Helm-managed CRDs that survive the per-release deletion loop went
   unnoticed until the next deploy.sh run rejected them. Add a
   verification pass at the end of undeploy.sh that re-enumerates CRDs
   filtered by meta.helm.sh/release-name + release-namespace
   annotations and surfaces any leftovers as a post-flight warning.
   CRDs already being deleted (non-null .metadata.deletionTimestamp)
   are excluded: the earlier `kubectl delete crd ... --wait=false`
   returns immediately, so a slow finalizer can leave the CRD listed
   though cleanup is normal. Only truly stuck (not-yet-terminating)
   CRDs are reported.

Unit test TestUndeployScript_TransientFailureWarnsAndContinues
generates a bundle, stubs kubectl to exit non-zero, sources each of
the three helpers, and asserts the helper returns 0 and emits the
expected Warning: on stderr.

Fixes: N/A
Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline),

Signed-off-by: Yuan Chen <[email protected]>
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 17, 2026
Combines three related hardening changes to the generated undeploy.sh
(previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a
theme -- make undeploy recover from partial or transient failure --
so bundling them avoids three sequential PR rebases over the same
template.

1. Transient-failure resilience (was NVIDIA#599).
   Three `kubectl | jq | while` pipelines in the post-uninstall cleanup
   path ran under `set -euo pipefail` with no fallback. A momentary
   control-plane 502 or auth refresh killed the whole script after
   Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and
   `Active` namespaces. Wrap each pipeline's trailing `done` in
   `|| echo "Warning: ..." >&2` so the failure is visible but the
   script continues. Matches the `|| true` pattern already used in the
   adjacent ORPHANED_CRD_GROUPS loop.

2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601).
   check_release_for_stuck_crds sourced its CRD list only from
   `helm get manifest`, which returns CRDs under the chart's templates/
   section but omits CRDs installed from the chart's crds/ directory
   (the Helm-recommended layout used by most operator charts). CRs
   backed by crds/-installed CRDs (e.g., NIMService) slipped past the
   finalizer pre-flight. Union both sources (manifest + annotated) and
   dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty
   lines without the grep-exits-1-on-no-match behavior, which would
   abort the script under pipefail when a release has no CRDs.

3. Post-flight Helm-annotation CRD check (was NVIDIA#600).
   Helm-managed CRDs that survive the per-release deletion loop went
   unnoticed until the next deploy.sh run rejected them. Add a
   verification pass at the end of undeploy.sh that re-enumerates CRDs
   filtered by meta.helm.sh/release-name + release-namespace
   annotations and surfaces any leftovers as a post-flight warning.
   CRDs already being deleted (non-null .metadata.deletionTimestamp)
   are excluded: the earlier `kubectl delete crd ... --wait=false`
   returns immediately, so a slow finalizer can leave the CRD listed
   though cleanup is normal. Only truly stuck (not-yet-terminating)
   CRDs are reported.

Unit test TestUndeployScript_TransientFailureWarnsAndContinues
generates a bundle, stubs kubectl to exit non-zero, sources each of
the three helpers, and asserts the helper returns 0 and emits the
expected Warning: on stderr.

Fixes: N/A
Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline),

Signed-off-by: Yuan Chen <[email protected]>
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 17, 2026
Combines three related hardening changes to the generated undeploy.sh
(previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a
theme -- make undeploy recover from partial or transient failure --
so bundling them avoids three sequential PR rebases over the same
template.

1. Transient-failure resilience (was NVIDIA#599).
   Three `kubectl | jq | while` pipelines in the post-uninstall cleanup
   path ran under `set -euo pipefail` with no fallback. A momentary
   control-plane 502 or auth refresh killed the whole script after
   Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and
   `Active` namespaces. Wrap each pipeline's trailing `done` in
   `|| echo "Warning: ..." >&2` so the failure is visible but the
   script continues. Matches the `|| true` pattern already used in the
   adjacent ORPHANED_CRD_GROUPS loop.

2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601).
   check_release_for_stuck_crds sourced its CRD list only from
   `helm get manifest`, which returns CRDs under the chart's templates/
   section but omits CRDs installed from the chart's crds/ directory
   (the Helm-recommended layout used by most operator charts). CRs
   backed by crds/-installed CRDs (e.g., NIMService) slipped past the
   finalizer pre-flight. Union both sources (manifest + annotated) and
   dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty
   lines without the grep-exits-1-on-no-match behavior, which would
   abort the script under pipefail when a release has no CRDs.

3. Post-flight Helm-annotation CRD check (was NVIDIA#600).
   Helm-managed CRDs that survive the per-release deletion loop went
   unnoticed until the next deploy.sh run rejected them. Add a
   verification pass at the end of undeploy.sh that re-enumerates CRDs
   filtered by meta.helm.sh/release-name + release-namespace
   annotations and surfaces any leftovers as a post-flight warning.
   CRDs already being deleted (non-null .metadata.deletionTimestamp)
   are excluded: the earlier `kubectl delete crd ... --wait=false`
   returns immediately, so a slow finalizer can leave the CRD listed
   though cleanup is normal. Only truly stuck (not-yet-terminating)
   CRDs are reported.

Unit test TestUndeployScript_TransientFailureWarnsAndContinues
generates a bundle, stubs kubectl to exit non-zero, sources each of
the three helpers, and asserts the helper returns 0 and emits the
expected Warning: on stderr.

Fixes: N/A
Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline),

Signed-off-by: Yuan Chen <[email protected]>
yuanchen8911 added a commit that referenced this pull request Apr 17, 2026
Combines three related hardening changes to the generated undeploy.sh
(previously split across #599, #601, #600). They share a file and a
theme -- make undeploy recover from partial or transient failure --
so bundling them avoids three sequential PR rebases over the same
template.

1. Transient-failure resilience (was #599).
   Three `kubectl | jq | while` pipelines in the post-uninstall cleanup
   path ran under `set -euo pipefail` with no fallback. A momentary
   control-plane 502 or auth refresh killed the whole script after
   Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and
   `Active` namespaces. Wrap each pipeline's trailing `done` in
   `|| echo "Warning: ..." >&2` so the failure is visible but the
   script continues. Matches the `|| true` pattern already used in the
   adjacent ORPHANED_CRD_GROUPS loop.

2. Pre-flight CRD discovery via Helm annotation (was #601).
   check_release_for_stuck_crds sourced its CRD list only from
   `helm get manifest`, which returns CRDs under the chart's templates/
   section but omits CRDs installed from the chart's crds/ directory
   (the Helm-recommended layout used by most operator charts). CRs
   backed by crds/-installed CRDs (e.g., NIMService) slipped past the
   finalizer pre-flight. Union both sources (manifest + annotated) and
   dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty
   lines without the grep-exits-1-on-no-match behavior, which would
   abort the script under pipefail when a release has no CRDs.

3. Post-flight Helm-annotation CRD check (was #600).
   Helm-managed CRDs that survive the per-release deletion loop went
   unnoticed until the next deploy.sh run rejected them. Add a
   verification pass at the end of undeploy.sh that re-enumerates CRDs
   filtered by meta.helm.sh/release-name + release-namespace
   annotations and surfaces any leftovers as a post-flight warning.
   CRDs already being deleted (non-null .metadata.deletionTimestamp)
   are excluded: the earlier `kubectl delete crd ... --wait=false`
   returns immediately, so a slow finalizer can leave the CRD listed
   though cleanup is normal. Only truly stuck (not-yet-terminating)
   CRDs are reported.

Unit test TestUndeployScript_TransientFailureWarnsAndContinues
generates a bundle, stubs kubectl to exit non-zero, sources each of
the three helpers, and asserts the helper returns 0 and emits the
expected Warning: on stderr.

Fixes: N/A
Related: #477 (introduced the per-component CRD cleanup pipeline),

Signed-off-by: Yuan Chen <[email protected]>
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 17, 2026
Combines three related hardening changes to the generated undeploy.sh
(previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a
theme -- make undeploy recover from partial or transient failure --
so bundling them avoids three sequential PR rebases over the same
template.

1. Transient-failure resilience (was NVIDIA#599).
   Three `kubectl | jq | while` pipelines in the post-uninstall cleanup
   path ran under `set -euo pipefail` with no fallback. A momentary
   control-plane 502 or auth refresh killed the whole script after
   Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and
   `Active` namespaces. Wrap each pipeline's trailing `done` in
   `|| echo "Warning: ..." >&2` so the failure is visible but the
   script continues. Matches the `|| true` pattern already used in the
   adjacent ORPHANED_CRD_GROUPS loop.

2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601).
   check_release_for_stuck_crds sourced its CRD list only from
   `helm get manifest`, which returns CRDs under the chart's templates/
   section but omits CRDs installed from the chart's crds/ directory
   (the Helm-recommended layout used by most operator charts). CRs
   backed by crds/-installed CRDs (e.g., NIMService) slipped past the
   finalizer pre-flight. Union both sources (manifest + annotated) and
   dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty
   lines without the grep-exits-1-on-no-match behavior, which would
   abort the script under pipefail when a release has no CRDs.

3. Post-flight Helm-annotation CRD check (was NVIDIA#600).
   Helm-managed CRDs that survive the per-release deletion loop went
   unnoticed until the next deploy.sh run rejected them. Add a
   verification pass at the end of undeploy.sh that re-enumerates CRDs
   filtered by meta.helm.sh/release-name + release-namespace
   annotations and surfaces any leftovers as a post-flight warning.
   CRDs already being deleted (non-null .metadata.deletionTimestamp)
   are excluded: the earlier `kubectl delete crd ... --wait=false`
   returns immediately, so a slow finalizer can leave the CRD listed
   though cleanup is normal. Only truly stuck (not-yet-terminating)
   CRDs are reported.

Unit test TestUndeployScript_TransientFailureWarnsAndContinues
generates a bundle, stubs kubectl to exit non-zero, sources each of
the three helpers, and asserts the helper returns 0 and emits the
expected Warning: on stderr.

Fixes: N/A
Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline),

Signed-off-by: Yuan Chen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant