fix(bundler): pre-flight should discover CRDs by Helm annotation, not just chart manifest#601
Conversation
📝 WalkthroughWalkthroughThe Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl`:
- Around line 211-214: The pipeline that builds CRD names from
"${manifest_crds}" and "${annotated_crds}" (the printf … | sort -u | grep -v
'^$' | while read -r crd_name; do ...) fails under set -euo pipefail when both
variables are empty because grep returns exit code 1; replace the grep filter
with awk 'NF' so blank lines are filtered but empty input yields exit code 0
(i.e., change "| grep -v '^$' \\" to "| awk 'NF' \\").
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: 84489e24-38a1-4871-9f9f-a97cec71386b
📒 Files selected for processing (1)
pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl
check_release_for_stuck_crds previously sourced its CRD list only from `helm get manifest`, which returns CRDs in the chart's templates/ section but omits CRDs installed from the chart's crds/ directory (the Helm-recommended layout used by most operator charts). That meant workload CRs backed by crds/-installed CRDs (e.g., NIMService) slipped past the finalizer pre-flight and blocked namespace deletion later. Augment the lookup with CRDs that carry the release's meta.helm.sh/release-name + release-namespace annotations. Union both sources, dedupe, and scan each CRD for CRs with active finalizers before uninstalling the operator. Also drop empty lines with `awk 'NF'` instead of `grep -v '^$'`: grep exits 1 when no lines match, which would abort the script under `set -euo pipefail` whenever a release has no CRDs (chart ships none, installed with --skip-crds, or already cleaned up). awk 'NF' is equivalent but always exits 0. Signed-off-by: Yuan Chen <[email protected]>
7a77c5a to
4a3597f
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl`:
- Around line 205-210: The code currently aborts early because manifest=$(helm
get manifest "${release}" -n "${ns}" 2>/dev/null) || return 0 prevents the
annotated_crds kubectl/jq fallback from running; change this so a failing helm
get doesn't return from the function—instead set manifest to an empty string on
failure (e.g. manifest=$(helm get manifest ... 2>/dev/null) || manifest="") or
use an if/then to only populate manifest when helm succeeds; keep the subsequent
manifest_crds parsing (awk) and always run the annotated_crds kubectl/jq command
so annotated CRDs are discovered when the Helm release record is missing or
corrupted.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: baeb7523-0bec-426d-9304-2e789a8b72a1
📒 Files selected for processing (1)
pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl
| manifest=$(helm get manifest "${release}" -n "${ns}" 2>/dev/null) || return 0 | ||
| echo "${manifest}" \ | ||
| | awk '/^kind:/{kind=$2} /^ name:/ && kind=="CustomResourceDefinition"{print $2; kind=""}' \ | ||
| manifest_crds=$(echo "${manifest}" \ | ||
| | awk '/^kind:/{kind=$2} /^ name:/ && kind=="CustomResourceDefinition"{print $2; kind=""}') | ||
| annotated_crds=$(kubectl get crd -o json 2>/dev/null \ | ||
| | jq -r --arg rel "${release}" --arg ns "${ns}" \ | ||
| '.items[] | select(.metadata.annotations["meta.helm.sh/release-name"]==$rel and .metadata.annotations["meta.helm.sh/release-namespace"]==$ns) | .metadata.name' 2>/dev/null || true) |
There was a problem hiding this comment.
Don't return before the annotation fallback runs.
Line 205 exits the function on any helm get manifest failure, so the new kubectl/jq path never executes. That still misses the messy-state case where the Helm release record is gone or corrupted but its CRDs are still present and annotated in-cluster.
Suggested change
- manifest=$(helm get manifest "${release}" -n "${ns}" 2>/dev/null) || return 0
- manifest_crds=$(echo "${manifest}" \
+ manifest=$(helm get manifest "${release}" -n "${ns}" 2>/dev/null || true)
+ manifest_crds=$(printf '%s\n' "${manifest}" \
| awk '/^kind:/{kind=$2} /^ name:/ && kind=="CustomResourceDefinition"{print $2; kind=""}')📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| manifest=$(helm get manifest "${release}" -n "${ns}" 2>/dev/null) || return 0 | |
| echo "${manifest}" \ | |
| | awk '/^kind:/{kind=$2} /^ name:/ && kind=="CustomResourceDefinition"{print $2; kind=""}' \ | |
| manifest_crds=$(echo "${manifest}" \ | |
| | awk '/^kind:/{kind=$2} /^ name:/ && kind=="CustomResourceDefinition"{print $2; kind=""}') | |
| annotated_crds=$(kubectl get crd -o json 2>/dev/null \ | |
| | jq -r --arg rel "${release}" --arg ns "${ns}" \ | |
| '.items[] | select(.metadata.annotations["meta.helm.sh/release-name"]==$rel and .metadata.annotations["meta.helm.sh/release-namespace"]==$ns) | .metadata.name' 2>/dev/null || true) | |
| manifest=$(helm get manifest "${release}" -n "${ns}" 2>/dev/null || true) | |
| manifest_crds=$(printf '%s\n' "${manifest}" \ | |
| | awk '/^kind:/{kind=$2} /^ name:/ && kind=="CustomResourceDefinition"{print $2; kind=""}') | |
| annotated_crds=$(kubectl get crd -o json 2>/dev/null \ | |
| | jq -r --arg rel "${release}" --arg ns "${ns}" \ | |
| '.items[] | select(.metadata.annotations["meta.helm.sh/release-name"]==$rel and .metadata.annotations["meta.helm.sh/release-namespace"]==$ns) | .metadata.name' 2>/dev/null || true) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl` around lines 205 - 210,
The code currently aborts early because manifest=$(helm get manifest
"${release}" -n "${ns}" 2>/dev/null) || return 0 prevents the annotated_crds
kubectl/jq fallback from running; change this so a failing helm get doesn't
return from the function—instead set manifest to an empty string on failure
(e.g. manifest=$(helm get manifest ... 2>/dev/null) || manifest="") or use an
if/then to only populate manifest when helm succeeds; keep the subsequent
manifest_crds parsing (awk) and always run the annotated_crds kubectl/jq command
so annotated CRDs are discovered when the Helm release record is missing or
corrupted.
Combines three related hardening changes to the generated undeploy.sh (previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a theme -- make undeploy recover from partial or transient failure -- so bundling them avoids three sequential PR rebases over the same template. 1. Transient-failure resilience (was NVIDIA#599). Three `kubectl | jq | while` pipelines in the post-uninstall cleanup path ran under `set -euo pipefail` with no fallback. A momentary control-plane 502 or auth refresh killed the whole script after Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and `Active` namespaces. Wrap each pipeline's trailing `done` in `|| echo "Warning: ..." >&2` so the failure is visible but the script continues. Matches the `|| true` pattern already used in the adjacent ORPHANED_CRD_GROUPS loop. 2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601). check_release_for_stuck_crds sourced its CRD list only from `helm get manifest`, which returns CRDs under the chart's templates/ section but omits CRDs installed from the chart's crds/ directory (the Helm-recommended layout used by most operator charts). CRs backed by crds/-installed CRDs (e.g., NIMService) slipped past the finalizer pre-flight. Union both sources (manifest + annotated) and dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty lines without the grep-exits-1-on-no-match behavior, which would abort the script under pipefail when a release has no CRDs. 3. Post-flight Helm-annotation CRD check (was NVIDIA#600). Helm-managed CRDs that survive the per-release deletion loop went unnoticed until the next deploy.sh run rejected them. Add a verification pass at the end of undeploy.sh that re-enumerates CRDs filtered by meta.helm.sh/release-name + release-namespace annotations and surfaces any leftovers as a post-flight warning. CRDs already being deleted (non-null .metadata.deletionTimestamp) are excluded: the earlier `kubectl delete crd ... --wait=false` returns immediately, so a slow finalizer can leave the CRD listed though cleanup is normal. Only truly stuck (not-yet-terminating) CRDs are reported. Unit test TestUndeployScript_TransientFailureWarnsAndContinues generates a bundle, stubs kubectl to exit non-zero, sources each of the three helpers, and asserts the helper returns 0 and emits the expected Warning: on stderr. Fixes: N/A Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline), NVIDIA#599, NVIDIA#600, NVIDIA#601 (superseded by this PR). Signed-off-by: Yuan Chen <[email protected]>
Combines three related hardening changes to the generated undeploy.sh (previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a theme -- make undeploy recover from partial or transient failure -- so bundling them avoids three sequential PR rebases over the same template. 1. Transient-failure resilience (was NVIDIA#599). Three `kubectl | jq | while` pipelines in the post-uninstall cleanup path ran under `set -euo pipefail` with no fallback. A momentary control-plane 502 or auth refresh killed the whole script after Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and `Active` namespaces. Wrap each pipeline's trailing `done` in `|| echo "Warning: ..." >&2` so the failure is visible but the script continues. Matches the `|| true` pattern already used in the adjacent ORPHANED_CRD_GROUPS loop. 2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601). check_release_for_stuck_crds sourced its CRD list only from `helm get manifest`, which returns CRDs under the chart's templates/ section but omits CRDs installed from the chart's crds/ directory (the Helm-recommended layout used by most operator charts). CRs backed by crds/-installed CRDs (e.g., NIMService) slipped past the finalizer pre-flight. Union both sources (manifest + annotated) and dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty lines without the grep-exits-1-on-no-match behavior, which would abort the script under pipefail when a release has no CRDs. 3. Post-flight Helm-annotation CRD check (was NVIDIA#600). Helm-managed CRDs that survive the per-release deletion loop went unnoticed until the next deploy.sh run rejected them. Add a verification pass at the end of undeploy.sh that re-enumerates CRDs filtered by meta.helm.sh/release-name + release-namespace annotations and surfaces any leftovers as a post-flight warning. CRDs already being deleted (non-null .metadata.deletionTimestamp) are excluded: the earlier `kubectl delete crd ... --wait=false` returns immediately, so a slow finalizer can leave the CRD listed though cleanup is normal. Only truly stuck (not-yet-terminating) CRDs are reported. Unit test TestUndeployScript_TransientFailureWarnsAndContinues generates a bundle, stubs kubectl to exit non-zero, sources each of the three helpers, and asserts the helper returns 0 and emits the expected Warning: on stderr. Fixes: N/A Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline), NVIDIA#599, NVIDIA#600, NVIDIA#601 (superseded by this PR). Signed-off-by: Yuan Chen <[email protected]>
Combines three related hardening changes to the generated undeploy.sh (previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a theme -- make undeploy recover from partial or transient failure -- so bundling them avoids three sequential PR rebases over the same template. 1. Transient-failure resilience (was NVIDIA#599). Three `kubectl | jq | while` pipelines in the post-uninstall cleanup path ran under `set -euo pipefail` with no fallback. A momentary control-plane 502 or auth refresh killed the whole script after Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and `Active` namespaces. Wrap each pipeline's trailing `done` in `|| echo "Warning: ..." >&2` so the failure is visible but the script continues. Matches the `|| true` pattern already used in the adjacent ORPHANED_CRD_GROUPS loop. 2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601). check_release_for_stuck_crds sourced its CRD list only from `helm get manifest`, which returns CRDs under the chart's templates/ section but omits CRDs installed from the chart's crds/ directory (the Helm-recommended layout used by most operator charts). CRs backed by crds/-installed CRDs (e.g., NIMService) slipped past the finalizer pre-flight. Union both sources (manifest + annotated) and dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty lines without the grep-exits-1-on-no-match behavior, which would abort the script under pipefail when a release has no CRDs. 3. Post-flight Helm-annotation CRD check (was NVIDIA#600). Helm-managed CRDs that survive the per-release deletion loop went unnoticed until the next deploy.sh run rejected them. Add a verification pass at the end of undeploy.sh that re-enumerates CRDs filtered by meta.helm.sh/release-name + release-namespace annotations and surfaces any leftovers as a post-flight warning. CRDs already being deleted (non-null .metadata.deletionTimestamp) are excluded: the earlier `kubectl delete crd ... --wait=false` returns immediately, so a slow finalizer can leave the CRD listed though cleanup is normal. Only truly stuck (not-yet-terminating) CRDs are reported. Unit test TestUndeployScript_TransientFailureWarnsAndContinues generates a bundle, stubs kubectl to exit non-zero, sources each of the three helpers, and asserts the helper returns 0 and emits the expected Warning: on stderr. Fixes: N/A Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline), NVIDIA#599, NVIDIA#600, NVIDIA#601 (superseded by this PR). Signed-off-by: Yuan Chen <[email protected]>
|
Thanks for the careful review — both points are addressed in the combined PR #602 (commit P1 — annotation-only discovery misses Fix in #602: Updated docstring reflects exactly what each source covers and calls out the P2 — Closing this PR is already done — tracking in #602. |
Combines three related hardening changes to the generated undeploy.sh (previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a theme -- make undeploy recover from partial or transient failure -- so bundling them avoids three sequential PR rebases over the same template. 1. Transient-failure resilience (was NVIDIA#599). Three `kubectl | jq | while` pipelines in the post-uninstall cleanup path ran under `set -euo pipefail` with no fallback. A momentary control-plane 502 or auth refresh killed the whole script after Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and `Active` namespaces. Wrap each pipeline's trailing `done` in `|| echo "Warning: ..." >&2` so the failure is visible but the script continues. Matches the `|| true` pattern already used in the adjacent ORPHANED_CRD_GROUPS loop. 2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601). check_release_for_stuck_crds sourced its CRD list only from `helm get manifest`, which returns CRDs under the chart's templates/ section but omits CRDs installed from the chart's crds/ directory (the Helm-recommended layout used by most operator charts). CRs backed by crds/-installed CRDs (e.g., NIMService) slipped past the finalizer pre-flight. Union both sources (manifest + annotated) and dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty lines without the grep-exits-1-on-no-match behavior, which would abort the script under pipefail when a release has no CRDs. 3. Post-flight Helm-annotation CRD check (was NVIDIA#600). Helm-managed CRDs that survive the per-release deletion loop went unnoticed until the next deploy.sh run rejected them. Add a verification pass at the end of undeploy.sh that re-enumerates CRDs filtered by meta.helm.sh/release-name + release-namespace annotations and surfaces any leftovers as a post-flight warning. CRDs already being deleted (non-null .metadata.deletionTimestamp) are excluded: the earlier `kubectl delete crd ... --wait=false` returns immediately, so a slow finalizer can leave the CRD listed though cleanup is normal. Only truly stuck (not-yet-terminating) CRDs are reported. Unit test TestUndeployScript_TransientFailureWarnsAndContinues generates a bundle, stubs kubectl to exit non-zero, sources each of the three helpers, and asserts the helper returns 0 and emits the expected Warning: on stderr. Fixes: N/A Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline), NVIDIA#599, NVIDIA#600, NVIDIA#601 (superseded by this PR). Signed-off-by: Yuan Chen <[email protected]>
Combines three related hardening changes to the generated undeploy.sh (previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a theme -- make undeploy recover from partial or transient failure -- so bundling them avoids three sequential PR rebases over the same template. 1. Transient-failure resilience (was NVIDIA#599). Three `kubectl | jq | while` pipelines in the post-uninstall cleanup path ran under `set -euo pipefail` with no fallback. A momentary control-plane 502 or auth refresh killed the whole script after Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and `Active` namespaces. Wrap each pipeline's trailing `done` in `|| echo "Warning: ..." >&2` so the failure is visible but the script continues. Matches the `|| true` pattern already used in the adjacent ORPHANED_CRD_GROUPS loop. 2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601). check_release_for_stuck_crds sourced its CRD list only from `helm get manifest`, which returns CRDs under the chart's templates/ section but omits CRDs installed from the chart's crds/ directory (the Helm-recommended layout used by most operator charts). CRs backed by crds/-installed CRDs (e.g., NIMService) slipped past the finalizer pre-flight. Union both sources (manifest + annotated) and dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty lines without the grep-exits-1-on-no-match behavior, which would abort the script under pipefail when a release has no CRDs. 3. Post-flight Helm-annotation CRD check (was NVIDIA#600). Helm-managed CRDs that survive the per-release deletion loop went unnoticed until the next deploy.sh run rejected them. Add a verification pass at the end of undeploy.sh that re-enumerates CRDs filtered by meta.helm.sh/release-name + release-namespace annotations and surfaces any leftovers as a post-flight warning. CRDs already being deleted (non-null .metadata.deletionTimestamp) are excluded: the earlier `kubectl delete crd ... --wait=false` returns immediately, so a slow finalizer can leave the CRD listed though cleanup is normal. Only truly stuck (not-yet-terminating) CRDs are reported. Unit test TestUndeployScript_TransientFailureWarnsAndContinues generates a bundle, stubs kubectl to exit non-zero, sources each of the three helpers, and asserts the helper returns 0 and emits the expected Warning: on stderr. Fixes: N/A Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline), Signed-off-by: Yuan Chen <[email protected]>
Combines three related hardening changes to the generated undeploy.sh (previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a theme -- make undeploy recover from partial or transient failure -- so bundling them avoids three sequential PR rebases over the same template. 1. Transient-failure resilience (was NVIDIA#599). Three `kubectl | jq | while` pipelines in the post-uninstall cleanup path ran under `set -euo pipefail` with no fallback. A momentary control-plane 502 or auth refresh killed the whole script after Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and `Active` namespaces. Wrap each pipeline's trailing `done` in `|| echo "Warning: ..." >&2` so the failure is visible but the script continues. Matches the `|| true` pattern already used in the adjacent ORPHANED_CRD_GROUPS loop. 2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601). check_release_for_stuck_crds sourced its CRD list only from `helm get manifest`, which returns CRDs under the chart's templates/ section but omits CRDs installed from the chart's crds/ directory (the Helm-recommended layout used by most operator charts). CRs backed by crds/-installed CRDs (e.g., NIMService) slipped past the finalizer pre-flight. Union both sources (manifest + annotated) and dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty lines without the grep-exits-1-on-no-match behavior, which would abort the script under pipefail when a release has no CRDs. 3. Post-flight Helm-annotation CRD check (was NVIDIA#600). Helm-managed CRDs that survive the per-release deletion loop went unnoticed until the next deploy.sh run rejected them. Add a verification pass at the end of undeploy.sh that re-enumerates CRDs filtered by meta.helm.sh/release-name + release-namespace annotations and surfaces any leftovers as a post-flight warning. CRDs already being deleted (non-null .metadata.deletionTimestamp) are excluded: the earlier `kubectl delete crd ... --wait=false` returns immediately, so a slow finalizer can leave the CRD listed though cleanup is normal. Only truly stuck (not-yet-terminating) CRDs are reported. Unit test TestUndeployScript_TransientFailureWarnsAndContinues generates a bundle, stubs kubectl to exit non-zero, sources each of the three helpers, and asserts the helper returns 0 and emits the expected Warning: on stderr. Fixes: N/A Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline), Signed-off-by: Yuan Chen <[email protected]>
Combines three related hardening changes to the generated undeploy.sh (previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a theme -- make undeploy recover from partial or transient failure -- so bundling them avoids three sequential PR rebases over the same template. 1. Transient-failure resilience (was NVIDIA#599). Three `kubectl | jq | while` pipelines in the post-uninstall cleanup path ran under `set -euo pipefail` with no fallback. A momentary control-plane 502 or auth refresh killed the whole script after Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and `Active` namespaces. Wrap each pipeline's trailing `done` in `|| echo "Warning: ..." >&2` so the failure is visible but the script continues. Matches the `|| true` pattern already used in the adjacent ORPHANED_CRD_GROUPS loop. 2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601). check_release_for_stuck_crds sourced its CRD list only from `helm get manifest`, which returns CRDs under the chart's templates/ section but omits CRDs installed from the chart's crds/ directory (the Helm-recommended layout used by most operator charts). CRs backed by crds/-installed CRDs (e.g., NIMService) slipped past the finalizer pre-flight. Union both sources (manifest + annotated) and dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty lines without the grep-exits-1-on-no-match behavior, which would abort the script under pipefail when a release has no CRDs. 3. Post-flight Helm-annotation CRD check (was NVIDIA#600). Helm-managed CRDs that survive the per-release deletion loop went unnoticed until the next deploy.sh run rejected them. Add a verification pass at the end of undeploy.sh that re-enumerates CRDs filtered by meta.helm.sh/release-name + release-namespace annotations and surfaces any leftovers as a post-flight warning. CRDs already being deleted (non-null .metadata.deletionTimestamp) are excluded: the earlier `kubectl delete crd ... --wait=false` returns immediately, so a slow finalizer can leave the CRD listed though cleanup is normal. Only truly stuck (not-yet-terminating) CRDs are reported. Unit test TestUndeployScript_TransientFailureWarnsAndContinues generates a bundle, stubs kubectl to exit non-zero, sources each of the three helpers, and asserts the helper returns 0 and emits the expected Warning: on stderr. Fixes: N/A Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline), Signed-off-by: Yuan Chen <[email protected]>
Combines three related hardening changes to the generated undeploy.sh (previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a theme -- make undeploy recover from partial or transient failure -- so bundling them avoids three sequential PR rebases over the same template. 1. Transient-failure resilience (was NVIDIA#599). Three `kubectl | jq | while` pipelines in the post-uninstall cleanup path ran under `set -euo pipefail` with no fallback. A momentary control-plane 502 or auth refresh killed the whole script after Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and `Active` namespaces. Wrap each pipeline's trailing `done` in `|| echo "Warning: ..." >&2` so the failure is visible but the script continues. Matches the `|| true` pattern already used in the adjacent ORPHANED_CRD_GROUPS loop. 2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601). check_release_for_stuck_crds sourced its CRD list only from `helm get manifest`, which returns CRDs under the chart's templates/ section but omits CRDs installed from the chart's crds/ directory (the Helm-recommended layout used by most operator charts). CRs backed by crds/-installed CRDs (e.g., NIMService) slipped past the finalizer pre-flight. Union both sources (manifest + annotated) and dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty lines without the grep-exits-1-on-no-match behavior, which would abort the script under pipefail when a release has no CRDs. 3. Post-flight Helm-annotation CRD check (was NVIDIA#600). Helm-managed CRDs that survive the per-release deletion loop went unnoticed until the next deploy.sh run rejected them. Add a verification pass at the end of undeploy.sh that re-enumerates CRDs filtered by meta.helm.sh/release-name + release-namespace annotations and surfaces any leftovers as a post-flight warning. CRDs already being deleted (non-null .metadata.deletionTimestamp) are excluded: the earlier `kubectl delete crd ... --wait=false` returns immediately, so a slow finalizer can leave the CRD listed though cleanup is normal. Only truly stuck (not-yet-terminating) CRDs are reported. Unit test TestUndeployScript_TransientFailureWarnsAndContinues generates a bundle, stubs kubectl to exit non-zero, sources each of the three helpers, and asserts the helper returns 0 and emits the expected Warning: on stderr. Fixes: N/A Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline), Signed-off-by: Yuan Chen <[email protected]>
Combines three related hardening changes to the generated undeploy.sh (previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a theme -- make undeploy recover from partial or transient failure -- so bundling them avoids three sequential PR rebases over the same template. 1. Transient-failure resilience (was NVIDIA#599). Three `kubectl | jq | while` pipelines in the post-uninstall cleanup path ran under `set -euo pipefail` with no fallback. A momentary control-plane 502 or auth refresh killed the whole script after Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and `Active` namespaces. Wrap each pipeline's trailing `done` in `|| echo "Warning: ..." >&2` so the failure is visible but the script continues. Matches the `|| true` pattern already used in the adjacent ORPHANED_CRD_GROUPS loop. 2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601). check_release_for_stuck_crds sourced its CRD list only from `helm get manifest`, which returns CRDs under the chart's templates/ section but omits CRDs installed from the chart's crds/ directory (the Helm-recommended layout used by most operator charts). CRs backed by crds/-installed CRDs (e.g., NIMService) slipped past the finalizer pre-flight. Union both sources (manifest + annotated) and dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty lines without the grep-exits-1-on-no-match behavior, which would abort the script under pipefail when a release has no CRDs. 3. Post-flight Helm-annotation CRD check (was NVIDIA#600). Helm-managed CRDs that survive the per-release deletion loop went unnoticed until the next deploy.sh run rejected them. Add a verification pass at the end of undeploy.sh that re-enumerates CRDs filtered by meta.helm.sh/release-name + release-namespace annotations and surfaces any leftovers as a post-flight warning. CRDs already being deleted (non-null .metadata.deletionTimestamp) are excluded: the earlier `kubectl delete crd ... --wait=false` returns immediately, so a slow finalizer can leave the CRD listed though cleanup is normal. Only truly stuck (not-yet-terminating) CRDs are reported. Unit test TestUndeployScript_TransientFailureWarnsAndContinues generates a bundle, stubs kubectl to exit non-zero, sources each of the three helpers, and asserts the helper returns 0 and emits the expected Warning: on stderr. Fixes: N/A Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline), Signed-off-by: Yuan Chen <[email protected]>
Combines three related hardening changes to the generated undeploy.sh (previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a theme -- make undeploy recover from partial or transient failure -- so bundling them avoids three sequential PR rebases over the same template. 1. Transient-failure resilience (was NVIDIA#599). Three `kubectl | jq | while` pipelines in the post-uninstall cleanup path ran under `set -euo pipefail` with no fallback. A momentary control-plane 502 or auth refresh killed the whole script after Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and `Active` namespaces. Wrap each pipeline's trailing `done` in `|| echo "Warning: ..." >&2` so the failure is visible but the script continues. Matches the `|| true` pattern already used in the adjacent ORPHANED_CRD_GROUPS loop. 2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601). check_release_for_stuck_crds sourced its CRD list only from `helm get manifest`, which returns CRDs under the chart's templates/ section but omits CRDs installed from the chart's crds/ directory (the Helm-recommended layout used by most operator charts). CRs backed by crds/-installed CRDs (e.g., NIMService) slipped past the finalizer pre-flight. Union both sources (manifest + annotated) and dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty lines without the grep-exits-1-on-no-match behavior, which would abort the script under pipefail when a release has no CRDs. 3. Post-flight Helm-annotation CRD check (was NVIDIA#600). Helm-managed CRDs that survive the per-release deletion loop went unnoticed until the next deploy.sh run rejected them. Add a verification pass at the end of undeploy.sh that re-enumerates CRDs filtered by meta.helm.sh/release-name + release-namespace annotations and surfaces any leftovers as a post-flight warning. CRDs already being deleted (non-null .metadata.deletionTimestamp) are excluded: the earlier `kubectl delete crd ... --wait=false` returns immediately, so a slow finalizer can leave the CRD listed though cleanup is normal. Only truly stuck (not-yet-terminating) CRDs are reported. Unit test TestUndeployScript_TransientFailureWarnsAndContinues generates a bundle, stubs kubectl to exit non-zero, sources each of the three helpers, and asserts the helper returns 0 and emits the expected Warning: on stderr. Fixes: N/A Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline), Signed-off-by: Yuan Chen <[email protected]>
Combines three related hardening changes to the generated undeploy.sh (previously split across #599, #601, #600). They share a file and a theme -- make undeploy recover from partial or transient failure -- so bundling them avoids three sequential PR rebases over the same template. 1. Transient-failure resilience (was #599). Three `kubectl | jq | while` pipelines in the post-uninstall cleanup path ran under `set -euo pipefail` with no fallback. A momentary control-plane 502 or auth refresh killed the whole script after Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and `Active` namespaces. Wrap each pipeline's trailing `done` in `|| echo "Warning: ..." >&2` so the failure is visible but the script continues. Matches the `|| true` pattern already used in the adjacent ORPHANED_CRD_GROUPS loop. 2. Pre-flight CRD discovery via Helm annotation (was #601). check_release_for_stuck_crds sourced its CRD list only from `helm get manifest`, which returns CRDs under the chart's templates/ section but omits CRDs installed from the chart's crds/ directory (the Helm-recommended layout used by most operator charts). CRs backed by crds/-installed CRDs (e.g., NIMService) slipped past the finalizer pre-flight. Union both sources (manifest + annotated) and dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty lines without the grep-exits-1-on-no-match behavior, which would abort the script under pipefail when a release has no CRDs. 3. Post-flight Helm-annotation CRD check (was #600). Helm-managed CRDs that survive the per-release deletion loop went unnoticed until the next deploy.sh run rejected them. Add a verification pass at the end of undeploy.sh that re-enumerates CRDs filtered by meta.helm.sh/release-name + release-namespace annotations and surfaces any leftovers as a post-flight warning. CRDs already being deleted (non-null .metadata.deletionTimestamp) are excluded: the earlier `kubectl delete crd ... --wait=false` returns immediately, so a slow finalizer can leave the CRD listed though cleanup is normal. Only truly stuck (not-yet-terminating) CRDs are reported. Unit test TestUndeployScript_TransientFailureWarnsAndContinues generates a bundle, stubs kubectl to exit non-zero, sources each of the three helpers, and asserts the helper returns 0 and emits the expected Warning: on stderr. Fixes: N/A Related: #477 (introduced the per-component CRD cleanup pipeline), Signed-off-by: Yuan Chen <[email protected]>
Combines three related hardening changes to the generated undeploy.sh (previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a theme -- make undeploy recover from partial or transient failure -- so bundling them avoids three sequential PR rebases over the same template. 1. Transient-failure resilience (was NVIDIA#599). Three `kubectl | jq | while` pipelines in the post-uninstall cleanup path ran under `set -euo pipefail` with no fallback. A momentary control-plane 502 or auth refresh killed the whole script after Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and `Active` namespaces. Wrap each pipeline's trailing `done` in `|| echo "Warning: ..." >&2` so the failure is visible but the script continues. Matches the `|| true` pattern already used in the adjacent ORPHANED_CRD_GROUPS loop. 2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601). check_release_for_stuck_crds sourced its CRD list only from `helm get manifest`, which returns CRDs under the chart's templates/ section but omits CRDs installed from the chart's crds/ directory (the Helm-recommended layout used by most operator charts). CRs backed by crds/-installed CRDs (e.g., NIMService) slipped past the finalizer pre-flight. Union both sources (manifest + annotated) and dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty lines without the grep-exits-1-on-no-match behavior, which would abort the script under pipefail when a release has no CRDs. 3. Post-flight Helm-annotation CRD check (was NVIDIA#600). Helm-managed CRDs that survive the per-release deletion loop went unnoticed until the next deploy.sh run rejected them. Add a verification pass at the end of undeploy.sh that re-enumerates CRDs filtered by meta.helm.sh/release-name + release-namespace annotations and surfaces any leftovers as a post-flight warning. CRDs already being deleted (non-null .metadata.deletionTimestamp) are excluded: the earlier `kubectl delete crd ... --wait=false` returns immediately, so a slow finalizer can leave the CRD listed though cleanup is normal. Only truly stuck (not-yet-terminating) CRDs are reported. Unit test TestUndeployScript_TransientFailureWarnsAndContinues generates a bundle, stubs kubectl to exit non-zero, sources each of the three helpers, and asserts the helper returns 0 and emits the expected Warning: on stderr. Fixes: N/A Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline), Signed-off-by: Yuan Chen <[email protected]>
Summary
Fix
undeploy.sh's pre-flight finalizer check so that CRDs installed from a Helm chart'scrds/directory are discovered and their CR instances scanned. Today only CRDs declared in the chart'stemplates/section (and therefore visible tohelm get manifest) are checked, so most operator-managed CRDs slip through and the script can be tricked into removing the operator before its CRs are cleaned up.Motivation / Context
Hit live on
aicr-cuj2today: I left aNIMService(demos/workloads/inference/nimservice-llama-3-2-1b.yaml) on the cluster, ran./undeploy.sh, the pre-flight check passed, the NIM operator got uninstalled, and the script then "completed" — but thenimservices.apps.nvidia.comCRD'scustomresourcecleanupfinalizer was waiting on the leftover NIMService CR, whose own NIM-operator finalizer could no longer be processed. The CRD stayedTerminatingindefinitely, thenim-workloadnamespace stayedTerminating, and recovery required manualkubectl patch ... finalizers:nullon the CR.The pre-flight check exists exactly to prevent that scenario, but it didn't fire. Investigating:
check_release_for_stuck_crdsdiscovers CRDs by parsing the output ofhelm get manifest— which Helm restricts totemplates/content. The k8s-nim-operator, gpu-operator, kai-scheduler, etc. charts all ship their CRDs undercrds/(the Helm-recommended layout). Those CRDs are never visible tohelm get manifest, so their CR instances are never scanned during pre-flight, and the script proceeds to remove the operator that was the only thing capable of clearing those CRs' finalizers.Add a second discovery source: query CRDs annotated with
meta.helm.sh/release-namematching the release (andmeta.helm.sh/release-namespacematching the namespace). This is the same annotation the per-component cleanup loop in this file already filters by, and the same one #600 uses for the post-flight verification check. Bringing pre-flight into line means workload CRs get caught before the operator is removed, and the user is told (via the existing pre-flight error message) to delete them first.Fixes: N/A (no GitHub issue filed)
Related: #599 (cleanup-pipeline resilience for the same code path), #600 (post-flight verification using the same annotation filter)
Type of Change
Component(s) Affected
pkg/bundler,pkg/component/*)Implementation Notes
Single function rewrite in
pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl:Notes:
sort -u, so CRDs visible tohelm get manifest(e.g., a chart that declares its CRDs in templates/) are still scanned exactly once and existing behavior is preserved.|| true-guarded so a transient API failure during pre-flight doesn't false-negative the whole check.return 0ifhelm get manifestfails is intentional: it means the release isn't installed and there's nothing to check. Same guarantee as before.Testing
pkg/bundler/deployer/helm/...pass with-race.bash -non the generated script: clean.aicr-cuj2with a deliberately-applied NIMService is the next step (out of scope for this PR; will verify after merge).Risk Assessment
Rollout notes: Generated-script change only. No API surface, no migration. Worst-case failure mode is "pre-flight reports stuck CR for a CRD it would have ignored before" — which is the correct behavior; user follows the existing
kubectl delete <cr>remediation hint or--skip-preflightto override. Reverting is a one-commit revert.Checklist
make testwith-raceonpkg/bundler/...)make linthas pre-existing yamllint failures in untracked.codex-worktrees/;go vet+golangci-lintclean on the changed packagekubectl get crd, both runtime-only)meta.helm.sh/release-name/-namespacefilter is already used by the per-component cleanup loop and by feat(bundler): add Helm-annotation post-flight check to undeploy.sh #600's post-flight check)git commit -S)Summary by CodeRabbit