feat(bundler): add Helm-annotation post-flight check to undeploy.sh#600
feat(bundler): add Helm-annotation post-flight check to undeploy.sh#600yuanchen8911 wants to merge 1 commit into
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Plus Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughA new post-flight verification is added to the undeploy script template that scans all CRDs, filters those annotated with Helm release metadata ( Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~15-30 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
… just chart manifest The pre-flight finalizer check (check_release_for_stuck_crds) discovers CRDs by parsing `helm get manifest`. That only returns the chart's templates/ section — not the crds/ directory. Most operator charts in this bundle (k8s-nim-operator, gpu-operator, kai-scheduler, etc.) ship their CRDs under crds/ per Helm's recommended layout, so those CRDs are invisible to pre-flight today. Effect: workload CRs (e.g., NIMService instances created from demos/workloads/inference/nim*.yaml) are not scanned for finalizers before the operator is uninstalled. The operator's finalizer on the CR becomes orphaned, the CRD's customresourcecleanup finalizer waits forever, and the namespace stays Terminating. Recovery requires manual finalizer patching. Discover CRDs from a second source as well: any CRD annotated with meta.helm.sh/release-name matching the release. This catches crds/- installed CRDs and any CRD created by the release at runtime. The two sources are deduplicated before per-CRD scanning, so existing behavior is preserved and CRDs visible only via the manifest still get checked. The same annotation-based filter is already used by the per-component CRD cleanup loop and (in NVIDIA#600) by the post-flight verification check. This brings pre-flight into line.
Add a verification pass at the end of undeploy.sh that re-enumerates CRDs filtered by the same meta.helm.sh/release-name annotation used by the per-component CRD deletion loop, and surfaces any leftovers as a post-flight warning. Today the script's post-flight section only checks for terminating namespaces, unavailable API services, and ORPHANED_CRD_GROUPS-style operator-runtime CRDs. Helm-managed CRDs that survive the deletion loop go unnoticed until the next deploy.sh run rejects them. After PR NVIDIA#599 made the deletion pipelines warn-on- failure instead of script-killing, this check is the missing verification surface that confirms the warnings did not point to actual leftovers. The check mirrors the existing ORPHANED_CRD_GROUPS post-flight pattern: per- release kubectl|jq pipeline guarded with `|| true`, results accumulated into a helm_orphaned_crds variable, single warning emitted if non-empty. When cleanup succeeded normally the variable stays empty and the check is silent. CRDs already being deleted (non-null .metadata.deletionTimestamp) are excluded from the rescan: the earlier `kubectl delete crd ... --wait=false` returns immediately, so a slow finalizer can leave the CRD still listed even though it is being cleaned up normally. Only truly stuck (not-yet-terminating) CRDs should be surfaced as a warning. Suggested by CodeRabbit review on PR NVIDIA#599; terminating-CRD filter added in response to review feedback on PR NVIDIA#600. Signed-off-by: Yuan Chen <[email protected]>
15d4aef to
381731f
Compare
Combines three related hardening changes to the generated undeploy.sh (previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a theme -- make undeploy recover from partial or transient failure -- so bundling them avoids three sequential PR rebases over the same template. 1. Transient-failure resilience (was NVIDIA#599). Three `kubectl | jq | while` pipelines in the post-uninstall cleanup path ran under `set -euo pipefail` with no fallback. A momentary control-plane 502 or auth refresh killed the whole script after Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and `Active` namespaces. Wrap each pipeline's trailing `done` in `|| echo "Warning: ..." >&2` so the failure is visible but the script continues. Matches the `|| true` pattern already used in the adjacent ORPHANED_CRD_GROUPS loop. 2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601). check_release_for_stuck_crds sourced its CRD list only from `helm get manifest`, which returns CRDs under the chart's templates/ section but omits CRDs installed from the chart's crds/ directory (the Helm-recommended layout used by most operator charts). CRs backed by crds/-installed CRDs (e.g., NIMService) slipped past the finalizer pre-flight. Union both sources (manifest + annotated) and dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty lines without the grep-exits-1-on-no-match behavior, which would abort the script under pipefail when a release has no CRDs. 3. Post-flight Helm-annotation CRD check (was NVIDIA#600). Helm-managed CRDs that survive the per-release deletion loop went unnoticed until the next deploy.sh run rejected them. Add a verification pass at the end of undeploy.sh that re-enumerates CRDs filtered by meta.helm.sh/release-name + release-namespace annotations and surfaces any leftovers as a post-flight warning. CRDs already being deleted (non-null .metadata.deletionTimestamp) are excluded: the earlier `kubectl delete crd ... --wait=false` returns immediately, so a slow finalizer can leave the CRD listed though cleanup is normal. Only truly stuck (not-yet-terminating) CRDs are reported. Unit test TestUndeployScript_TransientFailureWarnsAndContinues generates a bundle, stubs kubectl to exit non-zero, sources each of the three helpers, and asserts the helper returns 0 and emits the expected Warning: on stderr. Fixes: N/A Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline), NVIDIA#599, NVIDIA#600, NVIDIA#601 (superseded by this PR). Signed-off-by: Yuan Chen <[email protected]>
Combines three related hardening changes to the generated undeploy.sh (previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a theme -- make undeploy recover from partial or transient failure -- so bundling them avoids three sequential PR rebases over the same template. 1. Transient-failure resilience (was NVIDIA#599). Three `kubectl | jq | while` pipelines in the post-uninstall cleanup path ran under `set -euo pipefail` with no fallback. A momentary control-plane 502 or auth refresh killed the whole script after Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and `Active` namespaces. Wrap each pipeline's trailing `done` in `|| echo "Warning: ..." >&2` so the failure is visible but the script continues. Matches the `|| true` pattern already used in the adjacent ORPHANED_CRD_GROUPS loop. 2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601). check_release_for_stuck_crds sourced its CRD list only from `helm get manifest`, which returns CRDs under the chart's templates/ section but omits CRDs installed from the chart's crds/ directory (the Helm-recommended layout used by most operator charts). CRs backed by crds/-installed CRDs (e.g., NIMService) slipped past the finalizer pre-flight. Union both sources (manifest + annotated) and dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty lines without the grep-exits-1-on-no-match behavior, which would abort the script under pipefail when a release has no CRDs. 3. Post-flight Helm-annotation CRD check (was NVIDIA#600). Helm-managed CRDs that survive the per-release deletion loop went unnoticed until the next deploy.sh run rejected them. Add a verification pass at the end of undeploy.sh that re-enumerates CRDs filtered by meta.helm.sh/release-name + release-namespace annotations and surfaces any leftovers as a post-flight warning. CRDs already being deleted (non-null .metadata.deletionTimestamp) are excluded: the earlier `kubectl delete crd ... --wait=false` returns immediately, so a slow finalizer can leave the CRD listed though cleanup is normal. Only truly stuck (not-yet-terminating) CRDs are reported. Unit test TestUndeployScript_TransientFailureWarnsAndContinues generates a bundle, stubs kubectl to exit non-zero, sources each of the three helpers, and asserts the helper returns 0 and emits the expected Warning: on stderr. Fixes: N/A Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline), NVIDIA#599, NVIDIA#600, NVIDIA#601 (superseded by this PR). Signed-off-by: Yuan Chen <[email protected]>
Combines three related hardening changes to the generated undeploy.sh (previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a theme -- make undeploy recover from partial or transient failure -- so bundling them avoids three sequential PR rebases over the same template. 1. Transient-failure resilience (was NVIDIA#599). Three `kubectl | jq | while` pipelines in the post-uninstall cleanup path ran under `set -euo pipefail` with no fallback. A momentary control-plane 502 or auth refresh killed the whole script after Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and `Active` namespaces. Wrap each pipeline's trailing `done` in `|| echo "Warning: ..." >&2` so the failure is visible but the script continues. Matches the `|| true` pattern already used in the adjacent ORPHANED_CRD_GROUPS loop. 2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601). check_release_for_stuck_crds sourced its CRD list only from `helm get manifest`, which returns CRDs under the chart's templates/ section but omits CRDs installed from the chart's crds/ directory (the Helm-recommended layout used by most operator charts). CRs backed by crds/-installed CRDs (e.g., NIMService) slipped past the finalizer pre-flight. Union both sources (manifest + annotated) and dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty lines without the grep-exits-1-on-no-match behavior, which would abort the script under pipefail when a release has no CRDs. 3. Post-flight Helm-annotation CRD check (was NVIDIA#600). Helm-managed CRDs that survive the per-release deletion loop went unnoticed until the next deploy.sh run rejected them. Add a verification pass at the end of undeploy.sh that re-enumerates CRDs filtered by meta.helm.sh/release-name + release-namespace annotations and surfaces any leftovers as a post-flight warning. CRDs already being deleted (non-null .metadata.deletionTimestamp) are excluded: the earlier `kubectl delete crd ... --wait=false` returns immediately, so a slow finalizer can leave the CRD listed though cleanup is normal. Only truly stuck (not-yet-terminating) CRDs are reported. Unit test TestUndeployScript_TransientFailureWarnsAndContinues generates a bundle, stubs kubectl to exit non-zero, sources each of the three helpers, and asserts the helper returns 0 and emits the expected Warning: on stderr. Fixes: N/A Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline), NVIDIA#599, NVIDIA#600, NVIDIA#601 (superseded by this PR). Signed-off-by: Yuan Chen <[email protected]>
Combines three related hardening changes to the generated undeploy.sh (previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a theme -- make undeploy recover from partial or transient failure -- so bundling them avoids three sequential PR rebases over the same template. 1. Transient-failure resilience (was NVIDIA#599). Three `kubectl | jq | while` pipelines in the post-uninstall cleanup path ran under `set -euo pipefail` with no fallback. A momentary control-plane 502 or auth refresh killed the whole script after Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and `Active` namespaces. Wrap each pipeline's trailing `done` in `|| echo "Warning: ..." >&2` so the failure is visible but the script continues. Matches the `|| true` pattern already used in the adjacent ORPHANED_CRD_GROUPS loop. 2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601). check_release_for_stuck_crds sourced its CRD list only from `helm get manifest`, which returns CRDs under the chart's templates/ section but omits CRDs installed from the chart's crds/ directory (the Helm-recommended layout used by most operator charts). CRs backed by crds/-installed CRDs (e.g., NIMService) slipped past the finalizer pre-flight. Union both sources (manifest + annotated) and dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty lines without the grep-exits-1-on-no-match behavior, which would abort the script under pipefail when a release has no CRDs. 3. Post-flight Helm-annotation CRD check (was NVIDIA#600). Helm-managed CRDs that survive the per-release deletion loop went unnoticed until the next deploy.sh run rejected them. Add a verification pass at the end of undeploy.sh that re-enumerates CRDs filtered by meta.helm.sh/release-name + release-namespace annotations and surfaces any leftovers as a post-flight warning. CRDs already being deleted (non-null .metadata.deletionTimestamp) are excluded: the earlier `kubectl delete crd ... --wait=false` returns immediately, so a slow finalizer can leave the CRD listed though cleanup is normal. Only truly stuck (not-yet-terminating) CRDs are reported. Unit test TestUndeployScript_TransientFailureWarnsAndContinues generates a bundle, stubs kubectl to exit non-zero, sources each of the three helpers, and asserts the helper returns 0 and emits the expected Warning: on stderr. Fixes: N/A Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline), NVIDIA#599, NVIDIA#600, NVIDIA#601 (superseded by this PR). Signed-off-by: Yuan Chen <[email protected]>
Combines three related hardening changes to the generated undeploy.sh (previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a theme -- make undeploy recover from partial or transient failure -- so bundling them avoids three sequential PR rebases over the same template. 1. Transient-failure resilience (was NVIDIA#599). Three `kubectl | jq | while` pipelines in the post-uninstall cleanup path ran under `set -euo pipefail` with no fallback. A momentary control-plane 502 or auth refresh killed the whole script after Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and `Active` namespaces. Wrap each pipeline's trailing `done` in `|| echo "Warning: ..." >&2` so the failure is visible but the script continues. Matches the `|| true` pattern already used in the adjacent ORPHANED_CRD_GROUPS loop. 2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601). check_release_for_stuck_crds sourced its CRD list only from `helm get manifest`, which returns CRDs under the chart's templates/ section but omits CRDs installed from the chart's crds/ directory (the Helm-recommended layout used by most operator charts). CRs backed by crds/-installed CRDs (e.g., NIMService) slipped past the finalizer pre-flight. Union both sources (manifest + annotated) and dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty lines without the grep-exits-1-on-no-match behavior, which would abort the script under pipefail when a release has no CRDs. 3. Post-flight Helm-annotation CRD check (was NVIDIA#600). Helm-managed CRDs that survive the per-release deletion loop went unnoticed until the next deploy.sh run rejected them. Add a verification pass at the end of undeploy.sh that re-enumerates CRDs filtered by meta.helm.sh/release-name + release-namespace annotations and surfaces any leftovers as a post-flight warning. CRDs already being deleted (non-null .metadata.deletionTimestamp) are excluded: the earlier `kubectl delete crd ... --wait=false` returns immediately, so a slow finalizer can leave the CRD listed though cleanup is normal. Only truly stuck (not-yet-terminating) CRDs are reported. Unit test TestUndeployScript_TransientFailureWarnsAndContinues generates a bundle, stubs kubectl to exit non-zero, sources each of the three helpers, and asserts the helper returns 0 and emits the expected Warning: on stderr. Fixes: N/A Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline), Signed-off-by: Yuan Chen <[email protected]>
Combines three related hardening changes to the generated undeploy.sh (previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a theme -- make undeploy recover from partial or transient failure -- so bundling them avoids three sequential PR rebases over the same template. 1. Transient-failure resilience (was NVIDIA#599). Three `kubectl | jq | while` pipelines in the post-uninstall cleanup path ran under `set -euo pipefail` with no fallback. A momentary control-plane 502 or auth refresh killed the whole script after Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and `Active` namespaces. Wrap each pipeline's trailing `done` in `|| echo "Warning: ..." >&2` so the failure is visible but the script continues. Matches the `|| true` pattern already used in the adjacent ORPHANED_CRD_GROUPS loop. 2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601). check_release_for_stuck_crds sourced its CRD list only from `helm get manifest`, which returns CRDs under the chart's templates/ section but omits CRDs installed from the chart's crds/ directory (the Helm-recommended layout used by most operator charts). CRs backed by crds/-installed CRDs (e.g., NIMService) slipped past the finalizer pre-flight. Union both sources (manifest + annotated) and dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty lines without the grep-exits-1-on-no-match behavior, which would abort the script under pipefail when a release has no CRDs. 3. Post-flight Helm-annotation CRD check (was NVIDIA#600). Helm-managed CRDs that survive the per-release deletion loop went unnoticed until the next deploy.sh run rejected them. Add a verification pass at the end of undeploy.sh that re-enumerates CRDs filtered by meta.helm.sh/release-name + release-namespace annotations and surfaces any leftovers as a post-flight warning. CRDs already being deleted (non-null .metadata.deletionTimestamp) are excluded: the earlier `kubectl delete crd ... --wait=false` returns immediately, so a slow finalizer can leave the CRD listed though cleanup is normal. Only truly stuck (not-yet-terminating) CRDs are reported. Unit test TestUndeployScript_TransientFailureWarnsAndContinues generates a bundle, stubs kubectl to exit non-zero, sources each of the three helpers, and asserts the helper returns 0 and emits the expected Warning: on stderr. Fixes: N/A Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline), Signed-off-by: Yuan Chen <[email protected]>
Combines three related hardening changes to the generated undeploy.sh (previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a theme -- make undeploy recover from partial or transient failure -- so bundling them avoids three sequential PR rebases over the same template. 1. Transient-failure resilience (was NVIDIA#599). Three `kubectl | jq | while` pipelines in the post-uninstall cleanup path ran under `set -euo pipefail` with no fallback. A momentary control-plane 502 or auth refresh killed the whole script after Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and `Active` namespaces. Wrap each pipeline's trailing `done` in `|| echo "Warning: ..." >&2` so the failure is visible but the script continues. Matches the `|| true` pattern already used in the adjacent ORPHANED_CRD_GROUPS loop. 2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601). check_release_for_stuck_crds sourced its CRD list only from `helm get manifest`, which returns CRDs under the chart's templates/ section but omits CRDs installed from the chart's crds/ directory (the Helm-recommended layout used by most operator charts). CRs backed by crds/-installed CRDs (e.g., NIMService) slipped past the finalizer pre-flight. Union both sources (manifest + annotated) and dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty lines without the grep-exits-1-on-no-match behavior, which would abort the script under pipefail when a release has no CRDs. 3. Post-flight Helm-annotation CRD check (was NVIDIA#600). Helm-managed CRDs that survive the per-release deletion loop went unnoticed until the next deploy.sh run rejected them. Add a verification pass at the end of undeploy.sh that re-enumerates CRDs filtered by meta.helm.sh/release-name + release-namespace annotations and surfaces any leftovers as a post-flight warning. CRDs already being deleted (non-null .metadata.deletionTimestamp) are excluded: the earlier `kubectl delete crd ... --wait=false` returns immediately, so a slow finalizer can leave the CRD listed though cleanup is normal. Only truly stuck (not-yet-terminating) CRDs are reported. Unit test TestUndeployScript_TransientFailureWarnsAndContinues generates a bundle, stubs kubectl to exit non-zero, sources each of the three helpers, and asserts the helper returns 0 and emits the expected Warning: on stderr. Fixes: N/A Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline), Signed-off-by: Yuan Chen <[email protected]>
Combines three related hardening changes to the generated undeploy.sh (previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a theme -- make undeploy recover from partial or transient failure -- so bundling them avoids three sequential PR rebases over the same template. 1. Transient-failure resilience (was NVIDIA#599). Three `kubectl | jq | while` pipelines in the post-uninstall cleanup path ran under `set -euo pipefail` with no fallback. A momentary control-plane 502 or auth refresh killed the whole script after Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and `Active` namespaces. Wrap each pipeline's trailing `done` in `|| echo "Warning: ..." >&2` so the failure is visible but the script continues. Matches the `|| true` pattern already used in the adjacent ORPHANED_CRD_GROUPS loop. 2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601). check_release_for_stuck_crds sourced its CRD list only from `helm get manifest`, which returns CRDs under the chart's templates/ section but omits CRDs installed from the chart's crds/ directory (the Helm-recommended layout used by most operator charts). CRs backed by crds/-installed CRDs (e.g., NIMService) slipped past the finalizer pre-flight. Union both sources (manifest + annotated) and dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty lines without the grep-exits-1-on-no-match behavior, which would abort the script under pipefail when a release has no CRDs. 3. Post-flight Helm-annotation CRD check (was NVIDIA#600). Helm-managed CRDs that survive the per-release deletion loop went unnoticed until the next deploy.sh run rejected them. Add a verification pass at the end of undeploy.sh that re-enumerates CRDs filtered by meta.helm.sh/release-name + release-namespace annotations and surfaces any leftovers as a post-flight warning. CRDs already being deleted (non-null .metadata.deletionTimestamp) are excluded: the earlier `kubectl delete crd ... --wait=false` returns immediately, so a slow finalizer can leave the CRD listed though cleanup is normal. Only truly stuck (not-yet-terminating) CRDs are reported. Unit test TestUndeployScript_TransientFailureWarnsAndContinues generates a bundle, stubs kubectl to exit non-zero, sources each of the three helpers, and asserts the helper returns 0 and emits the expected Warning: on stderr. Fixes: N/A Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline), Signed-off-by: Yuan Chen <[email protected]>
Combines three related hardening changes to the generated undeploy.sh (previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a theme -- make undeploy recover from partial or transient failure -- so bundling them avoids three sequential PR rebases over the same template. 1. Transient-failure resilience (was NVIDIA#599). Three `kubectl | jq | while` pipelines in the post-uninstall cleanup path ran under `set -euo pipefail` with no fallback. A momentary control-plane 502 or auth refresh killed the whole script after Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and `Active` namespaces. Wrap each pipeline's trailing `done` in `|| echo "Warning: ..." >&2` so the failure is visible but the script continues. Matches the `|| true` pattern already used in the adjacent ORPHANED_CRD_GROUPS loop. 2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601). check_release_for_stuck_crds sourced its CRD list only from `helm get manifest`, which returns CRDs under the chart's templates/ section but omits CRDs installed from the chart's crds/ directory (the Helm-recommended layout used by most operator charts). CRs backed by crds/-installed CRDs (e.g., NIMService) slipped past the finalizer pre-flight. Union both sources (manifest + annotated) and dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty lines without the grep-exits-1-on-no-match behavior, which would abort the script under pipefail when a release has no CRDs. 3. Post-flight Helm-annotation CRD check (was NVIDIA#600). Helm-managed CRDs that survive the per-release deletion loop went unnoticed until the next deploy.sh run rejected them. Add a verification pass at the end of undeploy.sh that re-enumerates CRDs filtered by meta.helm.sh/release-name + release-namespace annotations and surfaces any leftovers as a post-flight warning. CRDs already being deleted (non-null .metadata.deletionTimestamp) are excluded: the earlier `kubectl delete crd ... --wait=false` returns immediately, so a slow finalizer can leave the CRD listed though cleanup is normal. Only truly stuck (not-yet-terminating) CRDs are reported. Unit test TestUndeployScript_TransientFailureWarnsAndContinues generates a bundle, stubs kubectl to exit non-zero, sources each of the three helpers, and asserts the helper returns 0 and emits the expected Warning: on stderr. Fixes: N/A Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline), Signed-off-by: Yuan Chen <[email protected]>
Combines three related hardening changes to the generated undeploy.sh (previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a theme -- make undeploy recover from partial or transient failure -- so bundling them avoids three sequential PR rebases over the same template. 1. Transient-failure resilience (was NVIDIA#599). Three `kubectl | jq | while` pipelines in the post-uninstall cleanup path ran under `set -euo pipefail` with no fallback. A momentary control-plane 502 or auth refresh killed the whole script after Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and `Active` namespaces. Wrap each pipeline's trailing `done` in `|| echo "Warning: ..." >&2` so the failure is visible but the script continues. Matches the `|| true` pattern already used in the adjacent ORPHANED_CRD_GROUPS loop. 2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601). check_release_for_stuck_crds sourced its CRD list only from `helm get manifest`, which returns CRDs under the chart's templates/ section but omits CRDs installed from the chart's crds/ directory (the Helm-recommended layout used by most operator charts). CRs backed by crds/-installed CRDs (e.g., NIMService) slipped past the finalizer pre-flight. Union both sources (manifest + annotated) and dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty lines without the grep-exits-1-on-no-match behavior, which would abort the script under pipefail when a release has no CRDs. 3. Post-flight Helm-annotation CRD check (was NVIDIA#600). Helm-managed CRDs that survive the per-release deletion loop went unnoticed until the next deploy.sh run rejected them. Add a verification pass at the end of undeploy.sh that re-enumerates CRDs filtered by meta.helm.sh/release-name + release-namespace annotations and surfaces any leftovers as a post-flight warning. CRDs already being deleted (non-null .metadata.deletionTimestamp) are excluded: the earlier `kubectl delete crd ... --wait=false` returns immediately, so a slow finalizer can leave the CRD listed though cleanup is normal. Only truly stuck (not-yet-terminating) CRDs are reported. Unit test TestUndeployScript_TransientFailureWarnsAndContinues generates a bundle, stubs kubectl to exit non-zero, sources each of the three helpers, and asserts the helper returns 0 and emits the expected Warning: on stderr. Fixes: N/A Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline), Signed-off-by: Yuan Chen <[email protected]>
Combines three related hardening changes to the generated undeploy.sh (previously split across #599, #601, #600). They share a file and a theme -- make undeploy recover from partial or transient failure -- so bundling them avoids three sequential PR rebases over the same template. 1. Transient-failure resilience (was #599). Three `kubectl | jq | while` pipelines in the post-uninstall cleanup path ran under `set -euo pipefail` with no fallback. A momentary control-plane 502 or auth refresh killed the whole script after Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and `Active` namespaces. Wrap each pipeline's trailing `done` in `|| echo "Warning: ..." >&2` so the failure is visible but the script continues. Matches the `|| true` pattern already used in the adjacent ORPHANED_CRD_GROUPS loop. 2. Pre-flight CRD discovery via Helm annotation (was #601). check_release_for_stuck_crds sourced its CRD list only from `helm get manifest`, which returns CRDs under the chart's templates/ section but omits CRDs installed from the chart's crds/ directory (the Helm-recommended layout used by most operator charts). CRs backed by crds/-installed CRDs (e.g., NIMService) slipped past the finalizer pre-flight. Union both sources (manifest + annotated) and dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty lines without the grep-exits-1-on-no-match behavior, which would abort the script under pipefail when a release has no CRDs. 3. Post-flight Helm-annotation CRD check (was #600). Helm-managed CRDs that survive the per-release deletion loop went unnoticed until the next deploy.sh run rejected them. Add a verification pass at the end of undeploy.sh that re-enumerates CRDs filtered by meta.helm.sh/release-name + release-namespace annotations and surfaces any leftovers as a post-flight warning. CRDs already being deleted (non-null .metadata.deletionTimestamp) are excluded: the earlier `kubectl delete crd ... --wait=false` returns immediately, so a slow finalizer can leave the CRD listed though cleanup is normal. Only truly stuck (not-yet-terminating) CRDs are reported. Unit test TestUndeployScript_TransientFailureWarnsAndContinues generates a bundle, stubs kubectl to exit non-zero, sources each of the three helpers, and asserts the helper returns 0 and emits the expected Warning: on stderr. Fixes: N/A Related: #477 (introduced the per-component CRD cleanup pipeline), Signed-off-by: Yuan Chen <[email protected]>
Combines three related hardening changes to the generated undeploy.sh (previously split across NVIDIA#599, NVIDIA#601, NVIDIA#600). They share a file and a theme -- make undeploy recover from partial or transient failure -- so bundling them avoids three sequential PR rebases over the same template. 1. Transient-failure resilience (was NVIDIA#599). Three `kubectl | jq | while` pipelines in the post-uninstall cleanup path ran under `set -euo pipefail` with no fallback. A momentary control-plane 502 or auth refresh killed the whole script after Helm uninstalls succeeded, leaving orphan CRDs, webhooks, and `Active` namespaces. Wrap each pipeline's trailing `done` in `|| echo "Warning: ..." >&2` so the failure is visible but the script continues. Matches the `|| true` pattern already used in the adjacent ORPHANED_CRD_GROUPS loop. 2. Pre-flight CRD discovery via Helm annotation (was NVIDIA#601). check_release_for_stuck_crds sourced its CRD list only from `helm get manifest`, which returns CRDs under the chart's templates/ section but omits CRDs installed from the chart's crds/ directory (the Helm-recommended layout used by most operator charts). CRs backed by crds/-installed CRDs (e.g., NIMService) slipped past the finalizer pre-flight. Union both sources (manifest + annotated) and dedupe. Uses `awk 'NF'` instead of `grep -v '^$'` to drop empty lines without the grep-exits-1-on-no-match behavior, which would abort the script under pipefail when a release has no CRDs. 3. Post-flight Helm-annotation CRD check (was NVIDIA#600). Helm-managed CRDs that survive the per-release deletion loop went unnoticed until the next deploy.sh run rejected them. Add a verification pass at the end of undeploy.sh that re-enumerates CRDs filtered by meta.helm.sh/release-name + release-namespace annotations and surfaces any leftovers as a post-flight warning. CRDs already being deleted (non-null .metadata.deletionTimestamp) are excluded: the earlier `kubectl delete crd ... --wait=false` returns immediately, so a slow finalizer can leave the CRD listed though cleanup is normal. Only truly stuck (not-yet-terminating) CRDs are reported. Unit test TestUndeployScript_TransientFailureWarnsAndContinues generates a bundle, stubs kubectl to exit non-zero, sources each of the three helpers, and asserts the helper returns 0 and emits the expected Warning: on stderr. Fixes: N/A Related: NVIDIA#477 (introduced the per-component CRD cleanup pipeline), Signed-off-by: Yuan Chen <[email protected]>
Summary
Add a post-flight verification pass that re-enumerates CRDs filtered by the same
meta.helm.sh/release-nameannotation used during the per-component CRD deletion loop, and reports any leftovers as aWARNING(also flipspostflight_issues=true).Motivation / Context
Today the
undeploy.shpost-flight section checks for:ORPHANED_CRD_GROUPS(e.g.nvidia.com,nfd.k8s-sigs.io,kai.scheduler).Helm-managed CRDs that survive the per-release deletion loop are not surfaced anywhere — they go unnoticed until the next
deploy.shrun rejects them.After #599, the per-component CRD cleanup pipelines no longer kill the script on a transient kubectl/jq failure; they emit a
Warning:to stderr and continue. That makes a verification check at the end of the script the natural complement: confirm the warnings did not correspond to real leftovers, and warn (with deletion command) if they did.Suggested by CodeRabbit review on #599 (suggestion #2). I deferred it in #599 to keep that PR focused on the immediate
set -eexit fix; this PR is the dedicated follow-up.Fixes: N/A
Related: #599 (warn-on-failure resilience for the same pipelines), #477 (introduced the per-component cleanup loop)
Type of Change
Component(s) Affected
pkg/bundler,pkg/component/*)Implementation Notes
Single block added to
pkg/bundler/deployer/helm/templates/undeploy.sh.tmplimmediately after the existingORPHANED_CRD_GROUPSpost-flight check, mirroring its style:Notes:
kubectl|jqis|| true-guarded so a transient API failure during post-flight doesn't itself tripset -e.Testing
pkg/bundler/deployer/helm/...pass with-race.bash -non the generated script: clean.Risk Assessment
Rollout notes: Generated-script change only. No API surface, no migration. Failure mode is "warning printed where there shouldn't have been one" (false positive if a CRD legitimately keeps its Helm annotation post-uninstall — none of the bundle's components do this today). Reverting is a one-commit revert.
Checklist
make testwith-raceonpkg/bundler/...)make linthas pre-existing yamllint failures in untracked.codex-worktrees/;go vet+golangci-lintclean on the changed packageWARNING:block already documented)ORPHANED_CRD_GROUPSpost-flight check directly above)git commit -S)Summary by CodeRabbit