Skip to content

fix(bundler): clean up orphaned KAI and Kubeflow Trainer CRDs on undeploy#416

Merged
yuanchen8911 merged 2 commits into
NVIDIA:mainfrom
yuanchen8911:fix/undeploy-orphaned-crds
Mar 16, 2026
Merged

fix(bundler): clean up orphaned KAI and Kubeflow Trainer CRDs on undeploy#416
yuanchen8911 merged 2 commits into
NVIDIA:mainfrom
yuanchen8911:fix/undeploy-orphaned-crds

Conversation

@yuanchen8911

Copy link
Copy Markdown
Contributor

Summary

Fix undeploy.sh to clean up orphaned CRDs from KAI scheduler and Kubeflow Trainer that survive Helm uninstall.

Motivation / Context

After undeploy.sh completes, 6 orphaned CRDs remain on the cluster:

  • configs.kai.scheduler, schedulingshards.kai.scheduler, topologies.kai.scheduler — created by the KAI operator at runtime, not by the Helm chart
  • trainjobs.trainer.kubeflow.org, trainingruntimes.trainer.kubeflow.org, clustertrainingruntimes.trainer.kubeflow.org — installed by the validator during conformance/performance tests

Neither set carries Helm labels, so the existing delete_release_cluster_resources label-based sweep misses them.

Fixes: N/A
Related: N/A

Type of Change

  • Bug fix (non-breaking change that fixes an issue)

Component(s) Affected

  • Bundlers (pkg/bundler, pkg/component/*)

Implementation Notes

  • Added explicit CRD cleanup in undeploy.sh.tmpl after namespace deletion, targeting kai.scheduler and trainer.kubeflow.org API groups
  • Added orphaned CRD detection in post-flight verification
  • Approach uses API group matching (not hardcoded CRD names) so new CRDs in these groups are automatically covered

Testing

make test  # All bundler tests pass

Verified generated undeploy.sh includes the new cleanup steps.

Risk Assessment

  • Low — Isolated change, well-tested, easy to revert

Rollout notes: Existing bundles won't have the fix — only newly generated bundles. Users with stale CRDs can manually delete them with kubectl delete crd <name>.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

@mchmarny mchmarny left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good fix — the orphaned CRD problem is real and the API group matching approach is the right design. Two inline comments: one on CRD deletion ordering relative to namespace teardown, one on grep precision. Clean change overall.

Comment thread pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl Outdated
Comment thread pkg/bundler/deployer/helm/templates/undeploy.sh.tmpl Outdated

@mchmarny mchmarny left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CRD deletion ordering is the only real thing

…ploy

KAI scheduler CRDs (configs, schedulingshards, topologies) are created
by the KAI operator at runtime, not by the Helm chart. Kubeflow Trainer
CRDs are installed by the validator during conformance/performance tests.
Neither carries Helm labels, so the existing label-based CRD sweep in
delete_release_cluster_resources misses them.

- Add explicit CRD cleanup before namespace deletion (while controllers
  still run) with --wait=false to avoid finalizer hangs
- Anchor grep patterns with $ to prevent false matches
- Add orphaned CRD detection in undeploy post-flight verification
- Add orphaned CRD warning in deploy pre-flight checks

Signed-off-by: Yuan Chen <[email protected]>
@yuanchen8911 yuanchen8911 force-pushed the fix/undeploy-orphaned-crds branch from 5a48c98 to 849a6f0 Compare March 16, 2026 21:57
@yuanchen8911

Copy link
Copy Markdown
Contributor Author

Thanks Mark — both addressed:

# Issue Fix
1 CRD cleanup runs after namespace deletion — controllers already gone, finalizers can't be processed Moved CRD cleanup before namespace deletion loop; added --wait=false as safety net
2 grep "\.${group}" not anchored — could false-match CRDs like foo.xkai.scheduler.io Anchored with $ in all three locations (undeploy cleanup, undeploy post-flight, deploy pre-flight)

@github-actions

Copy link
Copy Markdown
Contributor

This pull request has been automatically locked since it has been closed for 90 days with no further activity. Please open a new pull request for related changes.

@github-actions github-actions Bot locked as resolved and limited conversation to collaborators Jun 15, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants