Summary
undeploy.sh leaves stale cluster resources after a full undeploy, requiring manual cleanup before redeployment.
Stale resources observed
After running undeploy.sh --delete-pvcs on an EKS cluster with the h100-eks-ubuntu-inference-dynamo bundle:
Stale CRDs (had helm.sh/resource-policy: keep or were created by operators at runtime):
monitoring.coreos.com (11 CRDs) — kube-prometheus-stack resource policy
nvidia.com (clusterpolicies, nvidiadrivers) — gpu-operator resource policy
nfd.k8s-sigs.io (3 CRDs) — node feature discovery
grove.io / scheduler.grove.io (4 CRDs) — created by dynamo-platform operator
scheduling.run.ai (3 CRDs) — created by kai-scheduler operator
resource.nvidia.com (2 CRDs) — compute domain CRDs
jobset.x-k8s.io — created by conformance validator
Stale webhooks (created by conformance validator, not Helm-managed):
jobset-mutating-webhook-configuration
jobset-validating-webhook-configuration
validator.trainer.kubeflow.org
Stale namespaces (created manually or by tests):
Stuck CRDs with finalizers — CRDs with customresourcecleanup.apiextensions.k8s.io finalizer that can't be processed because the owning namespace is already deleted.
Root causes
-
ORPHANED_CRD_GROUPS only covers kai.scheduler and trainer.kubeflow.org. Missing: monitoring.coreos.com, nvidia.com, nfd.k8s-sigs.io, grove.io, scheduler.grove.io, scheduling.run.ai, resource.nvidia.com, jobset.x-k8s.io.
-
delete_release_cluster_resources only matches resources with app.kubernetes.io/managed-by=Helm labels. Operator-created CRDs and validator-created webhooks don't carry these labels.
-
Workload namespaces created outside the bundle (e.g., dynamo-workload for smoke tests) are not tracked by undeploy.sh.
-
CRD deletion ordering — CRDs with customresourcecleanup finalizers need their CRs deleted first. If the CR namespace is deleted before the CRD, the finalizer can't be resolved, leaving the CRD stuck.
Proposed fix
- Expand
ORPHANED_CRD_GROUPS to include all known operator-created CRD groups
- Add a post-uninstall sweep for webhooks referencing services in deleted namespaces (the
delete_orphaned_webhooks_for_ns function exists but runs too early for some cases)
- Consider adding a
--clean-all flag that force-removes all non-system CRDs and webhooks
- Document known workload namespaces that users should clean up before running
undeploy.sh
Summary
undeploy.shleaves stale cluster resources after a full undeploy, requiring manual cleanup before redeployment.Stale resources observed
After running
undeploy.sh --delete-pvcson an EKS cluster with theh100-eks-ubuntu-inference-dynamobundle:Stale CRDs (had
helm.sh/resource-policy: keepor were created by operators at runtime):monitoring.coreos.com(11 CRDs) — kube-prometheus-stack resource policynvidia.com(clusterpolicies, nvidiadrivers) — gpu-operator resource policynfd.k8s-sigs.io(3 CRDs) — node feature discoverygrove.io/scheduler.grove.io(4 CRDs) — created by dynamo-platform operatorscheduling.run.ai(3 CRDs) — created by kai-scheduler operatorresource.nvidia.com(2 CRDs) — compute domain CRDsjobset.x-k8s.io— created by conformance validatorStale webhooks (created by conformance validator, not Helm-managed):
jobset-mutating-webhook-configurationjobset-validating-webhook-configurationvalidator.trainer.kubeflow.orgStale namespaces (created manually or by tests):
dra-testdynamo-workloadStuck CRDs with finalizers — CRDs with
customresourcecleanup.apiextensions.k8s.iofinalizer that can't be processed because the owning namespace is already deleted.Root causes
ORPHANED_CRD_GROUPSonly coverskai.schedulerandtrainer.kubeflow.org. Missing:monitoring.coreos.com,nvidia.com,nfd.k8s-sigs.io,grove.io,scheduler.grove.io,scheduling.run.ai,resource.nvidia.com,jobset.x-k8s.io.delete_release_cluster_resourcesonly matches resources withapp.kubernetes.io/managed-by=Helmlabels. Operator-created CRDs and validator-created webhooks don't carry these labels.Workload namespaces created outside the bundle (e.g.,
dynamo-workloadfor smoke tests) are not tracked byundeploy.sh.CRD deletion ordering — CRDs with
customresourcecleanupfinalizers need their CRs deleted first. If the CR namespace is deleted before the CRD, the finalizer can't be resolved, leaving the CRD stuck.Proposed fix
ORPHANED_CRD_GROUPSto include all known operator-created CRD groupsdelete_orphaned_webhooks_for_nsfunction exists but runs too early for some cases)--clean-allflag that force-removes all non-system CRDs and webhooksundeploy.sh