Skip to content

fix: undeploy.sh misses runtime-created CRDs, webhooks, and workload namespaces #474

Description

@yuanchen8911

Summary

undeploy.sh leaves stale cluster resources after a full undeploy, requiring manual cleanup before redeployment.

Stale resources observed

After running undeploy.sh --delete-pvcs on an EKS cluster with the h100-eks-ubuntu-inference-dynamo bundle:

Stale CRDs (had helm.sh/resource-policy: keep or were created by operators at runtime):

  • monitoring.coreos.com (11 CRDs) — kube-prometheus-stack resource policy
  • nvidia.com (clusterpolicies, nvidiadrivers) — gpu-operator resource policy
  • nfd.k8s-sigs.io (3 CRDs) — node feature discovery
  • grove.io / scheduler.grove.io (4 CRDs) — created by dynamo-platform operator
  • scheduling.run.ai (3 CRDs) — created by kai-scheduler operator
  • resource.nvidia.com (2 CRDs) — compute domain CRDs
  • jobset.x-k8s.io — created by conformance validator

Stale webhooks (created by conformance validator, not Helm-managed):

  • jobset-mutating-webhook-configuration
  • jobset-validating-webhook-configuration
  • validator.trainer.kubeflow.org

Stale namespaces (created manually or by tests):

  • dra-test
  • dynamo-workload

Stuck CRDs with finalizers — CRDs with customresourcecleanup.apiextensions.k8s.io finalizer that can't be processed because the owning namespace is already deleted.

Root causes

  1. ORPHANED_CRD_GROUPS only covers kai.scheduler and trainer.kubeflow.org. Missing: monitoring.coreos.com, nvidia.com, nfd.k8s-sigs.io, grove.io, scheduler.grove.io, scheduling.run.ai, resource.nvidia.com, jobset.x-k8s.io.

  2. delete_release_cluster_resources only matches resources with app.kubernetes.io/managed-by=Helm labels. Operator-created CRDs and validator-created webhooks don't carry these labels.

  3. Workload namespaces created outside the bundle (e.g., dynamo-workload for smoke tests) are not tracked by undeploy.sh.

  4. CRD deletion ordering — CRDs with customresourcecleanup finalizers need their CRs deleted first. If the CR namespace is deleted before the CRD, the finalizer can't be resolved, leaving the CRD stuck.

Proposed fix

  1. Expand ORPHANED_CRD_GROUPS to include all known operator-created CRD groups
  2. Add a post-uninstall sweep for webhooks referencing services in deleted namespaces (the delete_orphaned_webhooks_for_ns function exists but runs too early for some cases)
  3. Consider adding a --clean-all flag that force-removes all non-system CRDs and webhooks
  4. Document known workload namespaces that users should clean up before running undeploy.sh

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions