fix: undeploy.sh misses runtime-created CRDs, webhooks, and workload namespaces

## Summary

`undeploy.sh` leaves stale cluster resources after a full undeploy, requiring manual cleanup before redeployment.

## Stale resources observed

After running `undeploy.sh --delete-pvcs` on an EKS cluster with the `h100-eks-ubuntu-inference-dynamo` bundle:

**Stale CRDs** (had `helm.sh/resource-policy: keep` or were created by operators at runtime):
- `monitoring.coreos.com` (11 CRDs) — kube-prometheus-stack resource policy
- `nvidia.com` (clusterpolicies, nvidiadrivers) — gpu-operator resource policy
- `nfd.k8s-sigs.io` (3 CRDs) — node feature discovery
- `grove.io` / `scheduler.grove.io` (4 CRDs) — created by dynamo-platform operator
- `scheduling.run.ai` (3 CRDs) — created by kai-scheduler operator
- `resource.nvidia.com` (2 CRDs) — compute domain CRDs
- `jobset.x-k8s.io` — created by conformance validator

**Stale webhooks** (created by conformance validator, not Helm-managed):
- `jobset-mutating-webhook-configuration`
- `jobset-validating-webhook-configuration`
- `validator.trainer.kubeflow.org`

**Stale namespaces** (created manually or by tests):
- `dra-test`
- `dynamo-workload`

**Stuck CRDs with finalizers** — CRDs with `customresourcecleanup.apiextensions.k8s.io` finalizer that can't be processed because the owning namespace is already deleted.

## Root causes

1. **`ORPHANED_CRD_GROUPS`** only covers `kai.scheduler` and `trainer.kubeflow.org`. Missing: `monitoring.coreos.com`, `nvidia.com`, `nfd.k8s-sigs.io`, `grove.io`, `scheduler.grove.io`, `scheduling.run.ai`, `resource.nvidia.com`, `jobset.x-k8s.io`.

2. **`delete_release_cluster_resources`** only matches resources with `app.kubernetes.io/managed-by=Helm` labels. Operator-created CRDs and validator-created webhooks don't carry these labels.

3. **Workload namespaces** created outside the bundle (e.g., `dynamo-workload` for smoke tests) are not tracked by `undeploy.sh`.

4. **CRD deletion ordering** — CRDs with `customresourcecleanup` finalizers need their CRs deleted first. If the CR namespace is deleted before the CRD, the finalizer can't be resolved, leaving the CRD stuck.

## Proposed fix

1. Expand `ORPHANED_CRD_GROUPS` to include all known operator-created CRD groups
2. Add a post-uninstall sweep for webhooks referencing services in deleted namespaces (the `delete_orphaned_webhooks_for_ns` function exists but runs too early for some cases)
3. Consider adding a `--clean-all` flag that force-removes all non-system CRDs and webhooks
4. Document known workload namespaces that users should clean up before running `undeploy.sh`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: undeploy.sh misses runtime-created CRDs, webhooks, and workload namespaces #474

Summary

Stale resources observed

Root causes

Proposed fix

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

fix: undeploy.sh misses runtime-created CRDs, webhooks, and workload namespaces #474

Description

Summary

Stale resources observed

Root causes

Proposed fix

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions