Enhance deployment-phase validation to reflect post-install workload readiness

> Note: this issue has been revised from its original version based on the discussions in the comments. The original scope (deploy.sh messaging/docs fix) is now handled separately in [#609](https://github.com/NVIDIA/aicr/pull/609).
>
> Follow-up note: the broader registry-driven work to make deployment-phase validation symmetric and deep for **all** components is now tracked in [#622](https://github.com/NVIDIA/aicr/issues/622). This issue remains intentionally scoped to strengthening the existing deployment phase for deployment completeness plus the targeted GPU readiness signals below.

## Summary

We already have most of the machinery for a deployer-neutral readiness check in `aicr validate`. The gap is not a missing standalone tool or a missing new phase; the gap is that the current `deployment` phase is still narrower than **post-install workload readiness**.

The proposal is to strengthen the existing `deployment` phase so that:

```bash
aicr validate --recipe recipe.yaml --phase deployment
```

answers the practical post-install question users care about: **is this cluster actually ready for GPU workloads now?**

Concretely, this issue is about:
- keep using the existing deployment phase
- preserve the existing baseline checks:
  - enabled namespaces are `Active`
  - declared `expectedResources` are healthy
- add the targeted GPU-specific readiness checks the phase was missing:
  - `skyhook-customizations`: Skyhook CR completion
  - `nvidia-dra-driver-gpu`: kubelet-plugin readiness
  - `gpu-operator`: `ClusterPolicy.status.state == ready`

So the main new value is the addition of the targeted Skyhook, DRA, and GPU Operator deep readiness checks on top of the preserved baseline deployment checks.

## Problem

Today, `deployment` validation is closer to:
- components were installed
- some expected resources exist
- basic operator/resource checks passed

That is still weaker than real post-install workload readiness for GPU bundles.

For example, the cluster may still be converging after install while:
- Skyhook tuning is still running
- the DRA kubelet plugin is not fully ready
- GPU components are present but not yet usable end-to-end

So the missing piece is not a new validator binary or a new CLI entrypoint. The missing piece is that the existing `deployment` phase does not yet include the specific readiness checks needed for post-install workload readiness.

## Scope Assumption

The scope of `--phase deployment` here is:
- deployment completeness for all enabled components, and
- the specific post-install GPU readiness signals needed to close the current gap

It is **not** intended to become:
- a generic per-component health-check aggregator
- a full end-to-end workload smoke test
- a replacement for broader conformance validation

## Approach

Keep the existing `deployment` phase. Do **not** add a new phase, a new flag, or a new tool for now.

Instead, strengthen the existing deployment phase so that:

```bash
aicr validate --recipe recipe.yaml --phase deployment
```

serves as the practical post-install workload readiness check.

The implementation target is:

- reuse the current deployment checks
- add deployment-completeness checks for enabled components
  - namespaces exist and are `Active`
  - declared `expectedResources` exist and are healthy
- add the missing GPU readiness checks
  - `skyhook-customizations`
    - require Skyhook CR completion (`status.status == complete`)
    - scope the check to **only the Skyhook CRs this recipe declares** (names extracted from the recipe's own `ComponentRef.ManifestFiles`), so unrelated Skyhooks on the cluster (stale from prior deploys, or from other tenants) do not widen the failure surface
    - skip gracefully when the Skyhook CRD is not registered on the cluster
  - `nvidia-dra-driver-gpu`
    - require the kubelet plugin DaemonSet to be ready, discovered deployer-neutrally by the chart's role-suffix convention (name ending in `-kubelet-plugin`) rather than by a hard-coded object name or release-scoped label
  - `gpu-operator`
    - require stronger operator readiness (`ClusterPolicy.status.state == ready`), not just deployment existence

All checks stay deployer-neutral: no Helm API calls, no reads of release metadata, no dependence on release-scoped labels. Every lookup is a plain Kubernetes API call against live-cluster object shape.

Because this reuses the existing CLI, day-N re-verification after scale-up or other runtime changes is also covered: run the same command again.

## Options Considered

### 1. Reuse `recipes/checks/*` directly in deployment via Chainsaw

Pros:
- maximum reuse of existing component health checks
- one source of truth for per-component readiness

Cons:
- most of the richer checks are in Chainsaw `Test` format, for example [`skyhook-customizations` health check](https://github.com/NVIDIA/aicr/blob/main/recipes/checks/skyhook-customizations/health-check.yaml)
- the normal deployment validator image does not currently ship the `chainsaw` binary
- making this work would require validator image/build/release pipeline changes, not just validator logic changes
- it broadens `deployment` toward a general component-readiness framework, which is more than we need for this gap

### 2. Convert the needed deployment checks to raw-resource Chainsaw assertions

Pros:
- avoids shipping the external `chainsaw` binary
- still reuses part of the current check model

Cons:
- current checks are not written in that simpler format
- creates a second, reduced version of the same health checks
- still adds another assertion path to deployment for a relatively small targeted gap

### 3. Strengthen `deployment` with a small pure-Go implementation

Pros:
- consistent with the current Go-based deployment validator implementation in [`validators/deployment`](https://github.com/NVIDIA/aicr/blob/main/validators/deployment/main.go)
- keeps the runtime path pure Go / Kubernetes client based
- no validator image/runtime dependency changes
- directly addresses the actual gap in a narrow way

Cons:
- does not make deployment a generic component-readiness aggregator
- broader per-component health checks still remain in conformance / `recipes/checks/*`

## Rationale

We are choosing option 3.

This is the smallest deployer-neutral change that closes the real gap:
- it preserves the current `aicr validate --phase deployment` user flow
- it keeps the implementation consistent with the existing Go-based validator path
- it avoids pulling Chainsaw into the normal deployment validator runtime and build pipeline
- it gives us what we actually need now:
  - deployment completeness for enabled components
  - plus the specific Skyhook, DRA, and GPU operator readiness checks that `deploy.sh` currently misses

It also keeps the user flow simple:

```bash
aicr validate --recipe recipe.yaml --phase deployment
```

instead of introducing a new phase or a new separate tool, and keeps the design deployer-neutral:
- no extra `deploy.sh` readiness framework
- no deployer-specific wait logic in bash
- no new standalone readiness binary

Broader per-component Chainsaw health checks can remain in conformance or be revisited later if we decide `deployment` should grow into a more general readiness framework.

Expected implementation size is small: a narrow change in the existing deployment validator package plus targeted tests, not a new validator subsystem.

## Out of Scope

- Adding a new validation phase name just for readiness
- Reintroducing a `deploy.sh --wait-ready` implementation here
- Pulling the full Chainsaw-based component health-check model into the normal deployment validator path
- Broader bridge-Job / cleanup-Job architecture work discussed alongside [#516](https://github.com/NVIDIA/aicr/issues/516)

## Related

- Messaging/docs fix: [#609](https://github.com/NVIDIA/aicr/pull/609)
- Undeploy hardening / complexity lesson: [#602](https://github.com/NVIDIA/aicr/pull/602)
- Registry-driven all-components readiness follow-up: [#622](https://github.com/NVIDIA/aicr/issues/622)
- Longer-term chart-based bundle architecture: [#516](https://github.com/NVIDIA/aicr/issues/516)

## Acceptance

- `aicr validate --recipe recipe.yaml --phase deployment` covers deployment completeness for enabled components
- deployment validation includes Skyhook completion, DRA plugin readiness, and stronger `gpu-operator` readiness where the recipe enables those components
- on a fresh GPU cluster where install has completed but Skyhook / driver convergence is still in progress, `aicr validate --phase deployment` fails during that window and passes after convergence completes
- `aicr validate --phase deployment` does not fail on clusters where the Skyhook CRD is not registered, or on recipes that do not enable `skyhook-customizations`; the Skyhook check is skipped (or reported as N/A) in those cases
- `aicr validate --phase deployment` scopes Skyhook readiness to the Skyhook CR names this recipe declares (extracted from `ComponentRef.ManifestFiles`). Unrelated Skyhook CRs on the cluster (stale from previous deploys, or owned by another tenant) do not affect the result. An enabled `skyhook-customizations` ref with no extractable Skyhook names fails closed as a recipe misconfiguration.
- `aicr validate --phase deployment` applies the same skip/fail-closed logic to `gpu-operator`: when the `nvidia.com/v1` CRD is not registered, the `ClusterPolicy` check is skipped; when it is registered but the `cluster-policy` CR is absent, validation fails as a real gpu-operator misconfiguration. Non-`NotFound` discovery errors (RBAC denial, API server unreachable, transient 5xx) always fail closed rather than being treated as a missing CRD — a broken cluster must never silently pass this check.
- `aicr validate --phase deployment` locates the DRA kubelet-plugin DaemonSet by the upstream chart's role-suffix convention (`-kubelet-plugin`), not by a hard-coded object name or a release-scoped label. Custom `fullnameOverride` values and non-Helm deployers continue to be validated correctly; ambiguity (>1 match) and absence (0 matches) surface as explicit failures that include the DaemonSet names involved.
- the deployment phase stays strictly deployer-neutral: no Helm API calls, no reads of release metadata, no dependence on release-scoped labels. Every lookup is a plain Kubernetes API call against live-cluster object shape.
- covered by unit tests for the deployment validator and existing live-cluster verification flows (for example CUJ validation), without introducing a new phase



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enhance deployment-phase validation to reflect post-install workload readiness #607

Summary

Problem

Scope Assumption

Approach

Options Considered

1. Reuse `recipes/checks/*` directly in deployment via Chainsaw

2. Convert the needed deployment checks to raw-resource Chainsaw assertions

3. Strengthen `deployment` with a small pure-Go implementation

Rationale

Out of Scope

Related

Acceptance

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Enhance deployment-phase validation to reflect post-install workload readiness #607

Description

Summary

Problem

Scope Assumption

Approach

Options Considered

1. Reuse recipes/checks/* directly in deployment via Chainsaw

2. Convert the needed deployment checks to raw-resource Chainsaw assertions

3. Strengthen deployment with a small pure-Go implementation

Rationale

Out of Scope

Related

Acceptance

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

1. Reuse `recipes/checks/*` directly in deployment via Chainsaw

3. Strengthen `deployment` with a small pure-Go implementation