Skip to content

Enhance deployment-phase validation to reflect post-install workload readiness #607

Description

@yuanchen8911

Note: this issue has been revised from its original version based on the discussions in the comments. The original scope (deploy.sh messaging/docs fix) is now handled separately in #609.

Follow-up note: the broader registry-driven work to make deployment-phase validation symmetric and deep for all components is now tracked in #622. This issue remains intentionally scoped to strengthening the existing deployment phase for deployment completeness plus the targeted GPU readiness signals below.

Summary

We already have most of the machinery for a deployer-neutral readiness check in aicr validate. The gap is not a missing standalone tool or a missing new phase; the gap is that the current deployment phase is still narrower than post-install workload readiness.

The proposal is to strengthen the existing deployment phase so that:

aicr validate --recipe recipe.yaml --phase deployment

answers the practical post-install question users care about: is this cluster actually ready for GPU workloads now?

Concretely, this issue is about:

  • keep using the existing deployment phase
  • preserve the existing baseline checks:
    • enabled namespaces are Active
    • declared expectedResources are healthy
  • add the targeted GPU-specific readiness checks the phase was missing:
    • skyhook-customizations: Skyhook CR completion
    • nvidia-dra-driver-gpu: kubelet-plugin readiness
    • gpu-operator: ClusterPolicy.status.state == ready

So the main new value is the addition of the targeted Skyhook, DRA, and GPU Operator deep readiness checks on top of the preserved baseline deployment checks.

Problem

Today, deployment validation is closer to:

  • components were installed
  • some expected resources exist
  • basic operator/resource checks passed

That is still weaker than real post-install workload readiness for GPU bundles.

For example, the cluster may still be converging after install while:

  • Skyhook tuning is still running
  • the DRA kubelet plugin is not fully ready
  • GPU components are present but not yet usable end-to-end

So the missing piece is not a new validator binary or a new CLI entrypoint. The missing piece is that the existing deployment phase does not yet include the specific readiness checks needed for post-install workload readiness.

Scope Assumption

The scope of --phase deployment here is:

  • deployment completeness for all enabled components, and
  • the specific post-install GPU readiness signals needed to close the current gap

It is not intended to become:

  • a generic per-component health-check aggregator
  • a full end-to-end workload smoke test
  • a replacement for broader conformance validation

Approach

Keep the existing deployment phase. Do not add a new phase, a new flag, or a new tool for now.

Instead, strengthen the existing deployment phase so that:

aicr validate --recipe recipe.yaml --phase deployment

serves as the practical post-install workload readiness check.

The implementation target is:

  • reuse the current deployment checks
  • add deployment-completeness checks for enabled components
    • namespaces exist and are Active
    • declared expectedResources exist and are healthy
  • add the missing GPU readiness checks
    • skyhook-customizations
      • require Skyhook CR completion (status.status == complete)
      • scope the check to only the Skyhook CRs this recipe declares (names extracted from the recipe's own ComponentRef.ManifestFiles), so unrelated Skyhooks on the cluster (stale from prior deploys, or from other tenants) do not widen the failure surface
      • skip gracefully when the Skyhook CRD is not registered on the cluster
    • nvidia-dra-driver-gpu
      • require the kubelet plugin DaemonSet to be ready, discovered deployer-neutrally by the chart's role-suffix convention (name ending in -kubelet-plugin) rather than by a hard-coded object name or release-scoped label
    • gpu-operator
      • require stronger operator readiness (ClusterPolicy.status.state == ready), not just deployment existence

All checks stay deployer-neutral: no Helm API calls, no reads of release metadata, no dependence on release-scoped labels. Every lookup is a plain Kubernetes API call against live-cluster object shape.

Because this reuses the existing CLI, day-N re-verification after scale-up or other runtime changes is also covered: run the same command again.

Options Considered

1. Reuse recipes/checks/* directly in deployment via Chainsaw

Pros:

  • maximum reuse of existing component health checks
  • one source of truth for per-component readiness

Cons:

  • most of the richer checks are in Chainsaw Test format, for example skyhook-customizations health check
  • the normal deployment validator image does not currently ship the chainsaw binary
  • making this work would require validator image/build/release pipeline changes, not just validator logic changes
  • it broadens deployment toward a general component-readiness framework, which is more than we need for this gap

2. Convert the needed deployment checks to raw-resource Chainsaw assertions

Pros:

  • avoids shipping the external chainsaw binary
  • still reuses part of the current check model

Cons:

  • current checks are not written in that simpler format
  • creates a second, reduced version of the same health checks
  • still adds another assertion path to deployment for a relatively small targeted gap

3. Strengthen deployment with a small pure-Go implementation

Pros:

  • consistent with the current Go-based deployment validator implementation in validators/deployment
  • keeps the runtime path pure Go / Kubernetes client based
  • no validator image/runtime dependency changes
  • directly addresses the actual gap in a narrow way

Cons:

  • does not make deployment a generic component-readiness aggregator
  • broader per-component health checks still remain in conformance / recipes/checks/*

Rationale

We are choosing option 3.

This is the smallest deployer-neutral change that closes the real gap:

  • it preserves the current aicr validate --phase deployment user flow
  • it keeps the implementation consistent with the existing Go-based validator path
  • it avoids pulling Chainsaw into the normal deployment validator runtime and build pipeline
  • it gives us what we actually need now:
    • deployment completeness for enabled components
    • plus the specific Skyhook, DRA, and GPU operator readiness checks that deploy.sh currently misses

It also keeps the user flow simple:

aicr validate --recipe recipe.yaml --phase deployment

instead of introducing a new phase or a new separate tool, and keeps the design deployer-neutral:

  • no extra deploy.sh readiness framework
  • no deployer-specific wait logic in bash
  • no new standalone readiness binary

Broader per-component Chainsaw health checks can remain in conformance or be revisited later if we decide deployment should grow into a more general readiness framework.

Expected implementation size is small: a narrow change in the existing deployment validator package plus targeted tests, not a new validator subsystem.

Out of Scope

  • Adding a new validation phase name just for readiness
  • Reintroducing a deploy.sh --wait-ready implementation here
  • Pulling the full Chainsaw-based component health-check model into the normal deployment validator path
  • Broader bridge-Job / cleanup-Job architecture work discussed alongside #516

Related

  • Messaging/docs fix: #609
  • Undeploy hardening / complexity lesson: #602
  • Registry-driven all-components readiness follow-up: #622
  • Longer-term chart-based bundle architecture: #516

Acceptance

  • aicr validate --recipe recipe.yaml --phase deployment covers deployment completeness for enabled components
  • deployment validation includes Skyhook completion, DRA plugin readiness, and stronger gpu-operator readiness where the recipe enables those components
  • on a fresh GPU cluster where install has completed but Skyhook / driver convergence is still in progress, aicr validate --phase deployment fails during that window and passes after convergence completes
  • aicr validate --phase deployment does not fail on clusters where the Skyhook CRD is not registered, or on recipes that do not enable skyhook-customizations; the Skyhook check is skipped (or reported as N/A) in those cases
  • aicr validate --phase deployment scopes Skyhook readiness to the Skyhook CR names this recipe declares (extracted from ComponentRef.ManifestFiles). Unrelated Skyhook CRs on the cluster (stale from previous deploys, or owned by another tenant) do not affect the result. An enabled skyhook-customizations ref with no extractable Skyhook names fails closed as a recipe misconfiguration.
  • aicr validate --phase deployment applies the same skip/fail-closed logic to gpu-operator: when the nvidia.com/v1 CRD is not registered, the ClusterPolicy check is skipped; when it is registered but the cluster-policy CR is absent, validation fails as a real gpu-operator misconfiguration. Non-NotFound discovery errors (RBAC denial, API server unreachable, transient 5xx) always fail closed rather than being treated as a missing CRD — a broken cluster must never silently pass this check.
  • aicr validate --phase deployment locates the DRA kubelet-plugin DaemonSet by the upstream chart's role-suffix convention (-kubelet-plugin), not by a hard-coded object name or a release-scoped label. Custom fullnameOverride values and non-Helm deployers continue to be validated correctly; ambiguity (>1 match) and absence (0 matches) surface as explicit failures that include the DaemonSet names involved.
  • the deployment phase stays strictly deployer-neutral: no Helm API calls, no reads of release metadata, no dependence on release-scoped labels. Every lookup is a plain Kubernetes API call against live-cluster object shape.
  • covered by unit tests for the deployment validator and existing live-cluster verification flows (for example CUJ validation), without introducing a new phase

Metadata

Metadata

Type

No type

Fields

No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions