Note: this issue has been revised from its original version based on the discussions in the comments. The original scope (deploy.sh messaging/docs fix) is now handled separately in #609.
Follow-up note: the broader registry-driven work to make deployment-phase validation symmetric and deep for all components is now tracked in #622. This issue remains intentionally scoped to strengthening the existing deployment phase for deployment completeness plus the targeted GPU readiness signals below.
Summary
We already have most of the machinery for a deployer-neutral readiness check in aicr validate. The gap is not a missing standalone tool or a missing new phase; the gap is that the current deployment phase is still narrower than post-install workload readiness.
The proposal is to strengthen the existing deployment phase so that:
aicr validate --recipe recipe.yaml --phase deployment
answers the practical post-install question users care about: is this cluster actually ready for GPU workloads now?
Concretely, this issue is about:
- keep using the existing deployment phase
- preserve the existing baseline checks:
- enabled namespaces are
Active
- declared
expectedResources are healthy
- add the targeted GPU-specific readiness checks the phase was missing:
skyhook-customizations: Skyhook CR completion
nvidia-dra-driver-gpu: kubelet-plugin readiness
gpu-operator: ClusterPolicy.status.state == ready
So the main new value is the addition of the targeted Skyhook, DRA, and GPU Operator deep readiness checks on top of the preserved baseline deployment checks.
Problem
Today, deployment validation is closer to:
- components were installed
- some expected resources exist
- basic operator/resource checks passed
That is still weaker than real post-install workload readiness for GPU bundles.
For example, the cluster may still be converging after install while:
- Skyhook tuning is still running
- the DRA kubelet plugin is not fully ready
- GPU components are present but not yet usable end-to-end
So the missing piece is not a new validator binary or a new CLI entrypoint. The missing piece is that the existing deployment phase does not yet include the specific readiness checks needed for post-install workload readiness.
Scope Assumption
The scope of --phase deployment here is:
- deployment completeness for all enabled components, and
- the specific post-install GPU readiness signals needed to close the current gap
It is not intended to become:
- a generic per-component health-check aggregator
- a full end-to-end workload smoke test
- a replacement for broader conformance validation
Approach
Keep the existing deployment phase. Do not add a new phase, a new flag, or a new tool for now.
Instead, strengthen the existing deployment phase so that:
aicr validate --recipe recipe.yaml --phase deployment
serves as the practical post-install workload readiness check.
The implementation target is:
- reuse the current deployment checks
- add deployment-completeness checks for enabled components
- namespaces exist and are
Active
- declared
expectedResources exist and are healthy
- add the missing GPU readiness checks
skyhook-customizations
- require Skyhook CR completion (
status.status == complete)
- scope the check to only the Skyhook CRs this recipe declares (names extracted from the recipe's own
ComponentRef.ManifestFiles), so unrelated Skyhooks on the cluster (stale from prior deploys, or from other tenants) do not widen the failure surface
- skip gracefully when the Skyhook CRD is not registered on the cluster
nvidia-dra-driver-gpu
- require the kubelet plugin DaemonSet to be ready, discovered deployer-neutrally by the chart's role-suffix convention (name ending in
-kubelet-plugin) rather than by a hard-coded object name or release-scoped label
gpu-operator
- require stronger operator readiness (
ClusterPolicy.status.state == ready), not just deployment existence
All checks stay deployer-neutral: no Helm API calls, no reads of release metadata, no dependence on release-scoped labels. Every lookup is a plain Kubernetes API call against live-cluster object shape.
Because this reuses the existing CLI, day-N re-verification after scale-up or other runtime changes is also covered: run the same command again.
Options Considered
1. Reuse recipes/checks/* directly in deployment via Chainsaw
Pros:
- maximum reuse of existing component health checks
- one source of truth for per-component readiness
Cons:
- most of the richer checks are in Chainsaw
Test format, for example skyhook-customizations health check
- the normal deployment validator image does not currently ship the
chainsaw binary
- making this work would require validator image/build/release pipeline changes, not just validator logic changes
- it broadens
deployment toward a general component-readiness framework, which is more than we need for this gap
2. Convert the needed deployment checks to raw-resource Chainsaw assertions
Pros:
- avoids shipping the external
chainsaw binary
- still reuses part of the current check model
Cons:
- current checks are not written in that simpler format
- creates a second, reduced version of the same health checks
- still adds another assertion path to deployment for a relatively small targeted gap
3. Strengthen deployment with a small pure-Go implementation
Pros:
- consistent with the current Go-based deployment validator implementation in
validators/deployment
- keeps the runtime path pure Go / Kubernetes client based
- no validator image/runtime dependency changes
- directly addresses the actual gap in a narrow way
Cons:
- does not make deployment a generic component-readiness aggregator
- broader per-component health checks still remain in conformance /
recipes/checks/*
Rationale
We are choosing option 3.
This is the smallest deployer-neutral change that closes the real gap:
- it preserves the current
aicr validate --phase deployment user flow
- it keeps the implementation consistent with the existing Go-based validator path
- it avoids pulling Chainsaw into the normal deployment validator runtime and build pipeline
- it gives us what we actually need now:
- deployment completeness for enabled components
- plus the specific Skyhook, DRA, and GPU operator readiness checks that
deploy.sh currently misses
It also keeps the user flow simple:
aicr validate --recipe recipe.yaml --phase deployment
instead of introducing a new phase or a new separate tool, and keeps the design deployer-neutral:
- no extra
deploy.sh readiness framework
- no deployer-specific wait logic in bash
- no new standalone readiness binary
Broader per-component Chainsaw health checks can remain in conformance or be revisited later if we decide deployment should grow into a more general readiness framework.
Expected implementation size is small: a narrow change in the existing deployment validator package plus targeted tests, not a new validator subsystem.
Out of Scope
- Adding a new validation phase name just for readiness
- Reintroducing a
deploy.sh --wait-ready implementation here
- Pulling the full Chainsaw-based component health-check model into the normal deployment validator path
- Broader bridge-Job / cleanup-Job architecture work discussed alongside #516
Related
- Messaging/docs fix: #609
- Undeploy hardening / complexity lesson: #602
- Registry-driven all-components readiness follow-up: #622
- Longer-term chart-based bundle architecture: #516
Acceptance
aicr validate --recipe recipe.yaml --phase deployment covers deployment completeness for enabled components
- deployment validation includes Skyhook completion, DRA plugin readiness, and stronger
gpu-operator readiness where the recipe enables those components
- on a fresh GPU cluster where install has completed but Skyhook / driver convergence is still in progress,
aicr validate --phase deployment fails during that window and passes after convergence completes
aicr validate --phase deployment does not fail on clusters where the Skyhook CRD is not registered, or on recipes that do not enable skyhook-customizations; the Skyhook check is skipped (or reported as N/A) in those cases
aicr validate --phase deployment scopes Skyhook readiness to the Skyhook CR names this recipe declares (extracted from ComponentRef.ManifestFiles). Unrelated Skyhook CRs on the cluster (stale from previous deploys, or owned by another tenant) do not affect the result. An enabled skyhook-customizations ref with no extractable Skyhook names fails closed as a recipe misconfiguration.
aicr validate --phase deployment applies the same skip/fail-closed logic to gpu-operator: when the nvidia.com/v1 CRD is not registered, the ClusterPolicy check is skipped; when it is registered but the cluster-policy CR is absent, validation fails as a real gpu-operator misconfiguration. Non-NotFound discovery errors (RBAC denial, API server unreachable, transient 5xx) always fail closed rather than being treated as a missing CRD — a broken cluster must never silently pass this check.
aicr validate --phase deployment locates the DRA kubelet-plugin DaemonSet by the upstream chart's role-suffix convention (-kubelet-plugin), not by a hard-coded object name or a release-scoped label. Custom fullnameOverride values and non-Helm deployers continue to be validated correctly; ambiguity (>1 match) and absence (0 matches) surface as explicit failures that include the DaemonSet names involved.
- the deployment phase stays strictly deployer-neutral: no Helm API calls, no reads of release metadata, no dependence on release-scoped labels. Every lookup is a plain Kubernetes API call against live-cluster object shape.
- covered by unit tests for the deployment validator and existing live-cluster verification flows (for example CUJ validation), without introducing a new phase
Summary
We already have most of the machinery for a deployer-neutral readiness check in
aicr validate. The gap is not a missing standalone tool or a missing new phase; the gap is that the currentdeploymentphase is still narrower than post-install workload readiness.The proposal is to strengthen the existing
deploymentphase so that:answers the practical post-install question users care about: is this cluster actually ready for GPU workloads now?
Concretely, this issue is about:
ActiveexpectedResourcesare healthyskyhook-customizations: Skyhook CR completionnvidia-dra-driver-gpu: kubelet-plugin readinessgpu-operator:ClusterPolicy.status.state == readySo the main new value is the addition of the targeted Skyhook, DRA, and GPU Operator deep readiness checks on top of the preserved baseline deployment checks.
Problem
Today,
deploymentvalidation is closer to:That is still weaker than real post-install workload readiness for GPU bundles.
For example, the cluster may still be converging after install while:
So the missing piece is not a new validator binary or a new CLI entrypoint. The missing piece is that the existing
deploymentphase does not yet include the specific readiness checks needed for post-install workload readiness.Scope Assumption
The scope of
--phase deploymenthere is:It is not intended to become:
Approach
Keep the existing
deploymentphase. Do not add a new phase, a new flag, or a new tool for now.Instead, strengthen the existing deployment phase so that:
serves as the practical post-install workload readiness check.
The implementation target is:
ActiveexpectedResourcesexist and are healthyskyhook-customizationsstatus.status == complete)ComponentRef.ManifestFiles), so unrelated Skyhooks on the cluster (stale from prior deploys, or from other tenants) do not widen the failure surfacenvidia-dra-driver-gpu-kubelet-plugin) rather than by a hard-coded object name or release-scoped labelgpu-operatorClusterPolicy.status.state == ready), not just deployment existenceAll checks stay deployer-neutral: no Helm API calls, no reads of release metadata, no dependence on release-scoped labels. Every lookup is a plain Kubernetes API call against live-cluster object shape.
Because this reuses the existing CLI, day-N re-verification after scale-up or other runtime changes is also covered: run the same command again.
Options Considered
1. Reuse
recipes/checks/*directly in deployment via ChainsawPros:
Cons:
Testformat, for exampleskyhook-customizationshealth checkchainsawbinarydeploymenttoward a general component-readiness framework, which is more than we need for this gap2. Convert the needed deployment checks to raw-resource Chainsaw assertions
Pros:
chainsawbinaryCons:
3. Strengthen
deploymentwith a small pure-Go implementationPros:
validators/deploymentCons:
recipes/checks/*Rationale
We are choosing option 3.
This is the smallest deployer-neutral change that closes the real gap:
aicr validate --phase deploymentuser flowdeploy.shcurrently missesIt also keeps the user flow simple:
instead of introducing a new phase or a new separate tool, and keeps the design deployer-neutral:
deploy.shreadiness frameworkBroader per-component Chainsaw health checks can remain in conformance or be revisited later if we decide
deploymentshould grow into a more general readiness framework.Expected implementation size is small: a narrow change in the existing deployment validator package plus targeted tests, not a new validator subsystem.
Out of Scope
deploy.sh --wait-readyimplementation hereRelated
Acceptance
aicr validate --recipe recipe.yaml --phase deploymentcovers deployment completeness for enabled componentsgpu-operatorreadiness where the recipe enables those componentsaicr validate --phase deploymentfails during that window and passes after convergence completesaicr validate --phase deploymentdoes not fail on clusters where the Skyhook CRD is not registered, or on recipes that do not enableskyhook-customizations; the Skyhook check is skipped (or reported as N/A) in those casesaicr validate --phase deploymentscopes Skyhook readiness to the Skyhook CR names this recipe declares (extracted fromComponentRef.ManifestFiles). Unrelated Skyhook CRs on the cluster (stale from previous deploys, or owned by another tenant) do not affect the result. An enabledskyhook-customizationsref with no extractable Skyhook names fails closed as a recipe misconfiguration.aicr validate --phase deploymentapplies the same skip/fail-closed logic togpu-operator: when thenvidia.com/v1CRD is not registered, theClusterPolicycheck is skipped; when it is registered but thecluster-policyCR is absent, validation fails as a real gpu-operator misconfiguration. Non-NotFounddiscovery errors (RBAC denial, API server unreachable, transient 5xx) always fail closed rather than being treated as a missing CRD — a broken cluster must never silently pass this check.aicr validate --phase deploymentlocates the DRA kubelet-plugin DaemonSet by the upstream chart's role-suffix convention (-kubelet-plugin), not by a hard-coded object name or a release-scoped label. CustomfullnameOverridevalues and non-Helm deployers continue to be validated correctly; ambiguity (>1 match) and absence (0 matches) surface as explicit failures that include the DaemonSet names involved.