enhance(validator): add targeted post-deployment GPU readiness checks by yuanchen8911 · Pull Request #611 · NVIDIA/aicr

yuanchen8911 · 2026-04-18T00:07:54Z

Summary

Strengthen aicr validate --phase deployment by adding the targeted GPU-workload readiness signals the current validator misses — Skyhook customizations, NVIDIA DRA driver, and GPU Operator (ClusterPolicy) — on top of the existing namespace and expectedResources checks.

Concretely, this PR:

keeps using the existing deployment phase
preserves the existing baseline checks:
- enabled namespaces are Active
- declared expectedResources are healthy
adds the targeted GPU-specific readiness checks the phase was missing:
- skyhook-customizations: Skyhook CR completion
- nvidia-dra-driver-gpu: kubelet-plugin readiness
- gpu-operator: ClusterPolicy.status.state == ready

So the main new value is the addition of the targeted Skyhook, DRA, and GPU Operator deep readiness checks on top of the preserved baseline deployment checks.

Follow-up note: the broader registry-driven work to make deployment-phase validation symmetric and deep for all components is now tracked in #622. This PR remains intentionally scoped to the targeted GPU readiness gap narrowed in #607.

The check stays strictly deployer-neutral: no Helm API calls, no dependence on Argo CD or any other deployer, no reads of release metadata, no release-identity assumptions. Every lookup is a plain Kubernetes API call against live-cluster object shape, so the command can be used as a standalone tool to verify any Kubernetes cluster's runtime install regardless of how the bundle was deployed (deploy.sh, Argo CD, kapp, helmfile, plain kubectl apply). Validated end-to-end on a real EKS H100 cluster.

Motivation / Context

#607 narrowed this follow-up to reusing and enhancing the existing deployment validator path instead of adding a new phase, flag, or Chainsaw runtime dependency. The previous deployment phase verified install/resource existence, but it did not cover the late-converging readiness signals that explain the post-install gap on fresh GPU clusters.

Fixes: #607
Related:

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation update
Refactoring (no functional changes)
Build/CI/tooling

Component(s) Affected

CLI (cmd/aicr, pkg/cli)
API server (cmd/aicrd, pkg/api, pkg/server)
Recipe engine / data (pkg/recipe)
Bundlers (pkg/bundler, pkg/component/*)
Collectors / snapshotter (pkg/collector, pkg/snapshotter)
Validator (validators/deployment)
Core libraries (pkg/errors, pkg/k8s)
Docs/examples (docs/)
Other: E2E test harness (tests/e2e)

Implementation Notes

Deployer-neutrality stance

The deployment-phase check stays strictly deployer-agnostic:

No Helm API calls.
No reads of Helm release metadata (secrets, ConfigMaps, or helm list).
No dependence on release-scoped labels such as app.kubernetes.io/instance.

Every lookup is a plain Kubernetes API call against live-cluster object shape — labels, discovery, name conventions written into the chart templates themselves. The check is valid regardless of whether the bundle was deployed via deploy.sh (Helm), Argo CD (Helm or Kustomize), kapp, helmfile, or plain kubectl apply.

Checks added or extended

Enabled component namespaces must be Active.
Declared expectedResources (Deployments, DaemonSets, etc.) must exist and be healthy.
Skyhook readiness (scope-limited to the recipe's own declarations): the validator parses each ComponentRef.ManifestFiles of the skyhook-customizations component through the recipe data provider, extracts the metadata.name of every Skyhook resource declared, and verifies each one exists on the cluster with status.status == "complete". A narrow Helm-template-tolerant regex extractor is used because manifests contain Helm {{ ... }} directives that break a full YAML parse. Skyhook CRs present on the cluster that are not declared by this recipe (stale from prior deploys, other tenants) are ignored. Fails closed when an enabled skyhook-customizations ref declares no extractable Skyhook names (recipe misconfiguration).
ClusterPolicy readiness: Discovery on nvidia.com/v1 is used up-front to disambiguate three cases the dynamic-client Get conflates:
- CRD not registered (IsNotFound from discovery) → skip per #607 acceptance.
- CRD registered but cluster-policy CR absent → fail closed (gpu-operator installed but not reconciled).
- Any other discovery error (RBAC 403, API 5xx, network timeout) → fail closed with a diagnostic pointing at RBAC / API reachability, so a transient discovery failure cannot silently mask readiness.
DRA kubelet-plugin readiness (role-suffix discovery): the validator lists DaemonSets in the component namespace (timeout-bounded) and selects the one whose name ends in the upstream chart's hard-coded role suffix -kubelet-plugin. 0 matches → fail with the DaemonSets seen in the namespace. >1 matches → fail with the matched names. This is a chart-shape assumption (the upstream nvidia-dra-driver-gpu chart always names that DaemonSet <fullname>-kubelet-plugin), not a release-identity assumption — it survives fullnameOverride and non-Helm deployers. Readiness gate: both DesiredNumberScheduled > 0 and NumberReady == DesiredNumberScheduled.

Skyhook is a cluster-scoped CR; the name-based Get path is cluster-scoped by construction, avoiding the namespaced-List-against-cluster-scoped-kind 404 that an earlier approach would have hit.

Supporting changes

tests/e2e/run.sh seeds a fake ClusterPolicy CRD (no status subresource — deliberate, and commented) plus a ready cluster-policy CR so deployment-phase E2E exercises the strengthened validator behavior.
docs/user/cli-reference.md documents the expanded deployment phase: the full list of checks, the graceful-skip path when a CRD is absent, the fail-closed path for a missing CR with the CRD present, and the fail-closed path for non-NotFound discovery errors (with a concrete RBAC 403 example).

Testing

# Local package run + coverage
GOFLAGS="-mod=vendor" go test -race -coverprofile=cover.out ./validators/deployment/...
go tool cover -func=cover.out | tail -1

# Repo coverage gate + full local gate
make test-coverage
unset GITLAB_TOKEN && make qualify

go test -race ./validators/deployment/...: passed.
validators/deployment coverage: substantially improved vs origin/main (new tests span the full behavioral surface below).
make test-coverage: passed (above the 70% project-wide threshold).
unset GITLAB_TOKEN && make qualify: passed.

Added / updated regression coverage:

DRA kubelet-plugin discovery

Happy path with the default chart name.
Custom fullname-override name (e.g., my-custom-gpu-kubelet-plugin) still passes — proves the role-suffix approach is not coupled to the default name.
Two DaemonSets with matching suffix fails with an ambiguity diagnostic listing both names.
Unrelated DaemonSet in the same namespace (e.g., node-exporter) is ignored.
Missing DaemonSet fails with a diagnostic that lists the DaemonSets actually seen in the namespace (for debugging).
0/0 and under-scheduled DaemonSet fail the readiness gate.

Skyhook readiness

Cluster-scoped Get contract (regression against the earlier namespaced-List 404).
CRD not registered → skip.
Unrelated stale Skyhook CR on the cluster (even with the AICR label) does not affect the result when the expected CR is complete.
Multiple expected Skyhooks with mixed failures → all failures surface in one error (not first-return).
Enabled skyhook-customizations ref with no extractable Skyhook names → fail closed.
Direct unit tests for extractSkyhookNamesFromManifest, including the tuning-gke.yaml case (metadata.name is "tuning", not "tuning-gke"), Helm-template preamble tolerance, and templated-name skip.

ClusterPolicy readiness

CRD-missing skip (via discovery).
CRD present but CR missing → fail closed with "gpu-operator installed but CR missing?" diagnostic.
Discovery returns non-NotFound (RBAC 403) → fail closed.
Missing status.state → fail.

Other

Namespace-not-Active → fail.
TestMain forces the global recipe data provider to initialize before any t.Parallel() test calls recipe.GetManifestContent, working around a pre-existing race on lazy init in pkg/recipe/provider.go.

Live-cluster validation

Ran aicr validate --recipe recipe.yaml --phase deployment against a real EKS H100 cluster (aws-us-east-1-aicr-cuj2) with the dynamo-platform inference recipe:

phase=deployment status=passed validators=4 passed=4 failed=0 duration=26.824s

Validator	Status
`operator-health`	passed
`expected-resources`	passed — exercised namespace `Active` for 10 namespaces, Skyhook `tuning` `status.status==complete` (name-based from the recipe's manifestFiles), `ClusterPolicy.status.state==ready` (discovery-gated), and the DRA kubelet-plugin DaemonSet via `-kubelet-plugin` suffix match
`gpu-operator-version`	passed
`check-nvidia-smi`	passed — 2/2 GPU nodes usable

Exit code was 0; CTRF report archived locally.

Risk Assessment

Low — Isolated change, well-tested, easy to revert
Medium — Touches multiple components or has broader impact
High — Breaking change, affects critical paths, or complex rollout

Rollout notes: No migration or flag changes. This strengthens deployment-phase validation only.

Skyhook and ClusterPolicy checks skip only when the corresponding CRD is not registered.
Once the CRD exists, missing readiness objects are surfaced as real failures; non-NotFound discovery errors are never treated as skips.
DRA kubelet-plugin discovery works for default-named and fullname-override'd deployments alike, and is deployer-neutral (no Helm or release-identity dependency).
Skyhook readiness is scoped to the Skyhook CRs the recipe itself declares via ManifestFiles; unrelated Skyhooks on the cluster are ignored.

Checklist

Tests pass locally (make test with -race)
Linter passes (make lint)
I did not skip/disable tests to make CI green
I added/updated tests for new functionality
I updated docs if user-facing behavior changed
Changes follow existing patterns in the codebase
Commits are cryptographically signed (git commit -S) — GPG signing info

coderabbitai · 2026-04-18T00:15:10Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

checkExpectedResources now errors when ctx.Clientset is nil and restricts work to enabled ComponentRefs. It adds readiness verifiers that run after resource existence checks: namespaces reach Active; Skyhook CRs (names extracted from enabled manifest files) exist and have status.status == "complete (CRD-gated); GPU Operator ClusterPolicy has status.state == "ready" (CRD-gated); and the DRA kubelet-plugin DaemonSet (discovered by -kubelet-plugin suffix) reports ready pods. A dynamic client is created via getDynamicClient for CR checks. New unit tests, e2e fixture setup/cleanup helpers, and CLI docs were added/updated.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'enhance(validator): add targeted post-deployment GPU readiness checks' directly and accurately reflects the main change: adding GPU readiness signal checks to the deployment validator.
Description check	✅ Passed	The description is comprehensive and clearly related to the changeset, detailing the new GPU readiness checks (Skyhook, DRA, ClusterPolicy), implementation approach, and testing results.
Linked Issues check	✅ Passed	All coding objectives from `#607` are met: deployment phase reuse, baseline check preservation, Skyhook/DRA/ClusterPolicy readiness additions, deployer-neutrality, and narrow scope implementation in validators/deployment.
Out of Scope Changes check	✅ Passed	All changes remain narrowly scoped to the stated objectives: deployment validator enhancements, E2E test fixture setup, and documentation updates; no extraneous modifications detected.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@validators/deployment/expected_resources_test.go`:
- Around line 37-231: Add a negative test that ensures verifyNamespacesActive()
(exercised via checkExpectedResources) fails when a required namespace is
missing or not Active: create a new test (e.g.,
TestCheckExpectedResources_FailsWhenNamespaceNotActive) using
newDeploymentTestContext with a component referencing a namespace but include a
corresponding non-active namespace object (use an existing helper like
inactiveNamespace or omit activeNamespace) in the runtime.Objects, then call
checkExpectedResources(ctx) and assert it returns an error and that the error
message contains the namespace readiness context (e.g., "namespace <name> not
Active" or similar) following the pattern used in other tests (t.Fatal/t.Fatalf
when err == nil and strings.Contains checks).

In `@validators/deployment/expected_resources.go`:
- Around line 267-276: The DaemonSet health check in verifyDRAKubeletPluginReady
currently allows a 0/0 DaemonSet to pass; after the existing
helper.VerifyResource call, explicitly fetch the DaemonSet object (use
ctx.Clientset.AppsV1().DaemonSets(namespace).Get with ctx.Ctx and
draKubeletPluginName), then inspect ds.Status.DesiredNumberScheduled and
ds.Status.NumberReady; if DesiredNumberScheduled == 0 or NumberReady == 0 return
a non-nil error (include context mentioning draKubeletPluginName and namespace)
so the DRA readiness gate fails when no pods are scheduled or ready.
- Around line 297-309: The buildResourceFetcher() function currently passes
ctx.DynamicClient directly into chainsaw.NewClusterFetcher(), which can be nil
for test or manually-constructed validators.Context instances; change
buildResourceFetcher() to call getDynamicClient(ctx) and use its returned
dynamic.Interface (handling the error) when constructing the chainsaw.Fetcher so
it lazily initializes from ctx.RESTConfig like verifySkyhookReady() and
verifyClusterPolicyReady() do; reference the validators.Context,
getDynamicClient(), buildResourceFetcher(), ctx.DynamicClient, and
chainsaw.NewClusterFetcher() in your fix and propagate any returned error
instead of risking a nil client.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 3ee239b2-2554-4580-991d-6ec0d91176e2

📥 Commits

Reviewing files that changed from the base of the PR and between c12c783 and f5e3ee9.

📒 Files selected for processing (2)

validators/deployment/expected_resources.go
validators/deployment/expected_resources_test.go

yuanchen8911 · 2026-04-20T17:16:49Z

Related Slack discussion: https://nvidia.slack.com/archives/C0A457AAWUC/p1776444268331979?thread_ts=1776444020.536339&cid=C0A457AAWUC

yuanchen8911 · 2026-04-20T17:18:26Z

Note that the scope of Issue 607 and PR 611 I opened is intentionally narrow and not meant to address all concerns. The goal is to improve the current AICR validator after deployment.

@lockwobr will investigate more advanced ideas, such as the bridge-job (issue #516), along with @pwittrock's ADR, in parallel as a longer-term solution.

A few notes on the proposal in issue #607 and PR #611 implementation:

The validator is independent of the deployer
Each component provides its own readiness status
The validator performs a complete readiness check across all components, including Skyhook and GPU Operator

…istered Issue NVIDIA#607's acceptance criteria explicitly calls for the Skyhook check to "not fail on clusters where the Skyhook CRD is not registered ... the Skyhook check is skipped (or reported as N/A) in those cases." Before this commit, PR NVIDIA#611 still propagated the dynamic-client error (meta.NoKindMatchError or apierrors.NotFound) as an ErrCodeInternal failure, so a recipe that declared skyhook-customizations against a cluster where the operator's CRD had not yet been installed tripped the deployment phase. Apply the same relaxation to verifyClusterPolicyReady by symmetry: a recipe can list gpu-operator before the operator's ClusterPolicy CRD exists on the cluster, and that case should also be a skip, not a hard failure. Behavior summary: - If the CRD is not registered (meta.IsNoMatchError), print a short "CRD not registered, skipping" line and return nil. - For ClusterPolicy, also treat apierrors.IsNotFound as skip. The instance not existing means the operator has not reconciled yet or is not installed, which NVIDIA#607 groups with the CRD-missing path. - Any other error (transient API, auth, timeout) still fails closed with the original wrapping, so we do not mask real problems. Tests: - Extend the homegrown fake dynamic client with an "unregistered" GVR set: Resource() returns a sentinel client whose Get/List return meta.NoKindMatchError. This lets unit tests express the CRD-not-installed state without pulling in a scheme-backed fake. - Add TestCheckExpectedResources_SkipsSkyhookWhenCRDNotRegistered and TestCheckExpectedResources_SkipsClusterPolicyWhenCRDNotRegistered pinning the NVIDIA#607 acceptance behavior.

github-actions · 2026-04-20T17:57:47Z

🌿 Preview your docs: https://nvidia-preview-fix-deployment-phase-readiness.docs.buildwithfern.com/aicr

…g CR Reviewer observation on PR NVIDIA#611: verifyClusterPolicyReady previously treated two distinct situations identically, because both surfaced as IsNotFound from the dynamic Get: 1. CRD not registered — gpu-operator chart not installed. Legitimate skip per NVIDIA#607 (recipe declares the component, but the component isn't on the cluster yet). 2. CRD registered but cluster-policy CR absent — gpu-operator is installed, but its singleton CR was never created or was manually deleted. The operator cannot reconcile the GPU stack in that state, so this is a real misconfiguration. A bare Get cannot tell these apart (same 404). Use Discovery to disambiguate: ServerResourcesForGroupVersion("nvidia.com/v1") is the authoritative check for whether the CRD is installed. When discovery reports the group as present, any subsequent Get error — including IsNotFound on the CR itself — is a real failure and is surfaced with a "gpu-operator installed but CR missing?" diagnostic. Also: - verifySkyhookReady is intentionally NOT symmetric: for a List call, IsNotFound can only mean "kind not served", so it is safe to skip on there, and the "CRD present, zero CRs" case is handled distinctly by the explicit len(Items) == 0 failure path. Added a comment documenting this so a future refactor does not unify the two paths and silently mask the missing-Skyhook case. Tests: - Extend the fake clientset with a configured Discovery service: GroupVersions are auto-registered from dynamicObjects (minus any in `unregistered`), and a second helper lets a test declare extra registered GroupVersions explicitly (needed when we want discovery to succeed without seeding a CR). - Existing tests unchanged in intent; they now flow through the new auto-register path transparently. - New test TestCheckExpectedResources_FailsWhenClusterPolicyCRMissing pins the disambiguation behavior: CRD registered, CR absent → fail with the expected diagnostic. Docs: - docs/user/cli-reference.md clarifies that once a CRD is registered, a missing CR fails loudly (example: gpu-operator ClusterPolicy).

…d path Reviewer feedback on PR NVIDIA#611: the "Other errors (transient API, auth) still fail closed" line is accurate but too terse to help diagnosis. Call out a concrete case — an RBAC 403 on skyhooks.skyhook.nvidia.com is not a NoMatch and therefore surfaces as a failure rather than a silent skip — so a reader hitting it can recognize what happened.

Strengthen `aicr validate --phase deployment` so it verifies deployment completeness plus the targeted GPU readiness signals called out in issue NVIDIA#607, keeping the implementation in the existing pure-Go validators/deployment path rather than adding a new phase, flag, or Chainsaw runtime dependency. Checks added or extended: - Enabled component namespaces must be `Active`. - Declared `expectedResources` (Deployments, DaemonSets, etc.) must exist and be healthy. - When `skyhook-customizations` is enabled: every Skyhook CR the recipe declares (extracted from ComponentRef.ManifestFiles, parsed with a narrow Helm-template-tolerant regex extractor) must exist on the cluster and report `status.status == "complete"`. Unrelated Skyhook CRs on the cluster (stale from prior deploys, or from other tenants) are ignored. All non-complete expected Skyhooks surface in a single error rather than first-failure-only. - When `gpu-operator` is enabled: `ClusterPolicy.status.state == ready`. Discovery on `nvidia.com/v1` is used up-front to disambiguate three cases the dynamic-client Get conflates: * CRD not registered (`IsNotFound` from discovery): skip per NVIDIA#607. * CRD registered but `cluster-policy` CR absent: fail closed as a real gpu-operator misconfiguration. * Any other discovery error (RBAC 403, API 5xx, timeout): fail closed with a diagnostic pointing at RBAC / API reachability, so a transient discovery failure cannot silently pass the check. - When `nvidia-dra-driver-gpu` is enabled: exactly one DaemonSet whose name ends in the upstream chart's role suffix "-kubelet-plugin" must exist in the component namespace and be ready (both DesiredNumberScheduled > 0 and NumberReady == DesiredNumberScheduled). Discovery is by object-name shape, not by release-scoped labels — the check makes no Helm API calls and no release-identity assumptions, so it survives `fullnameOverride` and non-Helm deployers alike. Ambiguous (>1) and missing (0) results report the matched and seen DaemonSet names for debugging. Skyhook skip behavior matches NVIDIA#607's acceptance: if the Skyhook CRD is not registered, the check is a no-op rather than a failure. The implementation uses Discovery for the CRD-gate and then Get-by-name for each expected Skyhook, so "CRD absent" and "CRD present but expected CR absent" are two genuinely different code paths — the first skips, the second fails loudly. Cluster-scoped Skyhook: previous (dynamic-client) code issued a namespaced List on a cluster-scoped kind, which returned 404 even when the CR was present on a real cluster. The name-based Get path is cluster-scoped by construction and does not repeat that mistake. Tests: - Homegrown fake dynamic client gained resource-scope awareness (calls to `.Namespace(x)` on a cluster-scoped GVR return a 404-like error) and an explicit "unregistered GVR" set so List/Get can return a real `meta.NoKindMatchError`, mirroring a missing CRD. - Fake clientset Discovery service is auto-configured from the dynamicObjects seeded by each test, with an escape hatch (`newDeploymentTestContextWithDiscovery`) for tests that need a GroupVersion registered without any CR instance. The reactor used to inject non-NotFound discovery errors binds to the literal resource string "resource" — see the inline comment citing vendor/k8s.io/ client-go/discovery/fake/discovery.go for why "apiresources" would silently fail to match. - Direct unit tests for the Skyhook manifest name extractor, including the tuning-gke.yaml edge case (metadata.name is "tuning", not "tuning-gke") and Helm template preamble tolerance. - New DRA regression tests: custom fullname-override name, multiple matches fails with ambiguity, unrelated DaemonSet in the namespace is ignored, missing DaemonSet fails with a diagnostic that includes the DaemonSets actually seen in the namespace. - New Skyhook regression tests: stale labeled Skyhook unrelated to the recipe's declared names is ignored, multiple expected Skyhooks in non-complete states all surface in a single error, enabled skyhook-customizations ref with no extractable Skyhook names fails closed as a recipe misconfiguration, CRD-missing skip, discovery- error fail-closed, cluster-scoped List contract. - `TestMain` forces the global recipe data provider to initialize before any `t.Parallel()` test calls `recipe.GetManifestContent`, working around a pre-existing race on the package-level lazy initialization in `pkg/recipe/provider.go`. E2E: - `tests/e2e/run.sh` seeds a fake `ClusterPolicy` CRD (no status subresource, deliberately — see the inline comment above the CRD) plus a ready `cluster-policy` CR so deployment-phase E2E exercises the strengthened validator behavior. Docs: - `docs/user/cli-reference.md` documents the expanded deployment phase: the full list of checks, the graceful-skip path when a CRD is absent, and the fail-closed path for a missing CR when the CRD is present and for non-NotFound discovery errors (with a concrete RBAC 403 example). Fixes: NVIDIA#607

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/user/cli-reference.md`:
- Around line 596-606: Update the "Deployment phase checks" text to document the
additional fail-closed case for skyhook customizations: when a component with
skyhook-customizations enabled produces no extractable Skyhook resource names
from its ComponentRef.ManifestFiles, the validator does not skip but fails
deployment validation immediately with a clear "no extractable Skyhook names"
diagnostic (treat this as a recipe/configuration error, distinct from a
CRD-missing skip or a missing CR failure). Reference the skyhook-customizations
flag and ComponentRef.ManifestFiles so readers can locate the source of the
extraction and understand this specific failure path.

In `@tests/e2e/run.sh`:
- Around line 716-720: cleanup_fake_gpu_operator_fixture currently
unconditionally deletes cluster-scoped objects (cluster-policy and
clusterpolicies.nvidia.com); change the flow so the test only removes resources
it created: when you create the CRD or ClusterPolicy in the setup (the code that
creates clusterpolicies.nvidia.com and cluster-policy), set flags (e.g.,
CREATED_GPU_OPERATOR_CRD and CREATED_CLUSTER_POLICY) or record their creation
state, and then modify cleanup_fake_gpu_operator_fixture to delete those
resources only if the corresponding flag is true; alternatively, before
creating, detect existing resources and skip the test or mark them as
preexisting so cleanup does not remove them (refer to function
cleanup_fake_gpu_operator_fixture, the resource names cluster-policy and
clusterpolicies.nvidia.com).
- Around line 713-714: The kubectl wait command for deployment/gpu-operator
currently swallows failures via "|| true", so change the helper to fail when the
deployment never becomes Available: remove the "|| true" from the kubectl wait
invocation (or capture its exit code and explicitly return/exit non-zero), so
the function running the wait returns a non-zero status on timeout and callers
can skip or abort cleanly; update the kubectl wait invocation (the line invoking
"kubectl wait --for=condition=available deployment/gpu-operator -n gpu-operator
--timeout=60s") to propagate failures instead of hiding them.

In `@validators/deployment/expected_resources.go`:
- Around line 174-175: The Kubernetes API calls (e.g.,
ctx.Clientset.CoreV1().Namespaces().Get, discovery and dynamic-client calls)
must use a bounded context instead of ctx.Ctx to avoid stalling the whole
validator; create a short-lived context using the same pattern as
verifyDRAKubeletPluginReady (e.g., call
ctx.Timeout(defaults.ResourceVerificationTimeout) to obtain a timed context, use
that context for the Get/Discovery/Dynamic calls, and ensure you Cancel/cleanup
the timed context), and replace direct uses of ctx.Ctx in those calls (locations
around the Namespace Get, the discovery calls, and the dynamic client usage)
with the new timed context.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 643460d4-e2a9-4d4d-97c0-94c575219b90

📥 Commits

Reviewing files that changed from the base of the PR and between f5e3ee9 and 0548c07.

📒 Files selected for processing (4)

docs/user/cli-reference.md
tests/e2e/run.sh
validators/deployment/expected_resources.go
validators/deployment/expected_resources_test.go

coderabbitai

♻️ Duplicate comments (5)

validators/deployment/expected_resources.go (3)

164-188: ⚠️ Potential issue | 🟠 Major

Wrap namespace lookup in a bounded context.

verifyNamespacesActive calls ctx.Clientset.CoreV1().Namespaces().Get(ctx.Ctx, ...) without a timeout. If the API server stalls, this probe can consume the entire deployment-validation budget and hang the validator job.

In contrast, verifyDRAKubeletPluginReady correctly uses ctx.Timeout(defaults.ResourceVerificationTimeout). For consistency and resilience, apply the same pattern here.

🔧 Suggested fix

 func verifyNamespacesActive(ctx *validators.Context, refs []recipe.ComponentRef) []string {
 	var failures []string
 	seen := make(map[string]bool, len(refs))

+	verifyCtx, cancel := ctx.Timeout(defaults.ResourceVerificationTimeout)
+	defer cancel()
+
 	for _, ref := range refs {
 		if ref.Namespace == "" || seen[ref.Namespace] {
 			continue
 		}
 		seen[ref.Namespace] = true

-		ns, err := ctx.Clientset.CoreV1().Namespaces().Get(ctx.Ctx, ref.Namespace, metav1.GetOptions{})
+		ns, err := ctx.Clientset.CoreV1().Namespaces().Get(verifyCtx, ref.Namespace, metav1.GetOptions{})
 		if err != nil {

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@validators/deployment/expected_resources.go` around lines 164 - 188,
verifyNamespacesActive currently calls
ctx.Clientset.CoreV1().Namespaces().Get(ctx.Ctx, ...) without a timeout; wrap
the lookup in a bounded context like the other validator functions by calling
ctx.Timeout(defaults.ResourceVerificationTimeout) to obtain a new context and
cancel function, defer cancel(), and pass that bounded context into
Namespaces().Get; update verifyNamespacesActive to use the new ctx from
ctx.Timeout(defaults.ResourceVerificationTimeout) for each namespace lookup to
prevent hanging.

391-409: ⚠️ Potential issue | 🟠 Major

Add bounded context to ClusterPolicy readiness check.

Same issue as verifySkyhookReady: the dynamic Get at line 409 uses ctx.Ctx directly. Use a bounded context for the dynamic client call.

🔧 Suggested fix

 func verifyClusterPolicyReady(ctx *validators.Context) error {
+	verifyCtx, cancel := ctx.Timeout(defaults.ResourceVerificationTimeout)
+	defer cancel()
+
 	gv := clusterPolicyGVR.GroupVersion().String()
 	_, discErr := ctx.Clientset.Discovery().ServerResourcesForGroupVersion(gv)
 	// ... discovery switch ...

 	dynClient, err := getDynamicClient(ctx)
 	if err != nil {
 		return err
 	}

-	clusterPolicy, err := dynClient.Resource(clusterPolicyGVR).Get(ctx.Ctx, clusterPolicyName, metav1.GetOptions{})
+	clusterPolicy, err := dynClient.Resource(clusterPolicyGVR).Get(verifyCtx, clusterPolicyName, metav1.GetOptions{})

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@validators/deployment/expected_resources.go` around lines 391 - 409, The
dynamic Get call using ctx.Ctx for clusterPolicy retrieval is unbounded; create
a bounded context (e.g., boundedCtx, cancel := context.WithTimeout(ctx.Ctx,
defaultTimeout) / defer cancel()) and use boundedCtx in
dynClient.Resource(clusterPolicyGVR).Get(...) instead of ctx.Ctx, ensuring you
import context and define/choose an appropriate timeout constant (or reuse an
existing one) and call cancel() to avoid leaking goroutines; adjust the code
paths around getDynamicClient, dynClient, clusterPolicyGVR and clusterPolicyName
accordingly.

249-273: ⚠️ Potential issue | 🟠 Major

Add bounded contexts to Skyhook readiness checks.

The discovery call (line 254) and dynamic Get calls (line 273) use ctx.Ctx directly without a timeout. Apply the same ctx.Timeout(defaults.ResourceVerificationTimeout) pattern used in verifyDRAKubeletPluginReady for consistency.

🔧 Suggested fix

 func verifySkyhookReady(ctx *validators.Context, ref recipe.ComponentRef) error {
 	expectedNames, err := expectedSkyhookNames(ref)
 	if err != nil {
 		return err
 	}
 	if len(expectedNames) == 0 {
 		return errors.New(errors.ErrCodeNotFound,
 			fmt.Sprintf("no Skyhook CR names could be extracted from component %s manifestFiles=%v",
 				ref.Name, ref.ManifestFiles))
 	}

+	verifyCtx, cancel := ctx.Timeout(defaults.ResourceVerificationTimeout)
+	defer cancel()
+
 	gv := skyhookGVR.GroupVersion().String()
-	_, discErr := ctx.Clientset.Discovery().ServerResourcesForGroupVersion(gv)
+	_, discErr := ctx.Clientset.Discovery().ServerResourcesForGroupVersion(gv)  // Note: Discovery doesn't take context
 	// ... rest of function should use verifyCtx for dynamic client calls

Note: The Discovery client doesn't accept a context parameter, but the dynamic client Get calls should use the bounded context:

-		sk, getErr := dynClient.Resource(skyhookGVR).Get(ctx.Ctx, name, metav1.GetOptions{})
+		sk, getErr := dynClient.Resource(skyhookGVR).Get(verifyCtx, name, metav1.GetOptions{})

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@validators/deployment/expected_resources.go` around lines 249 - 273, The
discovery call ServerResourcesForGroupVersion cannot take a context, but the
dynamic client Get calls do need a bounded context like
verifyDRAKubeletPluginReady uses: create a bounded context via
ctx.Timeout(defaults.ResourceVerificationTimeout) (e.g. boundedCtx :=
ctx.Timeout(defaults.ResourceVerificationTimeout) and defer
boundedCtx.Close()/cancel if required by your ctx type) and pass boundedCtx.Ctx
(or the appropriate field) into dynClient.Resource(skyhookGVR).Get(...) instead
of ctx.Ctx; keep the discovery call as-is but use the same timeout pattern
around the loop that calls getDynamicClient and dynClient.Resource(...).Get to
ensure each Get is timeboxed consistent with verifyDRAKubeletPluginReady.

tests/e2e/run.sh (2)

716-720: ⚠️ Potential issue | 🟠 Major

Guard cleanup to avoid deleting pre-existing GPU Operator resources.

cleanup_fake_gpu_operator_fixture() unconditionally deletes the ClusterPolicy and CRD. If this script runs against a cluster with a real GPU Operator installed, teardown would delete production resources.

Consider tracking whether the fixture created these objects and only deleting them if so, or checking for pre-existing resources before setup and skipping the test entirely.

🔧 Suggested approach

+# Track what the fixture created
+CREATED_GPU_OPERATOR_CRD=false
+CREATED_CLUSTER_POLICY=false
+
 setup_fake_gpu_operator_fixture() {
+  # Check if CRD already exists (real gpu-operator installed)
+  if kubectl get crd clusterpolicies.nvidia.com &>/dev/null; then
+    echo "ClusterPolicy CRD already exists (real gpu-operator?), skipping fixture"
+    return 1
+  fi
+
   kubectl create namespace gpu-operator --dry-run=client -o yaml | kubectl apply -f - 2>&1 || true
   # ... CRD apply ...
+  CREATED_GPU_OPERATOR_CRD=true
   # ... ClusterPolicy apply ...
+  CREATED_CLUSTER_POLICY=true
   # ...
 }

 cleanup_fake_gpu_operator_fixture() {
   kubectl delete deployment gpu-operator -n gpu-operator --ignore-not-found 2>&1 || true
-  kubectl delete clusterpolicy cluster-policy --ignore-not-found 2>&1 || true
-  kubectl delete crd clusterpolicies.nvidia.com --ignore-not-found 2>&1 || true
+  if [ "$CREATED_CLUSTER_POLICY" = true ]; then
+    kubectl delete clusterpolicy cluster-policy --ignore-not-found 2>&1 || true
+  fi
+  if [ "$CREATED_GPU_OPERATOR_CRD" = true ]; then
+    kubectl delete crd clusterpolicies.nvidia.com --ignore-not-found 2>&1 || true
+  fi
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/e2e/run.sh` around lines 716 - 720, The cleanup function
cleanup_fake_gpu_operator_fixture unconditionally deletes ClusterPolicy/CRD
which can remove a real GPU Operator; modify setup and cleanup so the script
first checks for pre-existing resources (e.g., run kubectl get for deployment
gpu-operator, clusterpolicy cluster-policy, and crd clusterpolicies.nvidia.com)
and record whether the fixture created them (store a boolean flag or temporary
marker). In setup, if resources already exist, set the flag to skip teardown or
abort the test; in cleanup, only run the kubectl delete lines when the recorded
flag indicates the fixture created those specific objects. Ensure the
flag/marker is uniquely tied to this test run so production resources are never
deleted by mistake.

713-714: ⚠️ Potential issue | 🟠 Major

Propagate deployment availability wait failure.

The kubectl wait at line 713 uses || true, which swallows the failure. If the fake gpu-operator deployment never becomes Available, subsequent tests will run against a broken fixture and fail for unrelated reasons.

🔧 Suggested fix

-  kubectl wait --for=condition=available deployment/gpu-operator -n gpu-operator --timeout=60s 2>&1 || true
+  if ! kubectl wait --for=condition=available deployment/gpu-operator -n gpu-operator --timeout=60s 2>&1; then
+    return 1
+  fi
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/e2e/run.sh` around lines 713 - 714, The kubectl wait invocation in
run.sh currently appends "|| true" which swallows failures; remove that fallback
so the non‑zero exit from kubectl wait propagates (or explicitly check its exit
code and exit with a helpful error) to halt the test run when the gpu-operator
deployment (deployment/gpu-operator -n gpu-operator) never becomes Available;
update the block containing the kubectl wait call to either drop "|| true" or
add an explicit error check/exit so downstream tests don't run against a broken
fixture.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@tests/e2e/run.sh`:
- Around line 716-720: The cleanup function cleanup_fake_gpu_operator_fixture
unconditionally deletes ClusterPolicy/CRD which can remove a real GPU Operator;
modify setup and cleanup so the script first checks for pre-existing resources
(e.g., run kubectl get for deployment gpu-operator, clusterpolicy
cluster-policy, and crd clusterpolicies.nvidia.com) and record whether the
fixture created them (store a boolean flag or temporary marker). In setup, if
resources already exist, set the flag to skip teardown or abort the test; in
cleanup, only run the kubectl delete lines when the recorded flag indicates the
fixture created those specific objects. Ensure the flag/marker is uniquely tied
to this test run so production resources are never deleted by mistake.
- Around line 713-714: The kubectl wait invocation in run.sh currently appends
"|| true" which swallows failures; remove that fallback so the non‑zero exit
from kubectl wait propagates (or explicitly check its exit code and exit with a
helpful error) to halt the test run when the gpu-operator deployment
(deployment/gpu-operator -n gpu-operator) never becomes Available; update the
block containing the kubectl wait call to either drop "|| true" or add an
explicit error check/exit so downstream tests don't run against a broken
fixture.

In `@validators/deployment/expected_resources.go`:
- Around line 164-188: verifyNamespacesActive currently calls
ctx.Clientset.CoreV1().Namespaces().Get(ctx.Ctx, ...) without a timeout; wrap
the lookup in a bounded context like the other validator functions by calling
ctx.Timeout(defaults.ResourceVerificationTimeout) to obtain a new context and
cancel function, defer cancel(), and pass that bounded context into
Namespaces().Get; update verifyNamespacesActive to use the new ctx from
ctx.Timeout(defaults.ResourceVerificationTimeout) for each namespace lookup to
prevent hanging.
- Around line 391-409: The dynamic Get call using ctx.Ctx for clusterPolicy
retrieval is unbounded; create a bounded context (e.g., boundedCtx, cancel :=
context.WithTimeout(ctx.Ctx, defaultTimeout) / defer cancel()) and use
boundedCtx in dynClient.Resource(clusterPolicyGVR).Get(...) instead of ctx.Ctx,
ensuring you import context and define/choose an appropriate timeout constant
(or reuse an existing one) and call cancel() to avoid leaking goroutines; adjust
the code paths around getDynamicClient, dynClient, clusterPolicyGVR and
clusterPolicyName accordingly.
- Around line 249-273: The discovery call ServerResourcesForGroupVersion cannot
take a context, but the dynamic client Get calls do need a bounded context like
verifyDRAKubeletPluginReady uses: create a bounded context via
ctx.Timeout(defaults.ResourceVerificationTimeout) (e.g. boundedCtx :=
ctx.Timeout(defaults.ResourceVerificationTimeout) and defer
boundedCtx.Close()/cancel if required by your ctx type) and pass boundedCtx.Ctx
(or the appropriate field) into dynClient.Resource(skyhookGVR).Get(...) instead
of ctx.Ctx; keep the discovery call as-is but use the same timeout pattern
around the loop that calls getDynamicClient and dynClient.Resource(...).Get to
ensure each Get is timeboxed consistent with verifyDRAKubeletPluginReady.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 1a00896b-3122-4a5c-a869-cddebd8ee65f

📥 Commits

Reviewing files that changed from the base of the PR and between 0548c07 and 5130bfe.

📒 Files selected for processing (4)

docs/user/cli-reference.md
tests/e2e/run.sh
validators/deployment/expected_resources.go
validators/deployment/expected_resources_test.go

mchmarny

Solid PR — the deployer-neutral design, discovery-gated CRD checks with fail-closed semantics, and scoped Skyhook readiness (by recipe manifest names, not cluster-wide label match) are well thought through. CI is fully green. Test coverage is thorough across the behavioral surface.

One medium finding: the test fake's Get method returns a bare errors.New("object not found") which doesn't satisfy apierrors.IsNotFound(), leaving the specific "recipe declared it but cluster has no such CR" branch in verifySkyhookReady untested. The rest are nits (test-only constants in production code, minor getDynamicClient caching opportunity).

Nothing blocks merge — the medium item is a test fidelity gap, not a production correctness issue.

yuanchen8911 added enhancement area/validator labels Apr 18, 2026

yuanchen8911 requested a review from a team as a code owner April 18, 2026 00:07

yuanchen8911 added the area/validator label Apr 18, 2026

github-actions Bot added size/XL and removed area/validator labels Apr 18, 2026

yuanchen8911 marked this pull request as draft April 18, 2026 00:11

yuanchen8911 changed the title ~~enhance(validator): strengthen deployment readiness checks~~ (WIP) enhance(validator): strengthen deployment readiness checks Apr 18, 2026

yuanchen8911 mentioned this pull request Apr 18, 2026

Enhance deployment-phase validation to reflect post-install workload readiness #607

Closed

coderabbitai Bot reviewed Apr 18, 2026

View reviewed changes

Comment thread validators/deployment/expected_resources_test.go

Comment thread validators/deployment/expected_resources.go Outdated

Comment thread validators/deployment/expected_resources.go

yuanchen8911 force-pushed the fix/deployment-phase-readiness branch from f5e3ee9 to fbca793 Compare April 18, 2026 01:52

github-actions Bot added the area/tests label Apr 18, 2026

yuanchen8911 force-pushed the fix/deployment-phase-readiness branch from fbca793 to e51b9be Compare April 18, 2026 02:02

github-actions Bot added the area/bundler label Apr 18, 2026

yuanchen8911 changed the title ~~(WIP) enhance(validator): strengthen deployment readiness checks~~ (WIP) enhance(validator): enhance deployment readiness checks Apr 18, 2026

yuanchen8911 mentioned this pull request Apr 18, 2026

chore(infra): disable CodeRabbit docstring check #612

Merged

25 tasks

yuanchen8911 changed the title ~~(WIP) enhance(validator): enhance deployment readiness checks~~ enhance(validator): enhance deployment readiness checks Apr 20, 2026

yuanchen8911 added the area/validator label Apr 20, 2026

github-actions Bot added area/docs and removed area/validator labels Apr 20, 2026

yuanchen8911 added the area/validator label Apr 20, 2026

github-actions Bot removed the area/validator label Apr 20, 2026

yuanchen8911 added the area/validator label Apr 20, 2026

yuanchen8911 force-pushed the fix/deployment-phase-readiness branch from b8a5103 to 899df5e Compare April 20, 2026 18:30

github-actions Bot removed the area/validator label Apr 20, 2026

yuanchen8911 force-pushed the fix/deployment-phase-readiness branch from 899df5e to 2a2baaa Compare April 20, 2026 19:13

yuanchen8911 added the area/validator label Apr 20, 2026

yuanchen8911 force-pushed the fix/deployment-phase-readiness branch from 2a2baaa to 0548c07 Compare April 20, 2026 19:22

github-actions Bot removed the area/validator label Apr 20, 2026

yuanchen8911 changed the title ~~enhance(validator): enhance deployment readiness checks~~ enhance(validator): enhance deployment readiness checks (WIP) Apr 20, 2026

yuanchen8911 changed the title ~~enhance(validator): enhance deployment readiness checks (WIP)~~ enhance(validator): add targeted GPU readiness checks Apr 20, 2026

yuanchen8911 mentioned this pull request Apr 20, 2026

Add registry-driven deployment readiness for all components #622

Closed

yuanchen8911 marked this pull request as ready for review April 20, 2026 20:21

yuanchen8911 force-pushed the fix/deployment-phase-readiness branch from 0548c07 to 5130bfe Compare April 20, 2026 20:28

coderabbitai Bot reviewed Apr 20, 2026

View reviewed changes

Comment thread docs/user/cli-reference.md

Comment thread tests/e2e/run.sh Outdated

Comment thread tests/e2e/run.sh

Comment thread validators/deployment/expected_resources.go Outdated

coderabbitai Bot reviewed Apr 20, 2026

View reviewed changes

yuanchen8911 changed the title ~~enhance(validator): add targeted GPU readiness checks~~ enhance(validator): add targeted post-deployment GPU readiness checks Apr 20, 2026

fix deployment readiness follow-up review comments

c08cbe7

yuanchen8911 requested review from lockwobr and mchmarny April 20, 2026 20:46

yuanchen8911 assigned ayuskauskas Apr 20, 2026

mchmarny reviewed Apr 20, 2026

View reviewed changes

test skyhook not-found readiness branch

066fad5

yuanchen8911 requested review from ayuskauskas and mchmarny April 20, 2026 21:16

polish deployment readiness review nits

102b11b

mchmarny approved these changes Apr 20, 2026

View reviewed changes

mchmarny merged commit 17d42f9 into NVIDIA:main Apr 20, 2026
33 checks passed

yuanchen8911 mentioned this pull request Apr 23, 2026

[EPIC] Reuse Chainsaw health checks in aicr validate --phase deployment #660

Closed

5 tasks

This was referenced Jun 8, 2026

Ship chainsaw binary + wire deployment-phase runner (#660 PR 2) #1220

Closed

feat(validator): ship chainsaw binary; activate deployment-phase runner #1235

Merged

Uh oh!

Conversation

yuanchen8911 commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation / Context

Type of Change

Component(s) Affected

Implementation Notes

Deployer-neutrality stance

Checks added or extended

Supporting changes

Testing

Live-cluster validation

Risk Assessment

Checklist

Uh oh!

coderabbitai Bot commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yuanchen8911 commented Apr 20, 2026

Uh oh!

yuanchen8911 commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Apr 20, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

mchmarny left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yuanchen8911 commented Apr 18, 2026 •

edited

Loading

coderabbitai Bot commented Apr 18, 2026 •

edited

Loading

yuanchen8911 commented Apr 20, 2026 •

edited

Loading