Skip to content

enhance(validator): add targeted post-deployment GPU readiness checks#611

Merged
mchmarny merged 4 commits into
NVIDIA:mainfrom
yuanchen8911:fix/deployment-phase-readiness
Apr 20, 2026
Merged

enhance(validator): add targeted post-deployment GPU readiness checks#611
mchmarny merged 4 commits into
NVIDIA:mainfrom
yuanchen8911:fix/deployment-phase-readiness

Conversation

@yuanchen8911

@yuanchen8911 yuanchen8911 commented Apr 18, 2026

Copy link
Copy Markdown
Contributor

Summary

Strengthen aicr validate --phase deployment by adding the targeted GPU-workload readiness signals the current validator misses — Skyhook customizations, NVIDIA DRA driver, and GPU Operator (ClusterPolicy) — on top of the existing namespace and expectedResources checks.

Concretely, this PR:

  • keeps using the existing deployment phase
  • preserves the existing baseline checks:
    • enabled namespaces are Active
    • declared expectedResources are healthy
  • adds the targeted GPU-specific readiness checks the phase was missing:
    • skyhook-customizations: Skyhook CR completion
    • nvidia-dra-driver-gpu: kubelet-plugin readiness
    • gpu-operator: ClusterPolicy.status.state == ready

So the main new value is the addition of the targeted Skyhook, DRA, and GPU Operator deep readiness checks on top of the preserved baseline deployment checks.

Follow-up note: the broader registry-driven work to make deployment-phase validation symmetric and deep for all components is now tracked in #622. This PR remains intentionally scoped to the targeted GPU readiness gap narrowed in #607.

The check stays strictly deployer-neutral: no Helm API calls, no dependence on Argo CD or any other deployer, no reads of release metadata, no release-identity assumptions. Every lookup is a plain Kubernetes API call against live-cluster object shape, so the command can be used as a standalone tool to verify any Kubernetes cluster's runtime install regardless of how the bundle was deployed (deploy.sh, Argo CD, kapp, helmfile, plain kubectl apply). Validated end-to-end on a real EKS H100 cluster.

Motivation / Context

#607 narrowed this follow-up to reusing and enhancing the existing deployment validator path instead of adding a new phase, flag, or Chainsaw runtime dependency. The previous deployment phase verified install/resource existence, but it did not cover the late-converging readiness signals that explain the post-install gap on fresh GPU clusters.

Fixes: #607
Related:

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • CLI (cmd/aicr, pkg/cli)
  • API server (cmd/aicrd, pkg/api, pkg/server)
  • Recipe engine / data (pkg/recipe)
  • Bundlers (pkg/bundler, pkg/component/*)
  • Collectors / snapshotter (pkg/collector, pkg/snapshotter)
  • Validator (validators/deployment)
  • Core libraries (pkg/errors, pkg/k8s)
  • Docs/examples (docs/)
  • Other: E2E test harness (tests/e2e)

Implementation Notes

Deployer-neutrality stance

The deployment-phase check stays strictly deployer-agnostic:

  • No Helm API calls.
  • No reads of Helm release metadata (secrets, ConfigMaps, or helm list).
  • No dependence on release-scoped labels such as app.kubernetes.io/instance.

Every lookup is a plain Kubernetes API call against live-cluster object shape — labels, discovery, name conventions written into the chart templates themselves. The check is valid regardless of whether the bundle was deployed via deploy.sh (Helm), Argo CD (Helm or Kustomize), kapp, helmfile, or plain kubectl apply.

Checks added or extended

  • Enabled component namespaces must be Active.
  • Declared expectedResources (Deployments, DaemonSets, etc.) must exist and be healthy.
  • Skyhook readiness (scope-limited to the recipe's own declarations): the validator parses each ComponentRef.ManifestFiles of the skyhook-customizations component through the recipe data provider, extracts the metadata.name of every Skyhook resource declared, and verifies each one exists on the cluster with status.status == "complete". A narrow Helm-template-tolerant regex extractor is used because manifests contain Helm {{ ... }} directives that break a full YAML parse. Skyhook CRs present on the cluster that are not declared by this recipe (stale from prior deploys, other tenants) are ignored. Fails closed when an enabled skyhook-customizations ref declares no extractable Skyhook names (recipe misconfiguration).
  • ClusterPolicy readiness: Discovery on nvidia.com/v1 is used up-front to disambiguate three cases the dynamic-client Get conflates:
    • CRD not registered (IsNotFound from discovery) → skip per #607 acceptance.
    • CRD registered but cluster-policy CR absent → fail closed (gpu-operator installed but not reconciled).
    • Any other discovery error (RBAC 403, API 5xx, network timeout) → fail closed with a diagnostic pointing at RBAC / API reachability, so a transient discovery failure cannot silently mask readiness.
  • DRA kubelet-plugin readiness (role-suffix discovery): the validator lists DaemonSets in the component namespace (timeout-bounded) and selects the one whose name ends in the upstream chart's hard-coded role suffix -kubelet-plugin. 0 matches → fail with the DaemonSets seen in the namespace. >1 matches → fail with the matched names. This is a chart-shape assumption (the upstream nvidia-dra-driver-gpu chart always names that DaemonSet <fullname>-kubelet-plugin), not a release-identity assumption — it survives fullnameOverride and non-Helm deployers. Readiness gate: both DesiredNumberScheduled > 0 and NumberReady == DesiredNumberScheduled.

Skyhook is a cluster-scoped CR; the name-based Get path is cluster-scoped by construction, avoiding the namespaced-List-against-cluster-scoped-kind 404 that an earlier approach would have hit.

Supporting changes

  • tests/e2e/run.sh seeds a fake ClusterPolicy CRD (no status subresource — deliberate, and commented) plus a ready cluster-policy CR so deployment-phase E2E exercises the strengthened validator behavior.
  • docs/user/cli-reference.md documents the expanded deployment phase: the full list of checks, the graceful-skip path when a CRD is absent, the fail-closed path for a missing CR with the CRD present, and the fail-closed path for non-NotFound discovery errors (with a concrete RBAC 403 example).

Testing

# Local package run + coverage
GOFLAGS="-mod=vendor" go test -race -coverprofile=cover.out ./validators/deployment/...
go tool cover -func=cover.out | tail -1

# Repo coverage gate + full local gate
make test-coverage
unset GITLAB_TOKEN && make qualify
  • go test -race ./validators/deployment/...: passed.
  • validators/deployment coverage: substantially improved vs origin/main (new tests span the full behavioral surface below).
  • make test-coverage: passed (above the 70% project-wide threshold).
  • unset GITLAB_TOKEN && make qualify: passed.

Added / updated regression coverage:

DRA kubelet-plugin discovery

  • Happy path with the default chart name.
  • Custom fullname-override name (e.g., my-custom-gpu-kubelet-plugin) still passes — proves the role-suffix approach is not coupled to the default name.
  • Two DaemonSets with matching suffix fails with an ambiguity diagnostic listing both names.
  • Unrelated DaemonSet in the same namespace (e.g., node-exporter) is ignored.
  • Missing DaemonSet fails with a diagnostic that lists the DaemonSets actually seen in the namespace (for debugging).
  • 0/0 and under-scheduled DaemonSet fail the readiness gate.

Skyhook readiness

  • Cluster-scoped Get contract (regression against the earlier namespaced-List 404).
  • CRD not registered → skip.
  • Unrelated stale Skyhook CR on the cluster (even with the AICR label) does not affect the result when the expected CR is complete.
  • Multiple expected Skyhooks with mixed failures → all failures surface in one error (not first-return).
  • Enabled skyhook-customizations ref with no extractable Skyhook names → fail closed.
  • Direct unit tests for extractSkyhookNamesFromManifest, including the tuning-gke.yaml case (metadata.name is "tuning", not "tuning-gke"), Helm-template preamble tolerance, and templated-name skip.

ClusterPolicy readiness

  • CRD-missing skip (via discovery).
  • CRD present but CR missing → fail closed with "gpu-operator installed but CR missing?" diagnostic.
  • Discovery returns non-NotFound (RBAC 403) → fail closed.
  • Missing status.state → fail.

Other

  • Namespace-not-Active → fail.
  • TestMain forces the global recipe data provider to initialize before any t.Parallel() test calls recipe.GetManifestContent, working around a pre-existing race on lazy init in pkg/recipe/provider.go.

Live-cluster validation

Ran aicr validate --recipe recipe.yaml --phase deployment against a real EKS H100 cluster (aws-us-east-1-aicr-cuj2) with the dynamo-platform inference recipe:

phase=deployment status=passed validators=4 passed=4 failed=0 duration=26.824s
Validator Status
operator-health passed
expected-resources passed — exercised namespace Active for 10 namespaces, Skyhook tuning status.status==complete (name-based from the recipe's manifestFiles), ClusterPolicy.status.state==ready (discovery-gated), and the DRA kubelet-plugin DaemonSet via -kubelet-plugin suffix match
gpu-operator-version passed
check-nvidia-smi passed — 2/2 GPU nodes usable

Exit code was 0; CTRF report archived locally.

Risk Assessment

  • Low — Isolated change, well-tested, easy to revert
  • Medium — Touches multiple components or has broader impact
  • High — Breaking change, affects critical paths, or complex rollout

Rollout notes: No migration or flag changes. This strengthens deployment-phase validation only.

  • Skyhook and ClusterPolicy checks skip only when the corresponding CRD is not registered.
  • Once the CRD exists, missing readiness objects are surfaced as real failures; non-NotFound discovery errors are never treated as skips.
  • DRA kubelet-plugin discovery works for default-named and fullname-override'd deployments alike, and is deployer-neutral (no Helm or release-identity dependency).
  • Skyhook readiness is scoped to the Skyhook CRs the recipe itself declares via ManifestFiles; unrelated Skyhooks on the cluster are ignored.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S) — GPG signing info

@yuanchen8911 yuanchen8911 requested a review from a team as a code owner April 18, 2026 00:07
@yuanchen8911 yuanchen8911 marked this pull request as draft April 18, 2026 00:11
@yuanchen8911 yuanchen8911 changed the title enhance(validator): strengthen deployment readiness checks (WIP) enhance(validator): strengthen deployment readiness checks Apr 18, 2026
@coderabbitai

coderabbitai Bot commented Apr 18, 2026

Copy link
Copy Markdown

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

checkExpectedResources now errors when ctx.Clientset is nil and restricts work to enabled ComponentRefs. It adds readiness verifiers that run after resource existence checks: namespaces reach Active; Skyhook CRs (names extracted from enabled manifest files) exist and have status.status == "complete (CRD-gated); GPU Operator ClusterPolicy has status.state == "ready" (CRD-gated); and the DRA kubelet-plugin DaemonSet (discovered by -kubelet-plugin suffix) reports ready pods. A dynamic client is created via getDynamicClient for CR checks. New unit tests, e2e fixture setup/cleanup helpers, and CLI docs were added/updated.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'enhance(validator): add targeted post-deployment GPU readiness checks' directly and accurately reflects the main change: adding GPU readiness signal checks to the deployment validator.
Description check ✅ Passed The description is comprehensive and clearly related to the changeset, detailing the new GPU readiness checks (Skyhook, DRA, ClusterPolicy), implementation approach, and testing results.
Linked Issues check ✅ Passed All coding objectives from #607 are met: deployment phase reuse, baseline check preservation, Skyhook/DRA/ClusterPolicy readiness additions, deployer-neutrality, and narrow scope implementation in validators/deployment.
Out of Scope Changes check ✅ Passed All changes remain narrowly scoped to the stated objectives: deployment validator enhancements, E2E test fixture setup, and documentation updates; no extraneous modifications detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@validators/deployment/expected_resources_test.go`:
- Around line 37-231: Add a negative test that ensures verifyNamespacesActive()
(exercised via checkExpectedResources) fails when a required namespace is
missing or not Active: create a new test (e.g.,
TestCheckExpectedResources_FailsWhenNamespaceNotActive) using
newDeploymentTestContext with a component referencing a namespace but include a
corresponding non-active namespace object (use an existing helper like
inactiveNamespace or omit activeNamespace) in the runtime.Objects, then call
checkExpectedResources(ctx) and assert it returns an error and that the error
message contains the namespace readiness context (e.g., "namespace <name> not
Active" or similar) following the pattern used in other tests (t.Fatal/t.Fatalf
when err == nil and strings.Contains checks).

In `@validators/deployment/expected_resources.go`:
- Around line 267-276: The DaemonSet health check in verifyDRAKubeletPluginReady
currently allows a 0/0 DaemonSet to pass; after the existing
helper.VerifyResource call, explicitly fetch the DaemonSet object (use
ctx.Clientset.AppsV1().DaemonSets(namespace).Get with ctx.Ctx and
draKubeletPluginName), then inspect ds.Status.DesiredNumberScheduled and
ds.Status.NumberReady; if DesiredNumberScheduled == 0 or NumberReady == 0 return
a non-nil error (include context mentioning draKubeletPluginName and namespace)
so the DRA readiness gate fails when no pods are scheduled or ready.
- Around line 297-309: The buildResourceFetcher() function currently passes
ctx.DynamicClient directly into chainsaw.NewClusterFetcher(), which can be nil
for test or manually-constructed validators.Context instances; change
buildResourceFetcher() to call getDynamicClient(ctx) and use its returned
dynamic.Interface (handling the error) when constructing the chainsaw.Fetcher so
it lazily initializes from ctx.RESTConfig like verifySkyhookReady() and
verifyClusterPolicyReady() do; reference the validators.Context,
getDynamicClient(), buildResourceFetcher(), ctx.DynamicClient, and
chainsaw.NewClusterFetcher() in your fix and propagate any returned error
instead of risking a nil client.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 3ee239b2-2554-4580-991d-6ec0d91176e2

📥 Commits

Reviewing files that changed from the base of the PR and between c12c783 and f5e3ee9.

📒 Files selected for processing (2)
  • validators/deployment/expected_resources.go
  • validators/deployment/expected_resources_test.go

Comment thread validators/deployment/expected_resources_test.go
Comment thread validators/deployment/expected_resources.go Outdated
Comment thread validators/deployment/expected_resources.go
@yuanchen8911 yuanchen8911 force-pushed the fix/deployment-phase-readiness branch from f5e3ee9 to fbca793 Compare April 18, 2026 01:52
@yuanchen8911 yuanchen8911 force-pushed the fix/deployment-phase-readiness branch from fbca793 to e51b9be Compare April 18, 2026 02:02
@yuanchen8911 yuanchen8911 changed the title (WIP) enhance(validator): strengthen deployment readiness checks (WIP) enhance(validator): enhance deployment readiness checks Apr 18, 2026
@yuanchen8911

Copy link
Copy Markdown
Contributor Author

@yuanchen8911

yuanchen8911 commented Apr 20, 2026

Copy link
Copy Markdown
Contributor Author

Note that the scope of Issue 607 and PR 611 I opened is intentionally narrow and not meant to address all concerns. The goal is to improve the current AICR validator after deployment.

@lockwobr will investigate more advanced ideas, such as the bridge-job (issue #516), along with @pwittrock's ADR, in parallel as a longer-term solution.

A few notes on the proposal in issue #607 and PR #611 implementation:

  • The validator is independent of the deployer
  • Each component provides its own readiness status
  • The validator performs a complete readiness check across all components, including Skyhook and GPU Operator

yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 20, 2026
…istered

Issue NVIDIA#607's acceptance criteria explicitly calls for the Skyhook check
to "not fail on clusters where the Skyhook CRD is not registered ...
the Skyhook check is skipped (or reported as N/A) in those cases."
Before this commit, PR NVIDIA#611 still propagated the dynamic-client error
(meta.NoKindMatchError or apierrors.NotFound) as an ErrCodeInternal
failure, so a recipe that declared skyhook-customizations against a
cluster where the operator's CRD had not yet been installed tripped
the deployment phase.

Apply the same relaxation to verifyClusterPolicyReady by symmetry: a
recipe can list gpu-operator before the operator's ClusterPolicy CRD
exists on the cluster, and that case should also be a skip, not a
hard failure.

Behavior summary:
- If the CRD is not registered (meta.IsNoMatchError), print a short
  "CRD not registered, skipping" line and return nil.
- For ClusterPolicy, also treat apierrors.IsNotFound as skip. The
  instance not existing means the operator has not reconciled yet
  or is not installed, which NVIDIA#607 groups with the CRD-missing path.
- Any other error (transient API, auth, timeout) still fails closed
  with the original wrapping, so we do not mask real problems.

Tests:
- Extend the homegrown fake dynamic client with an "unregistered"
  GVR set: Resource() returns a sentinel client whose Get/List
  return meta.NoKindMatchError. This lets unit tests express the
  CRD-not-installed state without pulling in a scheme-backed fake.
- Add TestCheckExpectedResources_SkipsSkyhookWhenCRDNotRegistered
  and TestCheckExpectedResources_SkipsClusterPolicyWhenCRDNotRegistered
  pinning the NVIDIA#607 acceptance behavior.
@yuanchen8911 yuanchen8911 changed the title (WIP) enhance(validator): enhance deployment readiness checks enhance(validator): enhance deployment readiness checks Apr 20, 2026
@github-actions

Copy link
Copy Markdown
Contributor

yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 20, 2026
…g CR

Reviewer observation on PR NVIDIA#611: verifyClusterPolicyReady previously
treated two distinct situations identically, because both surfaced as
IsNotFound from the dynamic Get:
  1. CRD not registered — gpu-operator chart not installed. Legitimate
     skip per NVIDIA#607 (recipe declares the component, but the component
     isn't on the cluster yet).
  2. CRD registered but cluster-policy CR absent — gpu-operator is
     installed, but its singleton CR was never created or was manually
     deleted. The operator cannot reconcile the GPU stack in that
     state, so this is a real misconfiguration.

A bare Get cannot tell these apart (same 404). Use Discovery to
disambiguate: ServerResourcesForGroupVersion("nvidia.com/v1") is the
authoritative check for whether the CRD is installed. When discovery
reports the group as present, any subsequent Get error — including
IsNotFound on the CR itself — is a real failure and is surfaced with a
"gpu-operator installed but CR missing?" diagnostic.

Also:
- verifySkyhookReady is intentionally NOT symmetric: for a List call,
  IsNotFound can only mean "kind not served", so it is safe to skip on
  there, and the "CRD present, zero CRs" case is handled distinctly by
  the explicit len(Items) == 0 failure path. Added a comment documenting
  this so a future refactor does not unify the two paths and silently
  mask the missing-Skyhook case.

Tests:
- Extend the fake clientset with a configured Discovery service:
  GroupVersions are auto-registered from dynamicObjects (minus any in
  `unregistered`), and a second helper lets a test declare extra
  registered GroupVersions explicitly (needed when we want discovery
  to succeed without seeding a CR).
- Existing tests unchanged in intent; they now flow through the new
  auto-register path transparently.
- New test TestCheckExpectedResources_FailsWhenClusterPolicyCRMissing
  pins the disambiguation behavior: CRD registered, CR absent → fail
  with the expected diagnostic.

Docs:
- docs/user/cli-reference.md clarifies that once a CRD is registered,
  a missing CR fails loudly (example: gpu-operator ClusterPolicy).
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 20, 2026
…d path

Reviewer feedback on PR NVIDIA#611: the "Other errors (transient API, auth)
still fail closed" line is accurate but too terse to help diagnosis.
Call out a concrete case — an RBAC 403 on skyhooks.skyhook.nvidia.com
is not a NoMatch and therefore surfaces as a failure rather than a
silent skip — so a reader hitting it can recognize what happened.
@yuanchen8911 yuanchen8911 force-pushed the fix/deployment-phase-readiness branch from b8a5103 to 899df5e Compare April 20, 2026 18:30
@yuanchen8911 yuanchen8911 force-pushed the fix/deployment-phase-readiness branch from 899df5e to 2a2baaa Compare April 20, 2026 19:13
@yuanchen8911 yuanchen8911 force-pushed the fix/deployment-phase-readiness branch from 2a2baaa to 0548c07 Compare April 20, 2026 19:22
@yuanchen8911 yuanchen8911 changed the title enhance(validator): enhance deployment readiness checks enhance(validator): enhance deployment readiness checks (WIP) Apr 20, 2026
@yuanchen8911 yuanchen8911 changed the title enhance(validator): enhance deployment readiness checks (WIP) enhance(validator): add targeted GPU readiness checks Apr 20, 2026
@yuanchen8911 yuanchen8911 marked this pull request as ready for review April 20, 2026 20:21
Strengthen `aicr validate --phase deployment` so it verifies deployment
completeness plus the targeted GPU readiness signals called out in
issue NVIDIA#607, keeping the implementation in the existing pure-Go
validators/deployment path rather than adding a new phase, flag, or
Chainsaw runtime dependency.

Checks added or extended:

- Enabled component namespaces must be `Active`.
- Declared `expectedResources` (Deployments, DaemonSets, etc.) must
  exist and be healthy.
- When `skyhook-customizations` is enabled: every Skyhook CR the recipe
  declares (extracted from ComponentRef.ManifestFiles, parsed with a
  narrow Helm-template-tolerant regex extractor) must exist on the
  cluster and report `status.status == "complete"`. Unrelated Skyhook
  CRs on the cluster (stale from prior deploys, or from other tenants)
  are ignored. All non-complete expected Skyhooks surface in a single
  error rather than first-failure-only.
- When `gpu-operator` is enabled: `ClusterPolicy.status.state == ready`.
  Discovery on `nvidia.com/v1` is used up-front to disambiguate three
  cases the dynamic-client Get conflates:
    * CRD not registered (`IsNotFound` from discovery): skip per NVIDIA#607.
    * CRD registered but `cluster-policy` CR absent: fail closed as a
      real gpu-operator misconfiguration.
    * Any other discovery error (RBAC 403, API 5xx, timeout): fail
      closed with a diagnostic pointing at RBAC / API reachability,
      so a transient discovery failure cannot silently pass the check.
- When `nvidia-dra-driver-gpu` is enabled: exactly one DaemonSet whose
  name ends in the upstream chart's role suffix "-kubelet-plugin" must
  exist in the component namespace and be ready (both
  DesiredNumberScheduled > 0 and NumberReady == DesiredNumberScheduled).
  Discovery is by object-name shape, not by release-scoped labels —
  the check makes no Helm API calls and no release-identity
  assumptions, so it survives `fullnameOverride` and non-Helm deployers
  alike. Ambiguous (>1) and missing (0) results report the matched and
  seen DaemonSet names for debugging.

Skyhook skip behavior matches NVIDIA#607's acceptance: if the Skyhook CRD is
not registered, the check is a no-op rather than a failure. The
implementation uses Discovery for the CRD-gate and then Get-by-name for
each expected Skyhook, so "CRD absent" and "CRD present but expected CR
absent" are two genuinely different code paths — the first skips, the
second fails loudly.

Cluster-scoped Skyhook: previous (dynamic-client) code issued a
namespaced List on a cluster-scoped kind, which returned 404 even when
the CR was present on a real cluster. The name-based Get path is
cluster-scoped by construction and does not repeat that mistake.

Tests:
- Homegrown fake dynamic client gained resource-scope awareness (calls
  to `.Namespace(x)` on a cluster-scoped GVR return a 404-like error)
  and an explicit "unregistered GVR" set so List/Get can return a real
  `meta.NoKindMatchError`, mirroring a missing CRD.
- Fake clientset Discovery service is auto-configured from the
  dynamicObjects seeded by each test, with an escape hatch
  (`newDeploymentTestContextWithDiscovery`) for tests that need a
  GroupVersion registered without any CR instance. The reactor used to
  inject non-NotFound discovery errors binds to the literal resource
  string "resource" — see the inline comment citing vendor/k8s.io/
  client-go/discovery/fake/discovery.go for why "apiresources" would
  silently fail to match.
- Direct unit tests for the Skyhook manifest name extractor,
  including the tuning-gke.yaml edge case (metadata.name is "tuning",
  not "tuning-gke") and Helm template preamble tolerance.
- New DRA regression tests: custom fullname-override name, multiple
  matches fails with ambiguity, unrelated DaemonSet in the namespace
  is ignored, missing DaemonSet fails with a diagnostic that includes
  the DaemonSets actually seen in the namespace.
- New Skyhook regression tests: stale labeled Skyhook unrelated to
  the recipe's declared names is ignored, multiple expected Skyhooks
  in non-complete states all surface in a single error, enabled
  skyhook-customizations ref with no extractable Skyhook names fails
  closed as a recipe misconfiguration, CRD-missing skip, discovery-
  error fail-closed, cluster-scoped List contract.
- `TestMain` forces the global recipe data provider to initialize
  before any `t.Parallel()` test calls `recipe.GetManifestContent`,
  working around a pre-existing race on the package-level lazy
  initialization in `pkg/recipe/provider.go`.

E2E:
- `tests/e2e/run.sh` seeds a fake `ClusterPolicy` CRD (no status
  subresource, deliberately — see the inline comment above the CRD)
  plus a ready `cluster-policy` CR so deployment-phase E2E exercises
  the strengthened validator behavior.

Docs:
- `docs/user/cli-reference.md` documents the expanded deployment phase:
  the full list of checks, the graceful-skip path when a CRD is
  absent, and the fail-closed path for a missing CR when the CRD is
  present and for non-NotFound discovery errors (with a concrete RBAC
  403 example).

Fixes: NVIDIA#607
@yuanchen8911 yuanchen8911 force-pushed the fix/deployment-phase-readiness branch from 0548c07 to 5130bfe Compare April 20, 2026 20:28

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/user/cli-reference.md`:
- Around line 596-606: Update the "Deployment phase checks" text to document the
additional fail-closed case for skyhook customizations: when a component with
skyhook-customizations enabled produces no extractable Skyhook resource names
from its ComponentRef.ManifestFiles, the validator does not skip but fails
deployment validation immediately with a clear "no extractable Skyhook names"
diagnostic (treat this as a recipe/configuration error, distinct from a
CRD-missing skip or a missing CR failure). Reference the skyhook-customizations
flag and ComponentRef.ManifestFiles so readers can locate the source of the
extraction and understand this specific failure path.

In `@tests/e2e/run.sh`:
- Around line 716-720: cleanup_fake_gpu_operator_fixture currently
unconditionally deletes cluster-scoped objects (cluster-policy and
clusterpolicies.nvidia.com); change the flow so the test only removes resources
it created: when you create the CRD or ClusterPolicy in the setup (the code that
creates clusterpolicies.nvidia.com and cluster-policy), set flags (e.g.,
CREATED_GPU_OPERATOR_CRD and CREATED_CLUSTER_POLICY) or record their creation
state, and then modify cleanup_fake_gpu_operator_fixture to delete those
resources only if the corresponding flag is true; alternatively, before
creating, detect existing resources and skip the test or mark them as
preexisting so cleanup does not remove them (refer to function
cleanup_fake_gpu_operator_fixture, the resource names cluster-policy and
clusterpolicies.nvidia.com).
- Around line 713-714: The kubectl wait command for deployment/gpu-operator
currently swallows failures via "|| true", so change the helper to fail when the
deployment never becomes Available: remove the "|| true" from the kubectl wait
invocation (or capture its exit code and explicitly return/exit non-zero), so
the function running the wait returns a non-zero status on timeout and callers
can skip or abort cleanly; update the kubectl wait invocation (the line invoking
"kubectl wait --for=condition=available deployment/gpu-operator -n gpu-operator
--timeout=60s") to propagate failures instead of hiding them.

In `@validators/deployment/expected_resources.go`:
- Around line 174-175: The Kubernetes API calls (e.g.,
ctx.Clientset.CoreV1().Namespaces().Get, discovery and dynamic-client calls)
must use a bounded context instead of ctx.Ctx to avoid stalling the whole
validator; create a short-lived context using the same pattern as
verifyDRAKubeletPluginReady (e.g., call
ctx.Timeout(defaults.ResourceVerificationTimeout) to obtain a timed context, use
that context for the Get/Discovery/Dynamic calls, and ensure you Cancel/cleanup
the timed context), and replace direct uses of ctx.Ctx in those calls (locations
around the Namespace Get, the discovery calls, and the dynamic client usage)
with the new timed context.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 643460d4-e2a9-4d4d-97c0-94c575219b90

📥 Commits

Reviewing files that changed from the base of the PR and between f5e3ee9 and 0548c07.

📒 Files selected for processing (4)
  • docs/user/cli-reference.md
  • tests/e2e/run.sh
  • validators/deployment/expected_resources.go
  • validators/deployment/expected_resources_test.go

Comment thread docs/user/cli-reference.md
Comment thread tests/e2e/run.sh Outdated
Comment thread tests/e2e/run.sh
Comment thread validators/deployment/expected_resources.go Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (5)
validators/deployment/expected_resources.go (3)

164-188: ⚠️ Potential issue | 🟠 Major

Wrap namespace lookup in a bounded context.

verifyNamespacesActive calls ctx.Clientset.CoreV1().Namespaces().Get(ctx.Ctx, ...) without a timeout. If the API server stalls, this probe can consume the entire deployment-validation budget and hang the validator job.

In contrast, verifyDRAKubeletPluginReady correctly uses ctx.Timeout(defaults.ResourceVerificationTimeout). For consistency and resilience, apply the same pattern here.

🔧 Suggested fix
 func verifyNamespacesActive(ctx *validators.Context, refs []recipe.ComponentRef) []string {
 	var failures []string
 	seen := make(map[string]bool, len(refs))

+	verifyCtx, cancel := ctx.Timeout(defaults.ResourceVerificationTimeout)
+	defer cancel()
+
 	for _, ref := range refs {
 		if ref.Namespace == "" || seen[ref.Namespace] {
 			continue
 		}
 		seen[ref.Namespace] = true

-		ns, err := ctx.Clientset.CoreV1().Namespaces().Get(ctx.Ctx, ref.Namespace, metav1.GetOptions{})
+		ns, err := ctx.Clientset.CoreV1().Namespaces().Get(verifyCtx, ref.Namespace, metav1.GetOptions{})
 		if err != nil {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@validators/deployment/expected_resources.go` around lines 164 - 188,
verifyNamespacesActive currently calls
ctx.Clientset.CoreV1().Namespaces().Get(ctx.Ctx, ...) without a timeout; wrap
the lookup in a bounded context like the other validator functions by calling
ctx.Timeout(defaults.ResourceVerificationTimeout) to obtain a new context and
cancel function, defer cancel(), and pass that bounded context into
Namespaces().Get; update verifyNamespacesActive to use the new ctx from
ctx.Timeout(defaults.ResourceVerificationTimeout) for each namespace lookup to
prevent hanging.

391-409: ⚠️ Potential issue | 🟠 Major

Add bounded context to ClusterPolicy readiness check.

Same issue as verifySkyhookReady: the dynamic Get at line 409 uses ctx.Ctx directly. Use a bounded context for the dynamic client call.

🔧 Suggested fix
 func verifyClusterPolicyReady(ctx *validators.Context) error {
+	verifyCtx, cancel := ctx.Timeout(defaults.ResourceVerificationTimeout)
+	defer cancel()
+
 	gv := clusterPolicyGVR.GroupVersion().String()
 	_, discErr := ctx.Clientset.Discovery().ServerResourcesForGroupVersion(gv)
 	// ... discovery switch ...

 	dynClient, err := getDynamicClient(ctx)
 	if err != nil {
 		return err
 	}

-	clusterPolicy, err := dynClient.Resource(clusterPolicyGVR).Get(ctx.Ctx, clusterPolicyName, metav1.GetOptions{})
+	clusterPolicy, err := dynClient.Resource(clusterPolicyGVR).Get(verifyCtx, clusterPolicyName, metav1.GetOptions{})
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@validators/deployment/expected_resources.go` around lines 391 - 409, The
dynamic Get call using ctx.Ctx for clusterPolicy retrieval is unbounded; create
a bounded context (e.g., boundedCtx, cancel := context.WithTimeout(ctx.Ctx,
defaultTimeout) / defer cancel()) and use boundedCtx in
dynClient.Resource(clusterPolicyGVR).Get(...) instead of ctx.Ctx, ensuring you
import context and define/choose an appropriate timeout constant (or reuse an
existing one) and call cancel() to avoid leaking goroutines; adjust the code
paths around getDynamicClient, dynClient, clusterPolicyGVR and clusterPolicyName
accordingly.

249-273: ⚠️ Potential issue | 🟠 Major

Add bounded contexts to Skyhook readiness checks.

The discovery call (line 254) and dynamic Get calls (line 273) use ctx.Ctx directly without a timeout. Apply the same ctx.Timeout(defaults.ResourceVerificationTimeout) pattern used in verifyDRAKubeletPluginReady for consistency.

🔧 Suggested fix
 func verifySkyhookReady(ctx *validators.Context, ref recipe.ComponentRef) error {
 	expectedNames, err := expectedSkyhookNames(ref)
 	if err != nil {
 		return err
 	}
 	if len(expectedNames) == 0 {
 		return errors.New(errors.ErrCodeNotFound,
 			fmt.Sprintf("no Skyhook CR names could be extracted from component %s manifestFiles=%v",
 				ref.Name, ref.ManifestFiles))
 	}

+	verifyCtx, cancel := ctx.Timeout(defaults.ResourceVerificationTimeout)
+	defer cancel()
+
 	gv := skyhookGVR.GroupVersion().String()
-	_, discErr := ctx.Clientset.Discovery().ServerResourcesForGroupVersion(gv)
+	_, discErr := ctx.Clientset.Discovery().ServerResourcesForGroupVersion(gv)  // Note: Discovery doesn't take context
 	// ... rest of function should use verifyCtx for dynamic client calls

Note: The Discovery client doesn't accept a context parameter, but the dynamic client Get calls should use the bounded context:

-		sk, getErr := dynClient.Resource(skyhookGVR).Get(ctx.Ctx, name, metav1.GetOptions{})
+		sk, getErr := dynClient.Resource(skyhookGVR).Get(verifyCtx, name, metav1.GetOptions{})
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@validators/deployment/expected_resources.go` around lines 249 - 273, The
discovery call ServerResourcesForGroupVersion cannot take a context, but the
dynamic client Get calls do need a bounded context like
verifyDRAKubeletPluginReady uses: create a bounded context via
ctx.Timeout(defaults.ResourceVerificationTimeout) (e.g. boundedCtx :=
ctx.Timeout(defaults.ResourceVerificationTimeout) and defer
boundedCtx.Close()/cancel if required by your ctx type) and pass boundedCtx.Ctx
(or the appropriate field) into dynClient.Resource(skyhookGVR).Get(...) instead
of ctx.Ctx; keep the discovery call as-is but use the same timeout pattern
around the loop that calls getDynamicClient and dynClient.Resource(...).Get to
ensure each Get is timeboxed consistent with verifyDRAKubeletPluginReady.
tests/e2e/run.sh (2)

716-720: ⚠️ Potential issue | 🟠 Major

Guard cleanup to avoid deleting pre-existing GPU Operator resources.

cleanup_fake_gpu_operator_fixture() unconditionally deletes the ClusterPolicy and CRD. If this script runs against a cluster with a real GPU Operator installed, teardown would delete production resources.

Consider tracking whether the fixture created these objects and only deleting them if so, or checking for pre-existing resources before setup and skipping the test entirely.

🔧 Suggested approach
+# Track what the fixture created
+CREATED_GPU_OPERATOR_CRD=false
+CREATED_CLUSTER_POLICY=false
+
 setup_fake_gpu_operator_fixture() {
+  # Check if CRD already exists (real gpu-operator installed)
+  if kubectl get crd clusterpolicies.nvidia.com &>/dev/null; then
+    echo "ClusterPolicy CRD already exists (real gpu-operator?), skipping fixture"
+    return 1
+  fi
+
   kubectl create namespace gpu-operator --dry-run=client -o yaml | kubectl apply -f - 2>&1 || true
   # ... CRD apply ...
+  CREATED_GPU_OPERATOR_CRD=true
   # ... ClusterPolicy apply ...
+  CREATED_CLUSTER_POLICY=true
   # ...
 }

 cleanup_fake_gpu_operator_fixture() {
   kubectl delete deployment gpu-operator -n gpu-operator --ignore-not-found 2>&1 || true
-  kubectl delete clusterpolicy cluster-policy --ignore-not-found 2>&1 || true
-  kubectl delete crd clusterpolicies.nvidia.com --ignore-not-found 2>&1 || true
+  if [ "$CREATED_CLUSTER_POLICY" = true ]; then
+    kubectl delete clusterpolicy cluster-policy --ignore-not-found 2>&1 || true
+  fi
+  if [ "$CREATED_GPU_OPERATOR_CRD" = true ]; then
+    kubectl delete crd clusterpolicies.nvidia.com --ignore-not-found 2>&1 || true
+  fi
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e/run.sh` around lines 716 - 720, The cleanup function
cleanup_fake_gpu_operator_fixture unconditionally deletes ClusterPolicy/CRD
which can remove a real GPU Operator; modify setup and cleanup so the script
first checks for pre-existing resources (e.g., run kubectl get for deployment
gpu-operator, clusterpolicy cluster-policy, and crd clusterpolicies.nvidia.com)
and record whether the fixture created them (store a boolean flag or temporary
marker). In setup, if resources already exist, set the flag to skip teardown or
abort the test; in cleanup, only run the kubectl delete lines when the recorded
flag indicates the fixture created those specific objects. Ensure the
flag/marker is uniquely tied to this test run so production resources are never
deleted by mistake.

713-714: ⚠️ Potential issue | 🟠 Major

Propagate deployment availability wait failure.

The kubectl wait at line 713 uses || true, which swallows the failure. If the fake gpu-operator deployment never becomes Available, subsequent tests will run against a broken fixture and fail for unrelated reasons.

🔧 Suggested fix
-  kubectl wait --for=condition=available deployment/gpu-operator -n gpu-operator --timeout=60s 2>&1 || true
+  if ! kubectl wait --for=condition=available deployment/gpu-operator -n gpu-operator --timeout=60s 2>&1; then
+    return 1
+  fi
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e/run.sh` around lines 713 - 714, The kubectl wait invocation in
run.sh currently appends "|| true" which swallows failures; remove that fallback
so the non‑zero exit from kubectl wait propagates (or explicitly check its exit
code and exit with a helpful error) to halt the test run when the gpu-operator
deployment (deployment/gpu-operator -n gpu-operator) never becomes Available;
update the block containing the kubectl wait call to either drop "|| true" or
add an explicit error check/exit so downstream tests don't run against a broken
fixture.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@tests/e2e/run.sh`:
- Around line 716-720: The cleanup function cleanup_fake_gpu_operator_fixture
unconditionally deletes ClusterPolicy/CRD which can remove a real GPU Operator;
modify setup and cleanup so the script first checks for pre-existing resources
(e.g., run kubectl get for deployment gpu-operator, clusterpolicy
cluster-policy, and crd clusterpolicies.nvidia.com) and record whether the
fixture created them (store a boolean flag or temporary marker). In setup, if
resources already exist, set the flag to skip teardown or abort the test; in
cleanup, only run the kubectl delete lines when the recorded flag indicates the
fixture created those specific objects. Ensure the flag/marker is uniquely tied
to this test run so production resources are never deleted by mistake.
- Around line 713-714: The kubectl wait invocation in run.sh currently appends
"|| true" which swallows failures; remove that fallback so the non‑zero exit
from kubectl wait propagates (or explicitly check its exit code and exit with a
helpful error) to halt the test run when the gpu-operator deployment
(deployment/gpu-operator -n gpu-operator) never becomes Available; update the
block containing the kubectl wait call to either drop "|| true" or add an
explicit error check/exit so downstream tests don't run against a broken
fixture.

In `@validators/deployment/expected_resources.go`:
- Around line 164-188: verifyNamespacesActive currently calls
ctx.Clientset.CoreV1().Namespaces().Get(ctx.Ctx, ...) without a timeout; wrap
the lookup in a bounded context like the other validator functions by calling
ctx.Timeout(defaults.ResourceVerificationTimeout) to obtain a new context and
cancel function, defer cancel(), and pass that bounded context into
Namespaces().Get; update verifyNamespacesActive to use the new ctx from
ctx.Timeout(defaults.ResourceVerificationTimeout) for each namespace lookup to
prevent hanging.
- Around line 391-409: The dynamic Get call using ctx.Ctx for clusterPolicy
retrieval is unbounded; create a bounded context (e.g., boundedCtx, cancel :=
context.WithTimeout(ctx.Ctx, defaultTimeout) / defer cancel()) and use
boundedCtx in dynClient.Resource(clusterPolicyGVR).Get(...) instead of ctx.Ctx,
ensuring you import context and define/choose an appropriate timeout constant
(or reuse an existing one) and call cancel() to avoid leaking goroutines; adjust
the code paths around getDynamicClient, dynClient, clusterPolicyGVR and
clusterPolicyName accordingly.
- Around line 249-273: The discovery call ServerResourcesForGroupVersion cannot
take a context, but the dynamic client Get calls do need a bounded context like
verifyDRAKubeletPluginReady uses: create a bounded context via
ctx.Timeout(defaults.ResourceVerificationTimeout) (e.g. boundedCtx :=
ctx.Timeout(defaults.ResourceVerificationTimeout) and defer
boundedCtx.Close()/cancel if required by your ctx type) and pass boundedCtx.Ctx
(or the appropriate field) into dynClient.Resource(skyhookGVR).Get(...) instead
of ctx.Ctx; keep the discovery call as-is but use the same timeout pattern
around the loop that calls getDynamicClient and dynClient.Resource(...).Get to
ensure each Get is timeboxed consistent with verifyDRAKubeletPluginReady.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 1a00896b-3122-4a5c-a869-cddebd8ee65f

📥 Commits

Reviewing files that changed from the base of the PR and between 0548c07 and 5130bfe.

📒 Files selected for processing (4)
  • docs/user/cli-reference.md
  • tests/e2e/run.sh
  • validators/deployment/expected_resources.go
  • validators/deployment/expected_resources_test.go

@yuanchen8911 yuanchen8911 changed the title enhance(validator): add targeted GPU readiness checks enhance(validator): add targeted post-deployment GPU readiness checks Apr 20, 2026

@mchmarny mchmarny left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid PR — the deployer-neutral design, discovery-gated CRD checks with fail-closed semantics, and scoped Skyhook readiness (by recipe manifest names, not cluster-wide label match) are well thought through. CI is fully green. Test coverage is thorough across the behavioral surface.

One medium finding: the test fake's Get method returns a bare errors.New("object not found") which doesn't satisfy apierrors.IsNotFound(), leaving the specific "recipe declared it but cluster has no such CR" branch in verifySkyhookReady untested. The rest are nits (test-only constants in production code, minor getDynamicClient caching opportunity).

Nothing blocks merge — the medium item is a test fidelity gap, not a production correctness issue.

Comment thread validators/deployment/expected_resources.go
Comment thread validators/deployment/expected_resources.go Outdated
Comment thread validators/deployment/expected_resources.go
Comment thread validators/deployment/expected_resources.go
Comment thread validators/deployment/expected_resources_test.go
Comment thread validators/deployment/expected_resources_test.go Outdated
@mchmarny mchmarny merged commit 17d42f9 into NVIDIA:main Apr 20, 2026
33 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enhance deployment-phase validation to reflect post-install workload readiness

3 participants