enhance(validator): add targeted post-deployment GPU readiness checks#611
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthrough
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@validators/deployment/expected_resources_test.go`:
- Around line 37-231: Add a negative test that ensures verifyNamespacesActive()
(exercised via checkExpectedResources) fails when a required namespace is
missing or not Active: create a new test (e.g.,
TestCheckExpectedResources_FailsWhenNamespaceNotActive) using
newDeploymentTestContext with a component referencing a namespace but include a
corresponding non-active namespace object (use an existing helper like
inactiveNamespace or omit activeNamespace) in the runtime.Objects, then call
checkExpectedResources(ctx) and assert it returns an error and that the error
message contains the namespace readiness context (e.g., "namespace <name> not
Active" or similar) following the pattern used in other tests (t.Fatal/t.Fatalf
when err == nil and strings.Contains checks).
In `@validators/deployment/expected_resources.go`:
- Around line 267-276: The DaemonSet health check in verifyDRAKubeletPluginReady
currently allows a 0/0 DaemonSet to pass; after the existing
helper.VerifyResource call, explicitly fetch the DaemonSet object (use
ctx.Clientset.AppsV1().DaemonSets(namespace).Get with ctx.Ctx and
draKubeletPluginName), then inspect ds.Status.DesiredNumberScheduled and
ds.Status.NumberReady; if DesiredNumberScheduled == 0 or NumberReady == 0 return
a non-nil error (include context mentioning draKubeletPluginName and namespace)
so the DRA readiness gate fails when no pods are scheduled or ready.
- Around line 297-309: The buildResourceFetcher() function currently passes
ctx.DynamicClient directly into chainsaw.NewClusterFetcher(), which can be nil
for test or manually-constructed validators.Context instances; change
buildResourceFetcher() to call getDynamicClient(ctx) and use its returned
dynamic.Interface (handling the error) when constructing the chainsaw.Fetcher so
it lazily initializes from ctx.RESTConfig like verifySkyhookReady() and
verifyClusterPolicyReady() do; reference the validators.Context,
getDynamicClient(), buildResourceFetcher(), ctx.DynamicClient, and
chainsaw.NewClusterFetcher() in your fix and propagate any returned error
instead of risking a nil client.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: 3ee239b2-2554-4580-991d-6ec0d91176e2
📒 Files selected for processing (2)
validators/deployment/expected_resources.govalidators/deployment/expected_resources_test.go
f5e3ee9 to
fbca793
Compare
fbca793 to
e51b9be
Compare
|
Note that the scope of Issue 607 and PR 611 I opened is intentionally narrow and not meant to address all concerns. The goal is to improve the current AICR validator after deployment. @lockwobr will investigate more advanced ideas, such as the bridge-job (issue #516), along with @pwittrock's ADR, in parallel as a longer-term solution. A few notes on the proposal in issue #607 and PR #611 implementation:
|
…istered Issue NVIDIA#607's acceptance criteria explicitly calls for the Skyhook check to "not fail on clusters where the Skyhook CRD is not registered ... the Skyhook check is skipped (or reported as N/A) in those cases." Before this commit, PR NVIDIA#611 still propagated the dynamic-client error (meta.NoKindMatchError or apierrors.NotFound) as an ErrCodeInternal failure, so a recipe that declared skyhook-customizations against a cluster where the operator's CRD had not yet been installed tripped the deployment phase. Apply the same relaxation to verifyClusterPolicyReady by symmetry: a recipe can list gpu-operator before the operator's ClusterPolicy CRD exists on the cluster, and that case should also be a skip, not a hard failure. Behavior summary: - If the CRD is not registered (meta.IsNoMatchError), print a short "CRD not registered, skipping" line and return nil. - For ClusterPolicy, also treat apierrors.IsNotFound as skip. The instance not existing means the operator has not reconciled yet or is not installed, which NVIDIA#607 groups with the CRD-missing path. - Any other error (transient API, auth, timeout) still fails closed with the original wrapping, so we do not mask real problems. Tests: - Extend the homegrown fake dynamic client with an "unregistered" GVR set: Resource() returns a sentinel client whose Get/List return meta.NoKindMatchError. This lets unit tests express the CRD-not-installed state without pulling in a scheme-backed fake. - Add TestCheckExpectedResources_SkipsSkyhookWhenCRDNotRegistered and TestCheckExpectedResources_SkipsClusterPolicyWhenCRDNotRegistered pinning the NVIDIA#607 acceptance behavior.
…g CR Reviewer observation on PR NVIDIA#611: verifyClusterPolicyReady previously treated two distinct situations identically, because both surfaced as IsNotFound from the dynamic Get: 1. CRD not registered — gpu-operator chart not installed. Legitimate skip per NVIDIA#607 (recipe declares the component, but the component isn't on the cluster yet). 2. CRD registered but cluster-policy CR absent — gpu-operator is installed, but its singleton CR was never created or was manually deleted. The operator cannot reconcile the GPU stack in that state, so this is a real misconfiguration. A bare Get cannot tell these apart (same 404). Use Discovery to disambiguate: ServerResourcesForGroupVersion("nvidia.com/v1") is the authoritative check for whether the CRD is installed. When discovery reports the group as present, any subsequent Get error — including IsNotFound on the CR itself — is a real failure and is surfaced with a "gpu-operator installed but CR missing?" diagnostic. Also: - verifySkyhookReady is intentionally NOT symmetric: for a List call, IsNotFound can only mean "kind not served", so it is safe to skip on there, and the "CRD present, zero CRs" case is handled distinctly by the explicit len(Items) == 0 failure path. Added a comment documenting this so a future refactor does not unify the two paths and silently mask the missing-Skyhook case. Tests: - Extend the fake clientset with a configured Discovery service: GroupVersions are auto-registered from dynamicObjects (minus any in `unregistered`), and a second helper lets a test declare extra registered GroupVersions explicitly (needed when we want discovery to succeed without seeding a CR). - Existing tests unchanged in intent; they now flow through the new auto-register path transparently. - New test TestCheckExpectedResources_FailsWhenClusterPolicyCRMissing pins the disambiguation behavior: CRD registered, CR absent → fail with the expected diagnostic. Docs: - docs/user/cli-reference.md clarifies that once a CRD is registered, a missing CR fails loudly (example: gpu-operator ClusterPolicy).
…d path Reviewer feedback on PR NVIDIA#611: the "Other errors (transient API, auth) still fail closed" line is accurate but too terse to help diagnosis. Call out a concrete case — an RBAC 403 on skyhooks.skyhook.nvidia.com is not a NoMatch and therefore surfaces as a failure rather than a silent skip — so a reader hitting it can recognize what happened.
b8a5103 to
899df5e
Compare
899df5e to
2a2baaa
Compare
2a2baaa to
0548c07
Compare
Strengthen `aicr validate --phase deployment` so it verifies deployment completeness plus the targeted GPU readiness signals called out in issue NVIDIA#607, keeping the implementation in the existing pure-Go validators/deployment path rather than adding a new phase, flag, or Chainsaw runtime dependency. Checks added or extended: - Enabled component namespaces must be `Active`. - Declared `expectedResources` (Deployments, DaemonSets, etc.) must exist and be healthy. - When `skyhook-customizations` is enabled: every Skyhook CR the recipe declares (extracted from ComponentRef.ManifestFiles, parsed with a narrow Helm-template-tolerant regex extractor) must exist on the cluster and report `status.status == "complete"`. Unrelated Skyhook CRs on the cluster (stale from prior deploys, or from other tenants) are ignored. All non-complete expected Skyhooks surface in a single error rather than first-failure-only. - When `gpu-operator` is enabled: `ClusterPolicy.status.state == ready`. Discovery on `nvidia.com/v1` is used up-front to disambiguate three cases the dynamic-client Get conflates: * CRD not registered (`IsNotFound` from discovery): skip per NVIDIA#607. * CRD registered but `cluster-policy` CR absent: fail closed as a real gpu-operator misconfiguration. * Any other discovery error (RBAC 403, API 5xx, timeout): fail closed with a diagnostic pointing at RBAC / API reachability, so a transient discovery failure cannot silently pass the check. - When `nvidia-dra-driver-gpu` is enabled: exactly one DaemonSet whose name ends in the upstream chart's role suffix "-kubelet-plugin" must exist in the component namespace and be ready (both DesiredNumberScheduled > 0 and NumberReady == DesiredNumberScheduled). Discovery is by object-name shape, not by release-scoped labels — the check makes no Helm API calls and no release-identity assumptions, so it survives `fullnameOverride` and non-Helm deployers alike. Ambiguous (>1) and missing (0) results report the matched and seen DaemonSet names for debugging. Skyhook skip behavior matches NVIDIA#607's acceptance: if the Skyhook CRD is not registered, the check is a no-op rather than a failure. The implementation uses Discovery for the CRD-gate and then Get-by-name for each expected Skyhook, so "CRD absent" and "CRD present but expected CR absent" are two genuinely different code paths — the first skips, the second fails loudly. Cluster-scoped Skyhook: previous (dynamic-client) code issued a namespaced List on a cluster-scoped kind, which returned 404 even when the CR was present on a real cluster. The name-based Get path is cluster-scoped by construction and does not repeat that mistake. Tests: - Homegrown fake dynamic client gained resource-scope awareness (calls to `.Namespace(x)` on a cluster-scoped GVR return a 404-like error) and an explicit "unregistered GVR" set so List/Get can return a real `meta.NoKindMatchError`, mirroring a missing CRD. - Fake clientset Discovery service is auto-configured from the dynamicObjects seeded by each test, with an escape hatch (`newDeploymentTestContextWithDiscovery`) for tests that need a GroupVersion registered without any CR instance. The reactor used to inject non-NotFound discovery errors binds to the literal resource string "resource" — see the inline comment citing vendor/k8s.io/ client-go/discovery/fake/discovery.go for why "apiresources" would silently fail to match. - Direct unit tests for the Skyhook manifest name extractor, including the tuning-gke.yaml edge case (metadata.name is "tuning", not "tuning-gke") and Helm template preamble tolerance. - New DRA regression tests: custom fullname-override name, multiple matches fails with ambiguity, unrelated DaemonSet in the namespace is ignored, missing DaemonSet fails with a diagnostic that includes the DaemonSets actually seen in the namespace. - New Skyhook regression tests: stale labeled Skyhook unrelated to the recipe's declared names is ignored, multiple expected Skyhooks in non-complete states all surface in a single error, enabled skyhook-customizations ref with no extractable Skyhook names fails closed as a recipe misconfiguration, CRD-missing skip, discovery- error fail-closed, cluster-scoped List contract. - `TestMain` forces the global recipe data provider to initialize before any `t.Parallel()` test calls `recipe.GetManifestContent`, working around a pre-existing race on the package-level lazy initialization in `pkg/recipe/provider.go`. E2E: - `tests/e2e/run.sh` seeds a fake `ClusterPolicy` CRD (no status subresource, deliberately — see the inline comment above the CRD) plus a ready `cluster-policy` CR so deployment-phase E2E exercises the strengthened validator behavior. Docs: - `docs/user/cli-reference.md` documents the expanded deployment phase: the full list of checks, the graceful-skip path when a CRD is absent, and the fail-closed path for a missing CR when the CRD is present and for non-NotFound discovery errors (with a concrete RBAC 403 example). Fixes: NVIDIA#607
0548c07 to
5130bfe
Compare
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/user/cli-reference.md`:
- Around line 596-606: Update the "Deployment phase checks" text to document the
additional fail-closed case for skyhook customizations: when a component with
skyhook-customizations enabled produces no extractable Skyhook resource names
from its ComponentRef.ManifestFiles, the validator does not skip but fails
deployment validation immediately with a clear "no extractable Skyhook names"
diagnostic (treat this as a recipe/configuration error, distinct from a
CRD-missing skip or a missing CR failure). Reference the skyhook-customizations
flag and ComponentRef.ManifestFiles so readers can locate the source of the
extraction and understand this specific failure path.
In `@tests/e2e/run.sh`:
- Around line 716-720: cleanup_fake_gpu_operator_fixture currently
unconditionally deletes cluster-scoped objects (cluster-policy and
clusterpolicies.nvidia.com); change the flow so the test only removes resources
it created: when you create the CRD or ClusterPolicy in the setup (the code that
creates clusterpolicies.nvidia.com and cluster-policy), set flags (e.g.,
CREATED_GPU_OPERATOR_CRD and CREATED_CLUSTER_POLICY) or record their creation
state, and then modify cleanup_fake_gpu_operator_fixture to delete those
resources only if the corresponding flag is true; alternatively, before
creating, detect existing resources and skip the test or mark them as
preexisting so cleanup does not remove them (refer to function
cleanup_fake_gpu_operator_fixture, the resource names cluster-policy and
clusterpolicies.nvidia.com).
- Around line 713-714: The kubectl wait command for deployment/gpu-operator
currently swallows failures via "|| true", so change the helper to fail when the
deployment never becomes Available: remove the "|| true" from the kubectl wait
invocation (or capture its exit code and explicitly return/exit non-zero), so
the function running the wait returns a non-zero status on timeout and callers
can skip or abort cleanly; update the kubectl wait invocation (the line invoking
"kubectl wait --for=condition=available deployment/gpu-operator -n gpu-operator
--timeout=60s") to propagate failures instead of hiding them.
In `@validators/deployment/expected_resources.go`:
- Around line 174-175: The Kubernetes API calls (e.g.,
ctx.Clientset.CoreV1().Namespaces().Get, discovery and dynamic-client calls)
must use a bounded context instead of ctx.Ctx to avoid stalling the whole
validator; create a short-lived context using the same pattern as
verifyDRAKubeletPluginReady (e.g., call
ctx.Timeout(defaults.ResourceVerificationTimeout) to obtain a timed context, use
that context for the Get/Discovery/Dynamic calls, and ensure you Cancel/cleanup
the timed context), and replace direct uses of ctx.Ctx in those calls (locations
around the Namespace Get, the discovery calls, and the dynamic client usage)
with the new timed context.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: 643460d4-e2a9-4d4d-97c0-94c575219b90
📒 Files selected for processing (4)
docs/user/cli-reference.mdtests/e2e/run.shvalidators/deployment/expected_resources.govalidators/deployment/expected_resources_test.go
There was a problem hiding this comment.
♻️ Duplicate comments (5)
validators/deployment/expected_resources.go (3)
164-188:⚠️ Potential issue | 🟠 MajorWrap namespace lookup in a bounded context.
verifyNamespacesActivecallsctx.Clientset.CoreV1().Namespaces().Get(ctx.Ctx, ...)without a timeout. If the API server stalls, this probe can consume the entire deployment-validation budget and hang the validator job.In contrast,
verifyDRAKubeletPluginReadycorrectly usesctx.Timeout(defaults.ResourceVerificationTimeout). For consistency and resilience, apply the same pattern here.🔧 Suggested fix
func verifyNamespacesActive(ctx *validators.Context, refs []recipe.ComponentRef) []string { var failures []string seen := make(map[string]bool, len(refs)) + verifyCtx, cancel := ctx.Timeout(defaults.ResourceVerificationTimeout) + defer cancel() + for _, ref := range refs { if ref.Namespace == "" || seen[ref.Namespace] { continue } seen[ref.Namespace] = true - ns, err := ctx.Clientset.CoreV1().Namespaces().Get(ctx.Ctx, ref.Namespace, metav1.GetOptions{}) + ns, err := ctx.Clientset.CoreV1().Namespaces().Get(verifyCtx, ref.Namespace, metav1.GetOptions{}) if err != nil {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@validators/deployment/expected_resources.go` around lines 164 - 188, verifyNamespacesActive currently calls ctx.Clientset.CoreV1().Namespaces().Get(ctx.Ctx, ...) without a timeout; wrap the lookup in a bounded context like the other validator functions by calling ctx.Timeout(defaults.ResourceVerificationTimeout) to obtain a new context and cancel function, defer cancel(), and pass that bounded context into Namespaces().Get; update verifyNamespacesActive to use the new ctx from ctx.Timeout(defaults.ResourceVerificationTimeout) for each namespace lookup to prevent hanging.
391-409:⚠️ Potential issue | 🟠 MajorAdd bounded context to ClusterPolicy readiness check.
Same issue as
verifySkyhookReady: the dynamic Get at line 409 usesctx.Ctxdirectly. Use a bounded context for the dynamic client call.🔧 Suggested fix
func verifyClusterPolicyReady(ctx *validators.Context) error { + verifyCtx, cancel := ctx.Timeout(defaults.ResourceVerificationTimeout) + defer cancel() + gv := clusterPolicyGVR.GroupVersion().String() _, discErr := ctx.Clientset.Discovery().ServerResourcesForGroupVersion(gv) // ... discovery switch ... dynClient, err := getDynamicClient(ctx) if err != nil { return err } - clusterPolicy, err := dynClient.Resource(clusterPolicyGVR).Get(ctx.Ctx, clusterPolicyName, metav1.GetOptions{}) + clusterPolicy, err := dynClient.Resource(clusterPolicyGVR).Get(verifyCtx, clusterPolicyName, metav1.GetOptions{})🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@validators/deployment/expected_resources.go` around lines 391 - 409, The dynamic Get call using ctx.Ctx for clusterPolicy retrieval is unbounded; create a bounded context (e.g., boundedCtx, cancel := context.WithTimeout(ctx.Ctx, defaultTimeout) / defer cancel()) and use boundedCtx in dynClient.Resource(clusterPolicyGVR).Get(...) instead of ctx.Ctx, ensuring you import context and define/choose an appropriate timeout constant (or reuse an existing one) and call cancel() to avoid leaking goroutines; adjust the code paths around getDynamicClient, dynClient, clusterPolicyGVR and clusterPolicyName accordingly.
249-273:⚠️ Potential issue | 🟠 MajorAdd bounded contexts to Skyhook readiness checks.
The discovery call (line 254) and dynamic Get calls (line 273) use
ctx.Ctxdirectly without a timeout. Apply the samectx.Timeout(defaults.ResourceVerificationTimeout)pattern used inverifyDRAKubeletPluginReadyfor consistency.🔧 Suggested fix
func verifySkyhookReady(ctx *validators.Context, ref recipe.ComponentRef) error { expectedNames, err := expectedSkyhookNames(ref) if err != nil { return err } if len(expectedNames) == 0 { return errors.New(errors.ErrCodeNotFound, fmt.Sprintf("no Skyhook CR names could be extracted from component %s manifestFiles=%v", ref.Name, ref.ManifestFiles)) } + verifyCtx, cancel := ctx.Timeout(defaults.ResourceVerificationTimeout) + defer cancel() + gv := skyhookGVR.GroupVersion().String() - _, discErr := ctx.Clientset.Discovery().ServerResourcesForGroupVersion(gv) + _, discErr := ctx.Clientset.Discovery().ServerResourcesForGroupVersion(gv) // Note: Discovery doesn't take context // ... rest of function should use verifyCtx for dynamic client callsNote: The Discovery client doesn't accept a context parameter, but the dynamic client Get calls should use the bounded context:
- sk, getErr := dynClient.Resource(skyhookGVR).Get(ctx.Ctx, name, metav1.GetOptions{}) + sk, getErr := dynClient.Resource(skyhookGVR).Get(verifyCtx, name, metav1.GetOptions{})🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@validators/deployment/expected_resources.go` around lines 249 - 273, The discovery call ServerResourcesForGroupVersion cannot take a context, but the dynamic client Get calls do need a bounded context like verifyDRAKubeletPluginReady uses: create a bounded context via ctx.Timeout(defaults.ResourceVerificationTimeout) (e.g. boundedCtx := ctx.Timeout(defaults.ResourceVerificationTimeout) and defer boundedCtx.Close()/cancel if required by your ctx type) and pass boundedCtx.Ctx (or the appropriate field) into dynClient.Resource(skyhookGVR).Get(...) instead of ctx.Ctx; keep the discovery call as-is but use the same timeout pattern around the loop that calls getDynamicClient and dynClient.Resource(...).Get to ensure each Get is timeboxed consistent with verifyDRAKubeletPluginReady.tests/e2e/run.sh (2)
716-720:⚠️ Potential issue | 🟠 MajorGuard cleanup to avoid deleting pre-existing GPU Operator resources.
cleanup_fake_gpu_operator_fixture()unconditionally deletes the ClusterPolicy and CRD. If this script runs against a cluster with a real GPU Operator installed, teardown would delete production resources.Consider tracking whether the fixture created these objects and only deleting them if so, or checking for pre-existing resources before setup and skipping the test entirely.
🔧 Suggested approach
+# Track what the fixture created +CREATED_GPU_OPERATOR_CRD=false +CREATED_CLUSTER_POLICY=false + setup_fake_gpu_operator_fixture() { + # Check if CRD already exists (real gpu-operator installed) + if kubectl get crd clusterpolicies.nvidia.com &>/dev/null; then + echo "ClusterPolicy CRD already exists (real gpu-operator?), skipping fixture" + return 1 + fi + kubectl create namespace gpu-operator --dry-run=client -o yaml | kubectl apply -f - 2>&1 || true # ... CRD apply ... + CREATED_GPU_OPERATOR_CRD=true # ... ClusterPolicy apply ... + CREATED_CLUSTER_POLICY=true # ... } cleanup_fake_gpu_operator_fixture() { kubectl delete deployment gpu-operator -n gpu-operator --ignore-not-found 2>&1 || true - kubectl delete clusterpolicy cluster-policy --ignore-not-found 2>&1 || true - kubectl delete crd clusterpolicies.nvidia.com --ignore-not-found 2>&1 || true + if [ "$CREATED_CLUSTER_POLICY" = true ]; then + kubectl delete clusterpolicy cluster-policy --ignore-not-found 2>&1 || true + fi + if [ "$CREATED_GPU_OPERATOR_CRD" = true ]; then + kubectl delete crd clusterpolicies.nvidia.com --ignore-not-found 2>&1 || true + fi }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/e2e/run.sh` around lines 716 - 720, The cleanup function cleanup_fake_gpu_operator_fixture unconditionally deletes ClusterPolicy/CRD which can remove a real GPU Operator; modify setup and cleanup so the script first checks for pre-existing resources (e.g., run kubectl get for deployment gpu-operator, clusterpolicy cluster-policy, and crd clusterpolicies.nvidia.com) and record whether the fixture created them (store a boolean flag or temporary marker). In setup, if resources already exist, set the flag to skip teardown or abort the test; in cleanup, only run the kubectl delete lines when the recorded flag indicates the fixture created those specific objects. Ensure the flag/marker is uniquely tied to this test run so production resources are never deleted by mistake.
713-714:⚠️ Potential issue | 🟠 MajorPropagate deployment availability wait failure.
The
kubectl waitat line 713 uses|| true, which swallows the failure. If the fake gpu-operator deployment never becomes Available, subsequent tests will run against a broken fixture and fail for unrelated reasons.🔧 Suggested fix
- kubectl wait --for=condition=available deployment/gpu-operator -n gpu-operator --timeout=60s 2>&1 || true + if ! kubectl wait --for=condition=available deployment/gpu-operator -n gpu-operator --timeout=60s 2>&1; then + return 1 + fi }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/e2e/run.sh` around lines 713 - 714, The kubectl wait invocation in run.sh currently appends "|| true" which swallows failures; remove that fallback so the non‑zero exit from kubectl wait propagates (or explicitly check its exit code and exit with a helpful error) to halt the test run when the gpu-operator deployment (deployment/gpu-operator -n gpu-operator) never becomes Available; update the block containing the kubectl wait call to either drop "|| true" or add an explicit error check/exit so downstream tests don't run against a broken fixture.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@tests/e2e/run.sh`:
- Around line 716-720: The cleanup function cleanup_fake_gpu_operator_fixture
unconditionally deletes ClusterPolicy/CRD which can remove a real GPU Operator;
modify setup and cleanup so the script first checks for pre-existing resources
(e.g., run kubectl get for deployment gpu-operator, clusterpolicy
cluster-policy, and crd clusterpolicies.nvidia.com) and record whether the
fixture created them (store a boolean flag or temporary marker). In setup, if
resources already exist, set the flag to skip teardown or abort the test; in
cleanup, only run the kubectl delete lines when the recorded flag indicates the
fixture created those specific objects. Ensure the flag/marker is uniquely tied
to this test run so production resources are never deleted by mistake.
- Around line 713-714: The kubectl wait invocation in run.sh currently appends
"|| true" which swallows failures; remove that fallback so the non‑zero exit
from kubectl wait propagates (or explicitly check its exit code and exit with a
helpful error) to halt the test run when the gpu-operator deployment
(deployment/gpu-operator -n gpu-operator) never becomes Available; update the
block containing the kubectl wait call to either drop "|| true" or add an
explicit error check/exit so downstream tests don't run against a broken
fixture.
In `@validators/deployment/expected_resources.go`:
- Around line 164-188: verifyNamespacesActive currently calls
ctx.Clientset.CoreV1().Namespaces().Get(ctx.Ctx, ...) without a timeout; wrap
the lookup in a bounded context like the other validator functions by calling
ctx.Timeout(defaults.ResourceVerificationTimeout) to obtain a new context and
cancel function, defer cancel(), and pass that bounded context into
Namespaces().Get; update verifyNamespacesActive to use the new ctx from
ctx.Timeout(defaults.ResourceVerificationTimeout) for each namespace lookup to
prevent hanging.
- Around line 391-409: The dynamic Get call using ctx.Ctx for clusterPolicy
retrieval is unbounded; create a bounded context (e.g., boundedCtx, cancel :=
context.WithTimeout(ctx.Ctx, defaultTimeout) / defer cancel()) and use
boundedCtx in dynClient.Resource(clusterPolicyGVR).Get(...) instead of ctx.Ctx,
ensuring you import context and define/choose an appropriate timeout constant
(or reuse an existing one) and call cancel() to avoid leaking goroutines; adjust
the code paths around getDynamicClient, dynClient, clusterPolicyGVR and
clusterPolicyName accordingly.
- Around line 249-273: The discovery call ServerResourcesForGroupVersion cannot
take a context, but the dynamic client Get calls do need a bounded context like
verifyDRAKubeletPluginReady uses: create a bounded context via
ctx.Timeout(defaults.ResourceVerificationTimeout) (e.g. boundedCtx :=
ctx.Timeout(defaults.ResourceVerificationTimeout) and defer
boundedCtx.Close()/cancel if required by your ctx type) and pass boundedCtx.Ctx
(or the appropriate field) into dynClient.Resource(skyhookGVR).Get(...) instead
of ctx.Ctx; keep the discovery call as-is but use the same timeout pattern
around the loop that calls getDynamicClient and dynClient.Resource(...).Get to
ensure each Get is timeboxed consistent with verifyDRAKubeletPluginReady.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: 1a00896b-3122-4a5c-a869-cddebd8ee65f
📒 Files selected for processing (4)
docs/user/cli-reference.mdtests/e2e/run.shvalidators/deployment/expected_resources.govalidators/deployment/expected_resources_test.go
mchmarny
left a comment
There was a problem hiding this comment.
Solid PR — the deployer-neutral design, discovery-gated CRD checks with fail-closed semantics, and scoped Skyhook readiness (by recipe manifest names, not cluster-wide label match) are well thought through. CI is fully green. Test coverage is thorough across the behavioral surface.
One medium finding: the test fake's Get method returns a bare errors.New("object not found") which doesn't satisfy apierrors.IsNotFound(), leaving the specific "recipe declared it but cluster has no such CR" branch in verifySkyhookReady untested. The rest are nits (test-only constants in production code, minor getDynamicClient caching opportunity).
Nothing blocks merge — the medium item is a test fidelity gap, not a production correctness issue.
Summary
Strengthen
aicr validate --phase deploymentby adding the targeted GPU-workload readiness signals the current validator misses — Skyhook customizations, NVIDIA DRA driver, and GPU Operator (ClusterPolicy) — on top of the existing namespace andexpectedResourceschecks.Concretely, this PR:
ActiveexpectedResourcesare healthyskyhook-customizations: Skyhook CR completionnvidia-dra-driver-gpu: kubelet-plugin readinessgpu-operator:ClusterPolicy.status.state == readySo the main new value is the addition of the targeted Skyhook, DRA, and GPU Operator deep readiness checks on top of the preserved baseline deployment checks.
The check stays strictly deployer-neutral: no Helm API calls, no dependence on Argo CD or any other deployer, no reads of release metadata, no release-identity assumptions. Every lookup is a plain Kubernetes API call against live-cluster object shape, so the command can be used as a standalone tool to verify any Kubernetes cluster's runtime install regardless of how the bundle was deployed (
deploy.sh, Argo CD, kapp, helmfile, plainkubectl apply). Validated end-to-end on a real EKS H100 cluster.Motivation / Context
#607narrowed this follow-up to reusing and enhancing the existing deployment validator path instead of adding a new phase, flag, or Chainsaw runtime dependency. The previous deployment phase verified install/resource existence, but it did not cover the late-converging readiness signals that explain the post-install gap on fresh GPU clusters.Fixes: #607
Related:
Type of Change
Component(s) Affected
cmd/aicr,pkg/cli)cmd/aicrd,pkg/api,pkg/server)pkg/recipe)pkg/bundler,pkg/component/*)pkg/collector,pkg/snapshotter)validators/deployment)pkg/errors,pkg/k8s)docs/)tests/e2e)Implementation Notes
Deployer-neutrality stance
The deployment-phase check stays strictly deployer-agnostic:
helm list).app.kubernetes.io/instance.Every lookup is a plain Kubernetes API call against live-cluster object shape — labels, discovery, name conventions written into the chart templates themselves. The check is valid regardless of whether the bundle was deployed via
deploy.sh(Helm), Argo CD (Helm or Kustomize), kapp, helmfile, or plainkubectl apply.Checks added or extended
Active.expectedResources(Deployments, DaemonSets, etc.) must exist and be healthy.ComponentRef.ManifestFilesof theskyhook-customizationscomponent through the recipe data provider, extracts themetadata.nameof every Skyhook resource declared, and verifies each one exists on the cluster withstatus.status == "complete". A narrow Helm-template-tolerant regex extractor is used because manifests contain Helm{{ ... }}directives that break a full YAML parse. Skyhook CRs present on the cluster that are not declared by this recipe (stale from prior deploys, other tenants) are ignored. Fails closed when an enabledskyhook-customizationsref declares no extractable Skyhook names (recipe misconfiguration).nvidia.com/v1is used up-front to disambiguate three cases the dynamic-clientGetconflates:IsNotFoundfrom discovery) → skip per#607acceptance.cluster-policyCR absent → fail closed (gpu-operator installed but not reconciled).-kubelet-plugin. 0 matches → fail with the DaemonSets seen in the namespace. >1 matches → fail with the matched names. This is a chart-shape assumption (the upstreamnvidia-dra-driver-gpuchart always names that DaemonSet<fullname>-kubelet-plugin), not a release-identity assumption — it survivesfullnameOverrideand non-Helm deployers. Readiness gate: bothDesiredNumberScheduled > 0andNumberReady == DesiredNumberScheduled.Skyhook is a cluster-scoped CR; the name-based
Getpath is cluster-scoped by construction, avoiding the namespaced-List-against-cluster-scoped-kind 404 that an earlier approach would have hit.Supporting changes
tests/e2e/run.shseeds a fakeClusterPolicyCRD (no status subresource — deliberate, and commented) plus a readycluster-policyCR so deployment-phase E2E exercises the strengthened validator behavior.docs/user/cli-reference.mddocuments the expanded deployment phase: the full list of checks, the graceful-skip path when a CRD is absent, the fail-closed path for a missing CR with the CRD present, and the fail-closed path for non-NotFound discovery errors (with a concrete RBAC 403 example).Testing
go test -race ./validators/deployment/...: passed.validators/deploymentcoverage: substantially improved vsorigin/main(new tests span the full behavioral surface below).make test-coverage: passed (above the70%project-wide threshold).unset GITLAB_TOKEN && make qualify: passed.Added / updated regression coverage:
DRA kubelet-plugin discovery
my-custom-gpu-kubelet-plugin) still passes — proves the role-suffix approach is not coupled to the default name.node-exporter) is ignored.0/0and under-scheduled DaemonSet fail the readiness gate.Skyhook readiness
skyhook-customizationsref with no extractable Skyhook names → fail closed.extractSkyhookNamesFromManifest, including thetuning-gke.yamlcase (metadata.name is"tuning", not"tuning-gke"), Helm-template preamble tolerance, and templated-name skip.ClusterPolicy readiness
status.state→ fail.Other
Active→ fail.TestMainforces the global recipe data provider to initialize before anyt.Parallel()test callsrecipe.GetManifestContent, working around a pre-existing race on lazy init inpkg/recipe/provider.go.Live-cluster validation
Ran
aicr validate --recipe recipe.yaml --phase deploymentagainst a real EKS H100 cluster (aws-us-east-1-aicr-cuj2) with the dynamo-platform inference recipe:operator-healthexpected-resourcesActivefor 10 namespaces, Skyhooktuningstatus.status==complete(name-based from the recipe's manifestFiles),ClusterPolicy.status.state==ready(discovery-gated), and the DRA kubelet-plugin DaemonSet via-kubelet-pluginsuffix matchgpu-operator-versioncheck-nvidia-smiExit code was
0; CTRF report archived locally.Risk Assessment
Rollout notes: No migration or flag changes. This strengthens deployment-phase validation only.
ManifestFiles; unrelated Skyhooks on the cluster are ignored.Checklist
make testwith-race)make lint)git commit -S) — GPG signing info