fix(validator): drop single-shot ClusterPolicy check from expected-resources by njhensley · Pull Request #1495 · NVIDIA/aicr

njhensley · 2026-06-26T16:24:03Z

Summary

Removes the redundant single-shot Go ClusterPolicy readiness check from the deployment-phase expected-resources validator. The registry-declared chainsaw gpu-operator health check already asserts the same ClusterPolicy state: ready with a polling timeout, so the Go check added nothing but flake.

Motivation / Context

The UAT - AWS run failed the expected-resources gate it gates on, with a single failure: [INTERNAL] ClusterPolicy state=notReady (want ready). Investigation showed the GPU Operator readiness signal was being checked two ways inside the same validator:

verifyClusterPolicyReady (Go) read ClusterPolicy.status.state once, with no retry — it sampled ~13s into the run.
The chainsaw gpu-operator health check (recipes/checks/gpu-operator/health-check.yaml, step validate-cluster-policy-ready) asserts the same state: ready but polls up to 5m.

On a fresh node the nvidia-operator-validator runs four sequential init containers (driver → toolkit → cuda → plugin) before ClusterPolicy reconciles to ready — ~2.5m in the failing run. The Go check fired far too early and flaked the whole deployment gate, while the chainsaw check waited and passed. (Nodewright tuning was already complete and the DRA kubelet-plugin healthy, so neither was the cause.)

This completes the clusterPolicyReady migration that was deferred in #1220 / #1235 (the chainsaw replacement landed but the Go original was never removed).

Fixes: N/A
Related: #1220, #1235

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation update
Refactoring (no functional changes)
Build/CI/tooling

Component(s) Affected

Validator (pkg/validator)

Implementation Notes

Removed verifyClusterPolicyReady, its call site in verifyGPUReadinessSignals, and the now-unused gpuOperatorComponent / clusterPolicyName / clusterPolicyReadyState consts and clusterPolicyGVR var.
The chainsaw assert is co-extensive with the removed Go check (both fire only when gpu-operator is enabled) and strictly superior (polls instead of single-shot), so there is no coverage loss.
verifyNodewrightReady and verifyDRAKubeletPluginReady deliberately stay in Go — their resource names are dynamically/release-derived and cannot be expressed in static chainsaw YAML. Neither was implicated in this flake.
Fail-closed-on-discovery-error coverage is preserved by re-targeting TestCheckExpectedResources_FailsWhenDiscoveryReturnsNonNotFoundError at the Nodewright discovery gate, which shares the same pattern.

Testing

go build ./validators/...
go test ./validators/...                                  # all packages pass
golangci-lint run -c .golangci.yaml ./validators/deployment/...   # 0 issues (pinned v2.12.2)

go test ./validators/... — all packages pass (deployment, chainsaw, conformance, helper, performance).
Lint clean with the repo-pinned golangci-lint v2.12.2.
Note: -race could not be exercised in my environment (no C compiler); CI's make qualify will run the race detector.

Risk Assessment

Low — Isolated change, well-tested, easy to revert

Rollout notes: No behavior change for healthy clusters — the deeper ClusterPolicy state: ready signal is still asserted via the chainsaw gpu-operator health check, now with correct polling semantics. The only residual consideration: the chainsaw assert's 5m poll must cover ClusterPolicy reconcile on a fresh node (observed ~2.5m); a genuinely slower reconcile would now fail legitimately rather than flakily.

Checklist

Tests pass locally (make test with -race)
Linter passes (make lint)
I did not skip/disable tests to make CI green
I added/updated tests for new functionality
I updated docs if user-facing behavior changed
Changes follow existing patterns in the codebase
Commits are cryptographically signed (git commit -S)

…sources The deployment-phase expected-resources gate checked GPU Operator ClusterPolicy readiness two ways: the Go-resident verifyClusterPolicyReady read status.state once with no retry, while the registry-declared chainsaw gpu-operator health check asserts the same state=ready with a 5m poll. On a fresh node the nvidia-operator-validator runs four sequential init containers (driver, toolkit, cuda, plugin) before ClusterPolicy reconciles to ready — ~2.5m in the failing UAT run. The Go check sampled at ~13s, saw notReady, and flaked the whole gate; the chainsaw check waited and passed. Remove verifyClusterPolicyReady and its constants/fixtures, completing the clusterPolicyReady migration deferred in NVIDIA#1220/NVIDIA#1235. The chainsaw assert is co-extensive (both fire only when gpu-operator is enabled) and strictly superior, so there is no coverage loss. verifyNodewrightReady and verifyDRAKubeletPluginReady stay in Go — their resource names are dynamically/release-derived and cannot be expressed in static chainsaw YAML. Fail-closed discovery coverage is preserved by re-targeting the discovery-error test at the Nodewright gate.

coderabbitai · 2026-06-26T16:28:13Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 906647d5-6cc3-4419-856c-735ddadcf172

📥 Commits

Reviewing files that changed from the base of the PR and between eac5eee and 47f2de9.

📒 Files selected for processing (2)

validators/deployment/expected_resources.go
validators/deployment/expected_resources_test.go

📝 Walkthrough

Walkthrough

This PR removes the Go-based ClusterPolicy readiness check from deployment expected-resources validation. GPU readiness now checks Nodewright customizations and the DRA kubelet-plugin, while the ClusterPolicy ready signal is documented as handled in Chainsaw YAML. The tests and fake discovery helpers were updated to focus on Nodewright readiness, discovery failures, and removal of ClusterPolicy-specific fixtures.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

NVIDIA/aicr#1426: Adds a GPU-operator ClusterPolicy .status.state=="ready" polling gate, which matches the readiness signal removed from Go validation here.
NVIDIA/aicr#1429: Introduces Nodewright/Skyhook readiness gating before ClusterPolicy checks, overlapping with the Nodewright-focused validation path kept in this PR.
NVIDIA/aicr#1439: Replaces polling the ClusterPolicy proxy state==ready in UAT with aicr validate --phase deployment, aligning with the ClusterPolicy readiness migration in this PR.

Suggested labels

area/tests

Suggested reviewers

mchmarny

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main change: removing the single-shot ClusterPolicy check from expected-resources.
Description check	✅ Passed	The description directly matches the changeset and explains the motivation, implementation, and testing.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands.}

…lake

mchmarny

Clean removal of the redundant single-shot Go ClusterPolicy readiness check. Verified the chainsaw gpu-operator health check (validate-cluster-policy-ready) asserts the same state: ready with a 5m poll, registered via assertFile so it fires exactly when gpu-operator is enabled — co-extensive and strictly superior to the deleted Go check. Fail-closed and CR-missing coverage are preserved via the Nodewright gate. CI fully green.

Minor non-blocking note: the old Go check skipped (per #607) when the ClusterPolicy CRD wasn't registered; the chainsaw step instead asserts the CRD is Established and fails after 5m. That's correct for the deployment phase (operator already installed) and arguably stricter-better — not a regression.

LGTM.

- Restore .github/workflows/rekor-monitor.yaml lost during squash (was added in #1480, absent from stale branch base) - Remove verifyClusterPolicyReady call and function reverted by squash (#1495 had removed it; stale branch base brought it back) - Update apiVersion aicr.nvidia.com/v1alpha1 → aicr.run/v1alpha2 in l40s-any, l40s-oke-training, l40s-oke-inference overlays (post-#1499 domain migration) - Add `ol` to OS enum description in pkg/client/v1/types.go and pkg/fingerprint/doc.go Signed-off-by: Atif Mahmood <[email protected]> Signed-off-by: Atif Mahmood <[email protected]>

njhensley requested a review from a team as a code owner June 26, 2026 16:24

njhensley added the theme/validation Constraint evaluation, health checks, and conformance evidence label Jun 26, 2026

github-actions Bot added the size/L label Jun 26, 2026

Merge branch 'main' into validator/expected-resources-clusterpolicy-f…

fd66f25

…lake

mchmarny approved these changes Jun 26, 2026

View reviewed changes

njhensley merged commit 6ac2cfc into NVIDIA:main Jun 26, 2026
30 checks passed

mchmarny mentioned this pull request Jun 29, 2026

feat(recipe): add L40S accelerator support #1510

Merged

15 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(validator): drop single-shot ClusterPolicy check from expected-resources#1495

fix(validator): drop single-shot ClusterPolicy check from expected-resources#1495
njhensley merged 2 commits into
NVIDIA:mainfrom
njhensley:validator/expected-resources-clusterpolicy-flake

njhensley commented Jun 26, 2026

Uh oh!

coderabbitai Bot commented Jun 26, 2026

Walkthrough

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

mchmarny left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

njhensley commented Jun 26, 2026

Summary

Motivation / Context

Type of Change

Component(s) Affected

Implementation Notes

Testing

Risk Assessment

Checklist

Uh oh!

coderabbitai Bot commented Jun 26, 2026

Walkthrough

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

mchmarny left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants