Skip to content

fix(validator): drop single-shot ClusterPolicy check from expected-resources#1495

Merged
njhensley merged 2 commits into
NVIDIA:mainfrom
njhensley:validator/expected-resources-clusterpolicy-flake
Jun 26, 2026
Merged

fix(validator): drop single-shot ClusterPolicy check from expected-resources#1495
njhensley merged 2 commits into
NVIDIA:mainfrom
njhensley:validator/expected-resources-clusterpolicy-flake

Conversation

@njhensley

Copy link
Copy Markdown
Member

Summary

Removes the redundant single-shot Go ClusterPolicy readiness check from the deployment-phase expected-resources validator. The registry-declared chainsaw gpu-operator health check already asserts the same ClusterPolicy state: ready with a polling timeout, so the Go check added nothing but flake.

Motivation / Context

The UAT - AWS run failed the expected-resources gate it gates on, with a single failure: [INTERNAL] ClusterPolicy state=notReady (want ready). Investigation showed the GPU Operator readiness signal was being checked two ways inside the same validator:

  • verifyClusterPolicyReady (Go) read ClusterPolicy.status.state once, with no retry — it sampled ~13s into the run.
  • The chainsaw gpu-operator health check (recipes/checks/gpu-operator/health-check.yaml, step validate-cluster-policy-ready) asserts the same state: ready but polls up to 5m.

On a fresh node the nvidia-operator-validator runs four sequential init containers (driver → toolkit → cuda → plugin) before ClusterPolicy reconciles to ready — ~2.5m in the failing run. The Go check fired far too early and flaked the whole deployment gate, while the chainsaw check waited and passed. (Nodewright tuning was already complete and the DRA kubelet-plugin healthy, so neither was the cause.)

This completes the clusterPolicyReady migration that was deferred in #1220 / #1235 (the chainsaw replacement landed but the Go original was never removed).

Fixes: N/A
Related: #1220, #1235

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • Validator (pkg/validator)

Implementation Notes

  • Removed verifyClusterPolicyReady, its call site in verifyGPUReadinessSignals, and the now-unused gpuOperatorComponent / clusterPolicyName / clusterPolicyReadyState consts and clusterPolicyGVR var.
  • The chainsaw assert is co-extensive with the removed Go check (both fire only when gpu-operator is enabled) and strictly superior (polls instead of single-shot), so there is no coverage loss.
  • verifyNodewrightReady and verifyDRAKubeletPluginReady deliberately stay in Go — their resource names are dynamically/release-derived and cannot be expressed in static chainsaw YAML. Neither was implicated in this flake.
  • Fail-closed-on-discovery-error coverage is preserved by re-targeting TestCheckExpectedResources_FailsWhenDiscoveryReturnsNonNotFoundError at the Nodewright discovery gate, which shares the same pattern.

Testing

go build ./validators/...
go test ./validators/...                                  # all packages pass
golangci-lint run -c .golangci.yaml ./validators/deployment/...   # 0 issues (pinned v2.12.2)
  • go test ./validators/... — all packages pass (deployment, chainsaw, conformance, helper, performance).
  • Lint clean with the repo-pinned golangci-lint v2.12.2.
  • Note: -race could not be exercised in my environment (no C compiler); CI's make qualify will run the race detector.

Risk Assessment

  • Low — Isolated change, well-tested, easy to revert

Rollout notes: No behavior change for healthy clusters — the deeper ClusterPolicy state: ready signal is still asserted via the chainsaw gpu-operator health check, now with correct polling semantics. The only residual consideration: the chainsaw assert's 5m poll must cover ClusterPolicy reconcile on a fresh node (observed ~2.5m); a genuinely slower reconcile would now fail legitimately rather than flakily.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

…sources

The deployment-phase expected-resources gate checked GPU Operator
ClusterPolicy readiness two ways: the Go-resident
verifyClusterPolicyReady
read status.state once with no retry, while the registry-declared
chainsaw
gpu-operator health check asserts the same state=ready with a 5m poll.

On a fresh node the nvidia-operator-validator runs four sequential init
containers (driver, toolkit, cuda, plugin) before ClusterPolicy
reconciles
to ready — ~2.5m in the failing UAT run. The Go check sampled at ~13s,
saw
notReady, and flaked the whole gate; the chainsaw check waited and
passed.

Remove verifyClusterPolicyReady and its constants/fixtures, completing
the
clusterPolicyReady migration deferred in NVIDIA#1220/NVIDIA#1235. The chainsaw
assert is
co-extensive (both fire only when gpu-operator is enabled) and strictly
superior, so there is no coverage loss. verifyNodewrightReady and
verifyDRAKubeletPluginReady stay in Go — their resource names are
dynamically/release-derived and cannot be expressed in static chainsaw
YAML.

Fail-closed discovery coverage is preserved by re-targeting the
discovery-error test at the Nodewright gate.
@njhensley njhensley requested a review from a team as a code owner June 26, 2026 16:24
@njhensley njhensley added the theme/validation Constraint evaluation, health checks, and conformance evidence label Jun 26, 2026
@coderabbitai

coderabbitai Bot commented Jun 26, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 906647d5-6cc3-4419-856c-735ddadcf172

📥 Commits

Reviewing files that changed from the base of the PR and between eac5eee and 47f2de9.

📒 Files selected for processing (2)
  • validators/deployment/expected_resources.go
  • validators/deployment/expected_resources_test.go

📝 Walkthrough

Walkthrough

This PR removes the Go-based ClusterPolicy readiness check from deployment expected-resources validation. GPU readiness now checks Nodewright customizations and the DRA kubelet-plugin, while the ClusterPolicy ready signal is documented as handled in Chainsaw YAML. The tests and fake discovery helpers were updated to focus on Nodewright readiness, discovery failures, and removal of ClusterPolicy-specific fixtures.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • NVIDIA/aicr#1426: Adds a GPU-operator ClusterPolicy .status.state=="ready" polling gate, which matches the readiness signal removed from Go validation here.
  • NVIDIA/aicr#1429: Introduces Nodewright/Skyhook readiness gating before ClusterPolicy checks, overlapping with the Nodewright-focused validation path kept in this PR.
  • NVIDIA/aicr#1439: Replaces polling the ClusterPolicy proxy state==ready in UAT with aicr validate --phase deployment, aligning with the ClusterPolicy readiness migration in this PR.

Suggested labels

area/tests

Suggested reviewers

  • mchmarny
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: removing the single-shot ClusterPolicy check from expected-resources.
Description check ✅ Passed The description directly matches the changeset and explains the motivation, implementation, and testing.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@mchmarny mchmarny left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean removal of the redundant single-shot Go ClusterPolicy readiness check. Verified the chainsaw gpu-operator health check (validate-cluster-policy-ready) asserts the same state: ready with a 5m poll, registered via assertFile so it fires exactly when gpu-operator is enabled — co-extensive and strictly superior to the deleted Go check. Fail-closed and CR-missing coverage are preserved via the Nodewright gate. CI fully green.

Minor non-blocking note: the old Go check skipped (per #607) when the ClusterPolicy CRD wasn't registered; the chainsaw step instead asserts the CRD is Established and fails after 5m. That's correct for the deployment phase (operator already installed) and arguably stricter-better — not a regression.

LGTM.

@njhensley njhensley merged commit 6ac2cfc into NVIDIA:main Jun 26, 2026
30 checks passed
atif1996 added a commit that referenced this pull request Jun 29, 2026
- Restore .github/workflows/rekor-monitor.yaml lost during squash (was
  added in #1480, absent from stale branch base)
- Remove verifyClusterPolicyReady call and function reverted by squash
  (#1495 had removed it; stale branch base brought it back)
- Update apiVersion aicr.nvidia.com/v1alpha1 → aicr.run/v1alpha2 in
  l40s-any, l40s-oke-training, l40s-oke-inference overlays (post-#1499
  domain migration)
- Add `ol` to OS enum description in pkg/client/v1/types.go and
  pkg/fingerprint/doc.go

Signed-off-by: Atif Mahmood <[email protected]>
Signed-off-by: Atif Mahmood <[email protected]>
atif1996 added a commit that referenced this pull request Jun 29, 2026
- Restore .github/workflows/rekor-monitor.yaml lost during squash (was
  added in #1480, absent from stale branch base)
- Remove verifyClusterPolicyReady call and function reverted by squash
  (#1495 had removed it; stale branch base brought it back)
- Update apiVersion aicr.nvidia.com/v1alpha1 → aicr.run/v1alpha2 in
  l40s-any, l40s-oke-training, l40s-oke-inference overlays (post-#1499
  domain migration)
- Add `ol` to OS enum description in pkg/client/v1/types.go and
  pkg/fingerprint/doc.go

Signed-off-by: Atif Mahmood <[email protected]>
Signed-off-by: Atif Mahmood <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/L theme/validation Constraint evaluation, health checks, and conformance evidence

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants