fix(validator): drop single-shot ClusterPolicy check from expected-resources#1495
Conversation
…sources The deployment-phase expected-resources gate checked GPU Operator ClusterPolicy readiness two ways: the Go-resident verifyClusterPolicyReady read status.state once with no retry, while the registry-declared chainsaw gpu-operator health check asserts the same state=ready with a 5m poll. On a fresh node the nvidia-operator-validator runs four sequential init containers (driver, toolkit, cuda, plugin) before ClusterPolicy reconciles to ready — ~2.5m in the failing UAT run. The Go check sampled at ~13s, saw notReady, and flaked the whole gate; the chainsaw check waited and passed. Remove verifyClusterPolicyReady and its constants/fixtures, completing the clusterPolicyReady migration deferred in NVIDIA#1220/NVIDIA#1235. The chainsaw assert is co-extensive (both fire only when gpu-operator is enabled) and strictly superior, so there is no coverage loss. verifyNodewrightReady and verifyDRAKubeletPluginReady stay in Go — their resource names are dynamically/release-derived and cannot be expressed in static chainsaw YAML. Fail-closed discovery coverage is preserved by re-targeting the discovery-error test at the Nodewright gate.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Enterprise Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThis PR removes the Go-based ClusterPolicy readiness check from deployment expected-resources validation. GPU readiness now checks Nodewright customizations and the DRA kubelet-plugin, while the ClusterPolicy ready signal is documented as handled in Chainsaw YAML. The tests and fake discovery helpers were updated to focus on Nodewright readiness, discovery failures, and removal of ClusterPolicy-specific fixtures. Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
mchmarny
left a comment
There was a problem hiding this comment.
Clean removal of the redundant single-shot Go ClusterPolicy readiness check. Verified the chainsaw gpu-operator health check (validate-cluster-policy-ready) asserts the same state: ready with a 5m poll, registered via assertFile so it fires exactly when gpu-operator is enabled — co-extensive and strictly superior to the deleted Go check. Fail-closed and CR-missing coverage are preserved via the Nodewright gate. CI fully green.
Minor non-blocking note: the old Go check skipped (per #607) when the ClusterPolicy CRD wasn't registered; the chainsaw step instead asserts the CRD is Established and fails after 5m. That's correct for the deployment phase (operator already installed) and arguably stricter-better — not a regression.
LGTM.
- Restore .github/workflows/rekor-monitor.yaml lost during squash (was added in #1480, absent from stale branch base) - Remove verifyClusterPolicyReady call and function reverted by squash (#1495 had removed it; stale branch base brought it back) - Update apiVersion aicr.nvidia.com/v1alpha1 → aicr.run/v1alpha2 in l40s-any, l40s-oke-training, l40s-oke-inference overlays (post-#1499 domain migration) - Add `ol` to OS enum description in pkg/client/v1/types.go and pkg/fingerprint/doc.go Signed-off-by: Atif Mahmood <[email protected]> Signed-off-by: Atif Mahmood <[email protected]>
- Restore .github/workflows/rekor-monitor.yaml lost during squash (was added in #1480, absent from stale branch base) - Remove verifyClusterPolicyReady call and function reverted by squash (#1495 had removed it; stale branch base brought it back) - Update apiVersion aicr.nvidia.com/v1alpha1 → aicr.run/v1alpha2 in l40s-any, l40s-oke-training, l40s-oke-inference overlays (post-#1499 domain migration) - Add `ol` to OS enum description in pkg/client/v1/types.go and pkg/fingerprint/doc.go Signed-off-by: Atif Mahmood <[email protected]> Signed-off-by: Atif Mahmood <[email protected]>
Summary
Removes the redundant single-shot Go ClusterPolicy readiness check from the deployment-phase
expected-resourcesvalidator. The registry-declared chainsawgpu-operatorhealth check already asserts the sameClusterPolicy state: readywith a polling timeout, so the Go check added nothing but flake.Motivation / Context
The UAT - AWS run failed the
expected-resourcesgate it gates on, with a single failure:[INTERNAL] ClusterPolicy state=notReady (want ready). Investigation showed the GPU Operator readiness signal was being checked two ways inside the same validator:verifyClusterPolicyReady(Go) readClusterPolicy.status.stateonce, with no retry — it sampled ~13s into the run.gpu-operatorhealth check (recipes/checks/gpu-operator/health-check.yaml, stepvalidate-cluster-policy-ready) asserts the samestate: readybut polls up to 5m.On a fresh node the
nvidia-operator-validatorruns four sequential init containers (driver → toolkit → cuda → plugin) before ClusterPolicy reconciles toready— ~2.5m in the failing run. The Go check fired far too early and flaked the whole deployment gate, while the chainsaw check waited and passed. (Nodewright tuning was alreadycompleteand the DRA kubelet-plugin healthy, so neither was the cause.)This completes the
clusterPolicyReadymigration that was deferred in #1220 / #1235 (the chainsaw replacement landed but the Go original was never removed).Fixes: N/A
Related: #1220, #1235
Type of Change
Component(s) Affected
pkg/validator)Implementation Notes
verifyClusterPolicyReady, its call site inverifyGPUReadinessSignals, and the now-unusedgpuOperatorComponent/clusterPolicyName/clusterPolicyReadyStateconsts andclusterPolicyGVRvar.gpu-operatoris enabled) and strictly superior (polls instead of single-shot), so there is no coverage loss.verifyNodewrightReadyandverifyDRAKubeletPluginReadydeliberately stay in Go — their resource names are dynamically/release-derived and cannot be expressed in static chainsaw YAML. Neither was implicated in this flake.TestCheckExpectedResources_FailsWhenDiscoveryReturnsNonNotFoundErrorat the Nodewright discovery gate, which shares the same pattern.Testing
go test ./validators/...— all packages pass (deployment, chainsaw, conformance, helper, performance).-racecould not be exercised in my environment (no C compiler); CI'smake qualifywill run the race detector.Risk Assessment
Rollout notes: No behavior change for healthy clusters — the deeper
ClusterPolicy state: readysignal is still asserted via the chainsawgpu-operatorhealth check, now with correct polling semantics. The only residual consideration: the chainsaw assert's 5m poll must cover ClusterPolicy reconcile on a fresh node (observed ~2.5m); a genuinely slower reconcile would now fail legitimately rather than flakily.Checklist
make testwith-race)make lint)git commit -S)