Skip to content

fix: harden timeout/false-PASS handling, evidence integrity, CI gaps#1511

Merged
mchmarny merged 13 commits into
mainfrom
fix/tech-debt-sweep
Jun 28, 2026
Merged

fix: harden timeout/false-PASS handling, evidence integrity, CI gaps#1511
mchmarny merged 13 commits into
mainfrom
fix/tech-debt-sweep

Conversation

@mchmarny

Copy link
Copy Markdown
Member

Summary

Tech-debt hardening sweep: closes "false-PASS" / silent-failure paths across collectors, validators, and the evidence verifier; tightens transient-vs-deterministic error classification; fixes CI path-filter gaps; and bumps tooling + Go dependencies.

Motivation / Context

A review pass (AICR Go review lens) surfaced a cluster of fail-open and silent-truncation defects: gates/collectors that report success without verifying anything, timeouts misclassified as deterministic failures, watch-channel closures failing healthy runs, unbounded/silently-truncated reads, and e2e lanes whose paths: filters don't fire when the relevant Go code changes. None are behavior changes on the happy path; each makes a failure mode fail loud/closed instead of silent.

Fixes: N/A
Related: N/A

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • CLI (cmd/aicr, pkg/cli)
  • Bundlers (pkg/bundler, pkg/component/*)
  • Collectors / snapshotter (pkg/collector, pkg/snapshotter)
  • Validator (pkg/validator)
  • Core libraries (pkg/errors, pkg/k8s)
  • Other: pkg/evidence, pkg/serializer, validators/*, CI workflows

Implementation Notes

Organized as logically-scoped commits:

  • pkg/errors — new IsTransient helper (context cancel/deadline or ErrCodeTimeout); dedupes the transient check repeated across collectors/snapshotter.
  • Collectors / snapshotter — sysctl skips an unreadable entry instead of aborting the whole walk; OS collector and k8s collectSafe surface ctx timeout instead of returning an empty-but-"successful" measurement; snapshotter classifies wait-deadline as ErrCodeTimeout and checks the Stdout.Write error.
  • Validator / bundlerprepareCluster uses PropagateOrWrap so coded inner errors (e.g. bad kubeconfig) stay deterministic; checksum ctx-cancel is ErrCodeTimeout; gateImage falls back to :dev for snapshot (-next) builds to avoid ImagePullBackOff.
  • Evidence — new CheckPhaseDigests verifies each predicate CTRFDigest against the on-disk ctrf/<phase>.json (fail-closed), wired into the inventory step; adds CTRFRelPath.
  • Serializer / conformance — enforce read size limits (reject oversize instead of silent truncation/accept-with-trailing-excess) and honor ctx on local reads; conformance httpGet rejects oversized scrapes.
  • Watch closures — re-Get current state when a watch channel closes without ctx cancellation (apiserver restart / LB drop), across the shared pod/job waiters (with ResourceVersion parity) and the five performance-validator watch loops.
  • CI — add missing Go consumer paths to sigstore/kms/kwok/recipe-evidence/gpu paths: filters; build + scan + clean up the aicr-gate image (previously only built on release tags, never vuln-scanned); pin fern-api from a trusted base checkout; bound kubectl/curl calls.
  • Deps/tooling — bump ko v0.18.1→v0.19.0, syft v1.45.1→v1.46.0; upgrade Go dependencies (make upgrade).

Deferred (intentionally out of scope): two pkg/cli/validate.go "logic above the facade" items (CTRF merge / recipe-evidence emission) that would change the pkg/client/v1.ValidateState public API — better isolated to their own PR.

Testing

make qualify

make qualify passes end-to-end with network access (test-coverage, lint, e2e, grype: no vulnerabilities, license headers). Re-run after the Go dependency upgrade with no regressions. New unit tests added for IsTransient, CheckPhaseDigests, CTRFRelPath, serializer oversize rejection, and newestRunnablePod.

Risk Assessment

  • Medium — Touches multiple components, but each change is fail-loud/fail-closed with no happy-path behavior change; well covered by make qualify.

Rollout notes: No migration or flags. Backwards compatible. The gateImage -next fallback and CI path-filter additions only affect build/CI selection, not produced artifacts.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

mchmarny added 9 commits June 28, 2026 07:04
… partial success

- sysctl: skip an unreadable entry instead of aborting the whole walk
- os collector: surface ctx timeout rather than a partial-but-nil-error measurement
- k8s collectSafe: propagate timeout/cancellation, swallow only deterministic failures
- snapshotter: classify wait deadline as ErrCodeTimeout; check Stdout.Write error

Signed-off-by: Mark Chmarny <[email protected]>
…age tag

- validator prepareCluster: PropagateOrWrap preserves inner codes (kubeconfig errors stay deterministic)
- checksum: ctx cancellation is ErrCodeTimeout, matching every other ctx.Err site
- readiness gateImage: fall back to :dev for snapshot (-next) builds to avoid ImagePullBackOff

Signed-off-by: Mark Chmarny <[email protected]>
The predicate's per-phase CTRFDigest was never cross-checked, so a predicate
could claim a digest disagreeing with the committed report and still pass.
Add CheckPhaseDigests (fail-closed) wired into the inventory step, plus the
CTRFRelPath helper and tests.

Signed-off-by: Mark Chmarny <[email protected]>
…nt truncation

- serializer NewFileReader: read bounded content fully so the size cap is
  actually enforced (reject oversize) and honor ctx on the local read path
- conformance httpGet: reject oversized responses rather than silently
  truncating Prometheus/DCGM scrapes into partial/missing-metric data

Signed-off-by: Mark Chmarny <[email protected]>
…ailures

A watch channel closing without ctx cancellation (apiserver restart, LB drop)
no longer fails a healthy run. Re-Get current state before declaring failure
across the shared pod/job waiters (with ResourceVersion parity) and the five
performance-validator watch loops; pick the youngest runnable pod on re-List.

Signed-off-by: Mark Chmarny <[email protected]>
- path filters: add Go consumer paths (pkg/trust, pkg/validator, pkg/recipe,
  pkg/cli, cmd/aicr, go.mod/sum) to sigstore/kms/kwok/recipe-evidence/gpu lanes
  so regressions actually trigger the relevant e2e
- vuln-scan: build, manifest, scan, and clean up the aicr-gate image (was only
  built on release tags, never scanned)
- fern preview-comment: pin fern-api version from a trusted base checkout
- bound kubectl/curl calls with request-timeout/max-time

Signed-off-by: Mark Chmarny <[email protected]>
@mchmarny mchmarny requested review from a team as code owners June 28, 2026 14:23
@mchmarny mchmarny added dependencies Pull requests that update a dependency file theme/validation Constraint evaluation, health checks, and conformance evidence labels Jun 28, 2026
@mchmarny mchmarny self-assigned this Jun 28, 2026
@github-actions

Copy link
Copy Markdown
Contributor

Recipe evidence check

No leaf overlays affected by this PR.

This gate is warning-only and never blocks merge.

@github-actions

github-actions Bot commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

Coverage Report ✅

Metric Value
Coverage 77.9%
Threshold 75%
Status Pass
Coverage Badge
![Coverage](https://img.shields.io/badge/coverage-77.9%25-green)

Merging this branch changes the coverage (6 decrease, 2 increase)

Impacted Packages Coverage Δ 🤖
github.com/NVIDIA/aicr/pkg/bundler 67.76% (-0.04%) 👎
github.com/NVIDIA/aicr/pkg/bundler/checksum 83.87% (-1.52%) 👎
github.com/NVIDIA/aicr/pkg/collector/k8s 70.59% (-0.41%) 👎
github.com/NVIDIA/aicr/pkg/collector/os 87.13% (-5.58%) 👎
github.com/NVIDIA/aicr/pkg/errors 100.00% (ø)
github.com/NVIDIA/aicr/pkg/evidence/attestation 70.13% (+0.04%) 👍
github.com/NVIDIA/aicr/pkg/evidence/verifier 69.20% (+0.57%) 👍
github.com/NVIDIA/aicr/pkg/k8s/pod 77.95% (ø)
github.com/NVIDIA/aicr/pkg/serializer 72.10% (-0.29%) 👎
github.com/NVIDIA/aicr/pkg/snapshotter 47.01% (-0.57%) 👎
github.com/NVIDIA/aicr/pkg/validator 35.52% (ø)
github.com/NVIDIA/aicr/validators/conformance 0.00% (ø)
github.com/NVIDIA/aicr/validators/performance 0.00% (ø)

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/NVIDIA/aicr/pkg/bundler/checksum/checksum.go 83.87% (-1.52%) 93 (+4) 78 (+2) 15 (+2) 👎
github.com/NVIDIA/aicr/pkg/bundler/readiness.go 91.18% (-0.25%) 34 (-1) 31 (-1) 3 👎
github.com/NVIDIA/aicr/pkg/collector/k8s/k8s.go 81.25% (-2.96%) 64 (+7) 52 (+4) 12 (+3) 👎
github.com/NVIDIA/aicr/pkg/collector/os/os.go 84.21% (-15.79%) 19 (+4) 16 (+1) 3 (+3) 💀
github.com/NVIDIA/aicr/pkg/collector/os/sysctl.go 88.64% (-2.06%) 44 (+1) 39 5 (+1) 👎
github.com/NVIDIA/aicr/pkg/errors/errors.go 100.00% (ø) 23 (+3) 23 (+3) 0
github.com/NVIDIA/aicr/pkg/evidence/attestation/types.go 100.00% (+100.00%) 1 (+1) 1 (+1) 0 🌟
github.com/NVIDIA/aicr/pkg/evidence/verifier/inventory.go 82.86% (+3.11%) 105 (+26) 87 (+24) 18 (+2) 👍
github.com/NVIDIA/aicr/pkg/evidence/verifier/verify.go 59.41% (-1.42%) 101 (+4) 60 (+1) 41 (+3) 👎
github.com/NVIDIA/aicr/pkg/k8s/pod/job.go 65.85% (ø) 82 54 28
github.com/NVIDIA/aicr/pkg/k8s/pod/wait.go 79.82% (ø) 114 91 23
github.com/NVIDIA/aicr/pkg/serializer/reader.go 59.33% (+0.55%) 150 (+19) 89 (+12) 61 (+7) 👍
github.com/NVIDIA/aicr/pkg/snapshotter/agent.go 30.00% (-0.61%) 200 (+4) 60 140 (+4) 👎
github.com/NVIDIA/aicr/pkg/validator/validator.go 26.67% (ø) 195 52 143
github.com/NVIDIA/aicr/validators/conformance/accelerator_metrics_check.go 0.00% (ø) 0 0 0
github.com/NVIDIA/aicr/validators/conformance/ai_service_metrics_check.go 0.00% (ø) 0 0 0
github.com/NVIDIA/aicr/validators/conformance/helpers.go 0.00% (ø) 0 0 0
github.com/NVIDIA/aicr/validators/performance/inference_perf_constraint.go 0.00% (ø) 0 0 0
github.com/NVIDIA/aicr/validators/performance/nccl_all_reduce_bw_constraint.go 0.00% (ø) 0 0 0
github.com/NVIDIA/aicr/validators/performance/nccl_preflight_nvreg.go 0.00% (ø) 0 0 0
github.com/NVIDIA/aicr/validators/performance/trainer_lifecycle.go 0.00% (ø) 0 0 0

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

@coderabbitai

coderabbitai Bot commented Jun 28, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 55800c99-ef47-44f2-875c-4d713bf96cd6

📥 Commits

Reviewing files that changed from the base of the PR and between 9c5e920 and a51688d.

📒 Files selected for processing (5)
  • pkg/bundler/checksum/checksum.go
  • validators/performance/inference_perf_constraint.go
  • validators/performance/nccl_all_reduce_bw_constraint.go
  • validators/performance/nccl_preflight_nvreg.go
  • validators/performance/trainer_lifecycle.go

📝 Walkthrough

Walkthrough

This PR updates transient error classification and propagation across collectors, snapshotter, bundler, serializer, and validator code. It adds CTRF path helpers and per-phase CTRF digest verification to evidence verification and wires that check into Verify. Several Kubernetes watch-based wait paths now re-check resource state on channel closure and use timeout or unavailable error codes. CI workflow path filters are expanded, a gate image is added to vuln-scan handling, and several tool versions and command timeouts are updated.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~90 minutes

Possibly related PRs

  • NVIDIA/aicr#873: Adds CTRF bundle files and digests that this PR verifies per phase.
  • NVIDIA/aicr#879: Establishes the evidence verifier flow that this PR extends with phase digest checks.
  • NVIDIA/aicr#1445: Modifies the same pkg/evidence/verifier/verify.go flow that this PR updates with additional digest verification.

Suggested labels

area/recipes, area/docs, area/cli, area/tests, theme/supply-chain

Suggested reviewers

  • njhensley
  • lalitadithya
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title is concise and clearly summarizes the main hardening, evidence integrity, and CI-related changes.
Description check ✅ Passed The description is detailed and directly matches the changeset across error handling, evidence verification, CI, and dependency updates.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/tech-debt-sweep

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 22

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/gpu-h100-inference-test.yaml:
- Around line 75-82: The GPU H100 inference workflow trigger list is missing
vendor changes, so updates to vendored Go dependencies can bypass the
conformance run. Add vendor/** to the path filters alongside go.mod/go.sum in
the workflow condition for the aicr-related sources so any dependency vendoring
that can affect the binary still triggers this gate.

In @.github/workflows/gpu-h100-training-test.yaml:
- Around line 75-82: The GPU training workflow path filter is missing vendored
dependency changes, so a vendor-only update can bypass the conformance run.
Update the path list in the workflow to also include vendor/** alongside go.mod
and go.sum, keeping the change aligned with the aicr validation and recipe
bundling triggers.

In @.github/workflows/gpu-smoke-test.yaml:
- Around line 64-68: The GPU smoke workflow path filter still misses
dependency-only changes, so update the trigger logic in the workflow to also
match go.mod, go.sum, and vendor/** alongside the existing pkg/validator,
pkg/cli, and cmd/aicr paths. This ensures dependency updates that change the
built aicr binary still run the smoke job; adjust the same paths block in the
workflow YAML so these dependency files are included.

In @.github/workflows/kwok-recipes.yaml:
- Around line 25-32: Update the workflow trigger filters in the kwok-recipes
workflow so both the push and pull_request path lists include vendor/**
alongside go.mod and go.sum. The issue is that vendor-only dependency changes
still bypass this workflow, so mirror the vendored-dependency trigger in both
matching blocks that cover pkg/recipe, pkg/bundler, pkg/cli, cmd/aicr, go.mod,
and go.sum.

In @.github/workflows/recipe-evidence.yaml:
- Around line 46-54: The workflow path filter for the evidence gate is missing
vendor changes, so updates to dependencies can bypass this check. In the
.github/workflows/recipe-evidence.yaml trigger list, add the vendor directory
alongside go.mod and go.sum so changes affecting bundled dependencies rerun the
gate. Use the existing aicr evidence verify gate entries as the anchor when
updating the paths.

In @.github/workflows/sigstore-scaffolding-e2e.yaml:
- Around line 44-48: The Sigstore e2e path filter is missing vendored dependency
changes, so updates under vendor/** can still skip coverage. Update the
workflow’s changed-path list in the Sigstore scaffolding e2e trigger to include
vendor/** alongside go.mod and go.sum, using the existing path patterns in the
workflow file so dependency-only and vendored-tree changes both trigger the job.

In @.github/workflows/vuln-scan-images.yaml:
- Around line 181-183: The workflow’s permissions block grants packages: write
without explanation, which triggers the undocumented-permissions warning. Add a
short inline rationale comment near the permissions block in the
vuln-scan-images workflow, and keep the existing contents: read / packages:
write settings unchanged. Reference the workflow permissions section so it’s
easy to locate even if the YAML shifts.

In `@Makefile`:
- Around line 713-714: The KWOK install steps in the Makefile only time out the
kubectl API call, not the remote YAML fetch. Update the two kubectl apply
invocations in the KWOK setup target to fetch each manifest with an explicit
client-side timeout first, then pipe the downloaded content into kubectl apply
so the network download itself is bounded. Use the existing KWOK_VERSION-based
URLs and keep the change localized to the KWOK manifest download commands.

In `@pkg/bundler/checksum/checksum.go`:
- Around line 46-48: GenerateChecksums only checks ctx.Err() before starting, so
cancellation during the file-hashing loop is ignored. Update GenerateChecksums
to re-check the context inside the checksum-processing loop (and before writing
checksums.txt), using ctx.Err() or ctx.Done() so the function exits promptly on
timeout. Keep the fix localized around GenerateChecksums and the per-file
hashing/write path.

In `@pkg/collector/os/sysctl.go`:
- Around line 49-63: Preserve the callback’s timeout classification at the
collectSysctl boundary: the walk callback now returns a structured
ErrCodeTimeout, but collectSysctl still converts any WalkDir error into
ErrCodeInternal and loses that signal. Update the error handling around the
WalkDir call in collectSysctl to use errors.PropagateOrWrap(err, errCode,
message) (or equivalent structured propagation) so a timeout from the callback
remains ErrCodeTimeout while other walk failures are still wrapped
appropriately. Refer to collectSysctl and the WalkDir callback logic when
applying the change.

In `@pkg/errors/errors_test.go`:
- Around line 24-39: The IsTransient test table in TestIsTransient is missing
coverage for a mixed unwrap chain where a non-timeout outer error wraps a
timeout inner error. Add a case using Wrap with ErrCodeInternal on the outside
and New(ErrCodeTimeout, ...) inside, and assert IsTransient returns true so the
test explicitly covers timeout detection anywhere in the unwrap chain.

In `@pkg/evidence/verifier/inventory.go`:
- Around line 141-145: The inventory verification logic in Verify currently only
iterates attestation.AllPhases and skips missing entries, which allows unknown
or misspelled keys in pred.Phases to pass unchecked. Update the phase validation
in pkg/evidence/verifier/inventory.go so Verify explicitly rejects any predicate
phase key not present in attestation.AllPhases, rather than silently continuing.
Use the existing pred.Phases, attestation.AllPhases, and Verify flow to add a
fail-closed check before concluding inventory success.

In `@pkg/serializer/reader_test.go`:
- Around line 493-501: The oversize-file test in NewFileReader only checks the
error text, so it can miss a regression in error classification. Update the
assertion in reader_test.go to verify the returned error is classified as
ErrCodeInvalidRequest for the oversize YAML input, using the NewFileReader
result and the existing error handling path, and keep the size-limit message
check only as a secondary validation if needed.

In `@pkg/serializer/reader.go`:
- Around line 166-168: The local read path in reader.go still blocks inside
io.ReadAll without enforcing ctx cancellation, so the timeout guarantee in the
surrounding comment is overstated. Update the read flow in the local-read branch
of the reader logic to keep checking ctx.Done() during the long-running read
instead of only before it starts, using the existing context handling around the
read loop or helper in reader.go. Make sure the symbols involved in this path,
especially the local-read branch around io.ReadAll, actually stop promptly on
deadline rather than relying on a pre-check alone.
- Around line 231-234: The structured errors returned by
NewFileReaderWithContext are being flattened to INTERNAL in the
FromFileWithKubeconfigContext path. Update the error handling in
FromFileWithKubeconfigContext so it preserves existing structured codes like
INVALID_REQUEST, TIMEOUT, and NOT_FOUND instead of always wrapping with
ErrCodeInternal; use errors.PropagateOrWrap(err, errCode, message) at the call
site that handles reader-construction failures. Keep the fix focused on the
reader construction flow and the related error-wrapping logic in
pkg/serializer/reader.go.

In `@tools/cleanup`:
- Around line 202-209: The CRD cleanup path in the cleanup logic is swallowing
API failures by using `|| true`, so timed-out or unhealthy apiserver calls can
still report success while leaving CRDs behind. Update the `all_crds` lookup and
the `kubectl delete` call in the cleanup loop to surface real failures instead
of masking them, while still keeping the bounded `--request-timeout` behavior
and only ignoring genuinely missing resources with `--ignore-not-found`.

In `@validators/conformance/helpers.go`:
- Around line 78-80: The oversize response classification from helpers.httpGet
is being lost because ai_service_metrics_check’s httpGet caller wraps every
non-timeout error as SERVICE_UNAVAILABLE. Update the caller to preserve
structured errors by using errors.PropagateOrWrap instead of unconditionally
wrapping, so ErrCodeInvalidRequest from httpGet can pass through while still
mapping unstructured failures to the service-unavailable path. Focus on the
httpGet handling logic in the AI service metrics check flow.

In `@validators/performance/inference_perf_constraint.go`:
- Around line 1119-1131: The watch-close handling in
inference_perf_constraint.go is dropping the re-check failure and always falling
back to ErrCodeUnavailable. Update the namespace/deployment watch-close branches
so the follow-up Get/readback error is propagated or wrapped instead of ignored,
and only return ErrCodeUnavailable when the confirmation Get succeeds but the
resource is still present/in the wrong state. Use errors.PropagateOrWrap(err,
errCode, message) in the affected reconciliation paths (including the later
deployment check) so existing structured error codes from clients.Get are
preserved.

In `@validators/performance/nccl_all_reduce_bw_constraint.go`:
- Around line 898-910: The watch-close fallback paths in the NCCL all-reduce BW
constraint reconciliation logic are masking real apiserver/client failures by
returning ErrCodeUnavailable even when the retry read itself fails. Update the
branches around waitCtx.Err, rctClient.Get, and the corresponding List checks to
only return ErrCodeUnavailable after a successful fallback read that still does
not show the expected object, and otherwise propagate the underlying
getErr/listErr using errors.PropagateOrWrap so the original structured error
code and message are preserved.
- Around line 1282-1297: The newestRunnablePod helper currently treats
PodSucceeded as eligible, which can cause waitForPodByLabelSelector to reuse a
completed pod from an earlier run. Update newestRunnablePod to exclude succeeded
pods alongside terminating and failed pods by extending the status filter in the
pod scan loop, keeping the behavior aligned with its non-terminating/runnable
contract.

In `@validators/performance/nccl_preflight_nvreg.go`:
- Around line 272-285: The fallback re-check in the pod watch handling is
dropping the real `Get` failure and always returning a generic unavailable
error. In the `ncclPreflight` watch-closure path, update the logic around
`podsClient.Get` so that a failed re-read returns the original `getErr` using
`errors.PropagateOrWrap` to preserve structured error codes, and only emit
`ErrCodeUnavailable` when the re-read succeeds but still shows a non-terminal
pod phase.

In `@validators/performance/trainer_lifecycle.go`:
- Around line 335-344: The CRD watch-closure fallback in trainer_lifecycle.go is
masking a failed re-check by returning ErrCodeUnavailable even when the
dynamicClient.Resource(crdGVR).Get call fails. Update the recheck logic so a
failed Get propagates or wraps the underlying error using
errors.PropagateOrWrap, and only return the unavailable result when the re-read
succeeds and isCRDEstablished still reports false; keep the existing ctx.Err()
timeout handling and the CRD-established success path intact.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 3bcec1b7-d19d-48d1-b967-a2c393159c20

📥 Commits

Reviewing files that changed from the base of the PR and between e2312f1 and b6bac15.

⛔ Files ignored due to path filters (106)
  • go.sum is excluded by !**/*.sum
  • vendor/cloud.google.com/go/longrunning/CHANGES.md is excluded by !vendor/**
  • vendor/cloud.google.com/go/longrunning/autogen/operations_client.go is excluded by !vendor/**
  • vendor/cloud.google.com/go/longrunning/internal/version.go is excluded by !vendor/**
  • vendor/cloud.google.com/go/longrunning/longrunning.go is excluded by !vendor/**
  • vendor/github.com/Azure/azure-sdk-for-go/sdk/azidentity/CHANGELOG.md is excluded by !vendor/**
  • vendor/github.com/Azure/azure-sdk-for-go/sdk/azidentity/TROUBLESHOOTING.md is excluded by !vendor/**
  • vendor/github.com/Azure/azure-sdk-for-go/sdk/azidentity/authentication_record.go is excluded by !vendor/**
  • vendor/github.com/Azure/azure-sdk-for-go/sdk/azidentity/azidentity.go is excluded by !vendor/**
  • vendor/github.com/Azure/azure-sdk-for-go/sdk/azidentity/azure_cli_credential.go is excluded by !vendor/**
  • vendor/github.com/Azure/azure-sdk-for-go/sdk/azidentity/azure_developer_cli_credential.go is excluded by !vendor/**
  • vendor/github.com/Azure/azure-sdk-for-go/sdk/azidentity/azure_pipelines_credential.go is excluded by !vendor/**
  • vendor/github.com/Azure/azure-sdk-for-go/sdk/azidentity/chained_token_credential.go is excluded by !vendor/**
  • vendor/github.com/Azure/azure-sdk-for-go/sdk/azidentity/ci.yml is excluded by !vendor/**
  • vendor/github.com/Azure/azure-sdk-for-go/sdk/azidentity/client_assertion_credential.go is excluded by !vendor/**
  • vendor/github.com/Azure/azure-sdk-for-go/sdk/azidentity/client_certificate_credential.go is excluded by !vendor/**
  • vendor/github.com/Azure/azure-sdk-for-go/sdk/azidentity/client_secret_credential.go is excluded by !vendor/**
  • vendor/github.com/Azure/azure-sdk-for-go/sdk/azidentity/confidential_client.go is excluded by !vendor/**
  • vendor/github.com/Azure/azure-sdk-for-go/sdk/azidentity/default_azure_credential.go is excluded by !vendor/**
  • vendor/github.com/Azure/azure-sdk-for-go/sdk/azidentity/developer_credential_util.go is excluded by !vendor/**
  • vendor/github.com/Azure/azure-sdk-for-go/sdk/azidentity/device_code_credential.go is excluded by !vendor/**
  • vendor/github.com/Azure/azure-sdk-for-go/sdk/azidentity/environment_credential.go is excluded by !vendor/**
  • vendor/github.com/Azure/azure-sdk-for-go/sdk/azidentity/errors.go is excluded by !vendor/**
  • vendor/github.com/Azure/azure-sdk-for-go/sdk/azidentity/interactive_browser_credential.go is excluded by !vendor/**
  • vendor/github.com/Azure/azure-sdk-for-go/sdk/azidentity/internal/customtokenproxy/transport.go is excluded by !vendor/**
  • vendor/github.com/Azure/azure-sdk-for-go/sdk/azidentity/logging.go is excluded by !vendor/**
  • vendor/github.com/Azure/azure-sdk-for-go/sdk/azidentity/managed_identity_client.go is excluded by !vendor/**
  • vendor/github.com/Azure/azure-sdk-for-go/sdk/azidentity/managed_identity_credential.go is excluded by !vendor/**
  • vendor/github.com/Azure/azure-sdk-for-go/sdk/azidentity/on_behalf_of_credential.go is excluded by !vendor/**
  • vendor/github.com/Azure/azure-sdk-for-go/sdk/azidentity/public_client.go is excluded by !vendor/**
  • vendor/github.com/Azure/azure-sdk-for-go/sdk/azidentity/username_password_credential.go is excluded by !vendor/**
  • vendor/github.com/Azure/azure-sdk-for-go/sdk/azidentity/version.go is excluded by !vendor/**
  • vendor/github.com/Azure/azure-sdk-for-go/sdk/azidentity/workload_identity.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/doca-driver-build/entrypoint/LICENSE is excluded by !vendor/**
  • vendor/github.com/Mellanox/doca-driver-build/entrypoint/pkg/mofedmodules/mofedmodules.go is excluded by !vendor/**
  • vendor/github.com/NVIDIA/k8s-launch-kit/pkg/presets/data/GB300-NVL-NVIDIA-GB300/topology.yaml is excluded by !vendor/**
  • vendor/github.com/aws/smithy-go/CHANGELOG.md is excluded by !vendor/**
  • vendor/github.com/aws/smithy-go/encoding/json/value.go is excluded by !vendor/**
  • vendor/github.com/aws/smithy-go/endpoints/private/rulesfn/uri.go is excluded by !vendor/**
  • vendor/github.com/aws/smithy-go/go_module_metadata.go is excluded by !vendor/**
  • vendor/github.com/aws/smithy-go/transport/http/host.go is excluded by !vendor/**
  • vendor/github.com/coreos/go-oidc/v3/oidc/jwks.go is excluded by !vendor/**
  • vendor/github.com/coreos/go-oidc/v3/oidc/logout.go is excluded by !vendor/**
  • vendor/github.com/coreos/go-oidc/v3/oidc/verify.go is excluded by !vendor/**
  • vendor/github.com/go-openapi/runtime/.golangci.yml is excluded by !vendor/**
  • vendor/github.com/go-openapi/runtime/CONTRIBUTORS.md is excluded by !vendor/**
  • vendor/github.com/go-openapi/runtime/middleware/request.go is excluded by !vendor/**
  • vendor/github.com/google/go-containerregistry/pkg/name/registry.go is excluded by !vendor/**
  • vendor/github.com/google/go-containerregistry/pkg/name/repository.go is excluded by !vendor/**
  • vendor/github.com/jellydator/ttlcache/v3/README.md is excluded by !vendor/**
  • vendor/github.com/jellydator/ttlcache/v3/cache.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/config/config.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/discovery/discover.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/discovery/library.go is excluded by !vendor/**
  • vendor/github.com/prometheus/common/expfmt/expfmt.go is excluded by !vendor/**
  • vendor/github.com/prometheus/common/expfmt/text_create.go is excluded by !vendor/**
  • vendor/github.com/prometheus/common/expfmt/text_parse.go is excluded by !vendor/**
  • vendor/github.com/prometheus/common/model/labels.go is excluded by !vendor/**
  • vendor/github.com/prometheus/common/model/labelset.go is excluded by !vendor/**
  • vendor/github.com/prometheus/common/model/metric.go is excluded by !vendor/**
  • vendor/github.com/prometheus/common/model/time.go is excluded by !vendor/**
  • vendor/github.com/prometheus/common/model/value.go is excluded by !vendor/**
  • vendor/github.com/prometheus/common/model/value_float.go is excluded by !vendor/**
  • vendor/github.com/prometheus/common/model/value_histogram.go is excluded by !vendor/**
  • vendor/github.com/prometheus/procfs/Makefile.common is excluded by !vendor/**
  • vendor/github.com/prometheus/procfs/README.md is excluded by !vendor/**
  • vendor/github.com/prometheus/procfs/SECURITY.md is excluded by !vendor/**
  • vendor/github.com/prometheus/procfs/crypto.go is excluded by !vendor/**
  • vendor/github.com/prometheus/procfs/mountinfo.go is excluded by !vendor/**
  • vendor/github.com/urfave/cli/v3/command_parse.go is excluded by !vendor/**
  • vendor/github.com/urfave/cli/v3/command_setup.go is excluded by !vendor/**
  • vendor/github.com/urfave/cli/v3/completion.go is excluded by !vendor/**
  • vendor/google.golang.org/api/internal/version.go is excluded by !vendor/**
  • vendor/k8s.io/kube-openapi/pkg/internal/third_party/go-json-experiment/json/alias.go is excluded by !vendor/**
  • vendor/k8s.io/kube-openapi/pkg/internal/third_party/go-json-experiment/json/arshal.go is excluded by !vendor/**
  • vendor/k8s.io/kube-openapi/pkg/internal/third_party/go-json-experiment/json/arshal_any.go is excluded by !vendor/**
  • vendor/k8s.io/kube-openapi/pkg/internal/third_party/go-json-experiment/json/arshal_default.go is excluded by !vendor/**
  • vendor/k8s.io/kube-openapi/pkg/internal/third_party/go-json-experiment/json/arshal_funcs.go is excluded by !vendor/**
  • vendor/k8s.io/kube-openapi/pkg/internal/third_party/go-json-experiment/json/arshal_inlined.go is excluded by !vendor/**
  • vendor/k8s.io/kube-openapi/pkg/internal/third_party/go-json-experiment/json/arshal_methods.go is excluded by !vendor/**
  • vendor/k8s.io/kube-openapi/pkg/internal/third_party/go-json-experiment/json/arshal_time.go is excluded by !vendor/**
  • vendor/k8s.io/kube-openapi/pkg/internal/third_party/go-json-experiment/json/doc.go is excluded by !vendor/**
  • vendor/k8s.io/kube-openapi/pkg/internal/third_party/go-json-experiment/json/errors.go is excluded by !vendor/**
  • vendor/k8s.io/kube-openapi/pkg/internal/third_party/go-json-experiment/json/fields.go is excluded by !vendor/**
  • vendor/k8s.io/kube-openapi/pkg/internal/third_party/go-json-experiment/json/fold.go is excluded by !vendor/**
  • vendor/k8s.io/kube-openapi/pkg/internal/third_party/go-json-experiment/json/intern.go is excluded by !vendor/**
  • vendor/k8s.io/kube-openapi/pkg/internal/third_party/go-json-experiment/json/internal/internal.go is excluded by !vendor/**
  • vendor/k8s.io/kube-openapi/pkg/internal/third_party/go-json-experiment/json/internal/jsonflags/flags.go is excluded by !vendor/**
  • vendor/k8s.io/kube-openapi/pkg/internal/third_party/go-json-experiment/json/internal/jsonopts/options.go is excluded by !vendor/**
  • vendor/k8s.io/kube-openapi/pkg/internal/third_party/go-json-experiment/json/internal/jsonwire/decode.go is excluded by !vendor/**
  • vendor/k8s.io/kube-openapi/pkg/internal/third_party/go-json-experiment/json/internal/jsonwire/encode.go is excluded by !vendor/**
  • vendor/k8s.io/kube-openapi/pkg/internal/third_party/go-json-experiment/json/internal/jsonwire/wire.go is excluded by !vendor/**
  • vendor/k8s.io/kube-openapi/pkg/internal/third_party/go-json-experiment/json/jsontext/alias.go is excluded by !vendor/**
  • vendor/k8s.io/kube-openapi/pkg/internal/third_party/go-json-experiment/json/jsontext/decode.go is excluded by !vendor/**
  • vendor/k8s.io/kube-openapi/pkg/internal/third_party/go-json-experiment/json/jsontext/doc.go is excluded by !vendor/**
  • vendor/k8s.io/kube-openapi/pkg/internal/third_party/go-json-experiment/json/jsontext/encode.go is excluded by !vendor/**
  • vendor/k8s.io/kube-openapi/pkg/internal/third_party/go-json-experiment/json/jsontext/errors.go is excluded by !vendor/**
  • vendor/k8s.io/kube-openapi/pkg/internal/third_party/go-json-experiment/json/jsontext/export.go is excluded by !vendor/**
  • vendor/k8s.io/kube-openapi/pkg/internal/third_party/go-json-experiment/json/jsontext/options.go is excluded by !vendor/**
  • vendor/k8s.io/kube-openapi/pkg/internal/third_party/go-json-experiment/json/jsontext/pools.go is excluded by !vendor/**
  • vendor/k8s.io/kube-openapi/pkg/internal/third_party/go-json-experiment/json/jsontext/quote.go is excluded by !vendor/**
  • vendor/k8s.io/kube-openapi/pkg/internal/third_party/go-json-experiment/json/jsontext/state.go is excluded by !vendor/**
  • vendor/k8s.io/kube-openapi/pkg/internal/third_party/go-json-experiment/json/jsontext/token.go is excluded by !vendor/**
  • vendor/k8s.io/kube-openapi/pkg/internal/third_party/go-json-experiment/json/jsontext/value.go is excluded by !vendor/**
  • vendor/k8s.io/kube-openapi/pkg/internal/third_party/go-json-experiment/json/options.go is excluded by !vendor/**
  • vendor/modules.txt is excluded by !vendor/**
📒 Files selected for processing (38)
  • .github/workflows/fern-docs-preview-comment.yml
  • .github/workflows/gpu-h100-inference-test.yaml
  • .github/workflows/gpu-h100-training-test.yaml
  • .github/workflows/gpu-smoke-test.yaml
  • .github/workflows/kms-ministack-e2e.yaml
  • .github/workflows/kwok-recipes.yaml
  • .github/workflows/recipe-evidence.yaml
  • .github/workflows/sigstore-scaffolding-e2e.yaml
  • .github/workflows/vuln-scan-images.yaml
  • .settings.yaml
  • Makefile
  • go.mod
  • pkg/bundler/checksum/checksum.go
  • pkg/bundler/readiness.go
  • pkg/collector/k8s/k8s.go
  • pkg/collector/os/os.go
  • pkg/collector/os/sysctl.go
  • pkg/errors/errors.go
  • pkg/errors/errors_test.go
  • pkg/evidence/attestation/ctrfpath_test.go
  • pkg/evidence/attestation/types.go
  • pkg/evidence/verifier/inventory.go
  • pkg/evidence/verifier/inventory_phase_test.go
  • pkg/evidence/verifier/verify.go
  • pkg/k8s/pod/job.go
  • pkg/k8s/pod/wait.go
  • pkg/serializer/reader.go
  • pkg/serializer/reader_test.go
  • pkg/snapshotter/agent.go
  • pkg/validator/validator.go
  • tools/cleanup
  • tools/install-rc
  • validators/conformance/helpers.go
  • validators/performance/inference_perf_constraint.go
  • validators/performance/nccl_all_reduce_bw_constraint.go
  • validators/performance/nccl_preflight_nvreg.go
  • validators/performance/newest_runnable_pod_test.go
  • validators/performance/trainer_lifecycle.go

Comment thread .github/workflows/gpu-h100-inference-test.yaml
Comment thread .github/workflows/gpu-h100-training-test.yaml
Comment thread .github/workflows/gpu-smoke-test.yaml
Comment thread .github/workflows/kwok-recipes.yaml
Comment thread .github/workflows/recipe-evidence.yaml
Comment thread validators/performance/inference_perf_constraint.go
Comment thread validators/performance/nccl_all_reduce_bw_constraint.go Outdated
Comment thread validators/performance/nccl_all_reduce_bw_constraint.go
Comment thread validators/performance/nccl_preflight_nvreg.go Outdated
Comment thread validators/performance/trainer_lifecycle.go Outdated
mchmarny added 3 commits June 28, 2026 07:42
The module path is github.com/nvidia/k8s-launch-kit (lowercase). A stale
GB300-NVL topology.yaml lingered under the uppercase NVIDIA/ path from a
prior casing; macOS's case-insensitive FS hid it from 'go mod vendor', but
CI's case-sensitive linux runner flagged vendor/ as out of sync. The pinned
version no longer ships this preset, so remove the orphan.

Signed-off-by: Mark Chmarny <[email protected]>
Address CodeRabbit review:
- add 'vendor/**' (alongside go.mod/go.sum) to every dependency-sensitive
  path filter — this repo vendors deps, so a vendor-only sync changes the
  binary and must not bypass the gate (the gap this PR itself hit)
- gpu-smoke: add go.mod/go.sum/vendor/** (had none)
- Makefile: bound the remote manifest fetch with a max-time download piped
  into kubectl apply (--request-timeout only bounds the apiserver call)
- tools/cleanup: warn on a timed-out CRD delete instead of swallowing it
- vuln-scan: document why the gate build needs packages: write

Signed-off-by: Mark Chmarny <[email protected]>
Address CodeRabbit review:
- serializer: PropagateOrWrap in FromFile so NotFound/InvalidRequest/Timeout
  survive; read local files under ctx (goroutine) so a hung FS honors the
  deadline, not just a pre-check
- conformance: propagate the oversize InvalidRequest code from httpGet in the
  Prometheus/DCGM callers instead of relabeling it Unavailable
- evidence verifier: reject unknown predicate phase keys (fail closed)
- checksum: re-check ctx.Err() inside the per-file loop
- sysctl: preserve the callback's timeout code at the function boundary
- validators: when a watch-closure re-Get itself fails, surface that error
  (Internal) instead of masking it as 'closed before observed'; exclude
  Succeeded pods from newestRunnablePod
- tests: assert error codes (not messages) and add mixed-code IsTransient cases

Signed-off-by: Mark Chmarny <[email protected]>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/bundler/checksum/checksum.go`:
- Around line 54-58: The checksum generation flow in GenerateChecksums still has
a small false-pass window because cancellation can happen after the loop ends
but before checksums.txt is written. Add one more ctx.Err() check in
GenerateChecksums immediately after the file-hashing loop and before the final
write so a late cancellation returns the same timeout error instead of
succeeding.

In `@validators/performance/inference_perf_constraint.go`:
- Around line 1245-1260: The fallback re-Get path in the DynamoGraphDeployment
watch handling can misclassify a request timeout as an internal error when
waitCtx expires between the context check and the Get call. Update the switch in
the recheck logic to explicitly detect context deadline exceeded or canceled
from getErr before the default branch, and return ErrCodeTimeout instead of
wrapping it as ErrCodeInternal. Use the existing watch/recheck block around
isDynamoDeploymentReady, apierrors.IsNotFound, and errors.Wrap to keep the
behavior consistent with the other DynamoGraphDeployment readiness paths.

In `@validators/performance/nccl_all_reduce_bw_constraint.go`:
- Around line 905-915: The fallback re-check in the IMEX ResourceClaimTemplate
wait path is misclassifying context/deadline failures as internal errors; update
the switch in the reconciliation logic around the rctClient.Get fallback so
timeout-like errors from that Get are returned as the timeout/unavailable class
instead of ErrCodeInternal. Use the existing IMEX ResourceClaimTemplate
watch-close branch and its aicrErrors.Wrap/aicrErrors.New handling to preserve
the original timeout classification when waitCtx expires before the re-check
completes.

In `@validators/performance/nccl_preflight_nvreg.go`:
- Around line 279-292: The fallback Get in the pod watch completion path is
misclassifying deadline races as internal errors instead of timeouts. In the
branch that handles current, getErr := podsClient.Get(...) inside the wait loop,
re-check waitCtx.Err() before the default error return and map context deadline
or cancellation to aicrErrors.ErrCodeTimeout rather than wrapping it as
ErrCodeInternal. Keep the existing handling in the watch fallback logic around
podsClient.Get, apierrors.IsNotFound, and the current.Status.Phase checks.

In `@validators/performance/trainer_lifecycle.go`:
- Around line 340-351: The new re-check failure codes are being lost because
waitForTrainerCRDsEstablished still wraps every waitForCRDEstablished error as
ErrCodeTimeout, which hides the ErrCodeUnavailable and ErrCodeInternal results
added in the CRD re-check path. Update the error handling in
waitForTrainerCRDsEstablished to use errors.PropagateOrWrap(err, errCode,
message) so existing structured codes from waitForCRDEstablished are preserved,
while only assigning ErrCodeTimeout when the failure is truly a timeout.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 4a65adab-7c42-44be-9c3d-425ab8bd9a4f

📥 Commits

Reviewing files that changed from the base of the PR and between b6bac15 and 9c5e920.

📒 Files selected for processing (22)
  • .github/workflows/gpu-h100-inference-test.yaml
  • .github/workflows/gpu-h100-training-test.yaml
  • .github/workflows/gpu-smoke-test.yaml
  • .github/workflows/kms-ministack-e2e.yaml
  • .github/workflows/kwok-recipes.yaml
  • .github/workflows/recipe-evidence.yaml
  • .github/workflows/sigstore-scaffolding-e2e.yaml
  • .github/workflows/vuln-scan-images.yaml
  • Makefile
  • pkg/bundler/checksum/checksum.go
  • pkg/collector/os/sysctl.go
  • pkg/errors/errors_test.go
  • pkg/evidence/verifier/inventory.go
  • pkg/serializer/reader.go
  • pkg/serializer/reader_test.go
  • tools/cleanup
  • validators/conformance/accelerator_metrics_check.go
  • validators/conformance/ai_service_metrics_check.go
  • validators/performance/inference_perf_constraint.go
  • validators/performance/nccl_all_reduce_bw_constraint.go
  • validators/performance/nccl_preflight_nvreg.go
  • validators/performance/trainer_lifecycle.go

Comment thread pkg/bundler/checksum/checksum.go
Comment thread validators/performance/inference_perf_constraint.go
Comment thread validators/performance/nccl_all_reduce_bw_constraint.go
Comment thread validators/performance/nccl_preflight_nvreg.go
Comment thread validators/performance/trainer_lifecycle.go
… (review)

Address CodeRabbit follow-ups: when a watch-closure fallback Get itself races
the deadline, getErr is a context error — return ErrCodeTimeout (via
IsTransient) instead of masking it as ErrCodeInternal. Applied to the Dynamo,
IMEX RCT, preflight-pod, and Trainer-CRD re-checks. Also:
- trainer: caller now PropagateOrWraps so the re-check's code isn't collapsed
  back to Timeout
- checksum: re-check ctx.Err() after the hash loop, before writing checksums.txt

Signed-off-by: Mark Chmarny <[email protected]>
@mchmarny mchmarny enabled auto-merge (squash) June 28, 2026 15:30
@mchmarny mchmarny merged commit a74fe77 into main Jun 28, 2026
159 checks passed
@mchmarny mchmarny deleted the fix/tech-debt-sweep branch June 28, 2026 15:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/bundler area/ci area/collector area/validator dependencies Pull requests that update a dependency file size/XL theme/validation Constraint evaluation, health checks, and conformance evidence

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants