fix: harden timeout/false-PASS handling, evidence integrity, CI gaps#1511
Conversation
…assification Signed-off-by: Mark Chmarny <[email protected]>
… partial success - sysctl: skip an unreadable entry instead of aborting the whole walk - os collector: surface ctx timeout rather than a partial-but-nil-error measurement - k8s collectSafe: propagate timeout/cancellation, swallow only deterministic failures - snapshotter: classify wait deadline as ErrCodeTimeout; check Stdout.Write error Signed-off-by: Mark Chmarny <[email protected]>
…age tag - validator prepareCluster: PropagateOrWrap preserves inner codes (kubeconfig errors stay deterministic) - checksum: ctx cancellation is ErrCodeTimeout, matching every other ctx.Err site - readiness gateImage: fall back to :dev for snapshot (-next) builds to avoid ImagePullBackOff Signed-off-by: Mark Chmarny <[email protected]>
The predicate's per-phase CTRFDigest was never cross-checked, so a predicate could claim a digest disagreeing with the committed report and still pass. Add CheckPhaseDigests (fail-closed) wired into the inventory step, plus the CTRFRelPath helper and tests. Signed-off-by: Mark Chmarny <[email protected]>
…nt truncation - serializer NewFileReader: read bounded content fully so the size cap is actually enforced (reject oversize) and honor ctx on the local read path - conformance httpGet: reject oversized responses rather than silently truncating Prometheus/DCGM scrapes into partial/missing-metric data Signed-off-by: Mark Chmarny <[email protected]>
…ailures A watch channel closing without ctx cancellation (apiserver restart, LB drop) no longer fails a healthy run. Re-Get current state before declaring failure across the shared pod/job waiters (with ResourceVersion parity) and the five performance-validator watch loops; pick the youngest runnable pod on re-List. Signed-off-by: Mark Chmarny <[email protected]>
- path filters: add Go consumer paths (pkg/trust, pkg/validator, pkg/recipe, pkg/cli, cmd/aicr, go.mod/sum) to sigstore/kms/kwok/recipe-evidence/gpu lanes so regressions actually trigger the relevant e2e - vuln-scan: build, manifest, scan, and clean up the aicr-gate image (was only built on release tags, never scanned) - fern preview-comment: pin fern-api version from a trusted base checkout - bound kubectl/curl calls with request-timeout/max-time Signed-off-by: Mark Chmarny <[email protected]>
Signed-off-by: Mark Chmarny <[email protected]>
Recipe evidence checkNo leaf overlays affected by this PR. This gate is warning-only and never blocks merge. |
Coverage Report ✅
Coverage BadgeMerging this branch changes the coverage (6 decrease, 2 increase)
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Enterprise Run ID: 📒 Files selected for processing (5)
📝 WalkthroughWalkthroughThis PR updates transient error classification and propagation across collectors, snapshotter, bundler, serializer, and validator code. It adds CTRF path helpers and per-phase CTRF digest verification to evidence verification and wires that check into Estimated code review effort🎯 5 (Critical) | ⏱️ ~90 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 22
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/workflows/gpu-h100-inference-test.yaml:
- Around line 75-82: The GPU H100 inference workflow trigger list is missing
vendor changes, so updates to vendored Go dependencies can bypass the
conformance run. Add vendor/** to the path filters alongside go.mod/go.sum in
the workflow condition for the aicr-related sources so any dependency vendoring
that can affect the binary still triggers this gate.
In @.github/workflows/gpu-h100-training-test.yaml:
- Around line 75-82: The GPU training workflow path filter is missing vendored
dependency changes, so a vendor-only update can bypass the conformance run.
Update the path list in the workflow to also include vendor/** alongside go.mod
and go.sum, keeping the change aligned with the aicr validation and recipe
bundling triggers.
In @.github/workflows/gpu-smoke-test.yaml:
- Around line 64-68: The GPU smoke workflow path filter still misses
dependency-only changes, so update the trigger logic in the workflow to also
match go.mod, go.sum, and vendor/** alongside the existing pkg/validator,
pkg/cli, and cmd/aicr paths. This ensures dependency updates that change the
built aicr binary still run the smoke job; adjust the same paths block in the
workflow YAML so these dependency files are included.
In @.github/workflows/kwok-recipes.yaml:
- Around line 25-32: Update the workflow trigger filters in the kwok-recipes
workflow so both the push and pull_request path lists include vendor/**
alongside go.mod and go.sum. The issue is that vendor-only dependency changes
still bypass this workflow, so mirror the vendored-dependency trigger in both
matching blocks that cover pkg/recipe, pkg/bundler, pkg/cli, cmd/aicr, go.mod,
and go.sum.
In @.github/workflows/recipe-evidence.yaml:
- Around line 46-54: The workflow path filter for the evidence gate is missing
vendor changes, so updates to dependencies can bypass this check. In the
.github/workflows/recipe-evidence.yaml trigger list, add the vendor directory
alongside go.mod and go.sum so changes affecting bundled dependencies rerun the
gate. Use the existing aicr evidence verify gate entries as the anchor when
updating the paths.
In @.github/workflows/sigstore-scaffolding-e2e.yaml:
- Around line 44-48: The Sigstore e2e path filter is missing vendored dependency
changes, so updates under vendor/** can still skip coverage. Update the
workflow’s changed-path list in the Sigstore scaffolding e2e trigger to include
vendor/** alongside go.mod and go.sum, using the existing path patterns in the
workflow file so dependency-only and vendored-tree changes both trigger the job.
In @.github/workflows/vuln-scan-images.yaml:
- Around line 181-183: The workflow’s permissions block grants packages: write
without explanation, which triggers the undocumented-permissions warning. Add a
short inline rationale comment near the permissions block in the
vuln-scan-images workflow, and keep the existing contents: read / packages:
write settings unchanged. Reference the workflow permissions section so it’s
easy to locate even if the YAML shifts.
In `@Makefile`:
- Around line 713-714: The KWOK install steps in the Makefile only time out the
kubectl API call, not the remote YAML fetch. Update the two kubectl apply
invocations in the KWOK setup target to fetch each manifest with an explicit
client-side timeout first, then pipe the downloaded content into kubectl apply
so the network download itself is bounded. Use the existing KWOK_VERSION-based
URLs and keep the change localized to the KWOK manifest download commands.
In `@pkg/bundler/checksum/checksum.go`:
- Around line 46-48: GenerateChecksums only checks ctx.Err() before starting, so
cancellation during the file-hashing loop is ignored. Update GenerateChecksums
to re-check the context inside the checksum-processing loop (and before writing
checksums.txt), using ctx.Err() or ctx.Done() so the function exits promptly on
timeout. Keep the fix localized around GenerateChecksums and the per-file
hashing/write path.
In `@pkg/collector/os/sysctl.go`:
- Around line 49-63: Preserve the callback’s timeout classification at the
collectSysctl boundary: the walk callback now returns a structured
ErrCodeTimeout, but collectSysctl still converts any WalkDir error into
ErrCodeInternal and loses that signal. Update the error handling around the
WalkDir call in collectSysctl to use errors.PropagateOrWrap(err, errCode,
message) (or equivalent structured propagation) so a timeout from the callback
remains ErrCodeTimeout while other walk failures are still wrapped
appropriately. Refer to collectSysctl and the WalkDir callback logic when
applying the change.
In `@pkg/errors/errors_test.go`:
- Around line 24-39: The IsTransient test table in TestIsTransient is missing
coverage for a mixed unwrap chain where a non-timeout outer error wraps a
timeout inner error. Add a case using Wrap with ErrCodeInternal on the outside
and New(ErrCodeTimeout, ...) inside, and assert IsTransient returns true so the
test explicitly covers timeout detection anywhere in the unwrap chain.
In `@pkg/evidence/verifier/inventory.go`:
- Around line 141-145: The inventory verification logic in Verify currently only
iterates attestation.AllPhases and skips missing entries, which allows unknown
or misspelled keys in pred.Phases to pass unchecked. Update the phase validation
in pkg/evidence/verifier/inventory.go so Verify explicitly rejects any predicate
phase key not present in attestation.AllPhases, rather than silently continuing.
Use the existing pred.Phases, attestation.AllPhases, and Verify flow to add a
fail-closed check before concluding inventory success.
In `@pkg/serializer/reader_test.go`:
- Around line 493-501: The oversize-file test in NewFileReader only checks the
error text, so it can miss a regression in error classification. Update the
assertion in reader_test.go to verify the returned error is classified as
ErrCodeInvalidRequest for the oversize YAML input, using the NewFileReader
result and the existing error handling path, and keep the size-limit message
check only as a secondary validation if needed.
In `@pkg/serializer/reader.go`:
- Around line 166-168: The local read path in reader.go still blocks inside
io.ReadAll without enforcing ctx cancellation, so the timeout guarantee in the
surrounding comment is overstated. Update the read flow in the local-read branch
of the reader logic to keep checking ctx.Done() during the long-running read
instead of only before it starts, using the existing context handling around the
read loop or helper in reader.go. Make sure the symbols involved in this path,
especially the local-read branch around io.ReadAll, actually stop promptly on
deadline rather than relying on a pre-check alone.
- Around line 231-234: The structured errors returned by
NewFileReaderWithContext are being flattened to INTERNAL in the
FromFileWithKubeconfigContext path. Update the error handling in
FromFileWithKubeconfigContext so it preserves existing structured codes like
INVALID_REQUEST, TIMEOUT, and NOT_FOUND instead of always wrapping with
ErrCodeInternal; use errors.PropagateOrWrap(err, errCode, message) at the call
site that handles reader-construction failures. Keep the fix focused on the
reader construction flow and the related error-wrapping logic in
pkg/serializer/reader.go.
In `@tools/cleanup`:
- Around line 202-209: The CRD cleanup path in the cleanup logic is swallowing
API failures by using `|| true`, so timed-out or unhealthy apiserver calls can
still report success while leaving CRDs behind. Update the `all_crds` lookup and
the `kubectl delete` call in the cleanup loop to surface real failures instead
of masking them, while still keeping the bounded `--request-timeout` behavior
and only ignoring genuinely missing resources with `--ignore-not-found`.
In `@validators/conformance/helpers.go`:
- Around line 78-80: The oversize response classification from helpers.httpGet
is being lost because ai_service_metrics_check’s httpGet caller wraps every
non-timeout error as SERVICE_UNAVAILABLE. Update the caller to preserve
structured errors by using errors.PropagateOrWrap instead of unconditionally
wrapping, so ErrCodeInvalidRequest from httpGet can pass through while still
mapping unstructured failures to the service-unavailable path. Focus on the
httpGet handling logic in the AI service metrics check flow.
In `@validators/performance/inference_perf_constraint.go`:
- Around line 1119-1131: The watch-close handling in
inference_perf_constraint.go is dropping the re-check failure and always falling
back to ErrCodeUnavailable. Update the namespace/deployment watch-close branches
so the follow-up Get/readback error is propagated or wrapped instead of ignored,
and only return ErrCodeUnavailable when the confirmation Get succeeds but the
resource is still present/in the wrong state. Use errors.PropagateOrWrap(err,
errCode, message) in the affected reconciliation paths (including the later
deployment check) so existing structured error codes from clients.Get are
preserved.
In `@validators/performance/nccl_all_reduce_bw_constraint.go`:
- Around line 898-910: The watch-close fallback paths in the NCCL all-reduce BW
constraint reconciliation logic are masking real apiserver/client failures by
returning ErrCodeUnavailable even when the retry read itself fails. Update the
branches around waitCtx.Err, rctClient.Get, and the corresponding List checks to
only return ErrCodeUnavailable after a successful fallback read that still does
not show the expected object, and otherwise propagate the underlying
getErr/listErr using errors.PropagateOrWrap so the original structured error
code and message are preserved.
- Around line 1282-1297: The newestRunnablePod helper currently treats
PodSucceeded as eligible, which can cause waitForPodByLabelSelector to reuse a
completed pod from an earlier run. Update newestRunnablePod to exclude succeeded
pods alongside terminating and failed pods by extending the status filter in the
pod scan loop, keeping the behavior aligned with its non-terminating/runnable
contract.
In `@validators/performance/nccl_preflight_nvreg.go`:
- Around line 272-285: The fallback re-check in the pod watch handling is
dropping the real `Get` failure and always returning a generic unavailable
error. In the `ncclPreflight` watch-closure path, update the logic around
`podsClient.Get` so that a failed re-read returns the original `getErr` using
`errors.PropagateOrWrap` to preserve structured error codes, and only emit
`ErrCodeUnavailable` when the re-read succeeds but still shows a non-terminal
pod phase.
In `@validators/performance/trainer_lifecycle.go`:
- Around line 335-344: The CRD watch-closure fallback in trainer_lifecycle.go is
masking a failed re-check by returning ErrCodeUnavailable even when the
dynamicClient.Resource(crdGVR).Get call fails. Update the recheck logic so a
failed Get propagates or wraps the underlying error using
errors.PropagateOrWrap, and only return the unavailable result when the re-read
succeeds and isCRDEstablished still reports false; keep the existing ctx.Err()
timeout handling and the CRD-established success path intact.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 3bcec1b7-d19d-48d1-b967-a2c393159c20
⛔ Files ignored due to path filters (106)
go.sumis excluded by!**/*.sumvendor/cloud.google.com/go/longrunning/CHANGES.mdis excluded by!vendor/**vendor/cloud.google.com/go/longrunning/autogen/operations_client.gois excluded by!vendor/**vendor/cloud.google.com/go/longrunning/internal/version.gois excluded by!vendor/**vendor/cloud.google.com/go/longrunning/longrunning.gois excluded by!vendor/**vendor/github.com/Azure/azure-sdk-for-go/sdk/azidentity/CHANGELOG.mdis excluded by!vendor/**vendor/github.com/Azure/azure-sdk-for-go/sdk/azidentity/TROUBLESHOOTING.mdis excluded by!vendor/**vendor/github.com/Azure/azure-sdk-for-go/sdk/azidentity/authentication_record.gois excluded by!vendor/**vendor/github.com/Azure/azure-sdk-for-go/sdk/azidentity/azidentity.gois excluded by!vendor/**vendor/github.com/Azure/azure-sdk-for-go/sdk/azidentity/azure_cli_credential.gois excluded by!vendor/**vendor/github.com/Azure/azure-sdk-for-go/sdk/azidentity/azure_developer_cli_credential.gois excluded by!vendor/**vendor/github.com/Azure/azure-sdk-for-go/sdk/azidentity/azure_pipelines_credential.gois excluded by!vendor/**vendor/github.com/Azure/azure-sdk-for-go/sdk/azidentity/chained_token_credential.gois excluded by!vendor/**vendor/github.com/Azure/azure-sdk-for-go/sdk/azidentity/ci.ymlis excluded by!vendor/**vendor/github.com/Azure/azure-sdk-for-go/sdk/azidentity/client_assertion_credential.gois excluded by!vendor/**vendor/github.com/Azure/azure-sdk-for-go/sdk/azidentity/client_certificate_credential.gois excluded by!vendor/**vendor/github.com/Azure/azure-sdk-for-go/sdk/azidentity/client_secret_credential.gois excluded by!vendor/**vendor/github.com/Azure/azure-sdk-for-go/sdk/azidentity/confidential_client.gois excluded by!vendor/**vendor/github.com/Azure/azure-sdk-for-go/sdk/azidentity/default_azure_credential.gois excluded by!vendor/**vendor/github.com/Azure/azure-sdk-for-go/sdk/azidentity/developer_credential_util.gois excluded by!vendor/**vendor/github.com/Azure/azure-sdk-for-go/sdk/azidentity/device_code_credential.gois excluded by!vendor/**vendor/github.com/Azure/azure-sdk-for-go/sdk/azidentity/environment_credential.gois excluded by!vendor/**vendor/github.com/Azure/azure-sdk-for-go/sdk/azidentity/errors.gois excluded by!vendor/**vendor/github.com/Azure/azure-sdk-for-go/sdk/azidentity/interactive_browser_credential.gois excluded by!vendor/**vendor/github.com/Azure/azure-sdk-for-go/sdk/azidentity/internal/customtokenproxy/transport.gois excluded by!vendor/**vendor/github.com/Azure/azure-sdk-for-go/sdk/azidentity/logging.gois excluded by!vendor/**vendor/github.com/Azure/azure-sdk-for-go/sdk/azidentity/managed_identity_client.gois excluded by!vendor/**vendor/github.com/Azure/azure-sdk-for-go/sdk/azidentity/managed_identity_credential.gois excluded by!vendor/**vendor/github.com/Azure/azure-sdk-for-go/sdk/azidentity/on_behalf_of_credential.gois excluded by!vendor/**vendor/github.com/Azure/azure-sdk-for-go/sdk/azidentity/public_client.gois excluded by!vendor/**vendor/github.com/Azure/azure-sdk-for-go/sdk/azidentity/username_password_credential.gois excluded by!vendor/**vendor/github.com/Azure/azure-sdk-for-go/sdk/azidentity/version.gois excluded by!vendor/**vendor/github.com/Azure/azure-sdk-for-go/sdk/azidentity/workload_identity.gois excluded by!vendor/**vendor/github.com/Mellanox/doca-driver-build/entrypoint/LICENSEis excluded by!vendor/**vendor/github.com/Mellanox/doca-driver-build/entrypoint/pkg/mofedmodules/mofedmodules.gois excluded by!vendor/**vendor/github.com/NVIDIA/k8s-launch-kit/pkg/presets/data/GB300-NVL-NVIDIA-GB300/topology.yamlis excluded by!vendor/**vendor/github.com/aws/smithy-go/CHANGELOG.mdis excluded by!vendor/**vendor/github.com/aws/smithy-go/encoding/json/value.gois excluded by!vendor/**vendor/github.com/aws/smithy-go/endpoints/private/rulesfn/uri.gois excluded by!vendor/**vendor/github.com/aws/smithy-go/go_module_metadata.gois excluded by!vendor/**vendor/github.com/aws/smithy-go/transport/http/host.gois excluded by!vendor/**vendor/github.com/coreos/go-oidc/v3/oidc/jwks.gois excluded by!vendor/**vendor/github.com/coreos/go-oidc/v3/oidc/logout.gois excluded by!vendor/**vendor/github.com/coreos/go-oidc/v3/oidc/verify.gois excluded by!vendor/**vendor/github.com/go-openapi/runtime/.golangci.ymlis excluded by!vendor/**vendor/github.com/go-openapi/runtime/CONTRIBUTORS.mdis excluded by!vendor/**vendor/github.com/go-openapi/runtime/middleware/request.gois excluded by!vendor/**vendor/github.com/google/go-containerregistry/pkg/name/registry.gois excluded by!vendor/**vendor/github.com/google/go-containerregistry/pkg/name/repository.gois excluded by!vendor/**vendor/github.com/jellydator/ttlcache/v3/README.mdis excluded by!vendor/**vendor/github.com/jellydator/ttlcache/v3/cache.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/config/config.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/discovery/discover.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/discovery/library.gois excluded by!vendor/**vendor/github.com/prometheus/common/expfmt/expfmt.gois excluded by!vendor/**vendor/github.com/prometheus/common/expfmt/text_create.gois excluded by!vendor/**vendor/github.com/prometheus/common/expfmt/text_parse.gois excluded by!vendor/**vendor/github.com/prometheus/common/model/labels.gois excluded by!vendor/**vendor/github.com/prometheus/common/model/labelset.gois excluded by!vendor/**vendor/github.com/prometheus/common/model/metric.gois excluded by!vendor/**vendor/github.com/prometheus/common/model/time.gois excluded by!vendor/**vendor/github.com/prometheus/common/model/value.gois excluded by!vendor/**vendor/github.com/prometheus/common/model/value_float.gois excluded by!vendor/**vendor/github.com/prometheus/common/model/value_histogram.gois excluded by!vendor/**vendor/github.com/prometheus/procfs/Makefile.commonis excluded by!vendor/**vendor/github.com/prometheus/procfs/README.mdis excluded by!vendor/**vendor/github.com/prometheus/procfs/SECURITY.mdis excluded by!vendor/**vendor/github.com/prometheus/procfs/crypto.gois excluded by!vendor/**vendor/github.com/prometheus/procfs/mountinfo.gois excluded by!vendor/**vendor/github.com/urfave/cli/v3/command_parse.gois excluded by!vendor/**vendor/github.com/urfave/cli/v3/command_setup.gois excluded by!vendor/**vendor/github.com/urfave/cli/v3/completion.gois excluded by!vendor/**vendor/google.golang.org/api/internal/version.gois excluded by!vendor/**vendor/k8s.io/kube-openapi/pkg/internal/third_party/go-json-experiment/json/alias.gois excluded by!vendor/**vendor/k8s.io/kube-openapi/pkg/internal/third_party/go-json-experiment/json/arshal.gois excluded by!vendor/**vendor/k8s.io/kube-openapi/pkg/internal/third_party/go-json-experiment/json/arshal_any.gois excluded by!vendor/**vendor/k8s.io/kube-openapi/pkg/internal/third_party/go-json-experiment/json/arshal_default.gois excluded by!vendor/**vendor/k8s.io/kube-openapi/pkg/internal/third_party/go-json-experiment/json/arshal_funcs.gois excluded by!vendor/**vendor/k8s.io/kube-openapi/pkg/internal/third_party/go-json-experiment/json/arshal_inlined.gois excluded by!vendor/**vendor/k8s.io/kube-openapi/pkg/internal/third_party/go-json-experiment/json/arshal_methods.gois excluded by!vendor/**vendor/k8s.io/kube-openapi/pkg/internal/third_party/go-json-experiment/json/arshal_time.gois excluded by!vendor/**vendor/k8s.io/kube-openapi/pkg/internal/third_party/go-json-experiment/json/doc.gois excluded by!vendor/**vendor/k8s.io/kube-openapi/pkg/internal/third_party/go-json-experiment/json/errors.gois excluded by!vendor/**vendor/k8s.io/kube-openapi/pkg/internal/third_party/go-json-experiment/json/fields.gois excluded by!vendor/**vendor/k8s.io/kube-openapi/pkg/internal/third_party/go-json-experiment/json/fold.gois excluded by!vendor/**vendor/k8s.io/kube-openapi/pkg/internal/third_party/go-json-experiment/json/intern.gois excluded by!vendor/**vendor/k8s.io/kube-openapi/pkg/internal/third_party/go-json-experiment/json/internal/internal.gois excluded by!vendor/**vendor/k8s.io/kube-openapi/pkg/internal/third_party/go-json-experiment/json/internal/jsonflags/flags.gois excluded by!vendor/**vendor/k8s.io/kube-openapi/pkg/internal/third_party/go-json-experiment/json/internal/jsonopts/options.gois excluded by!vendor/**vendor/k8s.io/kube-openapi/pkg/internal/third_party/go-json-experiment/json/internal/jsonwire/decode.gois excluded by!vendor/**vendor/k8s.io/kube-openapi/pkg/internal/third_party/go-json-experiment/json/internal/jsonwire/encode.gois excluded by!vendor/**vendor/k8s.io/kube-openapi/pkg/internal/third_party/go-json-experiment/json/internal/jsonwire/wire.gois excluded by!vendor/**vendor/k8s.io/kube-openapi/pkg/internal/third_party/go-json-experiment/json/jsontext/alias.gois excluded by!vendor/**vendor/k8s.io/kube-openapi/pkg/internal/third_party/go-json-experiment/json/jsontext/decode.gois excluded by!vendor/**vendor/k8s.io/kube-openapi/pkg/internal/third_party/go-json-experiment/json/jsontext/doc.gois excluded by!vendor/**vendor/k8s.io/kube-openapi/pkg/internal/third_party/go-json-experiment/json/jsontext/encode.gois excluded by!vendor/**vendor/k8s.io/kube-openapi/pkg/internal/third_party/go-json-experiment/json/jsontext/errors.gois excluded by!vendor/**vendor/k8s.io/kube-openapi/pkg/internal/third_party/go-json-experiment/json/jsontext/export.gois excluded by!vendor/**vendor/k8s.io/kube-openapi/pkg/internal/third_party/go-json-experiment/json/jsontext/options.gois excluded by!vendor/**vendor/k8s.io/kube-openapi/pkg/internal/third_party/go-json-experiment/json/jsontext/pools.gois excluded by!vendor/**vendor/k8s.io/kube-openapi/pkg/internal/third_party/go-json-experiment/json/jsontext/quote.gois excluded by!vendor/**vendor/k8s.io/kube-openapi/pkg/internal/third_party/go-json-experiment/json/jsontext/state.gois excluded by!vendor/**vendor/k8s.io/kube-openapi/pkg/internal/third_party/go-json-experiment/json/jsontext/token.gois excluded by!vendor/**vendor/k8s.io/kube-openapi/pkg/internal/third_party/go-json-experiment/json/jsontext/value.gois excluded by!vendor/**vendor/k8s.io/kube-openapi/pkg/internal/third_party/go-json-experiment/json/options.gois excluded by!vendor/**vendor/modules.txtis excluded by!vendor/**
📒 Files selected for processing (38)
.github/workflows/fern-docs-preview-comment.yml.github/workflows/gpu-h100-inference-test.yaml.github/workflows/gpu-h100-training-test.yaml.github/workflows/gpu-smoke-test.yaml.github/workflows/kms-ministack-e2e.yaml.github/workflows/kwok-recipes.yaml.github/workflows/recipe-evidence.yaml.github/workflows/sigstore-scaffolding-e2e.yaml.github/workflows/vuln-scan-images.yaml.settings.yamlMakefilego.modpkg/bundler/checksum/checksum.gopkg/bundler/readiness.gopkg/collector/k8s/k8s.gopkg/collector/os/os.gopkg/collector/os/sysctl.gopkg/errors/errors.gopkg/errors/errors_test.gopkg/evidence/attestation/ctrfpath_test.gopkg/evidence/attestation/types.gopkg/evidence/verifier/inventory.gopkg/evidence/verifier/inventory_phase_test.gopkg/evidence/verifier/verify.gopkg/k8s/pod/job.gopkg/k8s/pod/wait.gopkg/serializer/reader.gopkg/serializer/reader_test.gopkg/snapshotter/agent.gopkg/validator/validator.gotools/cleanuptools/install-rcvalidators/conformance/helpers.govalidators/performance/inference_perf_constraint.govalidators/performance/nccl_all_reduce_bw_constraint.govalidators/performance/nccl_preflight_nvreg.govalidators/performance/newest_runnable_pod_test.govalidators/performance/trainer_lifecycle.go
The module path is github.com/nvidia/k8s-launch-kit (lowercase). A stale GB300-NVL topology.yaml lingered under the uppercase NVIDIA/ path from a prior casing; macOS's case-insensitive FS hid it from 'go mod vendor', but CI's case-sensitive linux runner flagged vendor/ as out of sync. The pinned version no longer ships this preset, so remove the orphan. Signed-off-by: Mark Chmarny <[email protected]>
Address CodeRabbit review: - add 'vendor/**' (alongside go.mod/go.sum) to every dependency-sensitive path filter — this repo vendors deps, so a vendor-only sync changes the binary and must not bypass the gate (the gap this PR itself hit) - gpu-smoke: add go.mod/go.sum/vendor/** (had none) - Makefile: bound the remote manifest fetch with a max-time download piped into kubectl apply (--request-timeout only bounds the apiserver call) - tools/cleanup: warn on a timed-out CRD delete instead of swallowing it - vuln-scan: document why the gate build needs packages: write Signed-off-by: Mark Chmarny <[email protected]>
Address CodeRabbit review: - serializer: PropagateOrWrap in FromFile so NotFound/InvalidRequest/Timeout survive; read local files under ctx (goroutine) so a hung FS honors the deadline, not just a pre-check - conformance: propagate the oversize InvalidRequest code from httpGet in the Prometheus/DCGM callers instead of relabeling it Unavailable - evidence verifier: reject unknown predicate phase keys (fail closed) - checksum: re-check ctx.Err() inside the per-file loop - sysctl: preserve the callback's timeout code at the function boundary - validators: when a watch-closure re-Get itself fails, surface that error (Internal) instead of masking it as 'closed before observed'; exclude Succeeded pods from newestRunnablePod - tests: assert error codes (not messages) and add mixed-code IsTransient cases Signed-off-by: Mark Chmarny <[email protected]>
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/bundler/checksum/checksum.go`:
- Around line 54-58: The checksum generation flow in GenerateChecksums still has
a small false-pass window because cancellation can happen after the loop ends
but before checksums.txt is written. Add one more ctx.Err() check in
GenerateChecksums immediately after the file-hashing loop and before the final
write so a late cancellation returns the same timeout error instead of
succeeding.
In `@validators/performance/inference_perf_constraint.go`:
- Around line 1245-1260: The fallback re-Get path in the DynamoGraphDeployment
watch handling can misclassify a request timeout as an internal error when
waitCtx expires between the context check and the Get call. Update the switch in
the recheck logic to explicitly detect context deadline exceeded or canceled
from getErr before the default branch, and return ErrCodeTimeout instead of
wrapping it as ErrCodeInternal. Use the existing watch/recheck block around
isDynamoDeploymentReady, apierrors.IsNotFound, and errors.Wrap to keep the
behavior consistent with the other DynamoGraphDeployment readiness paths.
In `@validators/performance/nccl_all_reduce_bw_constraint.go`:
- Around line 905-915: The fallback re-check in the IMEX ResourceClaimTemplate
wait path is misclassifying context/deadline failures as internal errors; update
the switch in the reconciliation logic around the rctClient.Get fallback so
timeout-like errors from that Get are returned as the timeout/unavailable class
instead of ErrCodeInternal. Use the existing IMEX ResourceClaimTemplate
watch-close branch and its aicrErrors.Wrap/aicrErrors.New handling to preserve
the original timeout classification when waitCtx expires before the re-check
completes.
In `@validators/performance/nccl_preflight_nvreg.go`:
- Around line 279-292: The fallback Get in the pod watch completion path is
misclassifying deadline races as internal errors instead of timeouts. In the
branch that handles current, getErr := podsClient.Get(...) inside the wait loop,
re-check waitCtx.Err() before the default error return and map context deadline
or cancellation to aicrErrors.ErrCodeTimeout rather than wrapping it as
ErrCodeInternal. Keep the existing handling in the watch fallback logic around
podsClient.Get, apierrors.IsNotFound, and the current.Status.Phase checks.
In `@validators/performance/trainer_lifecycle.go`:
- Around line 340-351: The new re-check failure codes are being lost because
waitForTrainerCRDsEstablished still wraps every waitForCRDEstablished error as
ErrCodeTimeout, which hides the ErrCodeUnavailable and ErrCodeInternal results
added in the CRD re-check path. Update the error handling in
waitForTrainerCRDsEstablished to use errors.PropagateOrWrap(err, errCode,
message) so existing structured codes from waitForCRDEstablished are preserved,
while only assigning ErrCodeTimeout when the failure is truly a timeout.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 4a65adab-7c42-44be-9c3d-425ab8bd9a4f
📒 Files selected for processing (22)
.github/workflows/gpu-h100-inference-test.yaml.github/workflows/gpu-h100-training-test.yaml.github/workflows/gpu-smoke-test.yaml.github/workflows/kms-ministack-e2e.yaml.github/workflows/kwok-recipes.yaml.github/workflows/recipe-evidence.yaml.github/workflows/sigstore-scaffolding-e2e.yaml.github/workflows/vuln-scan-images.yamlMakefilepkg/bundler/checksum/checksum.gopkg/collector/os/sysctl.gopkg/errors/errors_test.gopkg/evidence/verifier/inventory.gopkg/serializer/reader.gopkg/serializer/reader_test.gotools/cleanupvalidators/conformance/accelerator_metrics_check.govalidators/conformance/ai_service_metrics_check.govalidators/performance/inference_perf_constraint.govalidators/performance/nccl_all_reduce_bw_constraint.govalidators/performance/nccl_preflight_nvreg.govalidators/performance/trainer_lifecycle.go
… (review) Address CodeRabbit follow-ups: when a watch-closure fallback Get itself races the deadline, getErr is a context error — return ErrCodeTimeout (via IsTransient) instead of masking it as ErrCodeInternal. Applied to the Dynamo, IMEX RCT, preflight-pod, and Trainer-CRD re-checks. Also: - trainer: caller now PropagateOrWraps so the re-check's code isn't collapsed back to Timeout - checksum: re-check ctx.Err() after the hash loop, before writing checksums.txt Signed-off-by: Mark Chmarny <[email protected]>
Summary
Tech-debt hardening sweep: closes "false-PASS" / silent-failure paths across collectors, validators, and the evidence verifier; tightens transient-vs-deterministic error classification; fixes CI path-filter gaps; and bumps tooling + Go dependencies.
Motivation / Context
A review pass (AICR Go review lens) surfaced a cluster of fail-open and silent-truncation defects: gates/collectors that report success without verifying anything, timeouts misclassified as deterministic failures, watch-channel closures failing healthy runs, unbounded/silently-truncated reads, and e2e lanes whose
paths:filters don't fire when the relevant Go code changes. None are behavior changes on the happy path; each makes a failure mode fail loud/closed instead of silent.Fixes: N/A
Related: N/A
Type of Change
Component(s) Affected
cmd/aicr,pkg/cli)pkg/bundler,pkg/component/*)pkg/collector,pkg/snapshotter)pkg/validator)pkg/errors,pkg/k8s)pkg/evidence,pkg/serializer,validators/*, CI workflowsImplementation Notes
Organized as logically-scoped commits:
pkg/errors— newIsTransienthelper (context cancel/deadline orErrCodeTimeout); dedupes the transient check repeated across collectors/snapshotter.k8scollectSafesurface ctx timeout instead of returning an empty-but-"successful" measurement; snapshotter classifies wait-deadline asErrCodeTimeoutand checks theStdout.Writeerror.prepareClusterusesPropagateOrWrapso coded inner errors (e.g. bad kubeconfig) stay deterministic; checksum ctx-cancel isErrCodeTimeout;gateImagefalls back to:devfor snapshot (-next) builds to avoidImagePullBackOff.CheckPhaseDigestsverifies each predicateCTRFDigestagainst the on-diskctrf/<phase>.json(fail-closed), wired into the inventory step; addsCTRFRelPath.httpGetrejects oversized scrapes.ResourceVersionparity) and the five performance-validator watch loops.paths:filters; build + scan + clean up theaicr-gateimage (previously only built on release tags, never vuln-scanned); pinfern-apifrom a trusted base checkout; boundkubectl/curlcalls.kov0.18.1→v0.19.0,syftv1.45.1→v1.46.0; upgrade Go dependencies (make upgrade).Deferred (intentionally out of scope): two
pkg/cli/validate.go"logic above the facade" items (CTRF merge / recipe-evidence emission) that would change thepkg/client/v1.ValidateStatepublic API — better isolated to their own PR.Testing
make qualifypasses end-to-end with network access (test-coverage, lint, e2e, grype: no vulnerabilities, license headers). Re-run after the Go dependency upgrade with no regressions. New unit tests added forIsTransient,CheckPhaseDigests,CTRFRelPath, serializer oversize rejection, andnewestRunnablePod.Risk Assessment
make qualify.Rollout notes: No migration or flags. Backwards compatible. The
gateImage-nextfallback and CI path-filter additions only affect build/CI selection, not produced artifacts.Checklist
make testwith-race)make lint)git commit -S)