Skip to content

feat(testgrid): add testgrid-publish CLI tool (TG2)#1447

Merged
mchmarny merged 17 commits into
mainfrom
feat/tg2-testgrid-publish
Jun 26, 2026
Merged

feat(testgrid): add testgrid-publish CLI tool (TG2)#1447
mchmarny merged 17 commits into
mainfrom
feat/tg2-testgrid-publish

Conversation

@srao-nv

@srao-nv srao-nv commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds `tools/testgrid-publish`, a new CLI tool that converts AICR evidence OCI bundles into TestGrid-compatible GCS feed entries. Also wires the binary into `.goreleaser.yaml` and adds `make testgrid-publish` / `make testgrid-publish-dryrun` targets.

Motivation / Context

Implements the TG2 publish path from the AICR TestGrid epic.

Fixes: #1267
Related: #1263, #1409

Type of Change

  • New feature (non-breaking change that adds functionality)
  • Build/CI/tooling

Component(s) Affected

  • Other: tools/testgrid-publish (new internal ops tool)

Implementation Notes

Pipeline (--bundle <oci-ref> or --bundle-dir <path> → GCS):

bundle dir
  ├── recipe.yaml          → CoordinateFor() → GCS group/dashboard/tab
  ├── statement.intoto.json → AttestedAt, k8s_version, aicr_version
  └── ctrf/*.json           → jUnit XML
                                   ↓
  gs://<bucket>/groups/<service>/<accel-os>/<intent[-platform]>/<build-id>/
    started.json   (written first)
    artifacts/junit.xml
    finished.json  (written last — triggers TestGrid updater)

Key decisions:

  • Upload order is strict (started → junit → finished); TestGrid updater ingests a build only after finished.json lands
  • --bundle-dir flag allows local dry-runs without an OCI registry (--dry-run prints planned GCS paths)
  • Criteria fields normalized (lowercase + trim) so EKS and eks both map to the correct GCS group
  • Zero-test bundles rejected — crashed test runners write empty CTRF and would otherwise publish as SUCCESS
  • CoordinateFor now delegates to pkg/recipe.CoordinateFor (feat(recipe): shared coordinate mapping + taxonomy spec (TG6) #1409 merged); gains /-in-dimension rejection guard
  • Signer extraction (signer_identity, signer_issuer) is a tracked stub — DSSE envelope parsing deferred to TG2 — Publish path (tools/testgrid-publish: OCI bundle → CTRF → jUnit → GCS feed) #1267 follow-up
  • Bounded file reads (readBoundedFile) prevent OOM from hostile/corrupt bundle files

Not included in this PR (follow-ups):

  • TG5: GitHub Actions workflow that calls testgrid-publish after aicr validate
  • Cross-project IAM binding for GCS bucket aicr-testgrid-staging

Testing

# Unit tests — all pass, -race clean
GOFLAGS="-mod=vendor" go test ./tools/testgrid-publish/... -race -count=1
# Coverage: 75.7% (project gate: 75%)

# Local dry-run (no OCI or GCS needed)
make testgrid-publish
./dist/testgrid-publish \
  --bundle-dir /tmp/bundle \
  --bucket aicr-testgrid-staging \
  --source-class uat \
  --dry-run

5 cross-review passes, 35 issues found and fixed before opening.

Risk Assessment

  • Low — New isolated tool with no callers yet; only additive changes to .goreleaser.yaml and Makefile

Rollout notes: Tool is not invoked by any existing CI workflow. TG5 will wire it in a follow-up PR once GCS IAM is confirmed.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

@srao-nv srao-nv requested review from a team as code owners June 24, 2026 15:32
@srao-nv srao-nv added the theme/ci-dx CI pipelines, developer experience, and build tooling label Jun 24, 2026
@github-actions

Copy link
Copy Markdown
Contributor

Welcome to AICR, @srao-nv! Thanks for your first pull request.

Before review, please ensure:

  • All commits are signed off per the DCO
  • CI checks pass (tests, lint, security scan)
  • The PR description explains the why behind your changes

A maintainer will review this soon.

@github-actions

github-actions Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Coverage Report ✅

Metric Value
Coverage 77.4%
Threshold 75%
Status Pass
Coverage Badge
![Coverage](https://img.shields.io/badge/coverage-77.4%25-green)

Merging this branch will increase overall coverage

Impacted Packages Coverage Δ 🤖
github.com/NVIDIA/aicr/tools/testgrid-publish 77.13% (+77.13%) 🌟

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/NVIDIA/aicr/tools/testgrid-publish/buildid.go 100.00% (+100.00%) 10 (+10) 10 (+10) 0 🌟
github.com/NVIDIA/aicr/tools/testgrid-publish/bundle.go 92.16% (+92.16%) 51 (+51) 47 (+47) 4 (+4) 🌟
github.com/NVIDIA/aicr/tools/testgrid-publish/coordinate.go 100.00% (+100.00%) 2 (+2) 2 (+2) 0 🌟
github.com/NVIDIA/aicr/tools/testgrid-publish/ctrf.go 94.44% (+94.44%) 54 (+54) 51 (+51) 3 (+3) 🌟
github.com/NVIDIA/aicr/tools/testgrid-publish/gcs.go 85.71% (+85.71%) 35 (+35) 30 (+30) 5 (+5) 🌟
github.com/NVIDIA/aicr/tools/testgrid-publish/main.go 55.24% (+55.24%) 105 (+105) 58 (+58) 47 (+47) 🌟
github.com/NVIDIA/aicr/tools/testgrid-publish/schema.go 100.00% (+100.00%) 1 (+1) 1 (+1) 0 🌟

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR adds a new tools/testgrid-publish CLI that reads bundle metadata, resolves TestGrid coordinates, converts CTRF reports to deterministic JUnit XML, assembles started.json and finished.json, and uploads the artifacts to GCS in order. It also adds dry-run output, metadata schema helpers, bundle parsing and attestation loading, CTRF conversion, GCS upload helpers, unit tests, and build wiring in .goreleaser.yaml and Makefile.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • NVIDIA/aicr#1409: Adds the pkg/recipe.CoordinateFor function and group/dashboard/tab resolution logic that tools/testgrid-publish/coordinate.go directly wraps and delegates to.

Suggested labels

area/cli, area/docs, area/ci

Suggested reviewers

  • njhensley
  • mchmarny
🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Linked Issues check ⚠️ Warning It covers the TG2 tool, but misses bundle-sourced timestamps, verified signer metadata, build-id semantics, and the required tools/** labeler/CODEOWNERS updates. Align started/finished timestamps with bundle data, implement verified signer extraction and monotonic idempotent build IDs, and add the required tools/** labeler/CODEOWNERS entries.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title is concise and clearly describes the new testgrid-publish tool added by this PR.
Description check ✅ Passed The description matches the changeset and summarizes the new tool, Makefile targets, and goreleaser wiring.
Out of Scope Changes check ✅ Passed The changes are all related to the new testgrid-publish tool and its supporting build and packaging wiring.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/tg2-testgrid-publish

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 13

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tools/testgrid-publish/bundle_test.go`:
- Around line 28-37: The test table struct in bundle_test.go is misformatted and
needs standard Go formatting. Run gofmt on this file (and re-check with
golangci-lint) so the alignment of the tests struct fields is normalized; use
the tests table declaration as the location to verify the formatting fix.

In `@tools/testgrid-publish/bundle.go`:
- Around line 134-149: The loadPredicate() flow currently unmarshals predicate
without validating the statement envelope, so add a check on predicateType
before decoding the body. Update the statement struct in bundle.go to capture
predicateType alongside Predicate, and in loadPredicate() reject any value other
than attestation.PredicateTypeV1 with an invalid-request error before
json.Unmarshal of stmt.Predicate. Keep the existing predicate nil and decode
error handling intact.

In `@tools/testgrid-publish/ctrf.go`:
- Around line 96-110: The merged CTRF-to-JUnit conversion in ctrf.go is using
only t.Name for both sorting and testcase naming, which collapses distinct rows
that share a phase and test name. Update the logic around the tests slice sort
and jUnitTestCase construction to use a unique merged row identity that includes
the source suite/file or equivalent stable identifier from ctrf.TestResult, so
equal-name rows remain distinguishable and ordering stays deterministic. Keep
the phase prefix, but derive both the sort key and emitted Name from that unique
identity instead of t.Name alone.
- Around line 128-130: The CTRF status error assignment in ctrf.go has redundant
string(...) casts because t.Status is already a string, so update the jUnitError
Message and Text construction in the CTRF status handling path to use t.Status
directly and remove the unnecessary conversions to satisfy golangci-lint’s
unconvert check.

In `@tools/testgrid-publish/gcs_test.go`:
- Around line 72-82: The cancellation path in TestGcloudCopyCanceled is only
being smoke-tested, so the wrapped context error from gcloudCopy is not
verified. Update this test to assert the returned error when the canceled
context is passed into gcloudCopy, using the gcloudCopy and
TestGcloudCopyCanceled symbols to locate the branch that should return the
cancellation-related error instead of ignoring it.
- Around line 84-104: TestWriteGCS currently only asserts success, so it can
miss regressions in upload sequencing or destination paths. Update the test to
inspect the fake gcloud invocations made by writeGCS and verify the exact three
uploads happen in order: started.json, artifacts/junit.xml, and finished.json,
using the same object path patterns produced by writeGCS. Reference writeGCS and
the fake gcloud recorder in setupFakeGcloud so the contract is checked even if
the implementation changes.

In `@tools/testgrid-publish/gcs.go`:
- Around line 110-121: Prevent TestGrid publishes from overwriting an existing
build because gcloudCopy currently uses gcloud storage cp to a deterministic
groups/.../<build-id>/ path, which can replace already-published objects on
reruns. Update gcloudCopy to use a write path that enforces ifGenerationMatch=0
for object creation, and keep the retry/cancel handling in gcloudCopy intact.
Make sure the change applies to the copy behavior used for started.json,
artifacts/junit.xml, and finished.json so duplicate publishes fail instead of
clobbering existing data.

In `@tools/testgrid-publish/main_test.go`:
- Around line 101-109: In TestRunDryRun, the error handling around json.Marshal,
os.MkdirAll, and os.WriteFile is shadowing the existing err variable via short
declarations in the if initializers. Update the error checks to reuse the
already-declared err or rename the inner bindings so the test no longer triggers
govet’s shadow check, keeping the logic in main_test.go’s TestRunDryRun
unchanged.

In `@tools/testgrid-publish/main.go`:
- Around line 133-140: The MaterializeBundle call in main should not opt out of
OCI digest verification by setting AllowUnpinnedTag to true. Update the
VerifyOptions used in verifier.MaterializeBundle so tag-only --bundle values are
rejected and only digest-pinned refs are accepted; keep the immutable publish
contract enforced by removing the permissive tag handling and relying on CI to
supply pinned refs.
- Around line 95-104: The main pipeline currently starts with an unbounded
context, so registry and bundle materialization I/O in run() can proceed without
timeout control. Update main() to create a timeout-bounded parent context
instead of using context.Background(), and pass that context into run() so the
entire bundle/pull flow is covered. Keep the change aligned with run() and
runConfig usage so all I/O paths inherit the same bounded context.
- Around line 167-172: The fallback in loadPredicate should not synthesize
AttestedAt from time.Now() when the predicate is missing. Update the
predicate-loading path in main.go so that started.json, finished.json, and
build-id are derived only from bundle-provided data, and handle unsigned/local
bundles without introducing wall-clock timestamps. Keep the change localized
around loadPredicate and the predicate/attestation handling used by TG2
publication.
- Around line 145-147: The pull-failure handling in the main flow is re-wrapping
verifier errors as ErrCodeUnavailable, which loses the structured code already
returned by MaterializeBundle(). Update the error path around the pull step to
detect when the returned error is already a repo pkg/errors value with the
correct structured code and propagate it directly instead of wrapping it again.
Keep the existing pull context in the local flow, but avoid forcing all failures
through the ErrCodeUnavailable branch in main.
- Around line 262-280: The printDryRun helper in main.go always returns nil, so
its error return is dead code and should be removed. Update printDryRun to have
no error return, then adjust every caller of printDryRun to stop expecting or
handling an error and just invoke it directly. Use the printDryRun function name
and its call sites in the publish flow to locate and clean up the unused error
plumbing.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: afccd459-7a23-494d-8a95-a9ef5d01ab4d

📥 Commits

Reviewing files that changed from the base of the PR and between e878161 and e0f7670.

📒 Files selected for processing (16)
  • .goreleaser.yaml
  • Makefile
  • tools/testgrid-publish/buildid.go
  • tools/testgrid-publish/buildid_test.go
  • tools/testgrid-publish/bundle.go
  • tools/testgrid-publish/bundle_test.go
  • tools/testgrid-publish/coordinate.go
  • tools/testgrid-publish/coordinate_test.go
  • tools/testgrid-publish/ctrf.go
  • tools/testgrid-publish/ctrf_test.go
  • tools/testgrid-publish/gcs.go
  • tools/testgrid-publish/gcs_test.go
  • tools/testgrid-publish/main.go
  • tools/testgrid-publish/main_test.go
  • tools/testgrid-publish/schema.go
  • tools/testgrid-publish/schema_test.go

Comment thread tools/testgrid-publish/bundle_test.go Outdated
Comment thread tools/testgrid-publish/bundle.go Outdated
Comment on lines +96 to +110
// Sort tests by name within each phase for determinism.
tests := make([]ctrf.TestResult, len(r.Results.Tests))
copy(tests, r.Results.Tests)
sort.Slice(tests, func(i, j int) bool {
return tests[i].Name < tests[j].Name
})
// Derive Tests from the actual slice, not Summary.Tests, so the
// attribute is accurate even when the CTRF producer miscounts.
suite.Tests = len(tests)

for _, t := range tests {
dur := fmt.Sprintf("%.3f", float64(t.Duration)/1000.0) // ms → s
tc := jUnitTestCase{
Name: string(phase) + "/" + t.Name,
ClassName: string(phase),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major | 🏗️ Heavy lift

Preserve a unique row key for merged phase reports.

The TG2 contract requires row naming to survive merged CTRF inputs. Here both the testcase name and the sort key collapse to phase + "/" + t.Name, so two results from the same phase with the same test name but different source files/suites become indistinguishable, and equal-name ordering is no longer well-defined. Build the emitted JUnit name and sort key from a unique merged row identity instead of t.Name alone.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/testgrid-publish/ctrf.go` around lines 96 - 110, The merged
CTRF-to-JUnit conversion in ctrf.go is using only t.Name for both sorting and
testcase naming, which collapses distinct rows that share a phase and test name.
Update the logic around the tests slice sort and jUnitTestCase construction to
use a unique merged row identity that includes the source suite/file or
equivalent stable identifier from ctrf.TestResult, so equal-name rows remain
distinguishable and ordering stays deterministic. Keep the phase prefix, but
derive both the sort key and emitted Name from that unique identity instead of
t.Name alone.

Comment thread tools/testgrid-publish/ctrf.go Outdated
Comment thread tools/testgrid-publish/gcs_test.go
Comment thread tools/testgrid-publish/main.go
Comment thread tools/testgrid-publish/main.go Outdated
Comment on lines +133 to +140
mat, err := verifier.MaterializeBundle(ctx,
verifier.VerifyOptions{
Input: cfg.bundleRef,
PlainHTTP: cfg.plainHTTP,
InsecureTLS: cfg.insecureTLS,
// Allow unpinned tags for CI convenience; digest-pinning is
// enforced at the GitHub Actions level (TG5 uses digest refs).
AllowUnpinnedTag: true,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

Reject tag-only OCI refs here instead of opting out.

Line 140 disables verifier.MaterializeBundle's direct-OCI digest check, so the same --bundle value can resolve to different evidence over time. That breaks the PR's immutable publish contract and weakens provenance guarantees; let CI pass digest-pinned refs instead.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/testgrid-publish/main.go` around lines 133 - 140, The MaterializeBundle
call in main should not opt out of OCI digest verification by setting
AllowUnpinnedTag to true. Update the VerifyOptions used in
verifier.MaterializeBundle so tag-only --bundle values are rejected and only
digest-pinned refs are accepted; keep the immutable publish contract enforced by
removing the permissive tag handling and relying on CI to supply pinned refs.

Comment thread tools/testgrid-publish/main.go
Comment thread tools/testgrid-publish/main.go
Comment thread tools/testgrid-publish/main.go Outdated

@njhensley njhensley left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multi-persona review (Go correctness · Security · CI/CD · Test-quality · Operability), confirmed by senior meta-review

Solid, well-iterated tool — hermetic tests, bounded reads, fail-closed CTRF handling, deterministic output. Requesting changes on 2 merge-blockers + 3 majors before this is wired into TG5. The tool has no callers yet, so most latent bugs are major/minor — except two design decisions that future wiring will silently paper over.

Tier legend: 🛑 Blocker · 🔴 Major · 🟡 Minor · 🔵 Nitpick (inline below).

Top items

  • 🛑 No signature verification before publishing a verdict to a trust surface (AllowUnpinnedTag + no VerifySignature).
  • 🛑 Orphaned goreleaser build — compiles every release, ships nowhere.
  • 🔴 Signer columns permanently blank (readPointerFromAttestation is a hard error stub → 2 of 7 schema columns always "missing").
  • 🔴 Corrupt predicate masked as success (any parse error → time.Now() fallback).
  • 🔴 Upload-order invariant never tested (fake gcloud records nothing; a reordering bug passes green).

Considered and intentionally not flagged

  • gcloud arg/shell injection — clean (no shell; every arg prefixed with literal gs:///tmpdir).
  • Path traversal — clean (all filepath.Join segments are trusted constants).
  • mod_timestamp anchor refactor — valid YAML; &build-timestamp is defined earlier on the aicr build, so the alias is sound and reproducibility is unchanged.
  • Bounded reads, temp-dir/file-handle cleanup, error wrapping/codes — verified correct.
  • Dropped as noise: 2-min timeout sizing for 3 gcloud cold-starts (theoretical); make target omitting release build flags (expected dev-convenience divergence).

Review assembled from 5 reviewer personas and a senior meta-reviewer that independently re-verified each finding against source at 8535f2e (zero false positives).

Comment thread tools/testgrid-publish/main.go Outdated
Comment thread .goreleaser.yaml Outdated
Comment on lines +87 to +102
- id: testgrid-publish
binary: testgrid-publish
mod_timestamp: *build-timestamp
dir: tools/testgrid-publish
env:
- CGO_ENABLED=0
- GOFLAGS=-mod=vendor
flags:
- -trimpath
ldflags:
- -w -s -extldflags "-static"
goos:
- linux
goarch:
- amd64
- arm64

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛑 Blocker — this build compiles every release but ships nowhere.

The testgrid-publish build has no consumer: archives lists only id: aicr, kos covers only aicr/aicrd, and there is no dockers/extra_files entry for it. Unlike aicrd (archive-less because it ships via kos:), this binary has neither an archive nor an image — so goreleaser builds it on every release and never distributes it. Decide now: drop it from goreleaser and build via the new make testgrid-publish target inside the consuming workflow, or give it a real distribution path (archives id / extra_files).

🟡 Related (minor): sboms: (line ~113) uses artifacts: binary with no ids: filter, so if this build stays it gets an SBOM but no signing/provenance — half-in the supply-chain surface, which can read as "attested". Resolve together: either bring it under the attestation story or exclude it from SBOM with an ids filter + comment.

Comment on lines +297 to +299
func readPointerFromAttestation(_ string) (*attestation.Pointer, error) {
return nil, errors.New(errors.ErrCodeInternal, "signer extraction not yet implemented")
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Major — signer columns are permanently blank.

readPointerFromAttestation unconditionally returns an error, so extractSigner always returns ("", "") and the att.Signer dereference just above is unreachable dead code. Result: signer_identity/signer_issuer — 2 of the 7 "single source of truth" schema columns — render "missing" in the TestGrid UI for every build, including correctly-signed UAT bundles. That is the exact drift the schema.go comment warns against. Either implement the DSSE/Sigstore-cert parse, or remove the two columns + dead deref and emit a one-time slog.Warn so operators know why they're blank rather than assuming a bug. (Tracked under #1267.)

Comment on lines +167 to +172
pred, predErr := loadPredicate(dir)
if predErr != nil {
// Unsigned / local bundles may not have a full predicate; use now.
slog.Warn("could not load predicate, using current time", "error", predErr)
pred = &attestation.Predicate{AttestedAt: time.Now().UTC()}
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Major — a corrupt predicate is silently masked as a successful publish.

Any loadPredicate error — including ErrCodeInvalidRequest for a truncated/corrupt statement.intoto.json, not just the not-found case — is downgraded to slog.Warn + a fabricated time.Now() timestamp. The comment only justifies the unsigned/local → no predicate case. This is the CLAUDE.md "fail-closed vs warn-and-continue" anti-pattern applied to a parse failure. Distinguish ErrCodeNotFound (benign → fall back) from a malformed predicate (→ return the error).

Comment on lines +84 to +105
func TestWriteGCS(t *testing.T) {
setupFakeGcloud(t, 0)

started := startedJSON{
Timestamp: 1749600000,
Metadata: map[string]string{metaKeyAICRVersion: "v0.1.0"},
}
finished := finishedJSON{
Timestamp: 1749600060,
Passed: true,
Result: "SUCCESS",
Metadata: started.Metadata,
}

err := writeGCS(context.Background(),
"aicr-testgrid-staging",
"groups/eks/h100-ubuntu/training/1749600000-abc12345",
started, finished, []byte("<testsuites/>"))
if err != nil {
t.Fatalf("writeGCS() unexpected error: %v", err)
}
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Major (test) — the upload-order invariant, the tool's core safety property, is never asserted.

The fake gcloud is #!/bin/sh (+ optional exit 1) — it copies and records nothing. TestWriteGCS only proves "three invocations exited 0"; it cannot catch a reordering of the uploads slice that would break TestGrid's finished.json-last completeness detection (the invariant gcs.go documents as mandatory). Make the fake record its dst argument to a temp log and assert the sequence ends started.jsonartifacts/junit.xmlfinished.json.

Comment on lines +19 to +46
func TestMetaKeys(t *testing.T) {
keys := MetaKeys()

// Verify length matches the declared constants.
const wantLen = 7
if len(keys) != wantLen {
t.Errorf("MetaKeys() length = %d, want %d", len(keys), wantLen)
}

// Verify all constants are present.
required := map[string]bool{
metaKeyAICRVersion: false,
metaKeyK8sVersion: false,
metaKeyK8sConstraint: false,
metaKeySignerIdentity: false,
metaKeySignerIssuer: false,
metaKeySourceClass: false,
metaKeyEvidenceDigest: false,
}
seen := make(map[string]bool)
for _, k := range keys {
if k == "" {
t.Error("MetaKeys() contains empty string")
}
if seen[k] {
t.Errorf("duplicate key %q in MetaKeys()", k)
}
seen[k] = true

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Minor (test) — this "contract" test is a self-consistency tautology.

TestMetaKeys checks MetaKeys() against the constants it's built from — it never verifies that the started.Metadata map actually emits exactly those keys, which is the stated purpose. Add a test asserting set-equality between the emitted started.json keys and MetaKeys(). (A true cross-repo contract with config-gen may be out of scope — at least re-document the limitation.)

Comment on lines +44 to +47
if int64(len(data)) > maxBytes {
return nil, errors.New(errors.ErrCodeInvalidRequest,
fmt.Sprintf("%s exceeds %d-byte size limit", path, maxBytes))
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Minor (test) — safety-critical branches are uncovered (grouped).

The branches a publish tool most wants tested are uncovered: this readBoundedFile oversize-rejection branch; convertCTRF's StatusPending mapping and the default: unknown-status fail-closed branch (ctrf.go); and TestGcloudCopyCanceled asserts nothing (_ = err). These are the realistic source of the 75.7%↔77.1% coverage drift. Add the oversize case, the pending + bogus-status cases, and either make the cancel test deterministic or drop it.

os.Exit(1)
}

ctx := context.Background()

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Minor — pre-oras.Copy registry handshake runs under an unbounded context.

context.Background() flows into MaterializeBundle. oras.Copy itself is internally bounded, but the registry resolve/auth handshake before it is not — the realistic hang. Per CLAUDE.md, derive a bounded root here: ctx, cancel := context.WithTimeout(context.Background(), defaults.EvidenceBundlePushTimeout).

Comment thread tools/testgrid-publish/main.go Outdated
Comment on lines +284 to +288
func junitAllPassed(xml []byte) bool {
// Quick scan: if any <failure or <error element exists, the run failed.
s := string(xml)
return !strings.Contains(s, "<failure") && !strings.Contains(s, "<error")
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 Nitpick — junitAllPassed re-derives pass/fail by string-scanning the marshaled XML.

It scans for <failure/<error instead of using the exact suite.Failures/suite.Errors counts convertCTRF already computed. No false positive today (XML escapes names/messages), but brittle to any future formatting change. Have convertCTRF return the aggregate counts and compute allPassed from them.


// gcloudCopy uploads a single local file to GCS via `gcloud storage cp`.
// Authentication is handled by ADC / the gcloud CLI configuration.
func gcloudCopy(ctx context.Context, src, dst string) error {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔵 Nitpick — no gcloud preflight (PATH / auth).

Failure surfaces late as a generic exec error after pull/parse/stage. An exec.LookPath("gcloud") + lightweight auth probe up front (and in --dry-run) would make a green dry-run actually predict a working publish.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tools/testgrid-publish/bundle.go`:
- Around line 143-147: The validation in the bundle parsing flow still allows a
missing PredicateType to pass, which can let malformed statements reach
predicate decoding. Update the check in the bundle handling logic around
stmt.PredicateType so it fails closed unless the value is exactly
attestation.PredicateTypeV1, and treat an empty PredicateType as invalid rather
than special-casing it. Keep the existing invalid-request error path and make
sure the guard in this validation block enforces the required field
consistently.

In `@tools/testgrid-publish/schema_test.go`:
- Around line 154-158: The current test around MetaKeys() only checks that
declared keys appear in the dry-run output, but it does not verify the reverse
direction, so emitted metadata can still drift from the declared set. Update the
schema_test.go assertion near MetaKeys() and run() to also fail when any emitted
metaKey-style entry is not present in MetaKeys(), using the existing emitted
map/output parsing so the test enforces set equality rather than a one-way
subset check.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: e4cf3bee-7405-4ec5-9290-8c00c3778328

📥 Commits

Reviewing files that changed from the base of the PR and between 8535f2e and 8feebd4.

📒 Files selected for processing (11)
  • .goreleaser.yaml
  • tools/testgrid-publish/bundle.go
  • tools/testgrid-publish/bundle_test.go
  • tools/testgrid-publish/coordinate.go
  • tools/testgrid-publish/ctrf.go
  • tools/testgrid-publish/ctrf_test.go
  • tools/testgrid-publish/gcs.go
  • tools/testgrid-publish/gcs_test.go
  • tools/testgrid-publish/main.go
  • tools/testgrid-publish/main_test.go
  • tools/testgrid-publish/schema_test.go
💤 Files with no reviewable changes (1)
  • .goreleaser.yaml

Comment on lines +143 to +147
if stmt.PredicateType != "" && stmt.PredicateType != attestation.PredicateTypeV1 {
return nil, errors.New(errors.ErrCodeInvalidRequest,
fmt.Sprintf("unsupported predicateType %q, expected %q",
stmt.PredicateType, attestation.PredicateTypeV1))
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security & Privacy | 🟡 Minor | ⚡ Quick win

Empty predicateType still bypasses validation — consider failing closed.

The new check rejects mismatched types, which addresses the bulk of the prior concern. But the stmt.PredicateType != "" guard lets a statement that omits predicateType entirely slip through to predicate decoding. Per the in-toto Statement spec, predicateType is a required field, so a legitimate AICR statement always sets it; allowing empty re-opens the original bypass for malformed/non-AICR statements whose payload happens to partially unmarshal into attestation.Predicate.

Suggested change
-	if stmt.PredicateType != "" && stmt.PredicateType != attestation.PredicateTypeV1 {
+	if stmt.PredicateType != attestation.PredicateTypeV1 {
 		return nil, errors.New(errors.ErrCodeInvalidRequest,
 			fmt.Sprintf("unsupported predicateType %q, expected %q",
 				stmt.PredicateType, attestation.PredicateTypeV1))
 	}

If empty predicateType is intentionally tolerated for some legacy/local bundle path, please confirm and document that rationale here.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if stmt.PredicateType != "" && stmt.PredicateType != attestation.PredicateTypeV1 {
return nil, errors.New(errors.ErrCodeInvalidRequest,
fmt.Sprintf("unsupported predicateType %q, expected %q",
stmt.PredicateType, attestation.PredicateTypeV1))
}
if stmt.PredicateType != attestation.PredicateTypeV1 {
return nil, errors.New(errors.ErrCodeInvalidRequest,
fmt.Sprintf("unsupported predicateType %q, expected %q",
stmt.PredicateType, attestation.PredicateTypeV1))
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/testgrid-publish/bundle.go` around lines 143 - 147, The validation in
the bundle parsing flow still allows a missing PredicateType to pass, which can
let malformed statements reach predicate decoding. Update the check in the
bundle handling logic around stmt.PredicateType so it fails closed unless the
value is exactly attestation.PredicateTypeV1, and treat an empty PredicateType
as invalid rather than special-casing it. Keep the existing invalid-request
error path and make sure the guard in this validation block enforces the
required field consistently.

Comment on lines +154 to +158
for _, k := range MetaKeys() {
if !emitted[k] {
t.Errorf("MetaKeys() key %q not found in dry-run output:\n%s", k, output)
}
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Test verifies subset, not set-equality.

This addresses the prior tautology concern by asserting each MetaKeys() entry appears in the emitted metadata. However, it only checks the subset direction — it won't catch a metadata key that run() emits but MetaKeys() doesn't declare (the drift the prior comment asked to guard against). Consider also asserting no extra metaKey*-style keys are emitted beyond MetaKeys().

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/testgrid-publish/schema_test.go` around lines 154 - 158, The current
test around MetaKeys() only checks that declared keys appear in the dry-run
output, but it does not verify the reverse direction, so emitted metadata can
still drift from the declared set. Update the schema_test.go assertion near
MetaKeys() and run() to also fail when any emitted metaKey-style entry is not
present in MetaKeys(), using the existing emitted map/output parsing so the test
enforces set equality rather than a one-way subset check.

@srao-nv srao-nv force-pushed the feat/tg2-testgrid-publish branch from 8feebd4 to b5db8a2 Compare June 25, 2026 19:05

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tools/testgrid-publish/buildid.go`:
- Around line 39-40: The build ID generator in buildID currently truncates the
digest too aggressively, which can cause collisions for different bundles
published in the same second. Update the build ID composition in
buildID/shortHex usage to use a longer digest suffix or the full digest instead
of the current 8-hex-character prefix, so distinct attestations cannot map to
the same <build-id>. Keep the change localized to the build ID formatting logic
and preserve the timestamp plus digest structure.

In `@tools/testgrid-publish/bundle_test.go`:
- Around line 179-254: TestLoadPredicate currently covers valid input and a
missing predicate, but it does not verify the envelope validation around
predicateType. Add two subtests in TestLoadPredicate to assert loadPredicate
rejects a statement when predicateType is missing and when predicateType is
present but does not match attestation.PredicateTypeV1, using the existing
loadPredicate and StatementFilename setup to locate the behavior.

In `@tools/testgrid-publish/bundle.go`:
- Around line 35-42: Wrap the raw filesystem failures in bundle loading with
structured AICR errors in the bundle helper that calls os.Open and io.ReadAll,
so parseCriteria and loadPredicate only see wrapped permission/read errors. Keep
already-structured pkg/errors values flowing unchanged upstream, but for bare
open/read failures add error code/context via errors.Wrap or
errors.WrapWithContext in the bundle read path around the existing
bundle-loading logic.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 52f9183a-290e-4df4-af70-fde35afa2709

📥 Commits

Reviewing files that changed from the base of the PR and between 8feebd4 and b5db8a2.

📒 Files selected for processing (16)
  • .goreleaser.yaml
  • Makefile
  • tools/testgrid-publish/buildid.go
  • tools/testgrid-publish/buildid_test.go
  • tools/testgrid-publish/bundle.go
  • tools/testgrid-publish/bundle_test.go
  • tools/testgrid-publish/coordinate.go
  • tools/testgrid-publish/coordinate_test.go
  • tools/testgrid-publish/ctrf.go
  • tools/testgrid-publish/ctrf_test.go
  • tools/testgrid-publish/gcs.go
  • tools/testgrid-publish/gcs_test.go
  • tools/testgrid-publish/main.go
  • tools/testgrid-publish/main_test.go
  • tools/testgrid-publish/schema.go
  • tools/testgrid-publish/schema_test.go

Comment on lines +39 to +40
shortDigest := shortHex(digest, 8)
return fmt.Sprintf("%d-%s", ts, shortDigest)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

8 hex chars are not enough to keep build IDs distinct.

Two different bundles attested in the same second and sharing the same last 32 digest bits will resolve to the same <build-id> and publish into the same GCS row. That violates the distinct-run requirement for this path; use a longer digest suffix or the full digest.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/testgrid-publish/buildid.go` around lines 39 - 40, The build ID
generator in buildID currently truncates the digest too aggressively, which can
cause collisions for different bundles published in the same second. Update the
build ID composition in buildID/shortHex usage to use a longer digest suffix or
the full digest instead of the current 8-hex-character prefix, so distinct
attestations cannot map to the same <build-id>. Keep the change localized to the
build ID formatting logic and preserve the timestamp plus digest structure.

Comment on lines +179 to +254
func TestLoadPredicate(t *testing.T) {
t.Run("valid statement", func(t *testing.T) {
dir := t.TempDir()
ts := time.Date(2026, 6, 1, 12, 0, 0, 0, time.UTC)
pred := attestation.Predicate{
AttestedAt: ts,
AICRVersion: "v0.12.3",
}
predBytes, _ := json.Marshal(pred)

stmt := map[string]any{
"_type": "https://in-toto.io/Statement/v0.1",
"predicateType": attestation.PredicateTypeV1,
"predicate": json.RawMessage(predBytes),
}
stmtBytes, _ := json.Marshal(stmt)
if err := os.WriteFile(
filepath.Join(dir, attestation.StatementFilename),
stmtBytes, 0o600,
); err != nil {
t.Fatal(err)
}

got, err := loadPredicate(dir)
if err != nil {
t.Fatalf("loadPredicate() unexpected error: %v", err)
}
if got.AICRVersion != "v0.12.3" {
t.Errorf("AICRVersion = %q, want %q", got.AICRVersion, "v0.12.3")
}
if !got.AttestedAt.Equal(ts) {
t.Errorf("AttestedAt = %v, want %v", got.AttestedAt, ts)
}
})

t.Run("missing statement returns error", func(t *testing.T) {
_, err := loadPredicate(t.TempDir())
if err == nil {
t.Fatal("expected error for missing statement.intoto.json")
}
})

t.Run("invalid json returns error", func(t *testing.T) {
dir := t.TempDir()
if err := os.WriteFile(
filepath.Join(dir, attestation.StatementFilename),
[]byte("not json"), 0o600,
); err != nil {
t.Fatal(err)
}
_, err := loadPredicate(dir)
if err == nil {
t.Fatal("expected error for invalid JSON")
}
})

t.Run("statement with no predicate field returns error", func(t *testing.T) {
dir := t.TempDir()
stmt := map[string]any{
"_type": "https://in-toto.io/Statement/v0.1",
"predicateType": attestation.PredicateTypeV1,
// no "predicate" field
}
stmtBytes, _ := json.Marshal(stmt)
if err := os.WriteFile(
filepath.Join(dir, attestation.StatementFilename),
stmtBytes, 0o600,
); err != nil {
t.Fatal(err)
}
_, err := loadPredicate(dir)
if err == nil {
t.Fatal("expected error when predicate field is missing")
}
})
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Add tests for missing and mismatched predicateType.

TestLoadPredicate covers valid JSON and missing predicate, but it never locks in the envelope rule that malformed statements must be rejected when predicateType is absent or wrong. Add both cases here so that contract cannot regress silently.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/testgrid-publish/bundle_test.go` around lines 179 - 254,
TestLoadPredicate currently covers valid input and a missing predicate, but it
does not verify the envelope validation around predicateType. Add two subtests
in TestLoadPredicate to assert loadPredicate rejects a statement when
predicateType is missing and when predicateType is present but does not match
attestation.PredicateTypeV1, using the existing loadPredicate and
StatementFilename setup to locate the behavior.

Comment thread tools/testgrid-publish/bundle.go
srao-nv added 14 commits June 25, 2026 15:47
Implements tools/testgrid-publish for the AICR TestGrid publish path
(#1267).

Pulls an evidence OCI bundle (or accepts a pre-materialized --bundle-dir),
converts CTRF phase reports to jUnit XML, and writes TestGrid build files
(started.json / artifacts/junit.xml / finished.json) to GCS in the correct
order so the TestGrid updater discovers complete builds only.

GCS hierarchy:
  gs://<bucket>/groups/<service>/<accel-os>/<intent[-platform]>/<build-id>/

Coordinate mapping via CoordinateFor():
  Group     = service  (eks, gke, aks, …)
  Dashboard = accelerator-os  (h100-ubuntu, gb200-cos, …)
  Tab       = intent[-platform]  (training-kubeflow, inference, …)

Seven started.json metadata keys: aicr_version, k8s_version,
k8s_constraint, signer_identity, signer_issuer, source_class,
evidence_digest.

Test coverage: 77.1% (gate: 70%). 23 tests including integration-style
dry-run tests that exercise the full pipeline without OCI or GCS access.
- F1: populate k8s_version from pred.Fingerprint.K8sVersion.Value
- F2: remove dead "see attestation.go" stub comment, add TODO with issue ref
- F3: use t.Cleanup for os.Stdout restore in all dry-run tests
- F4: TestRunDryRunFailedTests now reads and asserts passed=false in output
- F5: use EvidenceBundlePushTimeout (2 min) instead of EvidenceBundleBuildTimeout
      (60 s, "local I/O only") for GCS network uploads
- F6: replace hand-rolled contains/containsStr with strings.Contains
- F7: fix buildid.go comment — monotonic for 10-digit timestamps (2001–2286),
      not "zero-padded"
- F8: use t.Setenv in setupFakeGcloud for race-safe PATH mutation

Coverage: 76.8% (gate: 70%)
- G1: add explicit ctrf.StatusPassed case + default branch that emits
      <error> for unrecognized CTRF statuses (fail closed)
- G2: replace unbounded os.ReadFile with readBoundedFile helper
      (MaxRecipePOSTBytes for recipe.yaml, MaxBundlePOSTBytes for
      statement.intoto.json and CTRF phase files) per CLAUDE.md anti-pattern
- G3: fix RecipeCriteria.K8sVersion comment ("major.minor" → full semver)
- G4: replace fixed [4096]byte pipe buffer with io.ReadAll in all 3
      dry-run test helpers
- G5: count suite.Skipped in-loop (consistent with Failures/Errors)
      instead of pre-populating from CTRF Summary
- G6: add slog.Warn when attestation has a non-nil pointer but nil Signer
- G7: remove unnecessary _ = bundleDir (parameter, not local variable)

Coverage: 76.1% (gate: 70%)
- #2/#14: normalize criteria fields (TrimSpace + ToLower) in parseCriteria
  so "EKS", " eks ", "Ubuntu" all map to the correct lowercase GCS path
  segment that config-gen registers
- #7: derive suite.Tests from len(tests) after sort, not Summary.Tests,
  so the <testsuite tests="N"> attribute is accurate even when the CTRF
  producer miscounts
- #10: reject bundles where all phases have zero test results; a crashed
  test runner that writes empty CTRF files would otherwise publish as
  SUCCESS with zero testcases

Coverage: 77.1% (gate: 70%)
reviewer-go:
- G1: replace stderrors.New with errors.New(ErrCodeMethodNotAllowed) in
      readPointerFromAttestation stub; remove now-unused stderrors import
- G2: add slog.Info("writing to GCS") before writeGCS so operators can
      see progress during slow uploads instead of waiting up to 2 min
- G3: rename local var buildID→bid to avoid shadowing the package-level
      buildID() function
- G4: sort metadata keys before printing in printDryRun for deterministic
      output (enables diff-based CI assertions)

reviewer-cicd:
- C3: add testgrid-publish build entry to .goreleaser.yaml (linux
      amd64/arm64, no SLSA hooks — internal ops tool)
- C4: add make testgrid-publish and make testgrid-publish-dryrun targets
      (BUNDLE_DIR=<path> for local smoke testing)
- Remove dead slog.Warn in extractSigner — stub always errors before
  the att.Signer==nil branch is reachable
- Copy started.Metadata into finished instead of sharing the map pointer
- ErrCodeMethodNotAllowed → ErrCodeInternal for the signer stub (HTTP
  405 vs 500 — stub is a server-side capability gap, not a method error)
- main_test.go: capture printDryRun error in separate var so io.ReadAll
  cannot overwrite it before the nil-check (dead assertion fix)
- bundle.go: use os.IsNotExist to distinguish file-not-found from
  structured errors (e.g. size-limit ErrCodeInvalidRequest) so
  parseCriteria and loadPredicate don't re-wrap already-coded errors
  as ErrCodeNotFound per CLAUDE.md "don't double-wrap" rule
- gcs.go: check ctx.Err() after cmd.Run() fails so mid-upload context
  cancellation surfaces as ErrCodeUnavailable instead of ErrCodeInternal
PR #1409 (feat(recipe): shared coordinate mapping) is now merged.
Replace the local coordinate.go implementation with a thin wrapper that
delegates to recipe.CoordinateFor(*recipe.Criteria).

RecipeCriteria is retained for the testgrid-specific K8sVersion and
K8sConstraint fields which have no equivalent in recipe.Criteria.
toRecipeCriteria() converts between the two for the coordinate call.

The /"-in-dimension rejection guard added by pkg/recipe is now applied
automatically; existing tests are unchanged since all test values are
slash-free.
- ctrf.go: remove unnecessary string() casts (unconvert)
- ctrf_test.go: apply De Morgan's law (staticcheck QF1001)
- main_test.go: rename inner err vars to avoid shadow (govet)
- main.go: remove error return from printDryRun (unparam)
- bundle_test.go, main.go: gofmt formatting
- goreleaser: remove testgrid-publish build entry — it compiled on
  every release but had no archive/distribution path; make testgrid-
  publish is the correct build path for CI consumers (TG5)
- main.go: fail closed on corrupt predicate — only fall back to
  time.Now() when statement.intoto.json is absent (ErrCodeNotFound),
  not when it fails to parse (ErrCodeInvalidRequest); add stderrors
  import for the stderrors.Is check
- main.go: add "(dev only)" to --insecure-tls flag to match --plain-http
- gcs_test.go: record and assert upload order in TestWriteGCS — fake
  gcloud now appends each GCS destination to GCLOUD_RECORD; test
  verifies started.json → artifacts/junit.xml → finished.json sequence
- gcs_test.go: TestGcloudCopyCanceled now asserts error is non-nil
B1: strengthen AllowUnpinnedTag comment — explicit TG5 rationale
    (digest-pinning enforced at workflow level, not here)

minors/nitpicks:
- main.go: bounded OCI pull context (EvidenceBundlePushTimeout, 2 min)
- main.go: --bundle-dir auto-resolves nested summary-bundle/ subdir so
  users pointing at the bundle parent get a clear path rather than
  "recipe.yaml not found"
- ctrf.go: convertCTRF returns ([]byte, bool, error) — allPassed derived
  from accumulated suite.Failures/Errors counts, not XML string scan;
  removes junitAllPassed entirely
- coordinate.go: comment notes validation enforced by CoordinateFor
- gcs.go: exec.LookPath("gcloud") preflight before staging/upload;
  improved idempotency comment on upload loop
- bundle_test.go: TestParseCriteriaOversizeFile covers readBoundedFile
  size-limit rejection branch
- ctrf_test.go: StatusPending test case covers pending→skipped mapping
- schema_test.go: TestMetaKeysMatchEmittedMetadata cross-checks
  MetaKeys() against keys actually emitted in a dry-run
- main_test.go: TestRunDryRunSummaryBundleAutoResolve + TestWriteGCSGcloudNotFound

Coverage: 76.6% (gate: 75%)
…ng verifier errors

- bundle.go: capture predicateType in loadPredicate and reject any value
  other than PredicateTypeV1 before decoding the predicate body
- main.go: use errors.PropagateOrWrap for MaterializeBundle errors so
  structured error codes from the verifier are not overwritten by
  ErrCodeUnavailable
…ment

- ctrf_test.go: add test case for unrecognized CTRF status — verifies
  the default branch emits <error> and allPassed=false (fail-closed)
- main.go: update readPointerFromAttestation doc comment to accurately
  describe current stub behavior and add a warning about the swallow
  pattern needing revisit when DSSE parsing lands (#1267)
@srao-nv srao-nv force-pushed the feat/tg2-testgrid-publish branch from b5db8a2 to 192457b Compare June 25, 2026 19:47

@njhensley njhensley left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review (head 192457b9) — addressed; approving ✅

Re-checked every prior finding against the new sources, verified the dependency APIs, and ran the package tests locally (pass, 77.1% coverage, go vet clean). Both blockers and all three majors are resolved or explicitly defended; what remains is two minor, low-risk items on a tool with no callers yet. Clearing my earlier CHANGES_REQUESTED.

Blockers

  • 🛑→✅ Orphaned goreleaser build — build block removed from .goreleaser.yaml; built via make testgrid-publish in the workflow instead. Also dissolves the SBOM-without-provenance concern.
  • 🛑→⚠️ No signature verificationAllowUnpinnedTag kept, now explicitly defended in-code with digest-pinning deferred to TG5 at the workflow level. Acceptable while unwired — please make sure TG5 actually enforces the pin.

Majors

  • 🔴→✅ Corrupt predicate — now fails closed; only a missing predicate falls back to time.Now(). Bonus predicateType validation added.
  • 🔴→✅ Upload-order invariant — fake gcloud records destinations and TestWriteGCS now asserts started→junit→finished. 👍
  • 🔴→⚠️ Signer columns blank — still a stub, but the doc is now honest and warns future implementers; deferred to #1267.

Minors/nitpicks fixed: unbounded pull context (now bounded), --bundle-dir summary-bundle auto-resolution, counts-based allPassed (no more XML string-scan), gcloud preflight (LookPath), --insecure-tls "(dev only)" label, real MetaKeys()↔started.json contract test, and coverage for oversize/pending/unknown-status/cancel paths.

Remaining (non-blocking follow-ups, inline below): F6 criteria enum validation and F12 --bundle-dir build-id collision — both minor and fine to defer.

Nice iteration — the test hardening in particular is exactly right.

// *recipe.Criteria required by pkg/recipe.CoordinateFor. Normalization
// (lowercase, trimSpace) has already been applied by parseCriteria.
// Validation (empty, "any", "/") is enforced by recipe.CoordinateFor itself.
func toRecipeCriteria(c RecipeCriteria) *recipe.Criteria {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Follow-up (non-blocking) — F6. Criteria are still raw-cast here; recipe.CoordinateFor only rejects empty/any//, so a hostile recipe.yaml (service: "foo bar", accelerator: "..") still resolves to a valid coordinate and publishes under an arbitrary prefix in the operator's bucket. Confirmed not an escape/RCE (the gs:// prefix + /-rejection neutralize traversal/flag injection), so fine to defer — but validating against the registry allowlist (ParseService/ValidateWithRegistry) before building the coordinate would close the namespace-pollution gap.

if len(digest) >= n {
return digest[len(digest)-n:]
}
if digest != "" {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Follow-up (non-blocking) — F12. For --bundle-dir, digest="local" so shortHex returns "local" and the build-id collapses to {now-unix}-local; two distinct local bundles validated in the same second collide and overwrite in GCS. Dev-path only, so deferring is fine — folding a content hash into the suffix when digest=="local" would make local runs collision-safe.

srao-nv added a commit that referenced this pull request Jun 25, 2026
Wires tools/testgrid-publish into CI so every completed UAT run
(GCP or AWS) is automatically published to the AICR TestGrid dashboard.

Trigger: workflow_run on "UAT GCP" / "UAT AWS" (success or failure —
  failing runs are valid TestGrid columns showing what broke).
  Also supports workflow_dispatch for manual backfills.

Pipeline:
  1. Download evidence pointer artifact from the triggering UAT run
  2. Extract OCI bundle ref + digest from pointer.yaml
  3. Authenticate to GCP via WIF (publish SA, groups/ prefix write-only)
  4. Build testgrid-publish from source
  5. Pull OCI bundle → parse predicate → write GCS in correct order

Security:
  - No hardcoded project IDs or account emails in source; all config
    comes from repository variables (vars.GCP_PROJECT_NUMBER,
    vars.GCP_PROJECT_ID)
  - All actions pinned to commit SHAs
  - bundle_ref passed through env: not inline ${{ }} to prevent
    script injection
  - Minimal permissions: contents:read + id-token:write only

Auth: WIF pool aicr-testgrid-<env>-github (aicr-testgrid Terraform).
Workflow filename starts with "testgrid-publish" to satisfy the
WIF attribute_condition in wif-publish.tf.

Defaults to prod (gs://aicr-testgrid). Override to staging via
workflow_dispatch(environment=staging) for testing.

Pending before this can merge:
  - Apply prod Terraform in aicr-testgrid repo
  - Set vars.GCP_PROJECT_NUMBER and vars.GCP_PROJECT_ID in repo settings
  - TG2 (#1447) must be merged first
srao-nv added a commit that referenced this pull request Jun 25, 2026
- goreleaser: add testgrid-publish build (linux amd64+arm64, -trimpath,
  static) + archive entry so it ships as a release asset alongside aicr
- TG5 workflow: download pre-built binary from latest release via
  gh release download; fall back to go build for pre-release commits
- Eliminates Go compile step from the hot path of every UAT publish run

Depends on TG2 (#1447) — tools/testgrid-publish/ must exist at merge.
srao-nv added a commit that referenced this pull request Jun 25, 2026
Wires tools/testgrid-publish into CI so every completed UAT run
(GCP or AWS) automatically publishes to the AICR TestGrid dashboard.

Trigger: workflow_run on "UAT - GCP" / "UAT - AWS" (success or failure).
Also supports workflow_dispatch (main branch only) for backfills.

Pipeline:
  1. Validate producer run is from this repository (pre-credential guard)
  2. Download evidence pointer artifact; distinguish network errors from
     benign "conformance did not complete" skips
  3. Extract OCI bundle ref + sha256 digest from pointer.yaml (both validated)
  4. Authenticate to GCP via WIF (publish SA, objectCreator on groups/ only)
  5. Build testgrid-publish from source (vendor — avoids integrity risk of
     downloading a release artifact into a privileged WIF context)
  6. Publish: pull OCI bundle → parse predicate → write GCS in order

Also adds testgrid-publish build entry to .goreleaser.yaml (linux
amd64+arm64) so the binary ships as a release artifact for local use.
Workflow always builds from source; goreleaser is for distribution only.

Security hardening:
  - Top-level permissions: contents: read (least privilege default)
  - Job-level permissions: contents:read + actions:read + id-token:write
  - No project IDs, numbers, or SA emails in source — all from vars.*
  - All actions pinned to SHA
  - All user-controlled values through env: (no inline ${{ }} in run:)
  - set -euo pipefail in every shell script
  - workflow_dispatch restricted to refs/heads/main
  - concurrency block (one publish per triggering run, no cancel)
  - timeout-minutes: 15

Environment-aware: defaults to prod (gs://aicr-testgrid). Override to
staging via workflow_dispatch(environment=staging).

Pending before this can merge:
  - TG2 (#1447) must be merged first (tools/testgrid-publish must exist)
  - Apply prod Terraform in aicr-testgrid (creates aicr-testgrid-prod-github
    WIF pool and aicr-testgrid-prod-publish SA)
  - Set vars.GCP_PROJECT_NUMBER and vars.GCP_PROJECT_ID in repo settings
@mchmarny mchmarny enabled auto-merge (squash) June 26, 2026 12:52
@mchmarny mchmarny merged commit ffe7522 into main Jun 26, 2026
33 checks passed
@mchmarny mchmarny deleted the feat/tg2-testgrid-publish branch June 26, 2026 13:03
srao-nv added a commit that referenced this pull request Jun 26, 2026
Wires tools/testgrid-publish into CI so every completed UAT run
(GCP or AWS) automatically publishes to the AICR TestGrid dashboard.

Trigger: workflow_run on "UAT - GCP" / "UAT - AWS" (success or failure).
Also supports workflow_dispatch (main branch only) for backfills.

Pipeline:
  1. Validate producer run is from this repository (pre-credential guard)
  2. Download evidence pointer artifact; distinguish network errors from
     benign "conformance did not complete" skips
  3. Extract OCI bundle ref + sha256 digest from pointer.yaml; validate
     both fields and enforce ghcr.io/nvidia/...@sha256:<64hex> format
     before writing to GITHUB_OUTPUT (guards against newline injection)
  4. Authenticate to GCP via WIF (publish SA, objectCreator on groups/ only)
  5. Login to GHCR (packages: read) so testgrid-publish can pull the bundle
  6. Build testgrid-publish from source (vendor — no integrity risk from
     downloading a release artifact into a privileged WIF context)
  7. Publish: pull OCI bundle → parse predicate → write GCS in order

Security hardening:
  - Top-level permissions: contents: read (least privilege default)
  - Job-level: contents:read + actions:read + packages:read + id-token:write
  - No project IDs/numbers in source — all from vars.GCP_PROJECT_NUMBER
    and vars.GCP_PROJECT_ID (repo variables, not secrets)
  - All actions pinned to SHA
  - All user-controlled values through env: (no inline ${{ }} in run:)
  - set -euo pipefail in every shell script
  - workflow_dispatch restricted to refs/heads/main
  - workflow_dispatch inputs use type: choice / type: string to prevent
    typos silently deriving non-existent WIF pools
  - BUNDLE_REF validated against ghcr.io/nvidia/...@sha256:<64hex> regex
  - concurrency block (one publish per triggering run, no cancel)
  - timeout-minutes: 15
  - Release branches intentionally allowed (qualify release → publish)

Environment-aware: defaults to prod (gs://aicr-testgrid). Override to
staging via workflow_dispatch(environment=staging).

Pending before this can merge:
  - TG2 (#1447) must be merged first (tools/testgrid-publish must exist)
  - Apply prod Terraform in aicr-testgrid (creates aicr-testgrid-prod-github
    WIF pool and aicr-testgrid-prod-publish SA)
  - Set vars.GCP_PROJECT_NUMBER and vars.GCP_PROJECT_ID in repo settings
srao-nv added a commit that referenced this pull request Jul 2, 2026
Wires tools/testgrid-publish into CI so every completed UAT run
(GCP or AWS) automatically publishes to the AICR TestGrid dashboard.

Trigger: workflow_run on "UAT - GCP" / "UAT - AWS" (success or failure).
Also supports workflow_dispatch (main branch only) for backfills.

Pipeline:
  1. Validate producer run is from this repository (pre-credential guard)
  2. Download evidence pointer artifact; distinguish network errors from
     benign "conformance did not complete" skips
  3. Extract OCI bundle ref + sha256 digest from pointer.yaml; validate
     both fields and enforce ghcr.io/nvidia/...@sha256:<64hex> format
     before writing to GITHUB_OUTPUT (guards against newline injection)
  4. Authenticate to GCP via WIF (publish SA, objectCreator on groups/ only)
  5. Login to GHCR (packages: read) so testgrid-publish can pull the bundle
  6. Build testgrid-publish from source (vendor — no integrity risk from
     downloading a release artifact into a privileged WIF context)
  7. Publish: pull OCI bundle → parse predicate → write GCS in order

Security hardening:
  - Top-level permissions: contents: read (least privilege default)
  - Job-level: contents:read + actions:read + packages:read + id-token:write
  - No project IDs/numbers in source — all from vars.GCP_PROJECT_NUMBER
    and vars.GCP_PROJECT_ID (repo variables, not secrets)
  - All actions pinned to SHA
  - All user-controlled values through env: (no inline ${{ }} in run:)
  - set -euo pipefail in every shell script
  - workflow_dispatch restricted to refs/heads/main
  - workflow_dispatch inputs use type: choice / type: string to prevent
    typos silently deriving non-existent WIF pools
  - BUNDLE_REF validated against ghcr.io/nvidia/...@sha256:<64hex> regex
  - concurrency block (one publish per triggering run, no cancel)
  - timeout-minutes: 15
  - Release branches intentionally allowed (qualify release → publish)

Environment-aware: defaults to prod (gs://aicr-testgrid). Override to
staging via workflow_dispatch(environment=staging).

Pending before this can merge:
  - TG2 (#1447) must be merged first (tools/testgrid-publish must exist)
  - Apply prod Terraform in aicr-testgrid (creates aicr-testgrid-prod-github
    WIF pool and aicr-testgrid-prod-publish SA)
  - Set vars.GCP_PROJECT_NUMBER and vars.GCP_PROJECT_ID in repo settings
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/XL theme/ci-dx CI pipelines, developer experience, and build tooling

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TG2 — Publish path (tools/testgrid-publish: OCI bundle → CTRF → jUnit → GCS feed)

3 participants