Skip to content

ci(evidence): warning-only recipe-evidence drift gate for PRs#1065

Merged
mchmarny merged 10 commits into
NVIDIA:mainfrom
njhensley:ci/recipe-evidence-gate
May 28, 2026
Merged

ci(evidence): warning-only recipe-evidence drift gate for PRs#1065
mchmarny merged 10 commits into
NVIDIA:mainfrom
njhensley:ci/recipe-evidence-gate

Conversation

@njhensley

@njhensley njhensley commented May 27, 2026

Copy link
Copy Markdown
Member

Summary

Adds a warning-only PR gate that surfaces stale or missing recipe-evidence bundles. For each leaf overlay touched by the diff, the gate computes the current canonical digest, compares it to the signed predicate.recipe.digest (via aicr evidence verify), and posts a sticky comment.

Motivation / Context

PR #1055 landed aicr evidence digest, giving us a cheap local way to detect drift between a signed evidence pointer and the current recipe. This PR wires it into CI so contributors are nudged — not blocked — to refresh evidence when they touch a recipe.

The gate is intentionally warning-only because regenerating evidence requires a cluster matching the recipe's criteria, and most contributors don't have that hardware on hand. Surfacing the drift in the PR comment makes it visible for reviewers and gives the maintainer-with-hardware a chance to refresh before merge.

Fixes: #751
Related: ADR-007, follow-up to #1055

Type of Change

  • New feature (non-breaking change that adds functionality)
  • Build/CI/tooling

Component(s) Affected

  • Other: GitHub Actions workflows + helper shell script (.github/)

Implementation Notes

Three files:

  • .github/scripts/recipe-evidence-check.sh — shared gate logic. Discovers leaf overlays touched by the BASE_SHA...HEAD_SHA diff, runs aicr evidence digest + aicr evidence verify against each, builds a Markdown report. Runnable locally.
  • .github/workflows/recipe-evidence.yamlpull_request trigger, read-only token, builds aicr, invokes the script, uploads the report as an artifact.
  • .github/workflows/recipe-evidence-comment.yamlworkflow_run trigger, downloads the artifact, posts/updates the sticky PR comment under the standard write-scoped token. PR association is re-derived from workflow_run.head_sha (not the artifact, which is fork-attacker-controllable).

Why the two-workflow split: standard GitHub pattern (matches the existing fern-docs-preview-build + fern-docs-preview-comment). Keeps the PR-comment write token out of the untrusted PR-branch job.

Leaf discovery rules:

  1. Overlay file itself changed.
  2. Ancestor overlay in the spec.base chain changed.
  3. A component valuesFile referenced by this overlay (or any ancestor) changed.

Plus a broad-impact path: if recipes/registry.yaml or recipes/overlays/base.yaml changed, every leaf is potentially affected and the report flags the breadth.

Local invocation (same code paths CI uses):

make build
AICR=$(pwd)/bin/aicr \
  BASE_SHA=$(git merge-base origin/main HEAD) \
  HEAD_SHA=$(git rev-parse HEAD) \
  REPORT_OUT=/tmp/report.md \
  REPO_URL=https://github.com/NVIDIA/aicr \
  .github/scripts/recipe-evidence-check.sh
cat /tmp/report.md

Known limitations (tracked as follow-ups in the script header):

  • Pointer file name is computed as basename overlay .yaml, but aicr validate --emit-attestation uses the criteria-derived slug from RecipeNameFor. Non-canonical overlay names will be reported as missing-pointer until they're aligned.
  • Discovery walks spec.base ancestors but not spec.mixins; mixin/check/component edits outside literal valuesFile refs aren't promoted. Parity with the existing kwok-recipes Tier 2 classifier.
  • aicr evidence verify fetches OCI artifacts from PR-controlled URLs — an allow-list of trusted registries is a follow-up.

Example reports

These cover the three most representative response states. All render under the bot identity as a sticky comment (marker <!-- recipe-evidence-gate -->).

Clean — no recipes touched

Recipe evidence check

No leaf overlays affected by this PR.

This gate is warning-only and never blocks merge.

Single recipe, evidence on file, digest drifted (the marquee case)

Recipe evidence check

Affected leaf overlays: 1

Recipe Pointer Verify Digest match
h100-eks-ubuntu-training ✅ present ✅ passed ⚠️ stale (a3f4c1d2e890… vs current b7e2d0a6f15c…)

How to refresh evidence

Run on a cluster matching the recipe's criteria:

aicr snapshot -o snapshot.yaml
aicr validate \
  -r recipes/overlays/<slug>.yaml \
  -s snapshot.yaml \
  --emit-attestation ./out \
  --push ghcr.io/<your-fork>/aicr-evidence
cp ./out/pointer.yaml recipes/evidence/<slug>.yaml

This gate is warning-only and never blocks merge. See ADR-007 for the trust model.

Broad-impact change (registry.yaml or base.yaml touched)

Recipe evidence check

Broad impact: recipes/registry.yaml or recipes/overlays/base.yaml changed; every leaf recipe is potentially affected. The list below covers all of them — each one would ideally have refreshed evidence before merge.

Affected leaf overlays: 42

Recipe Pointer Verify Digest match
gb200-eks-inference ⚠️ missing
gb200-eks-training ✅ present ✅ passed ⚠️ stale (a3f4c1d2e890… vs current b7e2d0a6f15c…)
(38 rows elided in this preview)
h100-oke-ubuntu-training ⚠️ missing

Beyond MAX_ROWS (default 80), a truncation row replaces the rest: … +N more (truncated; raise MAX_ROWS or split the PR).

Testing

Ran the script locally against two SHA ranges on this branch:

Range Affected leaves Output
merge-base..HEAD (current branch — no overlay edits) 0 "No leaf overlays affected." ✓
8b493978^..8b493978 (PR #977 — single leaf) 1 (gb200-eks-ubuntu-inference-dynamo) missing-pointer row + refresh instructions ✓

Pointer-present + verify branch wasn't exercised live — recipes/evidence/ doesn't exist in the tree yet. Path traced by inspection in recipe-evidence-check.sh (lines 205-246).

make qualify: not run (YAML/shell-only change, no Go). yamllint and actionlint run in CI.

Risk Assessment

  • Low — Additive CI workflows + helper script. Gate is warning-only: pull-requests: write is on the comment workflow only, posted body is verbatim script output, no failing exit code blocks the merge queue.

Rollout notes: Gate activates automatically on PRs touching recipes/**. No pre-existing pointers means most early PRs will show ⚠️ missing rows — that's expected and serves as a backlog of what to generate.

Checklist

  • Tests pass locally (script verified against two ranges)
  • Linter passes — N/A locally (no Go); CI runs yamllint + actionlint
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality — script self-tested via local runs
  • I updated docs if user-facing behavior changed — gate trailer points to existing ADR-007; contributor-facing nudge lives in the report itself
  • Changes follow existing patterns in the codebase (two-workflow split mirrors fern-docs-preview-*)
  • Commits are cryptographically signed (git commit -S)

njhensley added 2 commits May 27, 2026 09:30
Adds the PR-side recipe-evidence gate:

- recipe-evidence.yaml — `pull_request` workflow that runs in the
  untrusted PR context, computes the current recipe digest via
  `aicr evidence digest`, compares it to the signed predicate fetched
  through `aicr evidence verify`, and uploads a result artifact.
- recipe-evidence-comment.yaml — `workflow_run` companion that
  consumes the artifact under the elevated default-token context and
  posts/updates the drift summary on the PR.

The split (verify in `pull_request`, comment in `workflow_run`) is the
standard GitHub pattern for keeping write-scoped tokens out of the
untrusted PR workflow.
Moves the leaf-discovery + per-recipe evidence check pipeline out of
recipe-evidence.yaml and into .github/scripts/recipe-evidence-check.sh.

- Verify workflow shrinks to: checkout, build aicr, run script, upload
  artifact. The 280-line inline shell block in YAML is now a real file
  with shellcheck-able structure.
- Same script runs locally for contributor self-preview — the header
  documents the AICR/BASE_SHA/HEAD_SHA/REPORT_OUT/REPO_URL contract,
  and `make build && bash .github/scripts/recipe-evidence-check.sh`
  reproduces what CI produces.
- Future work (registry allow-list for `aicr evidence verify`,
  mixin-aware discovery) edits the script, not the workflow YAML.

Two-workflow split (verify on `pull_request` → comment on
`workflow_run`) is preserved — still the standard pattern for keeping
write-scoped tokens out of the untrusted PR job.
@njhensley njhensley requested a review from a team as a code owner May 27, 2026 20:22
@njhensley njhensley self-assigned this May 27, 2026
@coderabbitai

coderabbitai Bot commented May 27, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: cef44511-0c53-4d5b-b2f7-8951c0ad27d5

📥 Commits

Reviewing files that changed from the base of the PR and between 46f78df and 98f32be.

📒 Files selected for processing (2)
  • .github/scripts/recipe-evidence-check.sh
  • .github/workflows/recipe-evidence-comment.yaml

📝 Walkthrough

Walkthrough

This PR adds a non-blocking evidence verification gate for recipe overlays. It introduces a Bash script that discovers affected leaf overlays from the PR diff and verifies their signed-digest evidence with the aicr CLI, a "Recipe Evidence: Verify" workflow that builds aicr, runs the script, and uploads a report artifact, and a "Recipe Evidence: Comment" workflow that downloads the report and creates/updates a sticky PR comment with the Markdown results.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • NVIDIA/aicr#1055: The evidence gate script calls aicr evidence digest -r <overlay> to compute and compare digests against verified predicates, which depends on the aicr evidence digest subcommand.

Suggested labels

size/L

Suggested reviewers

  • mchmarny
🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Linked Issues check ❓ Inconclusive The PR substantially addresses issue #751 by implementing the warning-only CI gate, discovery rules for affected overlays, and sticky PR comment posting as specified. However, some acceptance criteria like PR template and CONTRIBUTING.md updates are not implemented. Clarify whether the PR template and documentation updates are deferred to follow-up issues or should be included; if deferred, track them as separate deliverables.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title 'ci(evidence): warning-only recipe-evidence drift gate for PRs' clearly and concisely describes the main change: adding a CI gate for recipe evidence verification.
Description check ✅ Passed The description thoroughly explains the PR's purpose, implementation, motivation, testing, and risk assessment, all directly related to the changeset.
Out of Scope Changes check ✅ Passed All three files added (recipe-evidence-check.sh, recipe-evidence.yaml, recipe-evidence-comment.yaml) are directly scoped to implementing the CI evidence gate specified in issue #751.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/scripts/recipe-evidence-check.sh:
- Around line 96-105: The while loop that walks spec.base can hang on cyclic
chains because it has no termination guard; update the traversal (the loop using
current, parent, changed_files and setting include) to detect cycles by tracking
visited nodes (e.g., a visited set of parent names) or by enforcing a max-depth
counter and breaking if exceeded, and apply the same defensive check to the
other ancestor walk later in the script (the second while that similarly updates
current/parent/include) so both traversals cannot loop indefinitely.

In @.github/workflows/recipe-evidence-comment.yaml:
- Around line 83-90: The current logic uses head -1 to pick pr_number from pulls
which can silently choose the wrong PR; modify the selection to collect all
matching PR numbers (from the jq expression that filters by HEAD_OWNER), count
them, and only set pr_number when there is exactly one match; if count != 1,
emit the existing warning (using HEAD_SHA and HEAD_OWNER), set "skip=true" to
GITHUB_OUTPUT and exit 0. Update references to pr_number/pr matches accordingly
so subsequent posting only happens when a single PR was resolved.

In @.github/workflows/recipe-evidence.yaml:
- Around line 61-91: The workflow recipe-evidence.yaml must explicitly install
and pin yq and jq before running recipe-evidence-check.sh: add steps prior to
the "Run evidence check" step that install a specific yq and jq release (or call
a pinned composite action) so the script's invocations of yq and jq are
deterministic; ensure the step names are clear (e.g., "Install yq" and "Install
jq" or a single "Setup yq and jq" step), verify the tools are on PATH, and fail
the job if installation fails so recipe-evidence-check.sh won't silently proceed
without yq/jq available.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 61efe81e-c0aa-4407-9cda-ceb082499bd6

📥 Commits

Reviewing files that changed from the base of the PR and between 7adc586 and 55a1a9b.

📒 Files selected for processing (3)
  • .github/scripts/recipe-evidence-check.sh
  • .github/workflows/recipe-evidence-comment.yaml
  • .github/workflows/recipe-evidence.yaml

Comment thread .github/scripts/recipe-evidence-check.sh
Comment thread .github/workflows/recipe-evidence-comment.yaml Outdated
Comment thread .github/workflows/recipe-evidence.yaml
@njhensley njhensley linked an issue May 27, 2026 that may be closed by this pull request

@mchmarny mchmarny left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shape of the two-workflow split mirrors fern-docs-preview-* cleanly and the trusted-source PR re-derivation (commits//pulls + head-repo owner cross-check) is the right defense against the workflow_run trust boundary. Known limitations are honestly enumerated in the script header.

Four non-blocking inline comments: a substring-vs-line-equality issue in the three grep -qF rules (false positives on .bak/.orig/etc.), missing wall-clock timeout on aicr evidence verify/digest (PR-controlled pointer URL can hang to the job cap), and two nits — double-marker on truncation and 2>/dev/null swallowing stderr in the digest/verify warning paths.

Gate is warning-only and CI is green; nothing here blocks merge.

Comment thread .github/scripts/recipe-evidence-check.sh Outdated
Comment thread .github/scripts/recipe-evidence-check.sh Outdated
Comment thread .github/workflows/recipe-evidence-comment.yaml Outdated
Comment thread .github/scripts/recipe-evidence-check.sh Outdated
@mchmarny mchmarny assigned mchmarny and unassigned njhensley May 28, 2026
@coderabbitai

coderabbitai Bot commented May 28, 2026

Copy link
Copy Markdown

Actionable comments posted: 0

@njhensley njhensley self-assigned this May 28, 2026
@mchmarny mchmarny enabled auto-merge (squash) May 28, 2026 20:03
@mchmarny mchmarny assigned njhensley and unassigned njhensley and mchmarny May 28, 2026
@mchmarny mchmarny merged commit 652f125 into NVIDIA:main May 28, 2026
31 checks passed
@njhensley njhensley deleted the ci/recipe-evidence-gate branch May 28, 2026 20:21
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 3, 2026
…S inference

Commit signed recipe-evidence v1 pointers generated from live conformance
validation on two H100 inference clusters, to exercise the warning-only
recipe-evidence drift gate (NVIDIA#1065).

- h100-gke-cos-inference-dynamo  → ghcr.io/nvidia/aicr-evidence@sha256:6a59465f… (Rekor 1706764289)
- h100-eks-ubuntu-inference-dynamo → ghcr.io/nvidia/aicr-evidence@sha256:da9d8838… (Rekor 1706788485)

Both bundles built via 'aicr validate --phase conformance --emit-attestation'
on aicr-demo5 (GKE/COS) and aicr3 (EKS/Ubuntu), signed + pushed via
'aicr evidence publish' (Sigstore keyless).

Refs NVIDIA#1160
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Recipe contribution workflow with CI evidence gate

2 participants