Skip to content

feat(evidence): regenerate signed demo evidence under aicr.run#1509

Merged
yuanchen8911 merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:feat/regen-demo-evidence-aicr-run
Jun 27, 2026
Merged

feat(evidence): regenerate signed demo evidence under aicr.run#1509
yuanchen8911 merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:feat/regen-demo-evidence-aicr-run

Conversation

@yuanchen8911

@yuanchen8911 yuanchen8911 commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

Summary

Regenerate verifiable, signed demo evidence for the two training recipes under the canonical aicr.run predicate type, pushed to the NVIDIA-org registry. Follow-up to #1507, which removed the stale pre-migration pointers that could no longer verify.

Motivation / Context

Addresses @mchmarny's review note on #1507 (#1507 (comment)) asking whether the evidence could be regenerated under a canonical NVIDIA org rather than landing with community: [] + a follow-up. This PR is that regeneration.

#1499 migrated the artifact domain to aicr.run, which bumped PredicateTypeV1 to https://aicr.run/recipe-evidence/v1. The two committed demo evidence pointers had only their predicateType string edited in place, while still pinning the pre-migration, immutable signed OCI bundles — so aicr evidence verify failed on them. #1507 removed them (the correct immediate fix) and left regeneration as a follow-up. This PR is that regeneration.

Each bundle was produced by aicr validate --emit-attestation against a live cluster, then signed keyless (Sigstore) and pushed via aicr evidence publish --push ghcr.io/nvidia/aicr-evidence:

Recipe Cluster NCCL result OCI tag Rekor index
h100-gke-cos-training GKE a3-megagpu-8g (H100) busbw 338 GB/s ≥ 250 …:h100-gke-cos-training-3619d2b8a5ba 1981934662
gb200-eks-ubuntu-training EKS p6e-gb200 (GB200) nvls 841 GB/s ≥ 500 …:gb200-eks-ubuntu-training-788a1bf8955a 1982171931

Both carry predicateType: https://aicr.run/recipe-evidence/v1 and verify clean (aicr evidence verify → exit 0: signature-verify, predicate-parse, and manifest-hash all pass).

The signer is unchanged from the original demo evidence (same 7c4c0edc… community source slug), so this PR also restores the community allowlist entry and the allowlist TestClassify "community by slug" case — both had been removed in #1507 when the evidence was deleted.

Fixes: N/A (regeneration follow-up to #1507)
Related: #1507, #1499

Type of Change

Component(s) Affected

  • Other: recipes/evidence/ (signed pointers + signer allowlist), pkg/evidence/allowlist (test)

Implementation Notes

  • Both clusters' GPU nodes were fully occupied by a standing slinky-slurm NodeSet; freeing it (scale to 0 → validate → restore to 2) let the NCCL all-reduce workers schedule. No recipe/IMEX changes were needed — the earlier performance failures were pure capacity contention.
  • The snapshots were captured with the released agent image (pre-migration) and restamped aicr.nvidia.com/v1alpha1 → aicr.run/v1alpha2; the feat: migrate API domain to aicr.run, bump v1alpha2/v1beta2 #1499 bump is signal-only (no schema change), so the restamp is equivalent to a post-migration agent's output.
  • Pointers live under the same recipes/evidence/<recipe>/7c4c0edc…/ dirs as the originals (signer identity unchanged); only the digest-named filenames differ.

Testing

go test ./pkg/evidence/allowlist/...                                  # ok
aicr evidence verify recipes/evidence/h100-gke-cos-training/...       # exit 0
aicr evidence verify recipes/evidence/gb200-eks-ubuntu-training/...   # exit 0

Both committed pointers verify end-to-end against their pushed OCI bundles (signature + predicate + manifest hash). Scope is evidence pointers + allowlist + one test; no production Go logic changed.

Risk Assessment

  • Low — Restores demo/sample evidence; no production code paths. The pointer-contract gate validates the committed pointers against the allowlist in CI.

Rollout notes: Demo evidence is restored to a verifiable state. No runtime/API impact.

Checklist

  • Tests pass locally (allowlist package)
  • I did not skip/disable tests to make CI green
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

…A org)

Follow-up to NVIDIA#1507 (which removed the stale, unverifiable demo evidence
pointers after the aicr.run domain migration). This restores verifiable
demo evidence for both training recipes, regenerated and signed under the
canonical predicate type and pushed to the NVIDIA-org registry.

Each pointer was produced by 'aicr validate --emit-attestation' against a
live cluster, then signed keyless (Sigstore) and pushed via
'aicr evidence publish --push ghcr.io/nvidia/aicr-evidence':

- h100-gke-cos-training: validated on a GKE a3-megagpu-8g (H100) cluster;
  all phases pass (NCCL all-reduce busbw 338 GB/s >= 250 floor).
  OCI ghcr.io/nvidia/aicr-evidence:h100-gke-cos-training-3619d2b8a5ba,
  Rekor log index 1981934662.
- gb200-eks-ubuntu-training: validated on a p6e-gb200 EKS cluster; all
  phases pass (NCCL nvls 841 GB/s >= 500 floor).
  OCI ghcr.io/nvidia/aicr-evidence:gb200-eks-ubuntu-training-788a1bf8955a,
  Rekor log index 1982171931.

Both carry predicateType https://aicr.run/recipe-evidence/v1 and verify
clean ('aicr evidence verify' -> exit 0: signature, predicate, and
manifest-hash all pass). The signer is unchanged from the original demo
evidence, so the community allowlist entry (source 7c4c0edc...) and the
allowlist TestClassify 'community by slug' case are restored alongside the
pointers (both had been removed in NVIDIA#1507 when the evidence was deleted).

Signed-off-by: Yuan Chen <[email protected]>
@yuanchen8911 yuanchen8911 requested review from a team as code owners June 27, 2026 14:53
@yuanchen8911 yuanchen8911 added theme/supply-chain SLSA, SBOM, Sigstore, and provenance verification theme/validation Constraint evaluation, health checks, and conformance evidence labels Jun 27, 2026
@yuanchen8911 yuanchen8911 changed the title feat(evidence): regenerate signed demo evidence under aicr.run (NVIDIA org) feat(evidence): regenerate signed demo evidence under aicr.run Jun 27, 2026
@yuanchen8911 yuanchen8911 requested a review from mchmarny June 27, 2026 14:53
@github-actions

Copy link
Copy Markdown
Contributor

Recipe evidence check

Protected recipes

Recipes with committed evidence (recipes/evidence/<slug>/<source>/<digest>.yaml) that this PR affects: 2

Recipe Source Pointer Verify Digest match
gb200-eks-ubuntu-training 7c4c0edc8c765a95a0f3afdb3bbb8e91 sha256-93fac974407a873d5b6a52a72bafcaa18b019190545a23d03031680d6aabd2bc ❌ invalid — registry-forbidden (HTTP 401): registry not accessible (make the fork's aicr-evidence package public, or provide registry credentials) ⚠️ skipped (no signed digest)
h100-gke-cos-training 7c4c0edc8c765a95a0f3afdb3bbb8e91 sha256-f2573e7f2496cc895e6a780604645f7c24ed4d7e0edf4c4845c0d341a3a6326e ❌ invalid — registry-forbidden (HTTP 401): registry not accessible (make the fork's aicr-evidence package public, or provide registry credentials) ⚠️ skipped (no signed digest)

How to refresh evidence

Run on a cluster matching the recipe's criteria:

aicr snapshot -o snapshot.yaml
aicr validate \
  -r recipes/overlays/<slug>.yaml \
  -s snapshot.yaml \
  --emit-attestation ./out \
  --push ghcr.io/<your-fork>/aicr-evidence
# Copy to the per-source path printed in the emit 'copyTo' hint:
#   recipes/evidence/<slug>/<source>/<bundle-digest>.yaml

This gate is warning-only and never blocks merge. See ADR-007 for the trust model.

@coderabbitai

coderabbitai Bot commented Jun 27, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

The community allowlist in recipes/evidence/allowlist.yaml is updated from an empty list to a single entry specifying a GitHub OAuth issuer and a source slug. Two attestation YAML files for gb200-eks-ubuntu-training and h100-gke-cos-training recipes are populated with schema version, recipe name, and attestation details including signer identity, bundle digests, OCI references, and Rekor log indices. The TestClassify test case is updated to assert successful ClassCommunity classification for the matching signer identity.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

  • NVIDIA/aicr#1507: Inverse change to the same community allowlist and TestClassify test — removed the community entry and set wantOK: false for the same GitHub OAuth/slug identity.

Suggested labels

area/tests, size/M

Suggested reviewers

  • mchmarny
  • njhensley
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: regenerating signed demo evidence under aicr.run.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The description clearly matches the evidence regeneration, allowlist update, and test change in this PR.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@yuanchen8911 yuanchen8911 merged commit 9be382b into NVIDIA:main Jun 27, 2026
177 of 180 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/recipes size/S theme/supply-chain SLSA, SBOM, Sigstore, and provenance verification theme/validation Constraint evaluation, health checks, and conformance evidence

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants