feat(evidence): regenerate signed demo evidence under aicr.run#1509
Conversation
…A org) Follow-up to NVIDIA#1507 (which removed the stale, unverifiable demo evidence pointers after the aicr.run domain migration). This restores verifiable demo evidence for both training recipes, regenerated and signed under the canonical predicate type and pushed to the NVIDIA-org registry. Each pointer was produced by 'aicr validate --emit-attestation' against a live cluster, then signed keyless (Sigstore) and pushed via 'aicr evidence publish --push ghcr.io/nvidia/aicr-evidence': - h100-gke-cos-training: validated on a GKE a3-megagpu-8g (H100) cluster; all phases pass (NCCL all-reduce busbw 338 GB/s >= 250 floor). OCI ghcr.io/nvidia/aicr-evidence:h100-gke-cos-training-3619d2b8a5ba, Rekor log index 1981934662. - gb200-eks-ubuntu-training: validated on a p6e-gb200 EKS cluster; all phases pass (NCCL nvls 841 GB/s >= 500 floor). OCI ghcr.io/nvidia/aicr-evidence:gb200-eks-ubuntu-training-788a1bf8955a, Rekor log index 1982171931. Both carry predicateType https://aicr.run/recipe-evidence/v1 and verify clean ('aicr evidence verify' -> exit 0: signature, predicate, and manifest-hash all pass). The signer is unchanged from the original demo evidence, so the community allowlist entry (source 7c4c0edc...) and the allowlist TestClassify 'community by slug' case are restored alongside the pointers (both had been removed in NVIDIA#1507 when the evidence was deleted). Signed-off-by: Yuan Chen <[email protected]>
Recipe evidence checkProtected recipesRecipes with committed evidence (
How to refresh evidenceRun on a cluster matching the recipe's aicr snapshot -o snapshot.yaml
aicr validate \
-r recipes/overlays/<slug>.yaml \
-s snapshot.yaml \
--emit-attestation ./out \
--push ghcr.io/<your-fork>/aicr-evidence
# Copy to the per-source path printed in the emit 'copyTo' hint:
# recipes/evidence/<slug>/<source>/<bundle-digest>.yamlThis gate is warning-only and never blocks merge. See ADR-007 for the trust model. |
📝 WalkthroughWalkthroughThe community allowlist in Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Summary
Regenerate verifiable, signed demo evidence for the two training recipes under the canonical
aicr.runpredicate type, pushed to the NVIDIA-org registry. Follow-up to #1507, which removed the stale pre-migration pointers that could no longer verify.Motivation / Context
Addresses @mchmarny's review note on #1507 (#1507 (comment)) asking whether the evidence could be regenerated under a canonical NVIDIA org rather than landing with
community: []+ a follow-up. This PR is that regeneration.#1499 migrated the artifact domain to
aicr.run, which bumpedPredicateTypeV1tohttps://aicr.run/recipe-evidence/v1. The two committed demo evidence pointers had only theirpredicateTypestring edited in place, while still pinning the pre-migration, immutable signed OCI bundles — soaicr evidence verifyfailed on them. #1507 removed them (the correct immediate fix) and left regeneration as a follow-up. This PR is that regeneration.Each bundle was produced by
aicr validate --emit-attestationagainst a live cluster, then signed keyless (Sigstore) and pushed viaaicr evidence publish --push ghcr.io/nvidia/aicr-evidence:h100-gke-cos-training…:h100-gke-cos-training-3619d2b8a5bagb200-eks-ubuntu-training…:gb200-eks-ubuntu-training-788a1bf8955aBoth carry
predicateType: https://aicr.run/recipe-evidence/v1and verify clean (aicr evidence verify→ exit 0: signature-verify, predicate-parse, and manifest-hash all pass).The signer is unchanged from the original demo evidence (same
7c4c0edc…community source slug), so this PR also restores thecommunityallowlist entry and the allowlistTestClassify"community by slug" case — both had been removed in #1507 when the evidence was deleted.Fixes: N/A (regeneration follow-up to #1507)
Related: #1507, #1499
Type of Change
Component(s) Affected
recipes/evidence/(signed pointers + signer allowlist),pkg/evidence/allowlist(test)Implementation Notes
slinky-slurmNodeSet; freeing it (scale to 0 → validate → restore to 2) let the NCCL all-reduce workers schedule. No recipe/IMEX changes were needed — the earlier performance failures were pure capacity contention.aicr.nvidia.com/v1alpha1 → aicr.run/v1alpha2; the feat: migrate API domain to aicr.run, bump v1alpha2/v1beta2 #1499 bump is signal-only (no schema change), so the restamp is equivalent to a post-migration agent's output.recipes/evidence/<recipe>/7c4c0edc…/dirs as the originals (signer identity unchanged); only the digest-named filenames differ.Testing
Both committed pointers verify end-to-end against their pushed OCI bundles (signature + predicate + manifest hash). Scope is evidence pointers + allowlist + one test; no production Go logic changed.
Risk Assessment
Rollout notes: Demo evidence is restored to a verifiable state. No runtime/API impact.
Checklist
git commit -S)