fix(evidence,validator): correct catalog apiVersion examples, drop unverifiable demo evidence (#1499 follow-up)#1507
Conversation
Recipe evidence checkProtected recipesRecipes with committed evidence (
How to refresh evidenceRun on a cluster matching the recipe's aicr snapshot -o snapshot.yaml
aicr validate \
-r recipes/overlays/<slug>.yaml \
-s snapshot.yaml \
--emit-attestation ./out \
--push ghcr.io/<your-fork>/aicr-evidence
# Copy to the per-source path printed in the emit 'copyTo' hint:
# recipes/evidence/<slug>/<source>/<bundle-digest>.yamlThis gate is warning-only and never blocks merge. See ADR-007 for the trust model. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Enterprise Run ID: 📒 Files selected for processing (10)
💤 Files with no reviewable changes (2)
📝 WalkthroughWalkthroughThe PR updates ValidatorCatalog apiVersion values in the design doc, integrator guide, validators README, and provider test fixtures to Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/evidence/allowlist/allowlist_test.go`:
- Around line 81-87: The allowlist test case in allowlist_test.go still uses the
removed contributor’s email in the fixture despite community being empty, so
replace the identity in that test with a synthetic value. Update the affected
case in the allowlist test table near the community-issuer scenario so it no
longer references the prior demo signer’s personal/stale data, while keeping the
test’s behavior and expectations unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 68b718cd-2d36-4010-8c5d-c42119793dc8
📒 Files selected for processing (7)
docs/design/002-validatorv2-adr.mddocs/integrator/validator-extension.mdpkg/evidence/allowlist/allowlist_test.gorecipes/evidence/allowlist.yamlrecipes/evidence/gb200-eks-ubuntu-training/7c4c0edc8c765a95a0f3afdb3bbb8e91/sha256-8274b6a1da24aa9782dc12162bf6a38265c30a852585ca64cfad5718efbbdec3.yamlrecipes/evidence/h100-gke-cos-training/7c4c0edc8c765a95a0f3afdb3bbb8e91/sha256-33d4cf36622ead990c43a596f6f53c62b87d9fa4708f59b7e3f356f215e54317.yamlrecipes/validators/README.md
💤 Files with no reviewable changes (2)
- recipes/evidence/gb200-eks-ubuntu-training/7c4c0edc8c765a95a0f3afdb3bbb8e91/sha256-8274b6a1da24aa9782dc12162bf6a38265c30a852585ca64cfad5718efbbdec3.yaml
- recipes/evidence/h100-gke-cos-training/7c4c0edc8c765a95a0f3afdb3bbb8e91/sha256-33d4cf36622ead990c43a596f6f53c62b87d9fa4708f59b7e3f356f215e54317.yaml
a902ae8 to
168385a
Compare
… unverifiable demo evidence (NVIDIA#1499 follow-up) Follow-up to NVIDIA#1499 (aicr.run domain migration). That PR's sweep introduced two defects: F3 (docs): ValidatorCatalog examples were swept aicr.nvidia.com/v1 -> aicr.run/v1, but the validator catalog is a separate schema that is NOT part of the domain migration. The loader (pkg/validator/catalog) fail-closes on anything other than validator.nvidia.com/v1alpha1, so a copied example errors at load. Fix all four user-facing occurrences, and align the ValidatorCatalog fixtures in pkg/recipe/provider_test.go to match. F2 (evidence): the two committed demo evidence pointers had predicateType edited in place to aicr.run, but still pin the pre-migration, immutable OCI artifacts whose signed in-toto statements carry the old predicate type. 'aicr evidence verify' materializes the bundle then fails with 'unexpected predicateType https://aicr.nvidia.com/recipe-evidence/v1' (verify.go / signature.go compare the fetched statement). A YAML-only fix is impossible in either direction; the only true fix is regenerate+republish the signed bundles. Since that needs signing keys + OCI push, remove the two stale pointers (they also pinned a personal fork org, ghcr.io/yuanchen8911) and the now-orphaned community allowlist anchor, and switch the two docs that showed the deleted pointer as the preferred 'aicr evidence verify' example to the generic recipes/evidence/<recipe>/<src>/<digest>.yaml form. Regenerating proper signed demo evidence under the NVIDIA org is a separate follow-up. Updates the allowlist TestClassify fixture case accordingly. Signed-off-by: Yuan Chen <[email protected]>
168385a to
2098984
Compare
There was a problem hiding this comment.
Verified both fixes against main. This appears to be a domain revert: this PR does not reset aicr.run back to aicr.nvidia.com. Every added line is either validator.nvidia.com/v1alpha1 or a generic pointer placeholder.
F3 (validator apiVersion) — correct, approve. validator.nvidia.com/v1alpha1 is a Kubernetes-style API group, not the AICR service domain, and it's the value the catalog loader fail-closes on (pkg/validator/v1/catalog.go:21, enforced at pkg/validator/catalog/catalog.go:249). #1499 only mis-swept the doc/test examples to aicr.run/v1 — the CatalogAPIVersion constant itself was never changed — so copy-pasting the current docs hits unsupported apiVersion "aicr.run/v1" at load. This just realigns the examples to the unchanged code. Good catch.
F2 (evidence removal) — diagnosis correct, one question on the remedy (non-blocking). The signed OCI statements are immutable and predate the migration, so they can't be reconciled with a YAML edit in either direction. Deletion is the right immediate fix; see the inline note on whether we can regenerate under a canonical NVIDIA org in this PR rather than landing with community: [] + a follow-up.
Not blocking merge. CI is green and the recipe-evidence gate is warning-only.
| # the slug below. None onboarded yet — the prior entry anchored the | ||
| # demo evidence removed in this PR (see #1499 follow-up); re-add alongside | ||
| # regenerated, signed pointers. | ||
| community: [] |
There was a problem hiding this comment.
Deletion is the correct immediate fix — these can't verify post-migration in either direction (signed OCI statement is immutable and carries the pre-migration predicate type; reverting the YAML to the old domain fails pointer.go's PredicateTypeV1 check instead).
But the predicate type itself already migrated to aicr.run (PredicateTypeV1 = "https://" + header.Domain + "/recipe-evidence/v1"), so regenerating + re-signing produces statements that verify cleanly. Before merging with community: []: can we regenerate h100-gke-cos-training / gb200-eks-ubuntu-training under a canonical NVIDIA org (ghcr.io/NVIDIA/...) in this same PR, rather than landing with zero demo evidence and a follow-up? If that's blocked on signing-key / OCI-push access, delete-now + tracked follow-up is acceptable — just want to confirm we're not leaving the evidence surface empty longer than necessary. Either way, dropping the personal-fork pointer (ghcr.io/yuanchen8911/...) is right.
There was a problem hiding this comment.
Done in the follow-up #1509 (it landed as a separate PR only because #1507 merged before regeneration finished).
Both recipes were re-validated on live clusters, signed keyless, and pushed to ghcr.io/nvidia/aicr-evidence:
- h100-gke-cos-training — ghcr.io/nvidia/aicr-evidence:h100-gke-cos-training-3619d2b8a5ba (Rekor 1981934662), NCCL busbw 338 GB/s
- gb200-eks-ubuntu-training — ghcr.io/nvidia/aicr-evidence:gb200-eks-ubuntu-training-788a1bf8955a (Rekor 1982171931), NCCL nvls 841 GB/s
Both carry predicateType https://aicr.run/recipe-evidence/v1 and verify clean (aicr evidence verify -> exit 0: signature, predicate, manifest-hash). The signer is unchanged, so #1509 also restores the community allowlist entry (source 7c4c0edc...) removed here.
Note: the earlier NCCL performance failures were pure GPU-capacity contention (a standing slinky-slurm NodeSet held all GPUs/IMEX), not an IMEX/DRA problem — freeing it let the all-reduce workers schedule.
Summary
Follow-up to #1499 (aicr.run domain migration). Fixes two defects that PR's domain sweep introduced: ValidatorCatalog doc examples pointed at an invalid apiVersion, and two committed demo evidence pointers became unverifiable.
Motivation / Context
#1499 mechanically swept
aicr.nvidia.com→aicr.run. A cross-review surfaced that the sweep touched two surfaces it shouldn't have:F3 — ValidatorCatalog examples (invalid apiVersion). The validator catalog is a separate schema and is explicitly NOT part of the artifact-domain migration. Its loader fail-closes on anything other than
validator.nvidia.com/v1alpha1(pkg/validator/v1/catalog.goCatalogAPIVersion; rejected inpkg/validator/catalog/catalog.go). #1499 swept four example blocksaicr.nvidia.com/v1→aicr.run/v1— still invalid — so an integrator copying any of them hitsunsupported apiVersion "aicr.run/v1"at load. Fixed all four tovalidator.nvidia.com/v1alpha1:docs/integrator/validator-extension.md:52,:150recipes/validators/README.md:8docs/design/002-validatorv2-adr.md:70Also aligned the three
ValidatorCatalogtest fixtures inpkg/recipe/provider_test.gotovalidator.nvidia.com/v1alpha1(they passed before only because the layered provider merges external catalogs under the embedded header; this keeps test data honest).F2 — unverifiable demo evidence pointers. #1499 edited
predicateTypein place tohttps://aicr.run/recipe-evidence/v1in the two committed demo pointers, butbundle.digest/ocistill pin the pre-migration, immutable OCI artifacts whose signed in-toto statements carry the old predicate type.aicr evidence verifymaterializes the bundle, then fails withunexpected predicateType https://aicr.nvidia.com/recipe-evidence/v1(pkg/evidence/verifier/verify.go/signature.gocompare the fetched statement, not the YAML). A YAML-only fix is impossible in both directions: legacy failsLoadAndValidatePointer(pointer.go), andaicr.runfails the pulled-bundle check. The only true fix — regenerate + re-sign + republish the bundles — needs signing keys and OCI push, so this PR instead removes the two stale pointers (they also pinned a personal fork org,ghcr.io/yuanchen8911/aicr-evidence, not a canonical NVIDIA source) and the now-orphanedcommunityallowlist anchor that existed solely to back them.Two user docs (
docs/user/cli-reference.md,docs/user/artifact-verification.md) showed the now-deleted concrete pointer path as the preferredaicr evidence verifyexample; switched both to the genericrecipes/evidence/<recipe>/<src>/<digest>.yamlform.Regenerating proper signed demo evidence under the NVIDIA org is a separate follow-up (it is the only path that restores verifiable evidence; editing strings cannot).
Fixes: N/A (follow-up to #1499)
Related: #1499
Type of Change
Component(s) Affected
pkg/validator)docs/,examples/)pkg/evidence/allowlist,recipes/evidence/Implementation Notes
communitysource slug7c4c0edc…, so thecommunityallowlist entry is removed too (→community: []); the blockingevidence-pointer-contractgate enforces pointer→allowlist (not the reverse), so an empty community list with no community pointers is consistent.pkg/evidence/allowlist/allowlist_test.goTestClassifyloads the real committedallowlist.yaml; its "community by slug" case is updated to assert that a community-issuer signer is now admitted as "reported" only (no community entries). This is a fixture update to match the intentional allowlist change, not a disabled test.https://get.aicr.runresolves and serves the install script (HTTP 200). Thehttps://aicr.run/...predicate/buildType URIs are in-toto/SLSA type identifiers (compared as strings, never dereferenced), so the apex not resolving is expected and not a defect.Testing
Scope is docs + sample-evidence removal + one test fixture; no production Go logic changed. Targeted package tests and lint pass locally; full
make qualify(e2e/scan/cluster) runs in CI.Risk Assessment
Rollout notes: Demo evidence for
h100-gke-cos-training/gb200-eks-ubuntu-trainingis removed until regenerated under the NVIDIA org (tracked follow-up). No runtime/API impact.Checklist
-race-equivalent unit runs)golangci-linton the changed package: 0 issues)git commit -S)