Skip to content

fix(evidence,validator): correct catalog apiVersion examples, drop unverifiable demo evidence (#1499 follow-up)#1507

Merged
mchmarny merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:fix/1499-evidence-validator-followup
Jun 27, 2026
Merged

fix(evidence,validator): correct catalog apiVersion examples, drop unverifiable demo evidence (#1499 follow-up)#1507
mchmarny merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:fix/1499-evidence-validator-followup

Conversation

@yuanchen8911

@yuanchen8911 yuanchen8911 commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

Summary

Follow-up to #1499 (aicr.run domain migration). Fixes two defects that PR's domain sweep introduced: ValidatorCatalog doc examples pointed at an invalid apiVersion, and two committed demo evidence pointers became unverifiable.

Motivation / Context

#1499 mechanically swept aicr.nvidia.comaicr.run. A cross-review surfaced that the sweep touched two surfaces it shouldn't have:

F3 — ValidatorCatalog examples (invalid apiVersion). The validator catalog is a separate schema and is explicitly NOT part of the artifact-domain migration. Its loader fail-closes on anything other than validator.nvidia.com/v1alpha1 (pkg/validator/v1/catalog.go CatalogAPIVersion; rejected in pkg/validator/catalog/catalog.go). #1499 swept four example blocks aicr.nvidia.com/v1aicr.run/v1 — still invalid — so an integrator copying any of them hits unsupported apiVersion "aicr.run/v1" at load. Fixed all four to validator.nvidia.com/v1alpha1:

  • docs/integrator/validator-extension.md:52, :150
  • recipes/validators/README.md:8
  • docs/design/002-validatorv2-adr.md:70

Also aligned the three ValidatorCatalog test fixtures in pkg/recipe/provider_test.go to validator.nvidia.com/v1alpha1 (they passed before only because the layered provider merges external catalogs under the embedded header; this keeps test data honest).

F2 — unverifiable demo evidence pointers. #1499 edited predicateType in place to https://aicr.run/recipe-evidence/v1 in the two committed demo pointers, but bundle.digest/oci still pin the pre-migration, immutable OCI artifacts whose signed in-toto statements carry the old predicate type. aicr evidence verify materializes the bundle, then fails with unexpected predicateType https://aicr.nvidia.com/recipe-evidence/v1 (pkg/evidence/verifier/verify.go / signature.go compare the fetched statement, not the YAML). A YAML-only fix is impossible in both directions: legacy fails LoadAndValidatePointer (pointer.go), and aicr.run fails the pulled-bundle check. The only true fix — regenerate + re-sign + republish the bundles — needs signing keys and OCI push, so this PR instead removes the two stale pointers (they also pinned a personal fork org, ghcr.io/yuanchen8911/aicr-evidence, not a canonical NVIDIA source) and the now-orphaned community allowlist anchor that existed solely to back them.

Two user docs (docs/user/cli-reference.md, docs/user/artifact-verification.md) showed the now-deleted concrete pointer path as the preferred aicr evidence verify example; switched both to the generic recipes/evidence/<recipe>/<src>/<digest>.yaml form.

Regenerating proper signed demo evidence under the NVIDIA org is a separate follow-up (it is the only path that restores verifiable evidence; editing strings cannot).

Fixes: N/A (follow-up to #1499)
Related: #1499

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • Documentation update

Component(s) Affected

  • Validator (pkg/validator)
  • Docs/examples (docs/, examples/)
  • Other: pkg/evidence/allowlist, recipes/evidence/

Implementation Notes

  • The two evidence pointers were the only ones anchored by the community source slug 7c4c0edc…, so the community allowlist entry is removed too (→ community: []); the blocking evidence-pointer-contract gate enforces pointer→allowlist (not the reverse), so an empty community list with no community pointers is consistent.
  • pkg/evidence/allowlist/allowlist_test.go TestClassify loads the real committed allowlist.yaml; its "community by slug" case is updated to assert that a community-issuer signer is now admitted as "reported" only (no community entries). This is a fixture update to match the intentional allowlist change, not a disabled test.
  • Verified outside the repo while reviewing: https://get.aicr.run resolves and serves the install script (HTTP 200). The https://aicr.run/... predicate/buildType URIs are in-toto/SLSA type identifiers (compared as strings, never dereferenced), so the apex not resolving is expected and not a defect.

Testing

go test ./pkg/evidence/allowlist/... ./pkg/evidence/verifier/... ./pkg/validator/catalog/... ./tools/evidence-project/...   # ok
golangci-lint run -c .golangci.yaml ./pkg/evidence/allowlist/...                                                          # 0 issues

Scope is docs + sample-evidence removal + one test fixture; no production Go logic changed. Targeted package tests and lint pass locally; full make qualify (e2e/scan/cluster) runs in CI.

Risk Assessment

  • Low — Doc corrections plus removal of unverifiable sample evidence; no production code paths affected. Easy to revert.

Rollout notes: Demo evidence for h100-gke-cos-training / gb200-eks-ubuntu-training is removed until regenerated under the NVIDIA org (tracked follow-up). No runtime/API impact.

Checklist

  • Tests pass locally (targeted packages with -race-equivalent unit runs)
  • Linter passes (golangci-lint on the changed package: 0 issues)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

@yuanchen8911 yuanchen8911 requested review from a team as code owners June 27, 2026 01:02
@yuanchen8911 yuanchen8911 added theme/supply-chain SLSA, SBOM, Sigstore, and provenance verification theme/validation Constraint evaluation, health checks, and conformance evidence labels Jun 27, 2026
@github-actions

Copy link
Copy Markdown
Contributor

@github-actions

Copy link
Copy Markdown
Contributor

Recipe evidence check

Protected recipes

Recipes with committed evidence (recipes/evidence/<slug>/<source>/<digest>.yaml) that this PR affects: 2

Recipe Source Pointer Verify Digest match
gb200-eks-ubuntu-training ⚠️ evidence pointer removed
h100-gke-cos-training ⚠️ evidence pointer removed

How to refresh evidence

Run on a cluster matching the recipe's criteria:

aicr snapshot -o snapshot.yaml
aicr validate \
  -r recipes/overlays/<slug>.yaml \
  -s snapshot.yaml \
  --emit-attestation ./out \
  --push ghcr.io/<your-fork>/aicr-evidence
# Copy to the per-source path printed in the emit 'copyTo' hint:
#   recipes/evidence/<slug>/<source>/<bundle-digest>.yaml

This gate is warning-only and never blocks merge. See ADR-007 for the trust model.

@coderabbitai

coderabbitai Bot commented Jun 27, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 335db358-2a68-4dee-b928-9895d3bfded3

📥 Commits

Reviewing files that changed from the base of the PR and between 168385a and 2098984.

📒 Files selected for processing (10)
  • docs/design/002-validatorv2-adr.md
  • docs/integrator/validator-extension.md
  • docs/user/artifact-verification.md
  • docs/user/cli-reference.md
  • pkg/evidence/allowlist/allowlist_test.go
  • pkg/recipe/provider_test.go
  • recipes/evidence/allowlist.yaml
  • recipes/evidence/gb200-eks-ubuntu-training/7c4c0edc8c765a95a0f3afdb3bbb8e91/sha256-8274b6a1da24aa9782dc12162bf6a38265c30a852585ca64cfad5718efbbdec3.yaml
  • recipes/evidence/h100-gke-cos-training/7c4c0edc8c765a95a0f3afdb3bbb8e91/sha256-33d4cf36622ead990c43a596f6f53c62b87d9fa4708f59b7e3f356f215e54317.yaml
  • recipes/validators/README.md
💤 Files with no reviewable changes (2)
  • recipes/evidence/h100-gke-cos-training/7c4c0edc8c765a95a0f3afdb3bbb8e91/sha256-33d4cf36622ead990c43a596f6f53c62b87d9fa4708f59b7e3f356f215e54317.yaml
  • recipes/evidence/gb200-eks-ubuntu-training/7c4c0edc8c765a95a0f3afdb3bbb8e91/sha256-8274b6a1da24aa9782dc12162bf6a38265c30a852585ca64cfad5718efbbdec3.yaml

📝 Walkthrough

Walkthrough

The PR updates ValidatorCatalog apiVersion values in the design doc, integrator guide, validators README, and provider test fixtures to validator.nvidia.com/v1alpha1. It also changes evidence verification examples to a generic pointer path, empties the community allowlist, updates the allowlist test for the no-community-entry case, and removes two evidence YAML files.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

area/validator, area/tests

Suggested reviewers

  • mchmarny
  • njhensley
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main fixes: validator apiVersion examples and removal of unverifiable demo evidence.
Description check ✅ Passed The description directly matches the changeset and explains the validator docs fixes and evidence-pointer removals.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/evidence/allowlist/allowlist_test.go`:
- Around line 81-87: The allowlist test case in allowlist_test.go still uses the
removed contributor’s email in the fixture despite community being empty, so
replace the identity in that test with a synthetic value. Update the affected
case in the allowlist test table near the community-issuer scenario so it no
longer references the prior demo signer’s personal/stale data, while keeping the
test’s behavior and expectations unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 68b718cd-2d36-4010-8c5d-c42119793dc8

📥 Commits

Reviewing files that changed from the base of the PR and between 551bc44 and a902ae8.

📒 Files selected for processing (7)
  • docs/design/002-validatorv2-adr.md
  • docs/integrator/validator-extension.md
  • pkg/evidence/allowlist/allowlist_test.go
  • recipes/evidence/allowlist.yaml
  • recipes/evidence/gb200-eks-ubuntu-training/7c4c0edc8c765a95a0f3afdb3bbb8e91/sha256-8274b6a1da24aa9782dc12162bf6a38265c30a852585ca64cfad5718efbbdec3.yaml
  • recipes/evidence/h100-gke-cos-training/7c4c0edc8c765a95a0f3afdb3bbb8e91/sha256-33d4cf36622ead990c43a596f6f53c62b87d9fa4708f59b7e3f356f215e54317.yaml
  • recipes/validators/README.md
💤 Files with no reviewable changes (2)
  • recipes/evidence/gb200-eks-ubuntu-training/7c4c0edc8c765a95a0f3afdb3bbb8e91/sha256-8274b6a1da24aa9782dc12162bf6a38265c30a852585ca64cfad5718efbbdec3.yaml
  • recipes/evidence/h100-gke-cos-training/7c4c0edc8c765a95a0f3afdb3bbb8e91/sha256-33d4cf36622ead990c43a596f6f53c62b87d9fa4708f59b7e3f356f215e54317.yaml

Comment thread pkg/evidence/allowlist/allowlist_test.go
@yuanchen8911 yuanchen8911 force-pushed the fix/1499-evidence-validator-followup branch from a902ae8 to 168385a Compare June 27, 2026 01:42
@yuanchen8911 yuanchen8911 requested a review from mchmarny June 27, 2026 02:16
… unverifiable demo evidence (NVIDIA#1499 follow-up)

Follow-up to NVIDIA#1499 (aicr.run domain migration). That PR's sweep introduced
two defects:

F3 (docs): ValidatorCatalog examples were swept aicr.nvidia.com/v1 ->
aicr.run/v1, but the validator catalog is a separate schema that is NOT
part of the domain migration. The loader (pkg/validator/catalog) fail-closes
on anything other than validator.nvidia.com/v1alpha1, so a copied example
errors at load. Fix all four user-facing occurrences, and align the
ValidatorCatalog fixtures in pkg/recipe/provider_test.go to match.

F2 (evidence): the two committed demo evidence pointers had predicateType
edited in place to aicr.run, but still pin the pre-migration, immutable OCI
artifacts whose signed in-toto statements carry the old predicate type.
'aicr evidence verify' materializes the bundle then fails with
'unexpected predicateType https://aicr.nvidia.com/recipe-evidence/v1'
(verify.go / signature.go compare the fetched statement). A YAML-only fix is
impossible in either direction; the only true fix is regenerate+republish
the signed bundles. Since that needs signing keys + OCI push, remove the two
stale pointers (they also pinned a personal fork org, ghcr.io/yuanchen8911)
and the now-orphaned community allowlist anchor, and switch the two docs that
showed the deleted pointer as the preferred 'aicr evidence verify' example to
the generic recipes/evidence/<recipe>/<src>/<digest>.yaml form. Regenerating
proper signed demo evidence under the NVIDIA org is a separate follow-up.

Updates the allowlist TestClassify fixture case accordingly.

Signed-off-by: Yuan Chen <[email protected]>
@yuanchen8911 yuanchen8911 force-pushed the fix/1499-evidence-validator-followup branch from 168385a to 2098984 Compare June 27, 2026 02:19

@mchmarny mchmarny left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified both fixes against main. This appears to be a domain revert: this PR does not reset aicr.run back to aicr.nvidia.com. Every added line is either validator.nvidia.com/v1alpha1 or a generic pointer placeholder.

F3 (validator apiVersion) — correct, approve. validator.nvidia.com/v1alpha1 is a Kubernetes-style API group, not the AICR service domain, and it's the value the catalog loader fail-closes on (pkg/validator/v1/catalog.go:21, enforced at pkg/validator/catalog/catalog.go:249). #1499 only mis-swept the doc/test examples to aicr.run/v1 — the CatalogAPIVersion constant itself was never changed — so copy-pasting the current docs hits unsupported apiVersion "aicr.run/v1" at load. This just realigns the examples to the unchanged code. Good catch.

F2 (evidence removal) — diagnosis correct, one question on the remedy (non-blocking). The signed OCI statements are immutable and predate the migration, so they can't be reconciled with a YAML edit in either direction. Deletion is the right immediate fix; see the inline note on whether we can regenerate under a canonical NVIDIA org in this PR rather than landing with community: [] + a follow-up.

Not blocking merge. CI is green and the recipe-evidence gate is warning-only.

# the slug below. None onboarded yet — the prior entry anchored the
# demo evidence removed in this PR (see #1499 follow-up); re-add alongside
# regenerated, signed pointers.
community: []

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deletion is the correct immediate fix — these can't verify post-migration in either direction (signed OCI statement is immutable and carries the pre-migration predicate type; reverting the YAML to the old domain fails pointer.go's PredicateTypeV1 check instead).

But the predicate type itself already migrated to aicr.run (PredicateTypeV1 = "https://" + header.Domain + "/recipe-evidence/v1"), so regenerating + re-signing produces statements that verify cleanly. Before merging with community: []: can we regenerate h100-gke-cos-training / gb200-eks-ubuntu-training under a canonical NVIDIA org (ghcr.io/NVIDIA/...) in this same PR, rather than landing with zero demo evidence and a follow-up? If that's blocked on signing-key / OCI-push access, delete-now + tracked follow-up is acceptable — just want to confirm we're not leaving the evidence surface empty longer than necessary. Either way, dropping the personal-fork pointer (ghcr.io/yuanchen8911/...) is right.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in the follow-up #1509 (it landed as a separate PR only because #1507 merged before regeneration finished).

Both recipes were re-validated on live clusters, signed keyless, and pushed to ghcr.io/nvidia/aicr-evidence:

  • h100-gke-cos-training — ghcr.io/nvidia/aicr-evidence:h100-gke-cos-training-3619d2b8a5ba (Rekor 1981934662), NCCL busbw 338 GB/s
  • gb200-eks-ubuntu-training — ghcr.io/nvidia/aicr-evidence:gb200-eks-ubuntu-training-788a1bf8955a (Rekor 1982171931), NCCL nvls 841 GB/s

Both carry predicateType https://aicr.run/recipe-evidence/v1 and verify clean (aicr evidence verify -> exit 0: signature, predicate, manifest-hash). The signer is unchanged, so #1509 also restores the community allowlist entry (source 7c4c0edc...) removed here.

Note: the earlier NCCL performance failures were pure GPU-capacity contention (a standing slinky-slurm NodeSet held all GPUs/IMEX), not an IMEX/DRA problem — freeing it let the all-reduce workers schedule.

@mchmarny mchmarny merged commit 1391d84 into NVIDIA:main Jun 27, 2026
160 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/docs area/recipes size/M theme/supply-chain SLSA, SBOM, Sigstore, and provenance verification theme/validation Constraint evaluation, health checks, and conformance evidence

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants