Skip to content

fix(ci): safe manifest publishing#586

Merged
mchmarny merged 4 commits into
NVIDIA:mainfrom
njhensley:fix/safe-manifest-publishing
Apr 15, 2026
Merged

fix(ci): safe manifest publishing#586
mchmarny merged 4 commits into
NVIDIA:mainfrom
njhensley:fix/safe-manifest-publishing

Conversation

@njhensley

Copy link
Copy Markdown
Member

Summary

Replace docker manifest create --amend with docker buildx imagetools create for mutable aliases (:latest, :edge) and add post-publish verification that all expected platforms are present.

Motivation / Context

--amend preserves existing manifest-list state instead of replacing it cleanly, which can leak stale or unknown/unknown descriptors into mutable aliases. This caused the :latest tag to either not exist or contain a corrupted manifest list, resulting in ErrImagePull for all deployment-phase validator checks.

Fixes: #525
Related: N/A

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • CLI (cmd/aicr, pkg/cli)
  • API server (cmd/aicrd, pkg/api, pkg/server)
  • Recipe engine / data (pkg/recipe)
  • Bundlers (pkg/bundler, pkg/component/*)
  • Collectors / snapshotter (pkg/collector, pkg/snapshotter)
  • Validator (pkg/validator)
  • Core libraries (pkg/errors, pkg/k8s)
  • Docs/examples (docs/, examples/)
  • Other: CI workflows (on-tag.yaml, on-push.yaml)

Implementation Notes

  • docker buildx imagetools create atomically replaces the manifest list from the specified source images — no --amend semantics, no stale state carried forward.
  • :edge alias now sources from sha-<commit>-{arch} images instead of edge-{arch}, ensuring both tags derive from the same build artifacts deterministically.
  • Verification step fails the pipeline if linux/amd64 or linux/arm64 is missing, or if any unknown/unknown descriptor is present. This will also catch future regressions if provenance: false is removed from the build steps without updating the check.
  • VALIDATOR_PHASES and EXPECTED_PLATFORMS hoisted to workflow-level env vars with cross-reference comments to the build-docker matrix.

Testing

# Triggered throwaway workflow on fork against feature branch
gh workflow run test-manifest.yaml --repo njhensley/aicr --ref fix/safe-manifest-publishing
# Run: https://github.com/njhensley/aicr/actions/runs/24470785341 — passed

# Verified manifests locally
docker manifest inspect ghcr.io/njhensley/aicr-validators/deployment:test
docker manifest inspect ghcr.io/njhensley/aicr-validators/performance:test
docker manifest inspect ghcr.io/njhensley/aicr-validators/conformance:test
# All three: exactly 2 manifests (linux/amd64, linux/arm64), no unknown/unknown

Risk Assessment

  • Low — Isolated change, well-tested, easy to revert

Rollout notes: No migration needed. Next push to main publishes :edge via the new path. Next tag release publishes :latest via the new path.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S) — GPG signing info

njhensley and others added 3 commits April 15, 2026 11:12
Stop mutating latest/edge manifests in place. The --amend flag preserves
existing manifest-list state instead of replacing it cleanly, which can
leak stale or unknown/unknown descriptors into mutable aliases.

Switch both on-tag (latest) and on-push (edge) to docker buildx
imagetools create, which atomically replaces the manifest list from the
specified source images. Add a post-publish verification step that fails
unless both linux/amd64 and linux/arm64 are present and no
unknown/unknown descriptors exist.

Hoist VALIDATOR_PHASES and EXPECTED_PLATFORMS to workflow-level env vars
so loop parameters are obvious to update and cross-referenced with the
build-docker matrix.
@njhensley njhensley requested a review from a team as a code owner April 15, 2026 21:07
@mchmarny mchmarny added this to the v0.12 milestone Apr 15, 2026
@mchmarny mchmarny added the task label Apr 15, 2026
@mchmarny mchmarny enabled auto-merge (squash) April 15, 2026 21:13
@mchmarny mchmarny merged commit 352b006 into NVIDIA:main Apr 15, 2026
28 checks passed
@njhensley njhensley deleted the fix/safe-manifest-publishing branch April 15, 2026 21:24
yuanchen8911 pushed a commit to yuanchen8911/aicr that referenced this pull request Apr 16, 2026
yuanchen8911 pushed a commit to yuanchen8911/aicr that referenced this pull request Apr 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Deployment validator image not published to ghcr.io

2 participants