Skip to content

Deployment validator image not published to ghcr.io #525

Description

@sara4dev

Summary

The deployment validator image ghcr.io/nvidia/aicr-validators/deployment:latest has never been published, causing ErrImagePull on all clusters when running deployment-phase checks.

Impact

All 4 deployment-phase checks fail:

  • operator-health
  • expected-resources
  • gpu-operator-version
  • check-nvidia-smi

Conformance validator images (ghcr.io/nvidia/aicr-validators/conformance:latest) work fine with multi-arch support (amd64 + arm64).

Docker Manifest Inspection

deployment:latest — image does not exist:

$ docker manifest inspect ghcr.io/nvidia/aicr-validators/deployment:latest
no such manifest: ghcr.io/nvidia/aicr-validators/deployment:latest

conformance:latest — multi-arch manifest (amd64 + arm64), works correctly:

{
  "manifests": [
    { "platform": { "architecture": "amd64", "os": "linux" } },
    { "platform": { "architecture": "arm64", "os": "linux" } }
  ]
}

Reproduction

# On any cluster (arm64 or amd64):
aicr recipe --service eks --accelerator gb200 --os ubuntu --intent inference --platform dynamo -o recipe.yaml
aicr validate --recipe recipe.yaml

Pod event:

Failed to pull image "ghcr.io/nvidia/aicr-validators/deployment:latest":
  no such manifest: ghcr.io/nvidia/aicr-validators/deployment:latest

Expected Behavior

The deployment validator image should be built and published as a multi-arch manifest (amd64 + arm64), matching the conformance image pattern.

Environment

  • Cluster: EKS with GB200 (p6e-gb200.36xlarge), Kubernetes v1.34.4
  • Node OS: Ubuntu 24.04, kernel 6.14.0 (aarch64)
  • AICR version: built from source (main branch)
  • Image referenced in: recipes/validators/catalog.yaml

Metadata

Metadata

Type

Fields

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions