Skip to content

fix(recipes): pin nvidia-dra-driver-gpu to 0.4.1-rc.1 for strict-YAML fix#1341

Merged
yuanchen8911 merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:fix/dra-pin-0.4.1-rc.1
Jun 12, 2026
Merged

fix(recipes): pin nvidia-dra-driver-gpu to 0.4.1-rc.1 for strict-YAML fix#1341
yuanchen8911 merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:fix/dra-pin-0.4.1-rc.1

Conversation

@yuanchen8911

@yuanchen8911 yuanchen8911 commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Summary

Pin nvidia-dra-driver-gpu from 0.4.0 to 0.4.1-rc.1 (same oci://registry.k8s.io/dra-driver-nvidia/charts chart) to fix a duplicate pod-template label key that produces invalid YAML for strict consumers.

Motivation / Context

The 0.4.0 chart emits the nvidia-dra-driver-gpu-component label twice in the pod template (metadata.labels) for both the controller Deployment and the kubelet-plugin DaemonSet when rendered with AICR's values. Plain Helm tolerates the duplicate, but strict-YAML consumers that post-process the rendered manifests (Argo CD, Flux, kustomize, yq) fail the install with mapping key "nvidia-dra-driver-gpu-component" already defined. This breaks GitOps deployments — notably Flux reconciliation.

Upstream 0.4.1-rc.1 collapses the label to a single key. This is the RC cut specifically for AICR to pin to; the pin is temporary and will be bumped to the 0.4.1 GA release once it lands (scheduled week of 2026-06-29).

Fixes: 1289
Related: #1285 (the migration to registry.k8s.io v0.4.0 that introduced the regression), (upstream kubernetes-sigs/dra-driver-nvidia-gpu#1184)

Type of Change

  • Bug fix (non-breaking change that fixes an issue)

Component(s) Affected

  • Recipe engine / data (pkg/recipe)
  • Docs/examples (docs/, examples/)

Implementation Notes

  • Bumps the version pin in two canonical sites: recipes/registry.yaml (defaultVersion) and recipes/overlays/base.yaml (version), with an inline TEMPORARY comment at each explaining the RC rationale and the GA follow-up.
  • Regenerated docs/user/container-images.md via make bom-docs (chart version + image tag updated to v0.4.1-rc.1).
  • No source or values changes — the fix is entirely upstream in the chart.

Testing

make bom-docs
yamllint recipes/registry.yaml recipes/overlays/base.yaml   # clean
go test ./pkg/recipe/...                                    # ok
make qualify                                                # run before merge

End-to-end Flux-path verification using the exact spec.values AICR's Flux deployer injects (controller + kubeletPlugin nodeSelector/tolerations), rendering the OCI chart and running kustomize build:

  • 0.4.0: kustomize build FAILSmapping key "nvidia-dra-driver-gpu-component" already defined
  • 0.4.1-rc.1: kustomize build PASSES; label appears once per pod template

Risk Assessment

  • Low — Isolated version-pin change, easy to revert; fix verified against the strict-YAML failure mode.

Rollout notes: Temporary RC pin. Follow-up PR will bump to 0.4.1 GA when released (~week of 2026-06-29).

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

@github-actions

github-actions Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Recipe evidence check

Broad impact: recipes/registry.yaml or recipes/overlays/base.yaml changed;
every leaf recipe is potentially affected. The list below covers all of them — each
one would ideally have refreshed evidence before merge.

Affected leaf overlays: 63

Recipe Pointer Verify Digest match
a100-aks-training ⚠️ missing
a100-aks-ubuntu-training-kubeflow ⚠️ missing
a100-aks-ubuntu-training ⚠️ missing
a100-eks-training ⚠️ missing
a100-eks-ubuntu-training-kubeflow ⚠️ missing
a100-eks-ubuntu-training ⚠️ missing
a100-gke-cos-training-kubeflow ⚠️ missing
a100-gke-cos-training ⚠️ missing
a100-oke-training ⚠️ missing
a100-oke-ubuntu-training-kubeflow ⚠️ missing
a100-oke-ubuntu-training ⚠️ missing
b200-gke-cos-inference-dynamo ⚠️ missing
b200-gke-cos-inference ⚠️ missing
b200-gke-cos-training-kubeflow ⚠️ missing
b200-gke-cos-training ⚠️ missing
gb200-eks-inference ⚠️ missing
gb200-eks-training ⚠️ missing
gb200-eks-ubuntu-inference-dynamo ⚠️ missing
gb200-eks-ubuntu-inference ⚠️ missing
gb200-eks-ubuntu-training-kubeflow ⚠️ missing
gb200-eks-ubuntu-training ⚠️ missing
gb200-oke-inference ⚠️ missing
gb200-oke-training ⚠️ missing
gb200-oke-ubuntu-inference-dynamo ⚠️ missing
gb200-oke-ubuntu-inference ⚠️ missing
gb200-oke-ubuntu-training-kubeflow ⚠️ missing
gb200-oke-ubuntu-training ⚠️ missing
h100-aks-inference ⚠️ missing
h100-aks-training ⚠️ missing
h100-aks-ubuntu-inference-dynamo ⚠️ missing
h100-aks-ubuntu-inference ⚠️ missing
h100-aks-ubuntu-training-kubeflow ⚠️ missing
h100-aks-ubuntu-training ⚠️ missing
h100-bcm-training ⚠️ missing
h100-bcm-ubuntu-training ⚠️ missing
h100-eks-inference ⚠️ missing
h100-eks-training ⚠️ missing
h100-eks-ubuntu-inference-dynamo ⚠️ missing
h100-eks-ubuntu-inference-nim ⚠️ missing
h100-eks-ubuntu-inference ⚠️ missing
h100-eks-ubuntu-training-kubeflow ⚠️ missing
h100-eks-ubuntu-training-slurm ⚠️ missing
h100-eks-ubuntu-training ⚠️ missing
h100-gke-cos-inference-dynamo ⚠️ missing
h100-gke-cos-inference ⚠️ missing
h100-gke-cos-training-kubeflow ⚠️ missing
h100-gke-cos-training-slurm ⚠️ missing
h100-gke-cos-training ⚠️ missing
h100-kind-inference-dynamo ⚠️ missing
h100-kind-inference ⚠️ missing
h100-kind-training-kubeflow ⚠️ missing
h100-kind-training-slurm ⚠️ missing
h100-kind-training ⚠️ missing
h200-eks-inference ⚠️ missing
h200-eks-training ⚠️ missing
rtx-pro-6000-eks-inference ⚠️ missing
rtx-pro-6000-eks-ubuntu-inference-dynamo ⚠️ missing
rtx-pro-6000-eks-ubuntu-inference-nim ⚠️ missing
rtx-pro-6000-eks-ubuntu-inference ⚠️ missing
rtx-pro-6000-lke-inference ⚠️ missing
rtx-pro-6000-lke-training ⚠️ missing
rtx-pro-6000-lke-ubuntu-inference ⚠️ missing
rtx-pro-6000-lke-ubuntu-training ⚠️ missing

How to refresh evidence

Run on a cluster matching the recipe's criteria:

aicr snapshot -o snapshot.yaml
aicr validate \
  -r recipes/overlays/<slug>.yaml \
  -s snapshot.yaml \
  --emit-attestation ./out \
  --push ghcr.io/<your-fork>/aicr-evidence
cp ./out/pointer.yaml recipes/evidence/<slug>.yaml

This gate is warning-only and never blocks merge. See ADR-007 for the trust model.

@coderabbitai

coderabbitai Bot commented Jun 12, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

This PR updates the nvidia-dra-driver-gpu Helm component version from 0.4.0 to 0.4.1-rc.1 across overlays, registry defaults, and example recipes, adds temporary comments explaining the RC pin (workaround for a pod-template label issue), and synchronizes container image documentation and an inline comment in the component values file.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

  • NVIDIA/aicr#1285: Updates nvidia-dra-driver-gpu Helm component wiring and chart source; related prior work on the component's chart version.

Suggested labels

size/M

Suggested reviewers

  • mchmarny
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The pull request title accurately summarizes the main change: pinning nvidia-dra-driver-gpu to 0.4.1-rc.1 for a strict-YAML fix.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The PR description clearly relates to the changeset, detailing the pinning of nvidia-dra-driver-gpu from 0.4.0 to 0.4.1-rc.1 with context about the duplicate label issue and its fix.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@yuanchen8911 yuanchen8911 marked this pull request as draft June 12, 2026 21:54
@yuanchen8911 yuanchen8911 force-pushed the fix/dra-pin-0.4.1-rc.1 branch from c20b58c to 823bdd9 Compare June 12, 2026 21:55
@yuanchen8911 yuanchen8911 marked this pull request as ready for review June 12, 2026 21:57
@mchmarny

Copy link
Copy Markdown
Member

This also closes this one too, right? 1289

@yuanchen8911 yuanchen8911 enabled auto-merge (squash) June 12, 2026 22:13
… fix

The 0.4.0 chart (oci://registry.k8s.io/dra-driver-nvidia/charts) emits a
duplicate nvidia-dra-driver-gpu-component pod-template label key for both the
controller Deployment and the kubelet-plugin DaemonSet when rendered with
AICR's values. Plain Helm tolerates the duplicate, but strict-YAML consumers
that post-process the rendered manifests (Argo CD, Flux, kustomize, yq) fail
the install with 'mapping key already defined'.

Pin to the upstream 0.4.1-rc.1 release, which collapses the label to a single
key. Verified with the exact values AICR's Flux deployer injects: kustomize
build fails on 0.4.0 and passes on 0.4.1-rc.1.

This RC pin is temporary; bump to the 0.4.1 GA release once it lands
(scheduled week of 2026-06-29).

Refs: kubernetes-sigs/dra-driver-nvidia-gpu#1184
@yuanchen8911 yuanchen8911 force-pushed the fix/dra-pin-0.4.1-rc.1 branch from ae2eb7f to 520fa96 Compare June 12, 2026 22:14
@yuanchen8911 yuanchen8911 merged commit e32375c into NVIDIA:main Jun 12, 2026
179 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants