Skip to content

feat(recipes): migrate nvidia-dra-driver-gpu to registry.k8s.io v0.4.0#1285

Merged
mchmarny merged 7 commits into
mainfrom
feat/dra-driver-registry-migration
Jun 9, 2026
Merged

feat(recipes): migrate nvidia-dra-driver-gpu to registry.k8s.io v0.4.0#1285
mchmarny merged 7 commits into
mainfrom
feat/dra-driver-registry-migration

Conversation

@mchmarny

@mchmarny mchmarny commented Jun 9, 2026

Copy link
Copy Markdown
Member

Summary

Migrate nvidia-dra-driver-gpu from the legacy NGC chart (helm.ngc.nvidia.com/nvidia, v25.12.0) to the upstream oci://registry.k8s.io/dra-driver-nvidia/charts chart (v0.4.0). The new release carries Rekor-backed keyless cosign signatures on both the image and Helm chart artifact.

Motivation / Context

The DRA driver project moved from NVIDIA/k8s-dra-driver-gpu to kubernetes-sigs/dra-driver-nvidia-gpu and now publishes to registry.k8s.io via kpromo. The legacy NGC artifact only had a key-based signature (no Fulcio cert, no Rekor entry) and could not be verified keylessly. The new artifact closes that gap on the AICR side; SLSA provenance and SBOM attestations are still missing upstream and tracked there.

Related: kubernetes-sigs/dra-driver-nvidia-gpu#1105
Tracks: #745 (provenance audit per component)

Type of Change

  • New feature (non-breaking change that adds functionality)

Component(s) Affected

  • Recipe engine / data (pkg/recipe)
  • Docs/examples (docs/, examples/)

Implementation Notes

All existing values in recipes/components/nvidia-dra-driver-gpu/values.yaml (fullnameOverride, nvidiaDriverRoot, gpuResourcesEnabledOverride, resources.gpus.enabled, controller.priorityClassName, kubeletPlugin.priorityClassName) are confirmed compatible with the v0.4.0 chart structure — no values file changes needed.

Verification of upstream keyless signing (confirmed today against the live artifacts):

cosign verify registry.k8s.io/dra-driver-nvidia/dra-driver-nvidia-gpu:v0.4.0 \
  --certificate-identity [email protected] \
  --certificate-oidc-issuer https://accounts.google.com

Testing

unset GITLAB_TOKEN && make qualify

All tests, lint, e2e (22 chainsaw tests), and vulnerability scan pass on the rebased branch.

Risk Assessment

  • Low — Isolated change, well-tested, easy to revert

Rollout notes: Chart version jump (calendar-versioned 25.12.0 → semver 0.4.0) reflects the upstream renumbering on the move to kubernetes-sigs; same chart, same runtime behavior. Revert is a single-commit revert of recipes/registry.yaml.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality (N/A — declarative chart pin)
  • I updated docs if user-facing behavior changed (component-catalog, aks-gpu-setup, container-images BOM)
  • Changes follow existing patterns in the codebase (matches kueue, kai-scheduler, grove OCI chart entries)
  • Commits are cryptographically signed (git commit -S)

Project moved from NVIDIA/k8s-dra-driver-gpu to
kubernetes-sigs/dra-driver-nvidia-gpu. The new release at v0.4.0 is
published to registry.k8s.io via kpromo and carries Rekor-backed
keyless cosign signatures on both the image and Helm chart artifact —
closing the keyless-signing gap that the legacy NGC artifact did not
satisfy.

Tracks kubernetes-sigs/dra-driver-nvidia-gpu#1105.

- recipes/registry.yaml: chart source NGC -> oci://registry.k8s.io,
  version 25.12.0 -> 0.4.0
- docs/user/component-catalog.md, docs/integrator/aks-gpu-setup.md:
  update upstream repo links to kubernetes-sigs
- docs/user/container-images.md: regenerated via make bom-docs;
  s3c example block updated for the new image
@mchmarny mchmarny requested review from a team as code owners June 9, 2026 22:01
@mchmarny mchmarny self-assigned this Jun 9, 2026
@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: b6ff2d6b-5d96-438a-b7bf-0fb5cccc021b

📥 Commits

Reviewing files that changed from the base of the PR and between ff70989 and a9c2529.

📒 Files selected for processing (4)
  • docs/user/container-images.md
  • examples/recipes/aks-training.yaml
  • examples/recipes/eks-gb200-ubuntu-training-with-validation.yaml
  • recipes/components/nvidia-dra-driver-gpu/values.yaml

📝 Walkthrough

Walkthrough

This PR updates documentation, container image references, Helm registry coordinates, overlay/example componentRefs, and Helm values for the nvidia-dra-driver-gpu component to point to the kubernetes-sigs dra-driver-nvidia project and registry.k8s.io. The Helm chart source and version are changed to [email protected], images are updated to registry.k8s.io/dra-driver-nvidia/dra-driver-nvidia-gpu:v0.4.0, component docs and integrator references now point to the kubernetes-sigs GitHub repo, and Helm values add nameOverride: nvidia-dra-driver-gpu.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

dependencies, documentation, enhancement, size/M

Suggested reviewers

  • xdu31
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: migrating the nvidia-dra-driver-gpu component to a newer upstream version (v0.4.0) hosted at registry.k8s.io.
Description check ✅ Passed The description provides comprehensive context about the migration, including motivation, testing, risk assessment, and implementation notes that directly relate to the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/dra-driver-registry-migration

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions Bot added the size/S label Jun 9, 2026
@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Coverage Report ✅

Metric Value
Coverage 76.4%
Threshold 75%
Status Pass
Coverage Badge
![Coverage](https://img.shields.io/badge/coverage-76.4%25-green)

No Go source files changed in this PR.

dims
dims previously approved these changes Jun 9, 2026
base.yaml hard-pins each helm component's source and version, which
overrides registry.yaml defaults at resolution time. The earlier
registry.yaml change covered new recipes but base.yaml still pinned
the legacy NGC URL + 25.12.0, producing the impossible hybrid
"dra-driver-nvidia-gpu @ 25.12.0 from helm.ngc.nvidia.com" at install
time. KWOK matrix caught this even though make qualify did not.
Comment thread recipes/registry.yaml
Comment thread docs/user/container-images.md

@yuanchen8911 yuanchen8911 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left a few comments

The v0.4.0 chart's _helpers.tpl defines `dra-driver-nvidia-gpu.name`
as `nameOverride || .Chart.Name`, and the controller Deployment +
kubelet-plugin DaemonSet use that helper (not the fullname helper)
for their metadata.name. Without nameOverride, the rendered names
become `dra-driver-nvidia-gpu-controller` /
`dra-driver-nvidia-gpu-kubelet-plugin`, breaking the in-tree
references that hardcode `nvidia-dra-driver-gpu-*`:

- recipes/checks/nvidia-dra-driver-gpu/health-check.yaml
- tests/chainsaw/ai-conformance/common/assert-dra-driver.yaml
- validators/conformance/dra_support_check.go

Pinning nameOverride: nvidia-dra-driver-gpu restores the expected
rendered names with no downstream changes.

Addresses yuanchen8911's review feedback on PR #1285.
@mchmarny

mchmarny commented Jun 9, 2026

Copy link
Copy Markdown
Member Author

@yuanchen8911 thanks — both correct, both fixed:

1. base.yaml still pinned old NGC source/version — fixed in 4be19a9f. The base overlay's explicit pin overrode the registry.yaml default at resolution time, producing the impossible dra-driver-nvidia-gpu @ 25.12.0 from helm.ngc.nvidia.com tuple. The KWOK Tier 2 matrix in CI caught it; make qualify doesn't exercise the full recipe install path. Worth a follow-up to see if that gap is closable locally.

2. Workload names render as dra-driver-nvidia-gpu-* — fixed in ff70989d by pinning nameOverride: nvidia-dra-driver-gpu (in addition to the existing fullnameOverride) in both values.yaml and values-oke.yaml. Confirmed via helm template: the v0.4.0 chart's controller.yaml / kubeletplugin.yaml use the dra-driver-nvidia-gpu.name helper (= nameOverride || .Chart.Name), not the fullname helper, so fullnameOverride alone isn't enough. AICR's health check, conformance validator (validators/conformance/dra_support_check.go), and chainsaw assertion all hardcode nvidia-dra-driver-gpu-*, so pinning nameOverride is the least invasive fix.

make qualify green locally on both fixes; CI re-running.

yuanchen8911
yuanchen8911 previously approved these changes Jun 9, 2026

@yuanchen8911 yuanchen8911 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No remaining blocking issues after ff70989d. Both blockers verified fixed:

  • base.yaml overlay pin corrected to OCI / 0.4.0 (4be19a9f).
  • DRA workload names: nameOverride: nvidia-dra-driver-gpu pinned in both values.yaml and values-oke.yaml, so the v0.4.0 chart renders the nvidia-dra-driver-gpu-* names the health check, conformance validator, and chainsaw assert expect. Confirmed via helm template.

Two non-blocking cleanups remain (fine as a follow-up):

  • Stale examples — examples/recipes/aks-training.yaml and eks-gb200-ubuntu-training-with-validation.yaml still pin the DRA driver to NGC 25.12.0.
  • Docs prose — docs/user/container-images.md line 240 still lists the DRA driver under nvcr.io; line 243 should list it under registry.k8s.io.

LGTM.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@recipes/components/nvidia-dra-driver-gpu/values.yaml`:
- Around line 43-51: Update the top summary comment to accurately describe the
issue (it's a chart-name mismatch, not an "aicr-stack-" prefix) so it matches
the detailed explanation: replace the current line with a brief statement like
"Pin the release name prefix to match downstream assertions." or "Override chart
name to ensure rendered resource names match expected nvidia-dra-driver-gpu-*
pattern."; reference the existing nameOverride: nvidia-dra-driver-gpu and the
chart helper include "dra-driver-nvidia-gpu.name" / rendered names pattern
nvidia-dra-driver-gpu-* when making the edit.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 048e0cd3-06b4-430c-b372-469e827ef83f

📥 Commits

Reviewing files that changed from the base of the PR and between 4be19a9 and ff70989.

📒 Files selected for processing (2)
  • recipes/components/nvidia-dra-driver-gpu/values-oke.yaml
  • recipes/components/nvidia-dra-driver-gpu/values.yaml

Comment thread recipes/components/nvidia-dra-driver-gpu/values.yaml Outdated
…M prose

- examples/recipes/aks-training.yaml: chart/source/version → 0.4.0
- examples/recipes/eks-gb200-ubuntu-training-with-validation.yaml: same
- docs/user/container-images.md: move DRA driver attribution from
  nvcr.io to registry.k8s.io in the Registries spanned section
- recipes/components/nvidia-dra-driver-gpu/values.yaml: rewrite the
  nameOverride/fullnameOverride comment to explain both roles
  separately, per CodeRabbit review feedback

Addresses follow-up nits on PR #1285 from @yuanchen8911 and
@coderabbitai.
@mchmarny

mchmarny commented Jun 9, 2026

Copy link
Copy Markdown
Member Author

All three follow-up nits fixed in a9c25295:

  • Stale examplesexamples/recipes/aks-training.yaml and examples/recipes/eks-gb200-ubuntu-training-with-validation.yaml now pin oci://registry.k8s.io/dra-driver-nvidia/charts @ 0.4.0 (chart name also updated to dra-driver-nvidia-gpu to match the new chart's actual name).
  • BOM prosedocs/user/container-images.md Registries spanned section: DRA driver moved from the nvcr.io line to the registry.k8s.io line.
  • CodeRabbit comment cleanuprecipes/components/nvidia-dra-driver-gpu/values.yaml comment rewritten to explain nameOverride vs fullnameOverride roles separately, instead of conflating with the legacy "aicr-stack-" rationale.

make qualify green locally. cc @yuanchen8911 @coderabbitai

@github-actions github-actions Bot added size/M and removed size/S labels Jun 9, 2026
@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown

Tip

For best results, initiate chat on the files or code changes.

@mchmarny Thanks for the thorough follow-through! All three fixes look well-targeted:

  • Example files — Aligning aks-training.yaml and eks-gb200-ubuntu-training-with-validation.yaml to the OCI source with the correct chart name closes the last stale reference gap.
  • BOM prose — Moving the DRA driver entry to the registry.k8s.io section in container-images.md keeps the registry groupings accurate.
  • Comment clarity — Splitting the nameOverride and fullnameOverride explanations makes the intent in values.yaml much easier to follow for future maintainers.

With make qualify green locally and CI re-running, this looks ready. Nice clean migration overall. 🐇

@mchmarny mchmarny enabled auto-merge (squash) June 9, 2026 23:15
@mchmarny mchmarny requested a review from dims June 9, 2026 23:15
@mchmarny mchmarny requested a review from yuanchen8911 June 9, 2026 23:15

@yuanchen8911 yuanchen8911 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking DRA migration issues are resolved. Re-reviewed latest head a9c2529 and verified the base overlay, chart defaults, DRA nameOverride rendering, example updates, BOM prose, focused Go tests, and bom-check. No remaining findings from me.

@mchmarny mchmarny merged commit 7bb7059 into main Jun 9, 2026
168 of 169 checks passed
@mchmarny mchmarny deleted the feat/dra-driver-registry-migration branch June 9, 2026 23:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants