feat(recipes): migrate nvidia-dra-driver-gpu to registry.k8s.io v0.4.0#1285
Conversation
Project moved from NVIDIA/k8s-dra-driver-gpu to kubernetes-sigs/dra-driver-nvidia-gpu. The new release at v0.4.0 is published to registry.k8s.io via kpromo and carries Rekor-backed keyless cosign signatures on both the image and Helm chart artifact — closing the keyless-signing gap that the legacy NGC artifact did not satisfy. Tracks kubernetes-sigs/dra-driver-nvidia-gpu#1105. - recipes/registry.yaml: chart source NGC -> oci://registry.k8s.io, version 25.12.0 -> 0.4.0 - docs/user/component-catalog.md, docs/integrator/aks-gpu-setup.md: update upstream repo links to kubernetes-sigs - docs/user/container-images.md: regenerated via make bom-docs; s3c example block updated for the new image
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Enterprise Run ID: 📒 Files selected for processing (4)
📝 WalkthroughWalkthroughThis PR updates documentation, container image references, Helm registry coordinates, overlay/example componentRefs, and Helm values for the nvidia-dra-driver-gpu component to point to the kubernetes-sigs dra-driver-nvidia project and registry.k8s.io. The Helm chart source and version are changed to [email protected], images are updated to registry.k8s.io/dra-driver-nvidia/dra-driver-nvidia-gpu:v0.4.0, component docs and integrator references now point to the kubernetes-sigs GitHub repo, and Helm values add nameOverride: nvidia-dra-driver-gpu. Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Coverage Report ✅
Coverage BadgeNo Go source files changed in this PR. |
base.yaml hard-pins each helm component's source and version, which overrides registry.yaml defaults at resolution time. The earlier registry.yaml change covered new recipes but base.yaml still pinned the legacy NGC URL + 25.12.0, producing the impossible hybrid "dra-driver-nvidia-gpu @ 25.12.0 from helm.ngc.nvidia.com" at install time. KWOK matrix caught this even though make qualify did not.
The v0.4.0 chart's _helpers.tpl defines `dra-driver-nvidia-gpu.name` as `nameOverride || .Chart.Name`, and the controller Deployment + kubelet-plugin DaemonSet use that helper (not the fullname helper) for their metadata.name. Without nameOverride, the rendered names become `dra-driver-nvidia-gpu-controller` / `dra-driver-nvidia-gpu-kubelet-plugin`, breaking the in-tree references that hardcode `nvidia-dra-driver-gpu-*`: - recipes/checks/nvidia-dra-driver-gpu/health-check.yaml - tests/chainsaw/ai-conformance/common/assert-dra-driver.yaml - validators/conformance/dra_support_check.go Pinning nameOverride: nvidia-dra-driver-gpu restores the expected rendered names with no downstream changes. Addresses yuanchen8911's review feedback on PR #1285.
|
@yuanchen8911 thanks — both correct, both fixed: 1. 2. Workload names render as
|
yuanchen8911
left a comment
There was a problem hiding this comment.
No remaining blocking issues after ff70989d. Both blockers verified fixed:
base.yamloverlay pin corrected to OCI /0.4.0(4be19a9f).- DRA workload names:
nameOverride: nvidia-dra-driver-gpupinned in bothvalues.yamlandvalues-oke.yaml, so the v0.4.0 chart renders thenvidia-dra-driver-gpu-*names the health check, conformance validator, and chainsaw assert expect. Confirmed viahelm template.
Two non-blocking cleanups remain (fine as a follow-up):
- Stale examples —
examples/recipes/aks-training.yamlandeks-gb200-ubuntu-training-with-validation.yamlstill pin the DRA driver to NGC25.12.0. - Docs prose —
docs/user/container-images.mdline 240 still lists the DRA driver undernvcr.io; line 243 should list it underregistry.k8s.io.
LGTM.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@recipes/components/nvidia-dra-driver-gpu/values.yaml`:
- Around line 43-51: Update the top summary comment to accurately describe the
issue (it's a chart-name mismatch, not an "aicr-stack-" prefix) so it matches
the detailed explanation: replace the current line with a brief statement like
"Pin the release name prefix to match downstream assertions." or "Override chart
name to ensure rendered resource names match expected nvidia-dra-driver-gpu-*
pattern."; reference the existing nameOverride: nvidia-dra-driver-gpu and the
chart helper include "dra-driver-nvidia-gpu.name" / rendered names pattern
nvidia-dra-driver-gpu-* when making the edit.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 048e0cd3-06b4-430c-b372-469e827ef83f
📒 Files selected for processing (2)
recipes/components/nvidia-dra-driver-gpu/values-oke.yamlrecipes/components/nvidia-dra-driver-gpu/values.yaml
…M prose - examples/recipes/aks-training.yaml: chart/source/version → 0.4.0 - examples/recipes/eks-gb200-ubuntu-training-with-validation.yaml: same - docs/user/container-images.md: move DRA driver attribution from nvcr.io to registry.k8s.io in the Registries spanned section - recipes/components/nvidia-dra-driver-gpu/values.yaml: rewrite the nameOverride/fullnameOverride comment to explain both roles separately, per CodeRabbit review feedback Addresses follow-up nits on PR #1285 from @yuanchen8911 and @coderabbitai.
|
All three follow-up nits fixed in
|
|
Tip For best results, initiate chat on the files or code changes.
With |
yuanchen8911
left a comment
There was a problem hiding this comment.
Blocking DRA migration issues are resolved. Re-reviewed latest head a9c2529 and verified the base overlay, chart defaults, DRA nameOverride rendering, example updates, BOM prose, focused Go tests, and bom-check. No remaining findings from me.
Summary
Migrate
nvidia-dra-driver-gpufrom the legacy NGC chart (helm.ngc.nvidia.com/nvidia, v25.12.0) to the upstreamoci://registry.k8s.io/dra-driver-nvidia/chartschart (v0.4.0). The new release carries Rekor-backed keyless cosign signatures on both the image and Helm chart artifact.Motivation / Context
The DRA driver project moved from
NVIDIA/k8s-dra-driver-gputokubernetes-sigs/dra-driver-nvidia-gpuand now publishes toregistry.k8s.ioviakpromo. The legacy NGC artifact only had a key-based signature (no Fulcio cert, no Rekor entry) and could not be verified keylessly. The new artifact closes that gap on the AICR side; SLSA provenance and SBOM attestations are still missing upstream and tracked there.Related: kubernetes-sigs/dra-driver-nvidia-gpu#1105
Tracks: #745 (provenance audit per component)
Type of Change
Component(s) Affected
pkg/recipe)docs/,examples/)Implementation Notes
All existing values in
recipes/components/nvidia-dra-driver-gpu/values.yaml(fullnameOverride,nvidiaDriverRoot,gpuResourcesEnabledOverride,resources.gpus.enabled,controller.priorityClassName,kubeletPlugin.priorityClassName) are confirmed compatible with the v0.4.0 chart structure — no values file changes needed.Verification of upstream keyless signing (confirmed today against the live artifacts):
Testing
All tests, lint, e2e (22 chainsaw tests), and vulnerability scan pass on the rebased branch.
Risk Assessment
Rollout notes: Chart version jump (calendar-versioned
25.12.0→ semver0.4.0) reflects the upstream renumbering on the move to kubernetes-sigs; same chart, same runtime behavior. Revert is a single-commit revert ofrecipes/registry.yaml.Checklist
make testwith-race)make lint)git commit -S)