Skip to content

wsl: report a single "all" device to kubelet#1671

Merged
elezar merged 2 commits intoNVIDIA:mainfrom
elezar:wsl-single-all-device
Apr 15, 2026
Merged

wsl: report a single "all" device to kubelet#1671
elezar merged 2 commits intoNVIDIA:mainfrom
elezar:wsl-single-all-device

Conversation

@elezar
Copy link
Copy Markdown
Member

@elezar elezar commented Mar 21, 2026

On WSL, there is no isolation across different GPUs on a system. This is because they are all accessed through the same /dev/dgx device. This is reflected in in the CDI spec generated by the NVIDIA Container Toolkit to always generate a single all device.

This is incompatible with the device plugin when using a CDI-based device list strategy, since the device name reported by the plugin will include the device UUID or index.

The change in this PR ensures that the device plugin always reports a SINGLE device with a UUID and INDEX (all) so that this is compatible with the generated CDI spec.

elezar and others added 2 commits April 15, 2026 10:00
In order to prepare for the WSL changes, we remove the tegra resource
manager and pull the basic function implementations into the base type.
This means that the base type is essentially a resource manager that
does not support health checking and always uses distributed allocation.

Signed-off-by: Evan Lezar <[email protected]>
On WSL, all GPUs are accessed through /dev/dxg. Replace the per-GPU
wslDevice (which reported one device per physical GPU with individual
UUIDs) with a stateless wslAllGPUsDevice that always returns UUID "all"
and path "/dev/dxg". This causes the device map to collapse to a single
entry per resource, so kubelet sees exactly one GPU device on WSL.

When allocated, this flows naturally through all strategy paths
(envvar, CDI, volume mounts) to set NVIDIA_VISIBLE_DEVICES=all, which
is what nvidia-container-runtime on WSL expects.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Signed-off-by: Evan Lezar <[email protected]>
@elezar elezar force-pushed the wsl-single-all-device branch from 43b3086 to 1bb3658 Compare April 15, 2026 08:07
@elezar elezar requested a review from rahulait April 15, 2026 08:25
elezar added a commit to NVIDIA/OpenShell that referenced this pull request Apr 15, 2026
Use ghcr.io/nvidia/k8s-device-plugin:1bb36583 which includes upstream
fixes for WSL2 CDI spec compatibility (cdiVersion and device naming),
removing the need for any local spec transformation.

See NVIDIA/k8s-device-plugin#1671.

TODO: revert to chart-default image once a released version includes
these fixes.
elezar added a commit to NVIDIA/OpenShell that referenced this pull request Apr 15, 2026
Use ghcr.io/nvidia/k8s-device-plugin:1bb36583 which includes upstream
fixes for WSL2 CDI spec compatibility (cdiVersion and device naming),
removing the need for any local spec transformation.

See NVIDIA/k8s-device-plugin#1671.

TODO: revert to chart-default image once a released version includes
these fixes.

Signed-off-by: Evan Lezar <[email protected]>
Copy link
Copy Markdown
Contributor

@rahulait rahulait left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, LGTM. I don't have much experience with this, so would like someone else to approve as well.

@cdesiniotis
Copy link
Copy Markdown
Contributor

Does this mean that a multi-gpu WSL2 node will only report having one nvidia.com/gpu allocatable resource?

@elezar
Copy link
Copy Markdown
Member Author

elezar commented Apr 15, 2026

Does this mean that a multi-gpu WSL2 node will only report having one nvidia.com/gpu allocatable resource?

Yes, that's what this means. Note the driver does not (or at least did not) support device-level isolation on WSL. Meaning that even IF there were multiple devices, any container would have access to all of them.

@elezar
Copy link
Copy Markdown
Member Author

elezar commented Apr 15, 2026

I was able to confirm that this works.

The node includes the following CDI spec:

$ nvidia-ctk cdi list
INFO[0000] Found 1 CDI devices
k8s.device-plugin.nvidia.com/gpu=all

And I am able to run a GPU pod and run nvidia-smi in it.

@elezar elezar merged commit eb98db4 into NVIDIA:main Apr 15, 2026
11 checks passed
@elezar elezar deleted the wsl-single-all-device branch April 15, 2026 16:23
elezar added a commit to NVIDIA/OpenShell that referenced this pull request Apr 17, 2026
Use ghcr.io/nvidia/k8s-device-plugin:93042e1f which includes upstream
fixes for WSL2 CDI spec compatibility (cdiVersion and device naming),
removing the need for any local spec transformation.

See NVIDIA/k8s-device-plugin#1671.

TODO: revert to chart-default image once a released version includes
these fixes.

Signed-off-by: Evan Lezar <[email protected]>
@elezar
Copy link
Copy Markdown
Member Author

elezar commented Apr 17, 2026

/cherry-pick release-0.19

@github-actions
Copy link
Copy Markdown

🤖 Backport PR created for release-0.19: #1699

@elezar elezar mentioned this pull request Apr 20, 2026
elezar added a commit to NVIDIA/OpenShell that referenced this pull request Apr 24, 2026
v0.19.1 includes WSL2 CDI spec compatibility fixes.
See NVIDIA/k8s-device-plugin#1671.
elezar added a commit to NVIDIA/OpenShell that referenced this pull request Apr 24, 2026
v0.19.1 includes WSL2 CDI spec compatibility fixes.
See NVIDIA/k8s-device-plugin#1671.

Signed-off-by: Evan Lezar <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants