wsl: report a single "all" device to kubelet#1671
Conversation
b55dfe1 to
43b3086
Compare
In order to prepare for the WSL changes, we remove the tegra resource manager and pull the basic function implementations into the base type. This means that the base type is essentially a resource manager that does not support health checking and always uses distributed allocation. Signed-off-by: Evan Lezar <[email protected]>
On WSL, all GPUs are accessed through /dev/dxg. Replace the per-GPU wslDevice (which reported one device per physical GPU with individual UUIDs) with a stateless wslAllGPUsDevice that always returns UUID "all" and path "/dev/dxg". This causes the device map to collapse to a single entry per resource, so kubelet sees exactly one GPU device on WSL. When allocated, this flows naturally through all strategy paths (envvar, CDI, volume mounts) to set NVIDIA_VISIBLE_DEVICES=all, which is what nvidia-container-runtime on WSL expects. Co-Authored-By: Claude Sonnet 4.6 <[email protected]> Signed-off-by: Evan Lezar <[email protected]>
43b3086 to
1bb3658
Compare
Use ghcr.io/nvidia/k8s-device-plugin:1bb36583 which includes upstream fixes for WSL2 CDI spec compatibility (cdiVersion and device naming), removing the need for any local spec transformation. See NVIDIA/k8s-device-plugin#1671. TODO: revert to chart-default image once a released version includes these fixes.
Use ghcr.io/nvidia/k8s-device-plugin:1bb36583 which includes upstream fixes for WSL2 CDI spec compatibility (cdiVersion and device naming), removing the need for any local spec transformation. See NVIDIA/k8s-device-plugin#1671. TODO: revert to chart-default image once a released version includes these fixes. Signed-off-by: Evan Lezar <[email protected]>
rahulait
left a comment
There was a problem hiding this comment.
Overall, LGTM. I don't have much experience with this, so would like someone else to approve as well.
|
Does this mean that a multi-gpu WSL2 node will only report having one |
Yes, that's what this means. Note the driver does not (or at least did not) support device-level isolation on WSL. Meaning that even IF there were multiple devices, any container would have access to all of them. |
|
I was able to confirm that this works. The node includes the following CDI spec: And I am able to run a GPU pod and run |
Use ghcr.io/nvidia/k8s-device-plugin:93042e1f which includes upstream fixes for WSL2 CDI spec compatibility (cdiVersion and device naming), removing the need for any local spec transformation. See NVIDIA/k8s-device-plugin#1671. TODO: revert to chart-default image once a released version includes these fixes. Signed-off-by: Evan Lezar <[email protected]>
|
/cherry-pick release-0.19 |
|
🤖 Backport PR created for |
v0.19.1 includes WSL2 CDI spec compatibility fixes. See NVIDIA/k8s-device-plugin#1671.
v0.19.1 includes WSL2 CDI spec compatibility fixes. See NVIDIA/k8s-device-plugin#1671. Signed-off-by: Evan Lezar <[email protected]>
On WSL, there is no isolation across different GPUs on a system. This is because they are all accessed through the same
/dev/dgxdevice. This is reflected in in the CDI spec generated by the NVIDIA Container Toolkit to always generate a singlealldevice.This is incompatible with the device plugin when using a CDI-based device list strategy, since the device name reported by the plugin will include the device UUID or index.
The change in this PR ensures that the device plugin always reports a SINGLE device with a UUID and INDEX (
all) so that this is compatible with the generated CDI spec.