Description
I originally opened an issue on the Nvidia-container-toolkit repo, but we figured the issue may actually be better placed here.
Original issue: NVIDIA/nvidia-container-toolkit#434 @elezar
The Issue
Testing rootless docker 26.0.0 with nvidia container toolkit and Nvidia CDI support, the CDI injection fails, presumably because docker cannot find the nvidia.yaml.
docker run --rm -ti --device=nvidia.com/gpu=all ubuntu nvidia-smi -L
docker: Error response from daemon: CDI device injection failed: unresolvable CDI devices nvidia.com/gpu=all.
The client is looking at
CDI spec directories:
/etc/cdi
/var/run/cdi
by default, but unlike the rootful version, rootless is unable to access the specs.
We tested this by moving the specs to another directory and specified the new location in the docker daemon.json:
{
"features": {
"cdi": true
},
"cdi-spec-dirs": ["/home/username/.docker/cdi/", "/home/username/.docker/run/cdi/"],
"runtimes": {
"nvidia": {
"args": [],
"path": "nvidia-container-runtime"
}
}
}
CDI spec directories:
/home/username/.docker/cdi/
/home/username/.docker/run/cdi/
Which seems to have solved the issue.
Reproduce
- Install docker rootless 26.0.0 via install script
- install the nvidia-container toolkit according to the documentation
- run
nvidia-ctk runtime configure --runtime=docker --cdi.enabled --config=$HOME/.config/docker/daemon.json to enable cdi mode on rootless
- check
CDI spec directories location via docker info
- run a container with native CDI injection
docker run --rm -ti --device=nvidia.com/gpu=all ubuntu nvidia-smi -L
Expected behavior
We expected the rootless client to be able to run native CDI injections by accessing the nvidia.yaml default location, or give an indication, that the default location is inaccessible to rootless:
/.config/docker$ docker run --rm -ti --device=nvidia.com/gpu=all ubuntu nvidia-smi -L
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-b6022b4d-71db-8f15-15de-26a719f6b3e1)
GPU 1: NVIDIA A100-SXM4-40GB (UUID: GPU-22420f7d-6edb-e44a-c322-4ce539cade19)
GPU 2: NVIDIA A100-SXM4-40GB (UUID: GPU-5e3444e2-8577-0e99-c6ee-72f6eb2bd28c)
GPU 3: NVIDIA A100-SXM4-40GB (UUID: GPU-dd1f811d-a280-7e2e-bf7e-b84f7a977cc1)
docker version
Client:
Version: 26.0.0
API version: 1.45
Go version: go1.21.8
Git commit: 2ae903e
Built: Wed Mar 20 15:16:45 2024
OS/Arch: linux/amd64
Context: rootless
Server: Docker Engine - Community
Engine:
Version: 26.0.0
API version: 1.45 (minimum version 1.24)
Go version: go1.21.8
Git commit: 8b79278
Built: Wed Mar 20 15:18:14 2024
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: v1.7.13
GitCommit: 7c3aca7a610df76212171d200ca3811ff6096eb8
runc:
Version: 1.1.12
GitCommit: v1.1.12-0-g51d5e94
docker-init:
Version: 0.19.0
GitCommit: de40ad0
rootlesskit:
Version: 2.0.2
ApiVersion: 1.1.1
NetworkDriver: vpnkit
PortDriver: builtin
StateDir: /run/user/1010/dockerd-rootless
vpnkit:
Version: 7f0eff0dd99b576c5474de53b4454a157c642834
docker info
Client:
Version: 26.0.0
Context: rootless
Debug Mode: false
Plugins:
buildx: Docker Buildx (Docker Inc.)
Version: v0.13.1
Path: /usr/libexec/docker/cli-plugins/docker-buildx
compose: Docker Compose (Docker Inc.)
Version: v2.5.0
Path: /usr/libexec/docker/cli-plugins/docker-compose
Server:
Containers: 5
Running: 0
Paused: 0
Stopped: 5
Images: 3
Server Version: 26.0.0
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Using metacopy: false
Native Overlay Diff: false
userxattr: true
Logging Driver: json-file
Cgroup Driver: systemd
Cgroup Version: 2
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
CDI spec directories:
/home/ver23371/.docker/cdi/
/home/ver23371/.docker/run/cdi/
Swarm: inactive
Runtimes: nvidia runc io.containerd.runc.v2
Default Runtime: runc
Init Binary: docker-init
containerd version: 7c3aca7a610df76212171d200ca3811ff6096eb8
runc version: v1.1.12-0-g51d5e94
init version: de40ad0
Security Options:
seccomp
Profile: builtin
rootless
cgroupns
Kernel Version: 5.15.0-1047-nvidia
Operating System: Ubuntu 22.04.4 LTS
OSType: linux
Architecture: x86_64
CPUs: 128
Total Memory: 503.5GiB
Name: DGX-Station-A100-920-23487-2530-0R0
ID: 48ae789a-3d2d-43d8-841a-9a34c9bdc46e
Docker Root Dir: /home/ver23371/.local/share/docker
Debug Mode: false
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
Product License: Community Engine
WARNING: No cpu cfs quota support
WARNING: No cpu cfs period support
WARNING: No cpu shares support
WARNING: No cpuset support
WARNING: No io.weight support
WARNING: No io.weight (per device) support
WARNING: No io.max (rbps) support
WARNING: No io.max (wbps) support
WARNING: No io.max (riops) support
WARNING: No io.max (wiops) support
Additional Info
No response
Description
I originally opened an issue on the Nvidia-container-toolkit repo, but we figured the issue may actually be better placed here.
Original issue: NVIDIA/nvidia-container-toolkit#434 @elezar
The Issue
Testing rootless docker 26.0.0 with nvidia container toolkit and Nvidia CDI support, the CDI injection fails, presumably because docker cannot find the
nvidia.yaml.The client is looking at
by default, but unlike the rootful version, rootless is unable to access the specs.
We tested this by moving the specs to another directory and specified the new location in the docker daemon.json:
Which seems to have solved the issue.
Reproduce
nvidia-ctk runtime configure --runtime=docker --cdi.enabled --config=$HOME/.config/docker/daemon.jsonto enable cdi mode on rootlessCDI spec directorieslocation viadocker infodocker run --rm -ti --device=nvidia.com/gpu=all ubuntu nvidia-smi -LExpected behavior
We expected the rootless client to be able to run native CDI injections by accessing the nvidia.yaml default location, or give an indication, that the default location is inaccessible to rootless:
docker version
docker info
Additional Info
No response