Skip to content

Default Nvidia CDI spec location on rootless kit seems to be unaccessible #47676

@LukasIAO

Description

@LukasIAO

Description

I originally opened an issue on the Nvidia-container-toolkit repo, but we figured the issue may actually be better placed here.

Original issue: NVIDIA/nvidia-container-toolkit#434 @elezar

The Issue
Testing rootless docker 26.0.0 with nvidia container toolkit and Nvidia CDI support, the CDI injection fails, presumably because docker cannot find the nvidia.yaml.

docker run --rm -ti --device=nvidia.com/gpu=all ubuntu nvidia-smi -L
docker: Error response from daemon: CDI device injection failed: unresolvable CDI devices nvidia.com/gpu=all.

The client is looking at

 CDI spec directories:
  /etc/cdi
  /var/run/cdi

by default, but unlike the rootful version, rootless is unable to access the specs.

We tested this by moving the specs to another directory and specified the new location in the docker daemon.json:

{
    "features": {
        "cdi": true
    },
    "cdi-spec-dirs": ["/home/username/.docker/cdi/", "/home/username/.docker/run/cdi/"],
    "runtimes": {
        "nvidia": {
            "args": [],
            "path": "nvidia-container-runtime"
        }
    }
}
CDI spec directories:
  /home/username/.docker/cdi/
  /home/username/.docker/run/cdi/

Which seems to have solved the issue.

Reproduce

  1. Install docker rootless 26.0.0 via install script
  2. install the nvidia-container toolkit according to the documentation
  3. run nvidia-ctk runtime configure --runtime=docker --cdi.enabled --config=$HOME/.config/docker/daemon.json to enable cdi mode on rootless
  4. check CDI spec directories location via docker info
  5. run a container with native CDI injection docker run --rm -ti --device=nvidia.com/gpu=all ubuntu nvidia-smi -L

Expected behavior

We expected the rootless client to be able to run native CDI injections by accessing the nvidia.yaml default location, or give an indication, that the default location is inaccessible to rootless:

/.config/docker$ docker run --rm -ti --device=nvidia.com/gpu=all ubuntu nvidia-smi -L
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-b6022b4d-71db-8f15-15de-26a719f6b3e1)
GPU 1: NVIDIA A100-SXM4-40GB (UUID: GPU-22420f7d-6edb-e44a-c322-4ce539cade19)
GPU 2: NVIDIA A100-SXM4-40GB (UUID: GPU-5e3444e2-8577-0e99-c6ee-72f6eb2bd28c)
GPU 3: NVIDIA A100-SXM4-40GB (UUID: GPU-dd1f811d-a280-7e2e-bf7e-b84f7a977cc1)

docker version

Client:
 Version:           26.0.0
 API version:       1.45
 Go version:        go1.21.8
 Git commit:        2ae903e
 Built:             Wed Mar 20 15:16:45 2024
 OS/Arch:           linux/amd64
 Context:           rootless

Server: Docker Engine - Community
 Engine:
  Version:          26.0.0
  API version:      1.45 (minimum version 1.24)
  Go version:       go1.21.8
  Git commit:       8b79278
  Built:            Wed Mar 20 15:18:14 2024
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          v1.7.13
  GitCommit:        7c3aca7a610df76212171d200ca3811ff6096eb8
 runc:
  Version:          1.1.12
  GitCommit:        v1.1.12-0-g51d5e94
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0
 rootlesskit:
  Version:          2.0.2
  ApiVersion:       1.1.1
  NetworkDriver:    vpnkit
  PortDriver:       builtin
  StateDir:         /run/user/1010/dockerd-rootless
 vpnkit:
  Version:          7f0eff0dd99b576c5474de53b4454a157c642834

docker info

Client:
 Version:    26.0.0
 Context:    rootless
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.13.1
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.5.0
    Path:     /usr/libexec/docker/cli-plugins/docker-compose

Server:
 Containers: 5
  Running: 0
  Paused: 0
  Stopped: 5
 Images: 3
 Server Version: 26.0.0
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: false
  userxattr: true
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
 CDI spec directories:
  /home/ver23371/.docker/cdi/
  /home/ver23371/.docker/run/cdi/
 Swarm: inactive
 Runtimes: nvidia runc io.containerd.runc.v2
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 7c3aca7a610df76212171d200ca3811ff6096eb8
 runc version: v1.1.12-0-g51d5e94
 init version: de40ad0
 Security Options:
  seccomp
   Profile: builtin
  rootless
  cgroupns
 Kernel Version: 5.15.0-1047-nvidia
 Operating System: Ubuntu 22.04.4 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 128
 Total Memory: 503.5GiB
 Name: DGX-Station-A100-920-23487-2530-0R0
 ID: 48ae789a-3d2d-43d8-841a-9a34c9bdc46e
 Docker Root Dir: /home/ver23371/.local/share/docker
 Debug Mode: false
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false
 Product License: Community Engine

WARNING: No cpu cfs quota support
WARNING: No cpu cfs period support
WARNING: No cpu shares support
WARNING: No cpuset support
WARNING: No io.weight support
WARNING: No io.weight (per device) support
WARNING: No io.max (rbps) support
WARNING: No io.max (wbps) support
WARNING: No io.max (riops) support
WARNING: No io.max (wiops) support

Additional Info

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/rootlessRootless Modekind/bugBugs are bugs. The cause may or may not be known at triage time so debugging may be needed.status/0-triage

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions