Skip to content

Support nerdctl run --gpus#251

Merged
AkihiroSuda merged 2 commits intocontainerd:masterfrom
ktock:gpus
Jun 15, 2021
Merged

Support nerdctl run --gpus#251
AkihiroSuda merged 2 commits intocontainerd:masterfrom
ktock:gpus

Conversation

@ktock
Copy link
Copy Markdown
Member

@ktock ktock commented Jun 14, 2021

Fixes: #248

This PR adds --gpus option to nerdctl run based on containerd's GPU support (github.com/containerd/containerd/contrib/nvidia by containerd/containerd#2330).

# nerdctl run --gpus all --rm -it nvidia/cuda:9.0-base nvidia-smi

For compose (https://github.com/compose-spec/compose-spec/blob/master/deploy.md#devices):

version: "3.8"
services:
  demo:
    image: nvidia/cuda:9.0-base
    command: nvidia-smi
    deploy:
      resources:
        reservations:
          devices:
          - capabilities: ["utility"]
            driver: nvidia
            count: all

nvidia-container-cli is needed.

@AkihiroSuda
Copy link
Copy Markdown
Member

Comment thread README.md Outdated
- :whale: `--shm-size`: Size of `/dev/shm`

GPU flags:
- :whale: `--gpus`: GPU devices to add to the container ('all' to pass all GPUs). `nvidia-container-cli` is needed.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have ./docs/gpu.md to explain all the options?

We should also clarify how to set up GPU for rootless. (Can be another PR)

@ktock ktock marked this pull request as draft June 14, 2021 08:55
@ktock ktock force-pushed the gpus branch 5 times, most recently from 631261e to 5e3075c Compare June 14, 2021 11:49
@ktock ktock marked this pull request as ready for review June 14, 2021 11:49
@ktock
Copy link
Copy Markdown
Member Author

ktock commented Jun 14, 2021

Added compose support and docs.

Comment thread docs/gpu.md Outdated
The following example exposes all available GPUs.

```
nerdctl run -it --rm --gpus all ubuntu:20.04 nvidia-smi
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ubuntu:20.04 -> nvidia/cuda:9.0-base might be more useful?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed this.

Comment thread docs/gpu.md
- NVIDIA Drivers
- Same requirement as when you use GPUs on Docker. For details, please refer to [the doc by NVIDIA](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#pre-requisites).
- `nvidia-container-cli`
- containerd relies on this CLI for setting up GPUs inside container. You can install this via [`libnvidia-container` package](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/arch-overview.html#libnvidia-container).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you try rootless (on cgroup v1)?

I guess it needs setting no-cgroups = true
moby/moby#38729 (comment)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't work as of now.

We might need to patch github.com/containerd/containerd/contrib/nvidia for allowing to pass --no-cgroup option to nvidia-container-cli.

Containerd doesn't use nvidia-container-runtime (instead, it executes nvidia-container-cli directly) so we cannot use /etc/nvidia-container-runtime/config.toml for nerdctl.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A very hacky workaround for this is to wrap nvidia-container-cli to forcefully specify --no-cgroups.

mkdir -p /opt/nvidia/bin
mv /usr/bin/nvidia-container-cli /opt/nvidia/bin/
cat <<'EOF' > /usr/bin/nvidia-container-cli
#!/bin/bash
/opt/nvidia/bin/nvidia-container-cli ${@:1:($#-1)} --no-cgroups ${@:$#}
EOF

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened containerd/containerd#5603 for discussion

Copy link
Copy Markdown
Member Author

@ktock ktock Jun 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AkihiroSuda containerd/containerd#5604 is merged.
Updated this PR to use --no-cgroup and now it works in rootless environment as well (without any additional configurations to /etc/nvidia-container-runtime/config.toml, etc.).

replace directive is needed in go.mod to forcefully point to the latest commit of containerd.

@AkihiroSuda
Copy link
Copy Markdown
Member

I'll release nedctl v0.9.0 after merging this.

Signed-off-by: Kohei Tokunaga <[email protected]>
if dev.Count != 0 {
e = append(e, fmt.Sprintf("count=%d", dev.Count))
}

Copy link
Copy Markdown
Member

@fahedouch fahedouch Jun 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

count and device_ids are mutually exclusive. we should define one field at a time. is it configured somewhere ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fahedouch
Copy link
Copy Markdown
Member

fahedouch commented Jun 14, 2021

I'll release nedctl v0.9.0 after merging this.

@AkihiroSuda I will clear this ticket tonight #239 . It will be good to have it in 0.9 :)

@ktock ktock marked this pull request as draft June 15, 2021 00:37
@ktock ktock marked this pull request as ready for review June 15, 2021 01:07
Comment thread go.mod
gotest.tools/v3 v3.0.3
)

replace github.com/containerd/containerd => github.com/containerd/containerd v1.5.1-0.20210614183500-0a3a77bc4453
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why replace?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without replace, go mod tidy wants to point to v1.5.2.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😞

Copy link
Copy Markdown
Member

@AkihiroSuda AkihiroSuda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

@AkihiroSuda AkihiroSuda merged commit 40cce9e into containerd:master Jun 15, 2021
@ktock ktock deleted the gpus branch June 15, 2021 06:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support NVIDIA GPUs (nerdctl run --gpus)

3 participants