-
Notifications
You must be signed in to change notification settings - Fork 770
Support nerdctl run --gpus
#251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,61 @@ | ||
| # Using GPUs inside containers | ||
|
|
||
| nerdctl provides docker-compatible NVIDIA GPU support. | ||
|
|
||
| ## Prerequisites | ||
|
|
||
| - NVIDIA Drivers | ||
| - Same requirement as when you use GPUs on Docker. For details, please refer to [the doc by NVIDIA](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#pre-requisites). | ||
| - `nvidia-container-cli` | ||
| - containerd relies on this CLI for setting up GPUs inside container. You can install this via [`libnvidia-container` package](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/arch-overview.html#libnvidia-container). | ||
|
|
||
| ## Options for `nerdctl run --gpus` | ||
|
|
||
| `nerdctl run --gpus` is compatible to [`docker run --gpus`](https://docs.docker.com/engine/reference/commandline/run/#access-an-nvidia-gpu). | ||
|
|
||
| You can specify number of GPUs to use via `--gpus` option. | ||
| The following example exposes all available GPUs. | ||
|
|
||
| ``` | ||
| nerdctl run -it --rm --gpus all nvidia/cuda:9.0-base nvidia-smi | ||
| ``` | ||
|
|
||
| You can also pass detailed configuration to `--gpus` option as a list of key-value pairs. The following options are provided. | ||
|
|
||
| - `count`: number of GPUs to use. `all` exposes all available GPUs. | ||
| - `device`: IDs of GPUs to use. UUID or numbers of GPUs can be specified. | ||
| - `capabilities`: [Driver capabilities](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/user-guide.html#driver-capabilities). If unset, `utility` is used. | ||
|
|
||
| The following example exposes a specific GPU to the container. | ||
|
|
||
| ``` | ||
| nerdctl run -it --rm --gpus capabilities=utility,device=GPU-3a23c669-1f69-c64e-cf85-44e9b07e7a2a nvidia/cuda:9.0-base nvidia-smi | ||
| ``` | ||
|
|
||
| ## Fields for `nerdctl compose` | ||
|
|
||
| `nerdctl compose` also supports GPUs following [compose-spec](https://github.com/compose-spec/compose-spec/blob/master/deploy.md#devices). | ||
|
|
||
| You can use GPUs on compose when you specify some of the following `capabilities` in `services.demo.deploy.resources.reservations.devices`. | ||
|
|
||
| - `gpu` | ||
| - `nvidia` | ||
| - all allowed capabilities for `nerdctl run --gpus` | ||
|
|
||
| Avaliable fields are the same as `nerdctl run --gpus`. | ||
|
|
||
| The following exposes all available GPUs to the container. | ||
|
|
||
| ``` | ||
| version: "3.8" | ||
| services: | ||
| demo: | ||
| image: nvidia/cuda:9.0-base | ||
| command: nvidia-smi | ||
| deploy: | ||
| resources: | ||
| reservations: | ||
| devices: | ||
| - capabilities: ["utility"] | ||
| count: all | ||
| ``` | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -18,6 +18,7 @@ require ( | |
| github.com/containernetworking/plugins v0.9.1 | ||
| github.com/cyphar/filepath-securejoin v0.2.2 | ||
| github.com/docker/cli v20.10.7+incompatible | ||
| github.com/docker/distribution v2.7.1+incompatible // indirect | ||
| github.com/docker/docker v20.10.7+incompatible | ||
| github.com/docker/go-connections v0.4.0 | ||
| github.com/docker/go-units v0.4.0 | ||
|
|
@@ -27,13 +28,15 @@ require ( | |
| github.com/morikuni/aec v1.0.0 // indirect | ||
| github.com/opencontainers/go-digest v1.0.0 | ||
| github.com/opencontainers/image-spec v1.0.1 | ||
| github.com/opencontainers/runtime-spec v1.0.3-0.20210303205135-43e4633e40c1 | ||
| github.com/opencontainers/runtime-spec v1.0.3-0.20210326190908-1c3f411f0417 | ||
| github.com/pkg/errors v0.9.1 | ||
| github.com/rootless-containers/rootlesskit v0.14.2 | ||
| github.com/sirupsen/logrus v1.8.1 | ||
| github.com/urfave/cli/v2 v2.3.0 | ||
| golang.org/x/sync v0.0.0-20210220032951-036812b2e83c | ||
| golang.org/x/sys v0.0.0-20210420072515-93ed5bcd2bfe | ||
| golang.org/x/sys v0.0.0-20210426230700-d19ff857e887 | ||
| golang.org/x/term v0.0.0-20210406210042-72f3dc4e9b72 | ||
| gotest.tools/v3 v3.0.3 | ||
| ) | ||
|
|
||
| replace github.com/containerd/containerd => github.com/containerd/containerd v1.5.1-0.20210614183500-0a3a77bc4453 | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why replace?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Without replace,
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 😞 |
||
Large diffs are not rendered by default.
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -17,12 +17,15 @@ | |
| package serviceparser | ||
|
|
||
| import ( | ||
| "bytes" | ||
| "encoding/csv" | ||
| "fmt" | ||
| "path/filepath" | ||
| "strings" | ||
|
|
||
| "github.com/compose-spec/compose-go/types" | ||
| compose "github.com/compose-spec/compose-go/types" | ||
| "github.com/containerd/containerd/contrib/nvidia" | ||
| "github.com/containerd/containerd/identifiers" | ||
| "github.com/containerd/nerdctl/pkg/reflectutil" | ||
| "github.com/pkg/errors" | ||
|
|
@@ -104,6 +107,7 @@ func warnUnknownFields(svc compose.ServiceConfig) { | |
| } | ||
| if unknown := reflectutil.UnknownNonEmptyFields(svc.Deploy.Resources, | ||
| "Limits", | ||
| "Reservations", | ||
| ); len(unknown) > 0 { | ||
| logrus.Warnf("Ignoring: service %s: deploy.resources: %+v", svc.Name, unknown) | ||
| } | ||
|
|
@@ -115,6 +119,24 @@ func warnUnknownFields(svc compose.ServiceConfig) { | |
| logrus.Warnf("Ignoring: service %s: deploy.resources.resources: %+v", svc.Name, unknown) | ||
| } | ||
| } | ||
| if svc.Deploy.Resources.Reservations != nil { | ||
| if unknown := reflectutil.UnknownNonEmptyFields(svc.Deploy.Resources.Reservations, | ||
| "Devices", | ||
| ); len(unknown) > 0 { | ||
| logrus.Warnf("Ignoring: service %s: deploy.resources.resources.reservations: %+v", svc.Name, unknown) | ||
| } | ||
| for i, dev := range svc.Deploy.Resources.Reservations.Devices { | ||
| if unknown := reflectutil.UnknownNonEmptyFields(dev, | ||
| "Capabilities", | ||
| "Driver", | ||
| "Count", | ||
| "IDs", | ||
| ); len(unknown) > 0 { | ||
| logrus.Warnf("Ignoring: service %s: deploy.resources.resources.reservations.devices[%d]: %+v", | ||
| svc.Name, i, unknown) | ||
| } | ||
| } | ||
| } | ||
| } | ||
|
|
||
| // unknown fields of Build is checked in parseBuild(). | ||
|
|
@@ -193,6 +215,60 @@ func getMemLimit(svc compose.ServiceConfig) (types.UnitBytes, error) { | |
| return limit, nil | ||
| } | ||
|
|
||
| func getGPUs(svc compose.ServiceConfig) (reqs []string, _ error) { | ||
| // "gpu" and "nvidia" are also allowed capabilities (but not used as nvidia driver capabilities) | ||
| // https://github.com/moby/moby/blob/v20.10.7/daemon/nvidia_linux.go#L37 | ||
| capset := map[string]struct{}{"gpu": {}, "nvidia": {}} | ||
| for _, c := range nvidia.AllCaps() { | ||
| capset[string(c)] = struct{}{} | ||
| } | ||
| if svc.Deploy != nil && svc.Deploy.Resources.Reservations != nil { | ||
| for _, dev := range svc.Deploy.Resources.Reservations.Devices { | ||
| if len(dev.Capabilities) == 0 { | ||
| // "capabilities" is required. | ||
| // https://github.com/compose-spec/compose-spec/blob/74b933db994109616580eab8f47bf2ba226e0faa/deploy.md#devices | ||
| return nil, fmt.Errorf("service %s: specifying \"capabilities\" is required for resource reservations", svc.Name) | ||
| } | ||
|
|
||
| var requiresGPU bool | ||
| for _, c := range dev.Capabilities { | ||
| if _, ok := capset[c]; ok { | ||
| requiresGPU = true | ||
| } | ||
| } | ||
| if !requiresGPU { | ||
| continue | ||
| } | ||
|
|
||
| var e []string | ||
| if len(dev.Capabilities) > 0 { | ||
| e = append(e, fmt.Sprintf("capabilities=%s", strings.Join(dev.Capabilities, ","))) | ||
| } | ||
| if dev.Driver != "" { | ||
| e = append(e, fmt.Sprintf("driver=%s", dev.Driver)) | ||
| } | ||
| if len(dev.IDs) > 0 { | ||
| e = append(e, fmt.Sprintf("device=%s", strings.Join(dev.IDs, ","))) | ||
| } | ||
| if dev.Count != 0 { | ||
| e = append(e, fmt.Sprintf("count=%d", dev.Count)) | ||
| } | ||
|
|
||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
| buf := new(bytes.Buffer) | ||
| w := csv.NewWriter(buf) | ||
| if err := w.Write(e); err != nil { | ||
| return nil, err | ||
| } | ||
| w.Flush() | ||
| o := buf.Bytes() | ||
| if len(o) > 0 { | ||
| reqs = append(reqs, string(o[:len(o)-1])) // remove carriage return | ||
| } | ||
| } | ||
| } | ||
| return reqs, nil | ||
| } | ||
|
|
||
| // getRestart returns `nerdctl run --restart` flag string ("no" or "always") | ||
| // | ||
| // restart: {"no" (default), "always", "on-failure", "unless-stopped"} (https://github.com/compose-spec/compose-spec/blob/167f207d0a8967df87c5ed757dbb1a2bb6025a1e/spec.md#restart) | ||
|
|
@@ -400,6 +476,14 @@ func newContainer(project *compose.Project, parsed *Service, i int) (*Container, | |
| c.RunArgs = append(c.RunArgs, fmt.Sprintf("-m=%d", memLimit)) | ||
| } | ||
|
|
||
| if gpuReqs, err := getGPUs(svc); err != nil { | ||
| return nil, err | ||
| } else if len(gpuReqs) > 0 { | ||
| for _, gpus := range gpuReqs { | ||
| c.RunArgs = append(c.RunArgs, fmt.Sprintf("--gpus=%s", gpus)) | ||
| } | ||
| } | ||
|
|
||
| for k, v := range svc.Labels { | ||
| if v == "" { | ||
| c.RunArgs = append(c.RunArgs, fmt.Sprintf("-l=%s", k)) | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you try rootless (on cgroup v1)?
I guess it needs setting
no-cgroups = truemoby/moby#38729 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't work as of now.
We might need to patch
github.com/containerd/containerd/contrib/nvidiafor allowing to pass--no-cgroupoption tonvidia-container-cli.Containerd doesn't use
nvidia-container-runtime(instead, it executesnvidia-container-clidirectly) so we cannot use/etc/nvidia-container-runtime/config.tomlfor nerdctl.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A very hacky workaround for this is to wrap
nvidia-container-clito forcefully specify--no-cgroups.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened containerd/containerd#5603 for discussion
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AkihiroSuda containerd/containerd#5604 is merged.
Updated this PR to use
--no-cgroupand now it works in rootless environment as well (without any additional configurations to/etc/nvidia-container-runtime/config.toml, etc.).replacedirective is needed ingo.modto forcefully point to the latest commit of containerd.