Add nvidia gpu support by crosbymichael · Pull Request #2330 · containerd/containerd

crosbymichael · 2018-05-07T17:37:44Z

Update:

This PR now uses the nvidia-container-cli tool and a generic hook added to containerd.

Usage:

ctr run --rm --gpus 0 docker.io/nvidia/cuda:latest gpus nvidia-smi

Signed-off-by: Michael Crosby [email protected]

cpuguy83 · 2018-05-07T18:11:06Z

Check wording

dmcgowan

LGTM

codecov-io · 2018-05-07T19:48:05Z

Codecov Report

Merging #2330 into master will decrease coverage by 0.13%.
The diff coverage is 0%.

@@            Coverage Diff             @@
##           master    #2330      +/-   ##
==========================================
- Coverage   44.98%   44.85%   -0.14%     
==========================================
  Files          92       93       +1     
  Lines        9340     9368      +28     
==========================================
  Hits         4202     4202              
- Misses       4459     4487      +28     
  Partials      679      679

Flag	Coverage Δ
#linux	`49.1% <0%> (-0.19%)`	⬇️
#windows	`41.24% <ø> (ø)`	⬆️

Impacted Files	Coverage Δ
oci/spec_unix.go	`98.37% <ø> (ø)`	⬆️
oci/spec_linux.go	`0% <0%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a15e7a0...b949697. Read the comment docs.

vishh · 2018-05-07T22:08:40Z

Hey @crosbymichael, can you explain why support for GPUs is being added to containerd directly?
To give some context, I went down the road of adding GPUs to k8s via the kubelet initially, but then after realizing the complexity, switched to a plugin model in k8s.
How do you plan on generalizing to other devices?

crosbymichael · 2018-05-07T22:54:39Z

@vishh If you look at the implementation it's not really a deep integration where you can say that it was added containerd directly. It's all about how the spec is generated. I think the existing implementations kinda skew your thinking around adding gpus to a container and make it a little more complex than it has to be. This is not their fault, they had to code around not having options in the Docker API but at the lowest level, its just creating the right dev nodes and mounting the correct drivers into the container.

I still have some work to do on app profiles because that involves creating a custom profile for the container and mounting it in but the implementation as a whole is very simple. You can see that I also added infiniband support using this same logic. All of this is done outside of the daemon, client side for spec generation and we can use a standard runc or runtime that supports device nodes and mounts.

BTW, what complexities did you run into when you tried putting it in the kubelet?

flx42 · 2018-05-07T23:23:47Z

I think the existing implementations kinda skew your thinking around adding gpus to a container and make it a little more complex than it has to be. This is not their fault, they had to code around not having options in the Docker API but at the lowest level, its just creating the right dev nodes and mounting the correct drivers into the container.

Ah, that's not very nice :)
If it was this simple, we wouldn't need our own custom prestart hook for the general use case.
Some points:

You can't assume a single installation point for our user-level driver libraries, what you have will probably only work for a subset of ubuntu 16.04 users. They can be scattered on the system, and you can have leftovers from previous installs.
NVIDIA kernel modules might not be loaded, or partially loaded.
Device files might not exist on the host when runc is called.
/dev/nvidia-uvm-tools do not exist for older driver versions. More generally, the list of user-space components to mount will vary between versions.
The device ordering is different between the PCI order (reported by nvidia-smi) and the device minor.
It's often the opposite: device 0 might be /dev/nvidia7. If you have heterogeneous GPUs, it does matter
You have to use CUDA+NVML to figure out the proper device ordering.
We need to fixup our driver stack since it's not aware of namespaces, you might have seen the ModifyDeviceFiles part.
We hide some procfs entry when doing isolation.
We need to check compatibility between the image and the runtime. For instance, an application compiled against CUDA 9.1 will require driver 387, on older driver the CUDA initialization will fail. Since this can happen after the (lengthy) CPU-side initialization, we want to prevent this as early as possible.
Another example is device isolation for OpenGL EGL (I think that's what you are referring to for "app profiles"). It requires a driver >= 390.
We need to refresh the container's ldcache inside the container, to avoid relying on LD_LIBRARY_PATH (which is often cleared by many programs as we know).

And as I mentioned on the other thread, there will be more and more driver features to support:

Support for Vulkan. It will probably require writing a custom ICD to the filesystem, and writing an application profile (like EGL).
Support for GPU sharing with MPS
Support for nvlink
Support for vGPU
Support for fully-containerized drivers, this requires additional logic in the runtime.

I think that InfiniBand is even more complicated. But @3XX0 is the expert here.

In the case of @vishh (GKE) they control the driver install too, so they can have a simplified approach that's tied to the way they install driver. But for enabling all users, we can't make this assumption.

crosbymichael · 2018-05-07T23:41:50Z

@flx42 I meant the hook approach for everything not the underlying implementation. Sorry if I made it sound like your implementation was overly complex. When we start getting code into containerd or docker, all the configuration that happens inside the hooks shouldn't be needed anymore. I think splitting configuration vs runtime actions is a win for both projects. Again, really sorry that I didn't explain better.

Most of the things you listed here I'm still working on but they are almost all configuration level issues that can be handled before the container is started. The only thing that would HAVE to run inside the container's rootfs is a ldcache refresh. runc already has options for masking proc entries, etc.

flx42 · 2018-05-07T23:51:43Z

When we start getting code into containerd or docker, all the configuration that happens inside the hooks shouldn't be needed anymore.

Except when it can't, like for the libnetwork hook, right?

Most of the things you listed here I'm still working on but they are almost all configuration level issues that can be handled before the container is started.

Did you consider using our CLI too? We have nvidia-container-cli list:

$ nvidia-container-cli list
/dev/nvidiactl
/dev/nvidia-uvm
/dev/nvidia-modeset
/dev/nvidia1
/dev/nvidia0
/usr/bin/nvidia-smi
/usr/bin/nvidia-debugdump
/usr/bin/nvidia-persistenced
/usr/bin/nvidia-cuda-mps-control
/usr/bin/nvidia-cuda-mps-server
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.390.48
/usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.390.48
/usr/lib/x86_64-linux-gnu/libcuda.so.390.48

You can also use the command-line options to only get a subset:

$ nvidia-container-cli list --device=0
/dev/nvidiactl
/dev/nvidia-uvm
/dev/nvidia-modeset
/dev/nvidia1

$ nvidia-container-cli list --binaries
/usr/bin/nvidia-smi
/usr/bin/nvidia-debugdump
/usr/bin/nvidia-persistenced
/usr/bin/nvidia-cuda-mps-control
/usr/bin/nvidia-cuda-mps-server

vishh · 2018-05-08T00:11:19Z

@crosbymichael
@flx42 gave a detailed overview of what I meant by "complexity". I did not grok that you are just making configs simpler in this PR and so you can ignore my previous comment.
I agree that designing container support for GPUs to work for traditional systems (VMs or Desktop environments) is more challenging. In the k8s world, I have been suggesting making simplifying assumptions on how nodes are expected to be setup for consuming GPUs to avoid the plethora of possible configurations we have to deal with. That may not be applicable to containerd since it is expected to widely adopted on traditional systems too IIUC.
One more thing that is relevant to this thread is the lack of ability to run exec operations between Create and Start phase of a container to update a container sandbox. My hunch is that exec could be used to handle all the runtime aspects of consuming GPUs and other devices instead of binary OCI hooks.

crosbymichael · 2018-05-08T00:13:08Z

@vishh ya, perfect idea. If that does not work it's probably a containerd bug and not a runc issue because I remember us specifically allowing create + exec+delete as a workflow. I'll make an issue to follow up on that and get it working.

vishh · 2018-05-08T00:17:05Z

@mrunalp mentioned that we pivot root as part of create. I was originally surprised to hear that and then later learnt that the Spec does not define any specific behavior around that.
Glad to hear that you can help sort that out.

crosbymichael · 2018-05-08T00:18:20Z

Oh, that is the case of getting extra binaries in on exec. we could look at exeveat or think of another solution

vishh · 2018-05-08T00:20:05Z

@crosbymichael sounds good, keep me in the loop. @flx42 what's your take on using exec vs binary hooks?

crosbymichael · 2018-05-08T00:52:11Z

@vishh After chatting some with @flx42 some tonight, I'm trying to identify what actually needs to be run inside the container vs what can happen on the host. I'll still have to implement some of this but the only thing at the moment that has to be run inside the container is to reload the ldcache (which we could probably do this in a chroot).

The ModifyDeviceFiles can be written in the container's bundle or in one file under the runtime and bind mounted to /proc via the spec. No need to make a tmpfs mount for each container and write a file. The kernel modules can be loaded on the host before the container is created, same with verifying driver versions, this can be done at container create time.

flx42 · 2018-05-08T01:20:44Z

We should not focus too much on the current state of our driver, it might change. It's not impossible that in the future device isolation will not be through cgroups, it's currently an implementation detail. It could be done by echoing your own PID to a special file from our procfs folder, doing a special ioctl call, etc. I can't also guarantee that a single params file in our procfs will be fine for all containers.

I understand your approach, but that would mean it needs to duplicated by each program building an OCI runtime spec. Whereas a prestart hook is already a feature that is portable and needs to be supported by all OCI-compliant runtimes. When support of a feature is such a moving target, I think it does make sense to use prestart hooks, it's easier to upgrade than the full runtime or the container supervisor. Supporting all of these features will add many dependencies to the runtime/supervisor, that's why we believe a hook is a good fit,.
Or maybe we should do a spec hook? :)

I think you will be able to do less things after the create. If you did the pivot_root, you can't do operations that require access to both the runtime and the container namespaces simultaneously. Your privileges will also be dropped, right?
In advanced cases, I think you will need to cross the runtime spec <-> runtime boundary, for instance:

You need to interact with the host network namespace AND the container network namespace.
The destination of a mount depends on the content of the filesystem (environment variables, annotations, but also the layout of the rootfs). Or going even further, the content of the filesystem might require you to not do a mount.
Inside your exec you don't have sufficient privileges (LSMs, userns) to perform the operations that you need, and for reasons not fully under you control you might not always be able to do that on the host.
If you're spawning VMs + containers inside it, prestart hooks are supposed to be ran inside the VM, there is nothing meaningful you can do on the host side.

vishh · 2018-05-08T02:00:43Z

@flx42 What if create did not do a pivot root and we can have exec run with additional privileges similar to the hook? Based on your comment in #2330 (comment), you are using hooks as a way to customize the host, runtime features like isolation, and the application/container sandbox itself. Ideally, host customization (loading drivers) will be handled out of band from container runtimes, runtime isolation customization (namespace, cgroups changes if any) will be handled via runtime plugins and container sandbox customization (ldconfig, CUDA version check, etc.) will be handled via exec plugins.

vikaschoudhary16 · 2018-05-08T05:52:42Z

It seems possible to not use pivot_root for create:

containerd/container_opts_unix.go

Lines 226 to 227 in 1291671

    
           // WithNoPivotRoot instructs the runtime not to you pivot_root 
        
           func WithNoPivotRoot(_ context.Context, _ *Client, info *TaskInfo) error {

https://github.com/opencontainers/runc/blob/63e6708c74c1cc46091ec92ea9df5aca4d82e803/libcontainer/rootfs_linux.go#L102-L107

Though i still need to understand how exec will work as an alternate to hook, I guess using a solution which is independent of runtime is more portable and less complex for end-users. hooks approach looks more consistent across different runtimes.

crosbymichael · 2018-05-08T14:56:55Z

@vishh actually it shouldn't be an issue. We can mount in ldconfig at create time and exec process can run as different users and capability sets. We can run ldconfig on exec and either leave it as a mount or umount it after we are done.

To show what I mean, *specs.Process is what is used to start the exec. Capabilities and LSMs are applied on the process, not the container. https://github.com/opencontainers/runc/pull/1788/files#diff-604f4192deea5adf9c159e5f88f64630R40

flx42 · 2018-05-08T17:16:40Z

Is it fair to say that gpus now require customization to not just the container sandbox, but also the container runtime itself which the existing hooks happen to not differentiate between?

Sorry, I don't get what you are saying here. Could you explain?

This allows many different commands to be used as OCI hooks. It allows these commands to template out different args and env vars so that normal commands can accept the OCI spec State payload over stdin. Signed-off-by: Michael Crosby <[email protected]>

vishh · 2018-05-08T20:07:57Z

@flx42

Sorry, I don't get what you are saying here. Could you explain?

I updated my original comment to be a bit more specific.

@crosbymichael That sounds promising. I'd love to hear what @flx42 thinks about this approach since I'd like to use such a model in k8s land as well if possible.

flx42 · 2018-05-08T20:31:42Z

Thanks @vishh

Ideally, host customization (loading drivers) will be handled out of band from container runtimes

Of course that's possible. We already have those pieces in the hook and in our library/CLI, and they can be disabled. And if everything is already setup correctly, our code path in the library will be no-op.
But again, it's not part of the OCI runtime spec. If all you have is the OCI runtime spec and you don't have the luxury of making simplifying assumptions about the state of the host, you have to put all the logic in the prestart hook.
Today you can set load-kmods = false in /etc/nvidia-container-runtime/config.toml to make sure we won't try to load kernel modules.

runtime isolation customization (namespace, cgroups changes if any) will be handled via runtime plugin

Runtime plugin? You mean a OCI hook? :)

container sandbox customization (ldconfig, CUDA version check, etc.) will be handled via exec plugins.

Ok for ldconfig, too late for the CUDA version check.

crosbymichael · 2018-05-08T20:39:00Z

@vishh the other option we have been discussing is using the low level nvidia-containerd-cli. I have a branch here if you want to check it out:

https://github.com/containerd/containerd/compare/master...crosbymichael:nvidia-no-hook?expand=1

The general idea is that it uses the lowest level libnvidia-container lib and cli, as a hook still, but replaces the current container hook with a generic one.

I agree with everything you said in your updated comment, I wish we could make that work that way and that was my original goal but I'm not sure anymore. We need more support.

crosbymichael · 2018-05-09T15:37:00Z

I updated this PR to point to my branch with the nvidia-container-cli took and generic hook within containerd to support having generic CLI binaries as OCI hooks. It templates out information for args and env so that we can keep the nvidia support client side configuration only and not require any daemon level code.

flx42 · 2018-05-09T16:55:16Z

+
+var ociHook = cli.Command{
+	Name:  "oci-hook",
+	Usage: "provides a base for OCI runtime hooks that allow arguements to be templated.",


flx42 · 2018-05-09T17:03:23Z

+				Args: []string{
+					"containerd",
+					"oci-hook", "--",
+					psPath, "--pid", "{{pid}}",


Pretty neat, I like the idea :)

flx42 · 2018-05-09T17:05:31Z

+		"bundle":     t.bundle,
+		"rootfs":     t.rootfs,
+		"pid":        t.pid,
+		"annotation": t.annotation,


Shouldn't we pass the status too?
Semi-related: NVIDIA/nvidia-container-runtime#14
Useful to make sure the hook is called on the right container state.

flx42 · 2018-05-09T17:09:35Z

you might want to reference our README: https://github.com/nvidia/nvidia-container-runtime#supported-driver-capabilities

flx42 · 2018-05-09T17:16:07Z

nit: our official vocabulary is "UUID", even if it's actually GPU- prefix + a UUID.

This adds nvidia gpu support via the libnvidia-container project and `nvidia-container-cli`. Signed-off-by: Michael Crosby <[email protected]>

flx42 · 2018-05-09T17:46:13Z

+		var (
+			ctx  = newTemplateContext(state)
+			args = []string(context.Args())
+			env  = os.Environ()


Thought: should we also allow the environment to be modified?
For instance: containerd oci-hook --env ROOTFS={{rootfs}} -- myhook

edit: Ah wait, I didn't see what's below. But I don't quite get how it works for the env, maybe it's already possible.

It does allow it, if you look a couple lines below, we apply the template the same way as args.

Ok, I get it, you have to pass the environment to the whole hook wrapper and the templating will be applied, sorry for the noise :)

crosbymichael · 2018-05-10T15:03:17Z

@containerd/containerd-maintainers please take another look at this. The implementation changed from the first time this PR was submitted.

AkihiroSuda · 2018-05-10T15:29:36Z

+)
+
+// WithGPUs adds NVIDIA gpu support to a container
+func WithGPUs(opts ...Opts) oci.SpecOpts {


Can we have Opts for specifying the path of containerd binary?

What would be the reason?

The binary can be distribured with custom name. e.g. docker-containerd
, other-3rdparty-prefixed-containerd, or containerd-v1.2

ok, will update

AkihiroSuda · 2018-05-10T15:31:01Z

 	},
+	cli.IntFlag{
+		Name:  "gpus",
+		Usage: "add gpus to the container",


How about "add gpus to container, currently only nvidia gpus are supported" for clarity?

No. Lets keep it generic so we don't have to change it back later

houstar · 2018-06-05T07:22:12Z

+				Args: []string{
+					"containerd",
+					"oci-hook", "--",
+					psPath, "--pid", "{{pid}}",


LGTM. Could we let OCI hooks get supported in containerd ASAP ?

There is nothing stopping you today from adding hooks to the OCI runtime spec in containerd. This added function is just a helper script that makes it a little bit easier on hook authors

mikebrow

LGTM
Nice!

estesp

LGTM

crosbymichael added this to the 1.2 milestone May 7, 2018

crosbymichael force-pushed the hpc branch 2 times, most recently from 78ed5f3 to aa61428 Compare May 7, 2018 18:03

cpuguy83 reviewed May 7, 2018

View reviewed changes

Comment thread contrib/hpc/infiniband/infiniband.go Outdated

Copy link
Copy Markdown

Member

cpuguy83 May 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check wording

crosbymichael force-pushed the hpc branch from aa61428 to 22362ea Compare May 7, 2018 18:18

mlaventure approved these changes May 7, 2018

View reviewed changes

dmcgowan approved these changes May 7, 2018

View reviewed changes

crosbymichael force-pushed the hpc branch from 22362ea to 754709d Compare May 7, 2018 19:14

crosbymichael mentioned this pull request May 7, 2018

[RFE] Support for OCI hooks [2018 edition] moby/moby#36987

Open

crosbymichael force-pushed the hpc branch from 754709d to 87dcf1a Compare May 8, 2018 13:48

crosbymichael changed the title ~~Add nvidia gpu and infiniband support~~ Add nvidia gpu support May 8, 2018

crosbymichael force-pushed the hpc branch from 36dada9 to 34e3e7f Compare May 9, 2018 15:34

crosbymichael force-pushed the hpc branch from 34e3e7f to fc66390 Compare May 9, 2018 15:40

flx42 reviewed May 9, 2018

View reviewed changes

Add nvidia gpu support via libnvidia-container

b949697

This adds nvidia gpu support via the libnvidia-container project and `nvidia-container-cli`. Signed-off-by: Michael Crosby <[email protected]>

crosbymichael force-pushed the hpc branch from fc66390 to b949697 Compare May 9, 2018 17:37

flx42 reviewed May 9, 2018

View reviewed changes

AkihiroSuda reviewed May 10, 2018

View reviewed changes

houstar reviewed Jun 5, 2018

View reviewed changes

mikebrow approved these changes Jun 5, 2018

View reviewed changes

estesp approved these changes Jun 7, 2018

View reviewed changes

estesp merged commit df34eef into containerd:master Jun 7, 2018

flx42 mentioned this pull request Jun 21, 2018

"containerd oci-hook" does not pass the right rootfs #2412

Closed

flx42 mentioned this pull request Jul 9, 2018

[RFC] GPU support in CLI docker/cli#1200

Closed

crosbymichael deleted the hpc branch July 19, 2018 20:13

wk8 mentioned this pull request Nov 15, 2018

Adding a --load-kmods flag to the NVIDIA OCI hook #2794

Merged

This was referenced Jun 9, 2021

Support NVIDIA GPUs (nerdctl run --gpus) containerd/nerdctl#248

Closed

Support nerdctl run --gpus containerd/nerdctl#251

Merged

Conversation

crosbymichael commented May 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dmcgowan left a comment

Choose a reason for hiding this comment

Uh oh!

codecov-io commented May 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

vishh commented May 7, 2018

Uh oh!

crosbymichael commented May 7, 2018

Uh oh!

flx42 commented May 7, 2018

Uh oh!

crosbymichael commented May 7, 2018

Uh oh!

flx42 commented May 7, 2018

Uh oh!

vishh commented May 8, 2018

Uh oh!

crosbymichael commented May 8, 2018

Uh oh!

vishh commented May 8, 2018

Uh oh!

crosbymichael commented May 8, 2018

Uh oh!

vishh commented May 8, 2018

Uh oh!

crosbymichael commented May 8, 2018

Uh oh!

flx42 commented May 8, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vishh commented May 8, 2018 via email • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vikaschoudhary16 commented May 8, 2018

Uh oh!

crosbymichael commented May 8, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

flx42 commented May 8, 2018

Uh oh!

vishh commented May 8, 2018

Uh oh!

flx42 commented May 8, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

crosbymichael commented May 8, 2018

Uh oh!

crosbymichael commented May 9, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

flx42 May 9, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

crosbymichael commented May 10, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

crosbymichael commented May 7, 2018 •

edited

Loading

codecov-io commented May 7, 2018 •

edited

Loading

flx42 commented May 8, 2018 •

edited

Loading

vishh commented May 8, 2018 via email •

edited

Loading

crosbymichael commented May 8, 2018 •

edited

Loading

flx42 commented May 8, 2018 •

edited

Loading

flx42 May 9, 2018 •

edited

Loading