Skip to content

Add nvidia gpu support#2330

Merged
estesp merged 2 commits intocontainerd:masterfrom
crosbymichael:hpc
Jun 7, 2018
Merged

Add nvidia gpu support#2330
estesp merged 2 commits intocontainerd:masterfrom
crosbymichael:hpc

Conversation

@crosbymichael
Copy link
Copy Markdown
Member

@crosbymichael crosbymichael commented May 7, 2018

Update:

This PR now uses the nvidia-container-cli tool and a generic hook added to containerd.

Usage:

ctr run --rm --gpus 0 docker.io/nvidia/cuda:latest gpus nvidia-smi

Signed-off-by: Michael Crosby [email protected]

@crosbymichael crosbymichael added this to the 1.2 milestone May 7, 2018
@crosbymichael crosbymichael force-pushed the hpc branch 2 times, most recently from 78ed5f3 to aa61428 Compare May 7, 2018 18:03
Comment thread contrib/hpc/infiniband/infiniband.go Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check wording

Copy link
Copy Markdown
Member

@dmcgowan dmcgowan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@codecov-io
Copy link
Copy Markdown

codecov-io commented May 7, 2018

Codecov Report

Merging #2330 into master will decrease coverage by 0.13%.
The diff coverage is 0%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2330      +/-   ##
==========================================
- Coverage   44.98%   44.85%   -0.14%     
==========================================
  Files          92       93       +1     
  Lines        9340     9368      +28     
==========================================
  Hits         4202     4202              
- Misses       4459     4487      +28     
  Partials      679      679
Flag Coverage Δ
#linux 49.1% <0%> (-0.19%) ⬇️
#windows 41.24% <ø> (ø) ⬆️
Impacted Files Coverage Δ
oci/spec_unix.go 98.37% <ø> (ø) ⬆️
oci/spec_linux.go 0% <0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a15e7a0...b949697. Read the comment docs.

@vishh
Copy link
Copy Markdown

vishh commented May 7, 2018

Hey @crosbymichael, can you explain why support for GPUs is being added to containerd directly?
To give some context, I went down the road of adding GPUs to k8s via the kubelet initially, but then after realizing the complexity, switched to a plugin model in k8s.
How do you plan on generalizing to other devices?

@crosbymichael
Copy link
Copy Markdown
Member Author

@vishh If you look at the implementation it's not really a deep integration where you can say that it was added containerd directly. It's all about how the spec is generated. I think the existing implementations kinda skew your thinking around adding gpus to a container and make it a little more complex than it has to be. This is not their fault, they had to code around not having options in the Docker API but at the lowest level, its just creating the right dev nodes and mounting the correct drivers into the container.

I still have some work to do on app profiles because that involves creating a custom profile for the container and mounting it in but the implementation as a whole is very simple. You can see that I also added infiniband support using this same logic. All of this is done outside of the daemon, client side for spec generation and we can use a standard runc or runtime that supports device nodes and mounts.

BTW, what complexities did you run into when you tried putting it in the kubelet?

@flx42
Copy link
Copy Markdown
Contributor

flx42 commented May 7, 2018

I think the existing implementations kinda skew your thinking around adding gpus to a container and make it a little more complex than it has to be. This is not their fault, they had to code around not having options in the Docker API but at the lowest level, its just creating the right dev nodes and mounting the correct drivers into the container.

Ah, that's not very nice :)
If it was this simple, we wouldn't need our own custom prestart hook for the general use case.
Some points:

  • You can't assume a single installation point for our user-level driver libraries, what you have will probably only work for a subset of ubuntu 16.04 users. They can be scattered on the system, and you can have leftovers from previous installs.
  • NVIDIA kernel modules might not be loaded, or partially loaded.
  • Device files might not exist on the host when runc is called.
  • /dev/nvidia-uvm-tools do not exist for older driver versions. More generally, the list of user-space components to mount will vary between versions.
  • The device ordering is different between the PCI order (reported by nvidia-smi) and the device minor.
    It's often the opposite: device 0 might be /dev/nvidia7. If you have heterogeneous GPUs, it does matter
    You have to use CUDA+NVML to figure out the proper device ordering.
  • We need to fixup our driver stack since it's not aware of namespaces, you might have seen the ModifyDeviceFiles part.
  • We hide some procfs entry when doing isolation.
  • We need to check compatibility between the image and the runtime. For instance, an application compiled against CUDA 9.1 will require driver 387, on older driver the CUDA initialization will fail. Since this can happen after the (lengthy) CPU-side initialization, we want to prevent this as early as possible.
  • Another example is device isolation for OpenGL EGL (I think that's what you are referring to for "app profiles"). It requires a driver >= 390.
  • We need to refresh the container's ldcache inside the container, to avoid relying on LD_LIBRARY_PATH (which is often cleared by many programs as we know).

And as I mentioned on the other thread, there will be more and more driver features to support:

  • Support for Vulkan. It will probably require writing a custom ICD to the filesystem, and writing an application profile (like EGL).
  • Support for GPU sharing with MPS
  • Support for nvlink
  • Support for vGPU
  • Support for fully-containerized drivers, this requires additional logic in the runtime.

I think that InfiniBand is even more complicated. But @3XX0 is the expert here.

In the case of @vishh (GKE) they control the driver install too, so they can have a simplified approach that's tied to the way they install driver. But for enabling all users, we can't make this assumption.

@crosbymichael
Copy link
Copy Markdown
Member Author

@flx42 I meant the hook approach for everything not the underlying implementation. Sorry if I made it sound like your implementation was overly complex. When we start getting code into containerd or docker, all the configuration that happens inside the hooks shouldn't be needed anymore. I think splitting configuration vs runtime actions is a win for both projects. Again, really sorry that I didn't explain better.

Most of the things you listed here I'm still working on but they are almost all configuration level issues that can be handled before the container is started. The only thing that would HAVE to run inside the container's rootfs is a ldcache refresh. runc already has options for masking proc entries, etc.

@flx42
Copy link
Copy Markdown
Contributor

flx42 commented May 7, 2018

When we start getting code into containerd or docker, all the configuration that happens inside the hooks shouldn't be needed anymore.

Except when it can't, like for the libnetwork hook, right?

Most of the things you listed here I'm still working on but they are almost all configuration level issues that can be handled before the container is started.

Did you consider using our CLI too? We have nvidia-container-cli list:

$ nvidia-container-cli list
/dev/nvidiactl
/dev/nvidia-uvm
/dev/nvidia-modeset
/dev/nvidia1
/dev/nvidia0
/usr/bin/nvidia-smi
/usr/bin/nvidia-debugdump
/usr/bin/nvidia-persistenced
/usr/bin/nvidia-cuda-mps-control
/usr/bin/nvidia-cuda-mps-server
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.390.48
/usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.390.48
/usr/lib/x86_64-linux-gnu/libcuda.so.390.48

You can also use the command-line options to only get a subset:

$ nvidia-container-cli list --device=0
/dev/nvidiactl
/dev/nvidia-uvm
/dev/nvidia-modeset
/dev/nvidia1

$ nvidia-container-cli list --binaries
/usr/bin/nvidia-smi
/usr/bin/nvidia-debugdump
/usr/bin/nvidia-persistenced
/usr/bin/nvidia-cuda-mps-control
/usr/bin/nvidia-cuda-mps-server

@vishh
Copy link
Copy Markdown

vishh commented May 8, 2018

@crosbymichael
@flx42 gave a detailed overview of what I meant by "complexity". I did not grok that you are just making configs simpler in this PR and so you can ignore my previous comment.
I agree that designing container support for GPUs to work for traditional systems (VMs or Desktop environments) is more challenging. In the k8s world, I have been suggesting making simplifying assumptions on how nodes are expected to be setup for consuming GPUs to avoid the plethora of possible configurations we have to deal with. That may not be applicable to containerd since it is expected to widely adopted on traditional systems too IIUC.
One more thing that is relevant to this thread is the lack of ability to run exec operations between Create and Start phase of a container to update a container sandbox. My hunch is that exec could be used to handle all the runtime aspects of consuming GPUs and other devices instead of binary OCI hooks.

@crosbymichael
Copy link
Copy Markdown
Member Author

@vishh ya, perfect idea. If that does not work it's probably a containerd bug and not a runc issue because I remember us specifically allowing create + exec+delete as a workflow. I'll make an issue to follow up on that and get it working.

@vishh
Copy link
Copy Markdown

vishh commented May 8, 2018

@mrunalp mentioned that we pivot root as part of create. I was originally surprised to hear that and then later learnt that the Spec does not define any specific behavior around that.
Glad to hear that you can help sort that out.

@crosbymichael
Copy link
Copy Markdown
Member Author

Oh, that is the case of getting extra binaries in on exec. we could look at exeveat or think of another solution

@vishh
Copy link
Copy Markdown

vishh commented May 8, 2018

@crosbymichael sounds good, keep me in the loop. @flx42 what's your take on using exec vs binary hooks?

@crosbymichael
Copy link
Copy Markdown
Member Author

@vishh After chatting some with @flx42 some tonight, I'm trying to identify what actually needs to be run inside the container vs what can happen on the host. I'll still have to implement some of this but the only thing at the moment that has to be run inside the container is to reload the ldcache (which we could probably do this in a chroot).

The ModifyDeviceFiles can be written in the container's bundle or in one file under the runtime and bind mounted to /proc via the spec. No need to make a tmpfs mount for each container and write a file. The kernel modules can be loaded on the host before the container is created, same with verifying driver versions, this can be done at container create time.

@flx42
Copy link
Copy Markdown
Contributor

flx42 commented May 8, 2018

We should not focus too much on the current state of our driver, it might change. It's not impossible that in the future device isolation will not be through cgroups, it's currently an implementation detail. It could be done by echoing your own PID to a special file from our procfs folder, doing a special ioctl call, etc. I can't also guarantee that a single params file in our procfs will be fine for all containers.

I understand your approach, but that would mean it needs to duplicated by each program building an OCI runtime spec. Whereas a prestart hook is already a feature that is portable and needs to be supported by all OCI-compliant runtimes. When support of a feature is such a moving target, I think it does make sense to use prestart hooks, it's easier to upgrade than the full runtime or the container supervisor. Supporting all of these features will add many dependencies to the runtime/supervisor, that's why we believe a hook is a good fit,.
Or maybe we should do a spec hook? :)

I think you will be able to do less things after the create. If you did the pivot_root, you can't do operations that require access to both the runtime and the container namespaces simultaneously. Your privileges will also be dropped, right?
In advanced cases, I think you will need to cross the runtime spec <-> runtime boundary, for instance:

  • You need to interact with the host network namespace AND the container network namespace.
  • The destination of a mount depends on the content of the filesystem (environment variables, annotations, but also the layout of the rootfs). Or going even further, the content of the filesystem might require you to not do a mount.
  • Inside your exec you don't have sufficient privileges (LSMs, userns) to perform the operations that you need, and for reasons not fully under you control you might not always be able to do that on the host.
  • If you're spawning VMs + containers inside it, prestart hooks are supposed to be ran inside the VM, there is nothing meaningful you can do on the host side.

@vishh
Copy link
Copy Markdown

vishh commented May 8, 2018 via email

@vikaschoudhary16
Copy link
Copy Markdown

It seems possible to not use pivot_root for create:

// WithNoPivotRoot instructs the runtime not to you pivot_root
func WithNoPivotRoot(_ context.Context, _ *Client, info *TaskInfo) error {

https://github.com/opencontainers/runc/blob/63e6708c74c1cc46091ec92ea9df5aca4d82e803/libcontainer/rootfs_linux.go#L102-L107

Though i still need to understand how exec will work as an alternate to hook, I guess using a solution which is independent of runtime is more portable and less complex for end-users. hooks approach looks more consistent across different runtimes.

@crosbymichael crosbymichael changed the title Add nvidia gpu and infiniband support Add nvidia gpu support May 8, 2018
@crosbymichael
Copy link
Copy Markdown
Member Author

crosbymichael commented May 8, 2018

@vishh actually it shouldn't be an issue. We can mount in ldconfig at create time and exec process can run as different users and capability sets. We can run ldconfig on exec and either leave it as a mount or umount it after we are done.

To show what I mean, *specs.Process is what is used to start the exec. Capabilities and LSMs are applied on the process, not the container. https://github.com/opencontainers/runc/pull/1788/files#diff-604f4192deea5adf9c159e5f88f64630R40

@flx42
Copy link
Copy Markdown
Contributor

flx42 commented May 8, 2018

Is it fair to say that gpus now require customization to not just the container sandbox, but also the container runtime itself which the existing hooks happen to not differentiate between?

Sorry, I don't get what you are saying here. Could you explain?

This allows many different commands to be used as OCI hooks.  It allows
these commands to template out different args and env vars so that
normal commands can accept the OCI spec State payload over stdin.

Signed-off-by: Michael Crosby <[email protected]>
@vishh
Copy link
Copy Markdown

vishh commented May 8, 2018

@flx42

Sorry, I don't get what you are saying here. Could you explain?

I updated my original comment to be a bit more specific.

@crosbymichael That sounds promising. I'd love to hear what @flx42 thinks about this approach since I'd like to use such a model in k8s land as well if possible.

@flx42
Copy link
Copy Markdown
Contributor

flx42 commented May 8, 2018

Thanks @vishh

Ideally, host customization (loading drivers) will be handled out of band from container runtimes

Of course that's possible. We already have those pieces in the hook and in our library/CLI, and they can be disabled. And if everything is already setup correctly, our code path in the library will be no-op.
But again, it's not part of the OCI runtime spec. If all you have is the OCI runtime spec and you don't have the luxury of making simplifying assumptions about the state of the host, you have to put all the logic in the prestart hook.
Today you can set load-kmods = false in /etc/nvidia-container-runtime/config.toml to make sure we won't try to load kernel modules.

runtime isolation customization (namespace, cgroups changes if any) will be handled via runtime plugin

Runtime plugin? You mean a OCI hook? :)

container sandbox customization (ldconfig, CUDA version check, etc.) will be handled via exec plugins.

Ok for ldconfig, too late for the CUDA version check.

@crosbymichael
Copy link
Copy Markdown
Member Author

@vishh the other option we have been discussing is using the low level nvidia-containerd-cli. I have a branch here if you want to check it out:

https://github.com/containerd/containerd/compare/master...crosbymichael:nvidia-no-hook?expand=1

The general idea is that it uses the lowest level libnvidia-container lib and cli, as a hook still, but replaces the current container hook with a generic one.

I agree with everything you said in your updated comment, I wish we could make that work that way and that was my original goal but I'm not sure anymore. We need more support.

@crosbymichael
Copy link
Copy Markdown
Member Author

I updated this PR to point to my branch with the nvidia-container-cli took and generic hook within containerd to support having generic CLI binaries as OCI hooks. It templates out information for args and env so that we can keep the nvidia support client side configuration only and not require any daemon level code.

Comment thread cmd/containerd/command/oci-hook.go Outdated

var ociHook = cli.Command{
Name: "oci-hook",
Usage: "provides a base for OCI runtime hooks that allow arguements to be templated.",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo :)

Comment thread container_test.go
Args: []string{
"containerd",
"oci-hook", "--",
psPath, "--pid", "{{pid}}",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty neat, I like the idea :)

"bundle": t.bundle,
"rootfs": t.rootfs,
"pid": t.pid,
"annotation": t.annotation,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we pass the status too?
Semi-related: NVIDIA/nvidia-container-runtime#14
Useful to make sure the hook is called on the right container state.

Comment thread contrib/nvidia/nvidia.go Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread contrib/nvidia/nvidia.go Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: our official vocabulary is "UUID", even if it's actually GPU- prefix + a UUID.

This adds nvidia gpu support via the libnvidia-container project and
`nvidia-container-cli`.

Signed-off-by: Michael Crosby <[email protected]>
var (
ctx = newTemplateContext(state)
args = []string(context.Args())
env = os.Environ()
Copy link
Copy Markdown
Contributor

@flx42 flx42 May 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought: should we also allow the environment to be modified?
For instance: containerd oci-hook --env ROOTFS={{rootfs}} -- myhook

edit: Ah wait, I didn't see what's below. But I don't quite get how it works for the env, maybe it's already possible.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does allow it, if you look a couple lines below, we apply the template the same way as args.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I get it, you have to pass the environment to the whole hook wrapper and the templating will be applied, sorry for the noise :)

@crosbymichael
Copy link
Copy Markdown
Member Author

@containerd/containerd-maintainers please take another look at this. The implementation changed from the first time this PR was submitted.

Comment thread contrib/nvidia/nvidia.go
)

// WithGPUs adds NVIDIA gpu support to a container
func WithGPUs(opts ...Opts) oci.SpecOpts {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have Opts for specifying the path of containerd binary?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the reason?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The binary can be distribured with custom name. e.g. docker-containerd
, other-3rdparty-prefixed-containerd, or containerd-v1.2

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, will update

},
cli.IntFlag{
Name: "gpus",
Usage: "add gpus to the container",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about "add gpus to container, currently only nvidia gpus are supported" for clarity?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. Lets keep it generic so we don't have to change it back later

Comment thread container_test.go
Args: []string{
"containerd",
"oci-hook", "--",
psPath, "--pid", "{{pid}}",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Could we let OCI hooks get supported in containerd ASAP ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is nothing stopping you today from adding hooks to the OCI runtime spec in containerd. This added function is just a helper script that makes it a little bit easier on hook authors

Copy link
Copy Markdown
Member

@mikebrow mikebrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
Nice!

Copy link
Copy Markdown
Member

@estesp estesp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.