Add nvidia gpu support#2330
Conversation
78ed5f3 to
aa61428
Compare
Codecov Report
@@ Coverage Diff @@
## master #2330 +/- ##
==========================================
- Coverage 44.98% 44.85% -0.14%
==========================================
Files 92 93 +1
Lines 9340 9368 +28
==========================================
Hits 4202 4202
- Misses 4459 4487 +28
Partials 679 679
Continue to review full report at Codecov.
|
|
Hey @crosbymichael, can you explain why support for GPUs is being added to containerd directly? |
|
@vishh If you look at the implementation it's not really a deep integration where you can say that it was added containerd directly. It's all about how the spec is generated. I think the existing implementations kinda skew your thinking around adding gpus to a container and make it a little more complex than it has to be. This is not their fault, they had to code around not having options in the Docker API but at the lowest level, its just creating the right dev nodes and mounting the correct drivers into the container. I still have some work to do on app profiles because that involves creating a custom profile for the container and mounting it in but the implementation as a whole is very simple. You can see that I also added infiniband support using this same logic. All of this is done outside of the daemon, client side for spec generation and we can use a standard runc or runtime that supports device nodes and mounts. BTW, what complexities did you run into when you tried putting it in the kubelet? |
Ah, that's not very nice :)
And as I mentioned on the other thread, there will be more and more driver features to support:
I think that InfiniBand is even more complicated. But @3XX0 is the expert here. In the case of @vishh (GKE) they control the driver install too, so they can have a simplified approach that's tied to the way they install driver. But for enabling all users, we can't make this assumption. |
|
@flx42 I meant the hook approach for everything not the underlying implementation. Sorry if I made it sound like your implementation was overly complex. When we start getting code into containerd or docker, all the configuration that happens inside the hooks shouldn't be needed anymore. I think splitting configuration vs runtime actions is a win for both projects. Again, really sorry that I didn't explain better. Most of the things you listed here I'm still working on but they are almost all configuration level issues that can be handled before the container is started. The only thing that would HAVE to run inside the container's rootfs is a ldcache refresh. runc already has options for masking proc entries, etc. |
Except when it can't, like for the libnetwork hook, right?
Did you consider using our CLI too? We have You can also use the command-line options to only get a subset: |
|
@crosbymichael |
|
@vishh ya, perfect idea. If that does not work it's probably a containerd bug and not a runc issue because I remember us specifically allowing create + exec+delete as a workflow. I'll make an issue to follow up on that and get it working. |
|
@mrunalp mentioned that we pivot root as part of create. I was originally surprised to hear that and then later learnt that the Spec does not define any specific behavior around that. |
|
Oh, that is the case of getting extra binaries in on exec. we could look at exeveat or think of another solution |
|
@crosbymichael sounds good, keep me in the loop. @flx42 what's your take on using exec vs binary hooks? |
|
@vishh After chatting some with @flx42 some tonight, I'm trying to identify what actually needs to be run inside the container vs what can happen on the host. I'll still have to implement some of this but the only thing at the moment that has to be run inside the container is to reload the ldcache (which we could probably do this in a chroot). The |
|
We should not focus too much on the current state of our driver, it might change. It's not impossible that in the future device isolation will not be through cgroups, it's currently an implementation detail. It could be done by echoing your own PID to a special file from our procfs folder, doing a special I understand your approach, but that would mean it needs to duplicated by each program building an OCI runtime spec. Whereas a prestart hook is already a feature that is portable and needs to be supported by all OCI-compliant runtimes. When support of a feature is such a moving target, I think it does make sense to use prestart hooks, it's easier to upgrade than the full runtime or the container supervisor. Supporting all of these features will add many dependencies to the runtime/supervisor, that's why we believe a hook is a good fit,. I think you will be able to do less things after the
|
|
@flx42
What if create did not do a pivot root and we can have exec run with
additional privileges similar to the hook?
Based on your comment in #2330 (comment), you are using hooks as a way to customize the host, runtime features like isolation, and the application/container sandbox itself. Ideally, host customization (loading drivers) will be handled out of band from container runtimes, runtime isolation customization (namespace, cgroups changes if any) will be handled via runtime plugins and container sandbox customization (ldconfig, CUDA version check, etc.) will be handled via exec plugins.
|
|
It seems possible to not use containerd/container_opts_unix.go Lines 226 to 227 in 1291671 Though i still need to understand how exec will work as an alternate to hook, I guess using a solution which is independent of runtime is more portable and less complex for end-users. hooks approach looks more consistent across different runtimes. |
|
@vishh actually it shouldn't be an issue. We can mount in To show what I mean, |
Sorry, I don't get what you are saying here. Could you explain? |
This allows many different commands to be used as OCI hooks. It allows these commands to template out different args and env vars so that normal commands can accept the OCI spec State payload over stdin. Signed-off-by: Michael Crosby <[email protected]>
I updated my original comment to be a bit more specific. @crosbymichael That sounds promising. I'd love to hear what @flx42 thinks about this approach since I'd like to use such a model in k8s land as well if possible. |
|
Thanks @vishh
Of course that's possible. We already have those pieces in the hook and in our library/CLI, and they can be disabled. And if everything is already setup correctly, our code path in the library will be no-op.
Runtime plugin? You mean a OCI hook? :)
Ok for |
|
@vishh the other option we have been discussing is using the low level The general idea is that it uses the lowest level I agree with everything you said in your updated comment, I wish we could make that work that way and that was my original goal but I'm not sure anymore. We need more support. |
|
I updated this PR to point to my branch with the nvidia-container-cli took and generic hook within containerd to support having generic CLI binaries as OCI hooks. It templates out information for args and env so that we can keep the nvidia support client side configuration only and not require any daemon level code. |
|
|
||
| var ociHook = cli.Command{ | ||
| Name: "oci-hook", | ||
| Usage: "provides a base for OCI runtime hooks that allow arguements to be templated.", |
| Args: []string{ | ||
| "containerd", | ||
| "oci-hook", "--", | ||
| psPath, "--pid", "{{pid}}", |
There was a problem hiding this comment.
Pretty neat, I like the idea :)
| "bundle": t.bundle, | ||
| "rootfs": t.rootfs, | ||
| "pid": t.pid, | ||
| "annotation": t.annotation, |
There was a problem hiding this comment.
Shouldn't we pass the status too?
Semi-related: NVIDIA/nvidia-container-runtime#14
Useful to make sure the hook is called on the right container state.
There was a problem hiding this comment.
you might want to reference our README: https://github.com/nvidia/nvidia-container-runtime#supported-driver-capabilities
There was a problem hiding this comment.
nit: our official vocabulary is "UUID", even if it's actually GPU- prefix + a UUID.
This adds nvidia gpu support via the libnvidia-container project and `nvidia-container-cli`. Signed-off-by: Michael Crosby <[email protected]>
| var ( | ||
| ctx = newTemplateContext(state) | ||
| args = []string(context.Args()) | ||
| env = os.Environ() |
There was a problem hiding this comment.
Thought: should we also allow the environment to be modified?
For instance: containerd oci-hook --env ROOTFS={{rootfs}} -- myhook
edit: Ah wait, I didn't see what's below. But I don't quite get how it works for the env, maybe it's already possible.
There was a problem hiding this comment.
It does allow it, if you look a couple lines below, we apply the template the same way as args.
There was a problem hiding this comment.
Ok, I get it, you have to pass the environment to the whole hook wrapper and the templating will be applied, sorry for the noise :)
|
@containerd/containerd-maintainers please take another look at this. The implementation changed from the first time this PR was submitted. |
| ) | ||
|
|
||
| // WithGPUs adds NVIDIA gpu support to a container | ||
| func WithGPUs(opts ...Opts) oci.SpecOpts { |
There was a problem hiding this comment.
Can we have Opts for specifying the path of containerd binary?
There was a problem hiding this comment.
What would be the reason?
There was a problem hiding this comment.
The binary can be distribured with custom name. e.g. docker-containerd
, other-3rdparty-prefixed-containerd, or containerd-v1.2
| }, | ||
| cli.IntFlag{ | ||
| Name: "gpus", | ||
| Usage: "add gpus to the container", |
There was a problem hiding this comment.
How about "add gpus to container, currently only nvidia gpus are supported" for clarity?
There was a problem hiding this comment.
No. Lets keep it generic so we don't have to change it back later
| Args: []string{ | ||
| "containerd", | ||
| "oci-hook", "--", | ||
| psPath, "--pid", "{{pid}}", |
There was a problem hiding this comment.
LGTM. Could we let OCI hooks get supported in containerd ASAP ?
There was a problem hiding this comment.
There is nothing stopping you today from adding hooks to the OCI runtime spec in containerd. This added function is just a helper script that makes it a little bit easier on hook authors
Update:
This PR now uses the
nvidia-container-clitool and a generic hook added to containerd.Usage:
Signed-off-by: Michael Crosby [email protected]