Use cuda_compat drivers when available#267247
Conversation
There was a problem hiding this comment.
I think propagating a different version of addOpenGLRunpath might cause conflicts in derivations that explicitly consume their own version.
Initially, I thought that all we'd need is patchelf --add-needed with the absolute path to ${cuda_compat}/lib/libcuda.so in https://github.com/NixOS/nixpkgs/blob/bb142a6838c823a2cd8235e1d0dd3641e7dcfd63/pkgs/development/compilers/cudatoolkit/redist/build-cuda-redist-package.nix. This would ensure that anybody trying to load, for example, libcudart.so ends up loading the cuda-compat driver first.
Now I think that might be insufficient (how do we know people don't try to dlopen libcuda directly?) and you're right about overriding addOpenGLRunpath. Except we might want to do that on an even more global scale, i.e. we might want to change pkgs.addOpenGLRunpath whenever our nixpkgs instantiation happens to target an nvidia jetson (e.g. config.cudaSupport && hostPlatform.system == "aarch64-linux", but also maybe a special flag)
All in all, I think we need to ponder at the available options a bit more. Thank you for starting this!
There was a problem hiding this comment.
Wouldn't a simple test that cudaPackages.cuda_compat exists, as we do here, be sufficient? Basically, if there's a CUDA cuda_compat redist package available, we do that. Maybe we can add a test to see if we're not cross-compiling.
I personally think changing pkgs.addOpenGLRunpath would make sense, as what we do is mostly to "extend" the usual path /run/opengl-driver to find cuda_compat's libcuda.so. But we noticed that some packages are using e.g. the attribute addOpenGLRunpath.driverLink, so we must be careful to not change the interface of the derivation.
There was a problem hiding this comment.
Wouldn't a simple test that cudaPackages.cuda_compat exists, as we do here, be sufficient
Yes and no, e.g. cf. #266475. I'd say, we probably don't want the definition of addOpenGLRunpath to depend on the internals of the cudaPackages set, but it's OK if it depends on a top-level attribute like config. I haven't thought about this much though
some packages are using e.g. the attribute addOpenGLRunpath.driverLink, so we must be careful to not change the interface of the derivation.
Yes
|
Actually, no we do not want What we really want is Potential blockers:
|
|
@yannham I just rebased on branch on the latest staging to fix the merge conflicts so you'll want to rebase as well. |
7f232fe to
ac5faa6
Compare
There's a bunch of places, currently in |
|
@SomeoneSerge thanks a lot for your input! I'm not drawing any conclusion here, and you guys are the experts, but I wonder: how is the openGL driver different from cuda_compat? Because I believe your recommendation/alternative way of reaching all CUDA packages could apply to linking to the openGL driver as well. That is, instead of using I can only guess, but I suppose it wasn't done this way because:
Are those assumptions somehow remotely correct? And if they are, is there anything differentiating As a side-question, if you think patching build inputs is still the way to go, it sounds like a bigger change: if this PR works as it is with a very minimal change, would that be reasonable as a first step, and then open an issue or whatever tracking method to use the right approach as a follow-up? |
|
In fact, what I'm saying might not make a lot of sense, because I thought Also, there is a kind of a bootstrapping issue, because |
904f0b6 to
ac5faa6
Compare
Indeed, they are similar, which is why normally we deploy both impurely. The exact restrictions that OpenGL/libgbm imposes are different. The There will be a few programs that use
Yes, it does seem like we have to deploy As suggested before, I would try reverting the |
|
So tried the proposed approach, but it doesn't work (ignore cuda_compat and show the actual version of the driver, 11.4). I can't see any I personally don't see an obvious way of enabling cuda_compat without doing something impure such as we do for openGL drivers... |
30b2433 to
85adf60
Compare
|
The last version is attempting to do what @SomeoneSerge suggested, by duplicating the logic of Unfortunately, I get the following error: |
|
Additional update: the build works, and the binary run without any |
|
I used |
|
The error looks like cmake trying to enable_language(cuda) and failing some sanity checks... I think I had similar issues when nvcc coildn't find cudart. Could you gist the full log? PS sorry I'm slow to respond lately' |
|
One more thing: addOpenGLRunpath assigns the impure path a higher priority so it's important that our postfixup happens last or we might load the old driver. We also need to ensure that we're prepending the compat driver to the runpath, not appending. Try inspecting --print-rpath of the resulting binaries |
|
Here is the full CMake log. I also tried with I also attached the environment (serialized with As mentioned above, I tried to use the exact exported Beside trying to |
|
Damn! I just realized that we probably can't have the compat libcuda.so in dt_needed, because then we couldn't e.g. load the reverse dependencies inside the nix sandbox. This is because libcuda from compat has the libnvrm* libraries in dt_needed but they're impure. E.g. we couldn't import torch, if it loads libcublas, and that loads libcuda, and that fails to find libnvrm* Unless that's all wrong I guess we want to just modify the runpath... |
|
Another thought: in jetpack-nixos, Could we do the same thing with the one from the compat library, and patch its libnvrm* dependencies directly with the right paths? It might be made more complicated by the fact that the Another solution, fully on the jetpack-nixos side: we can control what goes in the hardware.opengl.package = pkgs.nvidia-jetpack.l4t-3d-core;
hardware.opengl.extraPackages = with pkgs.nvidia-jetpack; [
+ # l4t-core provides the libnvrm_gpu.so and libnvrm_mem.so, which are
+ # otherwise missing when using cuda_compat for the time being. Ideally,
+ # they should probably be directly set in a DT_NEEDED entry, as is the
+ # case currently with the `libcuda.so` provided by l4t-cuda
+ l4t-core
l4t-cuda
l4t-nvsci # cuda may use nvsci
l4t-gbm l4t-waylandI wonder if doing something as dumb as including |
We'd have to get hold of the jetpack
Indeed. If we deploy a jetpack-nixos, it probably only makes sense to deploy the compat cuda driver, since it offers pretty much a superset of the features in l4t-cuda's libcuda.so However, I think we'd still want to (try) to link cuda_compat directly to the nixpkgs' apps, because we'd want them to work on vanilla/ubuntu jetsons as well |
38f913d to
9004a3c
Compare
9004a3c to
2d3d5ab
Compare
|
I rebased, now I just need to add a guard to not add this hook when we're not on a Jetson |
There was a problem hiding this comment.
Just a thought: maybe we don't want the directly-linked cuda_compat to take precedence over the impure path. Rather, we want the impure path to also contain a cuda_compat. I.e. we want jetpack-nixos to deploy the compat libcuda, and we want the wrappers for ubuntu to also prefer the compat library over normal one
Here's how it goes (well, it's all assumptions): we always want to use the newest possible cuda_compat. We're going to link one directly through DT_RUNPATH at build time, which we expect to be newer than libcuda in the jetpack Ubuntu installation. However, a year or a few down the line, we might want to reuse a package built from an old nixpkgs revision. It conceivable that we'd be running a much newer kernel module, one for which cuda_compat in the runpath would be too old. Then we just deploy a newer cuda_compat on our jetpack-nixos and make the app use it. We could do that via LD_LIBRARY_PATH/LD_PRELOAD, or we could try to give the impure runpath higher priority. The latter might hypothetically break because it could rely on a newer libc/libstdc++, but in practice nvidia seems to always link old versions. LD_* are always an option
There was a problem hiding this comment.
All in all, I think I agree with you. The driver is provided impurely in /run/opengl-driver, and as cuda_compat is supposed be a drop-in replacement for the driver, I don't see a good reason why it should be treated differently.
That being said, I think right now jetpack-nixos is lagging behind (based on NixOS 22.11), so this PR would still be a temporary (TM) way to get cuda_compat working there without waiting for a newer jetpack? But, I would be tempted to get rid of this once #256324 and associated PRs all make it into a NixOS release, and jetpack-nixos supports that release.
There was a problem hiding this comment.
this PR would still be a temporary (TM) way to
I wonder if we still misunderstood each other. To be clear, I hope that this PR's changes will stay and that we won't have to remove them. That cuda_compat would essentially work the same way as ROCm drivers do (which, afaiu, we link directly and which we don't need tools like nvidia-docker2 for). I only wonder about special cases where the user might just hypothetically load a different libcuda
There was a problem hiding this comment.
ust a thought: maybe we don't want the directly-linked cuda_compat to take precedence over the impure path. Rather, we want the impure path to also contain a cuda_compat. I.e. we want jetpack-nixos to deploy the compat libcuda, and we want the wrappers for ubuntu to also prefer the compat library over normal one
I interpreted this, at least on jetpack-nixos, as "let's put cuda_compat into /run/opengl-driver because it's more flexible: jetpack-nixos is now responsible for selecting cuda_compat or the normal drivers, and it's easy to dynamically change by toggling what we put in /run/opengl-driver without resorting to LD_". In the latter case, would this PR still be needed? It seems to me that the answer is no, but maybe I'm missing something
2d3d5ab to
b685883
Compare
Some nvidia devices, such as the Jetson family, support the Nvidia compatibility package (nvidia_compat) which allows to run executables built against a higher CUDA major version on a system with an older CUDA driver. On such platforms, the consensus among CUDA maintainers is that there is no downside in always enabling it by default. This commit links to the relevant cuda_compat shared libraries by patching the CUDA core packages' runpaths when cuda_compat is available, in the same way as we do for OpenGL drivers currently.
188a95b to
d6c198a
Compare
|
This branch has been rebased, fixed and tested on Jetson. It's good to go |
|
Ofborg mentions |
|
Argh. Thanks for noticing @SomeoneSerge. Maybe the guard |
|
Please @ConnorBaker do not merge when ofborg-eval has not finished on such PRs. Maybe the breakage was avoidable on master here... |
|
This pull request has been mentioned on NixOS Discourse. There might be relevant details there: |
Description of changes
Some nvidia devices, such as the Jetson family, support a package called nvidia compatibility package (
cudaPackages.cuda_compat) which allows to run executables built against a higher CUDA major version on a system with an older CUDA driver. On such platforms, the consensus among CUDA maintainers is that there is no downside in always enabling it by default.This PR links to the relevant cuda_compat shared libraries when available by patching the executables' runpaths when cuda_compat is available, in the same way as we already do for OpenGL drivers.
We initially tried to upgrade and reuse
add-patch-opengl-driverbecause the script does exactly what we want, excepted that we would like to add several paths, and not just one, to therunpathof concerned executables.However, this derivation is used in many places (both the result but also the attrs such as
driverLink), making it harder to reuse it or to make more generic without causing a mass rebuild and without breaking existing packages. Instead, we just override thedriverLinkattrs, which is small, localized, and what we want.TODO before merging
Things done
nix.conf? (See Nix manual)sandbox = relaxedsandbox = truenix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD". Note: all changes have to be committed, also see nixpkgs-review usage./result/bin/)