Skip to content

rocmPackages: 6.0.2 -> 6.3.3, and various ROCm build fixes and new packages#367695

Merged
prusnak merged 11 commits intoNixOS:stagingfrom
LunNova:rocm-update
Mar 24, 2025
Merged

rocmPackages: 6.0.2 -> 6.3.3, and various ROCm build fixes and new packages#367695
prusnak merged 11 commits intoNixOS:stagingfrom
LunNova:rocm-update

Conversation

@LunNova
Copy link
Member

@LunNova LunNova commented Dec 23, 2024

Fixes #337159
Fixes #383836
Fixes #379354

Bump to 6.3.3 for rocmPackages_6 package set and associated updates in packages which depend on changed or newly introduced ROCm packages.

Upstream PRs/issues Raised

TODO List

  • Fix rocmcxx GCC prefix
  • Contemplate trying to make a normal Nix style CC wrapper work again instead of this sysroot style mess and then don't because I spent 2 weeks on it already (please someone fix this)
  • Expand GPU targets list for *blas libraries
  • Maybe? expand GPU targets list for CK
    • CK seems to be ~untested on anything other than MI200/MI300 series so might be safer not to
    • Trying this out we can reduce the list if breakage is reported.
  • Hack cuda backend out of triton 3.2 so we can build torch for ROCm without deps on unfree cudart
  • Get compression of offload and msgpack working for hipblaslt. 10GB derivation is not ok.
  • Remove debug info / dontStrip settings
  • Clean up the triton mess in rocm-modules/6/default.nix
  • Turn traces into TODO items in this list
  • Upstream patches
  • Resurrect binary compatibility patches for new COMGR (gfx1036 -> uses gfx1030 if 1036 not available)
    • Confirm patches are working correctly with "new" unbundler path which we have enabled
  • Make use of working LLVM packages .override to simplify LLVM overrideScope isn't present and is needed.
  • Import minimal set of pytorch changes to build with rocm 6.3 from https://github.com/LunNova/ml.nix/blob/main/pytorch-rocm.nix
  • Allow better build parallelism by creating -minimal versions of some of the huge packages built for no gfx arches Too difficult, not doing in this PR
  • Clean up hacks related to build parallelism
  • Document clang-ocl, rocm-thunk going away.
  • Fix migraph packages
  • Remove Tensile parallelism patches
  • Convert in-tree patches to fetchpatch usage where possible

Things done

  • Built on platform(s)
    • x86_64-linux
    • aarch64-linux
    • x86_64-darwin
    • aarch64-darwin
  • For non-Linux: Is sandboxing enabled in nix.conf? (See Nix manual)
    • sandbox = relaxed
    • sandbox = true
  • Tested, as applicable:
  • Tested compilation of all packages that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD". Note: all changes have to be committed, also see nixpkgs-review usage
  • Tested basic functionality of all binary files (usually in ./result/bin/)
  • 25.05 Release Notes (or backporting 24.11 and 25.05 Release notes)
    • (Package updates) Added a release notes entry if the change is major or breaking
    • (Module updates) Added a release notes entry if the change is significant
    • (Module addition) Added a release notes entry if adding a new NixOS module
  • Fits CONTRIBUTING.md.

Add a 👍 reaction to pull requests you find important.

@github-actions github-actions bot added the 6.topic: rocm ROCm is an Advanced Micro Devices software stack for graphics processing unit programming. label Dec 23, 2024
@LunNova LunNova mentioned this pull request Dec 23, 2024
34 tasks
@github-actions github-actions bot added 10.rebuild-darwin: 11-100 This PR causes between 11 and 100 packages to rebuild on Darwin. 10.rebuild-linux: 101-500 This PR causes between 101 and 500 packages to rebuild on Linux. labels Dec 23, 2024
@LunNova LunNova force-pushed the rocm-update branch 2 times, most recently from 76e05f1 to c2abb37 Compare December 23, 2024 18:06
@Shawn8901
Copy link
Contributor

not knowing much about the rocm stack, mainly here as i am using btop with rocm support enabled.

I receive the following error when compiling rocm-smi on a nixpkgs on recent master including this PR.

rocm-smi> [ 24%] Building CXX object oam/CMakeFiles/oam.dir/__/src/rocm_smi_power_mon.cc.o
rocm-smi> [ 27%] Building CXX object oam/CMakeFiles/oam.dir/__/src/rocm_smi_utils.cc.o
rocm-smi> clang++clang++: : warning: warning: -Wl,-z,noexecstack: 'linker' input unused [-Wunused-command-line-argument]-Wl,-z,noexecstack: 'linker' input unused [-Wunused-command-line-argument]
rocm-smi> 
rocm-smi> clang++clang++: : warning: warning: -Wl,-znoexecheap: 'linker' input unused [-Wunused-command-line-argument]-Wl,-znoexecheap: 'linker' input unused [-Wunused-command-line-argument]
rocm-smi> 
rocm-smi> clang++clang++: : warning: warning: -Wl,-z,relro: 'linker' input unused [-Wunused-command-line-argument]-Wl,-z,relro: 'linker' input unused [-Wunused-command-line-argument]
rocm-smi> 
rocm-smi> clang++clang++: : warning: warning: -Wl,-z,now: 'linker' input unused [-Wunused-command-line-argument]-Wl,-z,now: 'linker' input unused [-Wunused-command-line-argument]
rocm-smi> 
rocm-smi> clang++: warning: -Wl,-z,noexecstack: 'linker' input unused [-Wunused-command-line-argument]
rocm-smi> clang++: warning: -Wl,-znoexecheap: 'linker' input unused [-Wunused-command-line-argument]
rocm-smi> clang++: warning: -Wl,-z,relro: 'linker' input unused [-Wunused-command-line-argument]
rocm-smi> clang++: warning: -Wl,-z,now: 'linker' input unused [-Wunused-command-line-argument]
rocm-smi> warning: warning: warning: unknown warning option '-Wtrampolines' [-Wunknown-warning-option]unknown warning option '-Wtrampolines' [-Wunknown-warning-option]unknown warning option '-Wtrampolines' [-Wunknown-warning-option]
rocm-smi> 
rocm-smi> 
rocm-smi> /build/source/src/rocm_smi_power_mon.cc:44:10: fatal error: 'cassert' file not found
rocm-smi>    44 | #include <cassert>
rocm-smi>       |          ^~~~~~~~~
rocm-smi> /build/source/src/rocm_smi_monitor.cc:46:10: fatal error: 'algorithm' file not found
rocm-smi>    46 | #include <algorithm>
rocm-smi>       |          ^~~~~~~~~~~
rocm-smi> /build/source/src/rocm_smi_utils.cc:54:10: fatal error: 'algorithm' file not found
rocm-smi>    54 | #include <algorithm>
rocm-smi>       |          ^~~~~~~~~~~
rocm-smi> 1 warning and 1 error generated.
rocm-smi> make[2]: *** [oam/CMakeFiles/oam.dir/build.make:121: oam/CMakeFiles/oam.dir/__/src/rocm_smi_power_mon.cc.o] Error 1
rocm-smi> 1 warning and 1 error generated.
rocm-smi> make[2]: *** [oam/CMakeFiles/oam.dir/build.make:107: oam/CMakeFiles/oam.dir/__/src/rocm_smi_monitor.cc.o] Error 1
rocm-smi> 1 warning and 1 error generated.
rocm-smi> make[2]: *** [oam/CMakeFiles/oam.dir/build.make:135: oam/CMakeFiles/oam.dir/__/src/rocm_smi_utils.cc.o] Error 1
rocm-smi> make[1]: *** [CMakeFiles/Makefile2:226: oam/CMakeFiles/oam.dir/all] Error 2
rocm-smi> make: *** [Makefile:156: all] Error 2

@LunNova
Copy link
Member Author

LunNova commented Dec 23, 2024

@Shawn8901 Should be fixed now, was broken when I first opened the PR.

@GZGavinZhao
Copy link
Contributor

One thing that I've wanted to do for a long time is to completely remove the ROCm LLVM as an stdEnv, which should solve a lot of these weird compilation errors. Solus's ROCm stack does this, and we compile every non-HIP code with GCC. In this way, the entire ROCm LLVM can be compacted into a single derivation and the complexity of packaging/updating ROCm LLVM is drastically reduced.

That is, you should be able to use just the default stdenv with GCC to compile non-HIP code and tell CMake/HIPCC to use the ROCm LLVM only when compiling HIP code. You can achieve this entirely through environment variables. It doesn't make sense that because a portion of the codebase contains HIP code, any C/C++ in the codebase needs to be compiled with ROCm LLVM's C compiler.

@GZGavinZhao
Copy link
Contributor

Ok I just noticed the "Contemplate trying to make a normal Nix style CC wrapper work again" section, so it seems like you've already experienced the pain of the ROCm LLVM 😅 I will give my idea a try in the next few days and get back.

@LunNova
Copy link
Member Author

LunNova commented Dec 23, 2024

It looks like upstream are moving away from a separate hipcc and using clang (now amdclang++ or amdclang) for the entire build with -x hip --offload-arch ... as extra args for HIP files.

Maintaining a separate HIP only compiler might require maintaining significant cmakefile patches to get it to be used, but if you can work out a way to do this that isn't maintenance hell that's great.

@GZGavinZhao
Copy link
Contributor

Resurrect binary compatibility patches for new COMGR

Please see https://github.com/GZGavinZhao/rocm-llvm-project/commits/solus-rocm-6.2.x for the patches and https://lists.debian.org/debian-ai/2024/12/msg00042.html for more details. I hope they apply cleanly on v6.3, but if not I think the changes are easy enough to manually rewrite them.

If you need patches for other components, please see https://github.com/GZGavinZhao/<component-name>/commits/solus-rocm-6.2.x. Every patch there is used by Solus's ROCm 6.2 stack. For example, for rocm-clr, that would be https://github.com/GZGavinZhao/clr/commits/solus-rocm-6.2.x. IIRC the components that require ISA compatibility patches are Comgr, clr, and rocBLAS.

@GZGavinZhao
Copy link
Contributor

Maintaining a separate HIP only compiler might require maintaining significant cmakefile patches to get it to be used, but if you can work out a way to do this that isn't maintenance hell that's great.

Solus does this and we didn't have to use any patches. Most of the work done was figuring out the environment variables to tell CMake and/or HIPCC what our intended HIP compiler is. The only thing I'm worrying about is locating sysroots due to non-standard installation prefix, but other than that Solus's experience shows that this is definitely doable.

@github-actions github-actions bot added 6.topic: python Python is a high-level, general-purpose programming language. 10.rebuild-darwin: 101-500 This PR causes between 101 and 500 packages to rebuild on Darwin. and removed 10.rebuild-darwin: 11-100 This PR causes between 11 and 100 packages to rebuild on Darwin. labels Dec 24, 2024
@LunNova LunNova force-pushed the rocm-update branch 2 times, most recently from 22f00e1 to c05b8cb Compare December 24, 2024 17:21
@LunNova

This comment was marked as outdated.

@henryrgithub
Copy link

henryrgithub commented Dec 24, 2024

I'm having trouble getting this to build. I get

Failed Tests (2):
MLIR :: Dialect/SPIRV/IR/availability.mlir
MLIR :: Dialect/SPIRV/IR/target-env.mlir

during the triton-llvm-19.1.0-rc1 test phase. RX6800XT. X86_64-linux on nixos. No overlays or config or anything, just trying to create a devshell with python312Packages.torch. I can post a (nearly) minimum reproducible flake:

{
  description = "Rocm 6 py312 torch";

  inputs.nixpkgs.url = "github:LunNova/nixpkgs/rocm-update";

  outputs = { self, nixpkgs }:
    let
      supportedSystems = [ "x86_64-linux" ];
      forEachSupportedSystem = f: nixpkgs.lib.genAttrs supportedSystems (system: f {
        pkgs = import nixpkgs {
          inherit system;
        };
      });
    in
    {
      devShells = forEachSupportedSystem ({ pkgs }: {
        default =
        let
          pythonPackages = pkgs.python312Packages;
        in
        pkgs.mkShell {
          venvDir = ".venv";
          packages = with pkgs; [
          ] ++
          (with pythonPackages; [
            torch
          ])
          ;
        };
      });
    };
}

@LunNova

This comment was marked as outdated.

@LunNova

This comment was marked as outdated.

@LunNova

This comment was marked as outdated.

@alapshin
Copy link
Contributor

I tried to use this PR via overlay

rocmPackages = inputs.nixpkgs-rocm.legacyPackages."${final.system}".rocmPackages;

But got a collision when building ollama

    ollama = {
      enable = true;
      acceleration = "rocm";
      rocmOverrideGfx = "10.3.0";
    };
error: builder for '/nix/store/56lg7l3hbwvyf9b026xdyk0zrf24fyw8-rocm-path.drv' failed with exit code 25;
       last 1 log lines:
       > error: collision between `/nix/store/iyc18np8p0c4abw3xq0h65f0qil3qjp7-rocm-clang/llvm/bin/ld.lld' and `/nix/store/wxpnkzvsydgnybkknkipjcdp0pd1m05i-clr-6.3.1/llvm/bin/ld.lld'

But overall it seems to be building without errors. When not using ollama I was able to rebuild my system without errors.

@GZGavinZhao
Copy link
Contributor

I got a separate HIP compiler working and successfully compiled rocsparse (because it's the quickest one to build 😅). Currently, this compiler still uses the GCC toolchains (e.g. glibc and libstdc++). Ideally I want it to use the LLVM toolchains (llvm-libc, compiler-rt, and libc++) so it's fully self-contained and eliminates the chance of clashing with whatever stdenv derivations choose to use, so I'm working on that. This should hopefully resolve the llama-cpp clashing error @alapshin mentioned.

@LunNova
Copy link
Member Author

LunNova commented Dec 26, 2024

ROCm's standard toolchain is clang + GNU libs including libstdc++. There are a few packages which don't compile with libc++ without patches
hipblaslt will fail with llvm/llvm-project#98734 - no PR raised to workaround
rocmlir will fail due to a missing const on operator<, PR raised to fix ROCm/rocMLIR#1708

Clashing error is because I added a /llvm link to clr for use by other ROCm packages and ollama also already adds one to its ROCm env internally, can be resolved by dropping the /llvm link from ollama.

@LunNova
Copy link
Member Author

LunNova commented Dec 26, 2024

@LunNova
Copy link
Member Author

LunNova commented Dec 26, 2024

Added exact GPU targets as pkgs.rocmPackages_6.gfx908, gfx1030 etc.
It's possible to replace rocmPackages_6 with one of these more specific package sets in an overlay.
Should help reduce times for rebuilds while testing changes or targeting a specific device.
Haven't worked out a way to compose (rocblas that supports all gfx) from (multiple rocblases that support different gfx), so there isn't anything super fancy going on with these they're just setting GPU_TARGETS or equivalent.

@aviallon
Copy link
Contributor

Thank you everyone for your awesome work!
Do we know when this will get into master?

@arunoruto
Copy link
Contributor

Thank you everyone for your awesome work! Do we know when this will get into master?

You can track it here: https://nixpk.gs/pr-tracker.html?pr=367695

puyral added a commit to puyral/nixomagus that referenced this pull request Mar 29, 2025
This will really take effect once
NixOS/nixpkgs#367695 lands in unstable
@aviallon
Copy link
Contributor

Now in nixos-unstable-small !!!

@pshirshov
Copy link
Contributor

Ergh, zluda is broken again:

error: builder for '/nix/store/xgrg5navl2lcs6mjkpym2539svafx859-zluda-4-unstable-2025-01-28.drv' failed with exit code 101;
       last 25 log lines:
       >
       >
       >     No build type selected.  You need to pass -DCMAKE_BUILD_TYPE=<type> in
       >     order to configure LLVM.
       >
       >     Available options are:
       >
       >       * -DCMAKE_BUILD_TYPE=Release - For an optimized build with no assertions or debug info.
       >       * -DCMAKE_BUILD_TYPE=Debug - For an unoptimized build with assertions and debug info.
       >       * -DCMAKE_BUILD_TYPE=RelWithDebInfo - For an optimized build with no assertions but with debug info.
       >       * -DCMAKE_BUILD_TYPE=MinSizeRel - For a build optimized for size instead of speed.
       >
       >     Learn more about these options in our documentation at
       >     https://llvm.org/docs/CMake.html#cmake-build-type
       >
       >
       >
       >
       >   thread 'main' panicked at /build/zluda-4-unstable-2025-01-28-vendor/cmake-0.1.51/src/lib.rs:1100:5:
       >
       >   command did not execute successfully, got: exit status: 1
       >
       >   build script failed, must exit now
       >   note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
       > warning: build failed, waiting for other jobs to finish...
       For full logs, run 'nix log /nix/store/xgrg5navl2lcs6mjkpym2539svafx859-zluda-4-unstable-2025-01-28.drv'.

@aviallon
Copy link
Contributor

Ergh, zluda is broken again:

error: builder for '/nix/store/xgrg5navl2lcs6mjkpym2539svafx859-zluda-4-unstable-2025-01-28.drv' failed with exit code 101;
       last 25 log lines:
       >
       >
       >     No build type selected.  You need to pass -DCMAKE_BUILD_TYPE=<type> in
       >     order to configure LLVM.
       >
       >     Available options are:
       >
       >       * -DCMAKE_BUILD_TYPE=Release - For an optimized build with no assertions or debug info.
       >       * -DCMAKE_BUILD_TYPE=Debug - For an unoptimized build with assertions and debug info.
       >       * -DCMAKE_BUILD_TYPE=RelWithDebInfo - For an optimized build with no assertions but with debug info.
       >       * -DCMAKE_BUILD_TYPE=MinSizeRel - For a build optimized for size instead of speed.
       >
       >     Learn more about these options in our documentation at
       >     https://llvm.org/docs/CMake.html#cmake-build-type
       >
       >
       >
       >
       >   thread 'main' panicked at /build/zluda-4-unstable-2025-01-28-vendor/cmake-0.1.51/src/lib.rs:1100:5:
       >
       >   command did not execute successfully, got: exit status: 1
       >
       >   build script failed, must exit now
       >   note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
       > warning: build failed, waiting for other jobs to finish...
       For full logs, run 'nix log /nix/store/xgrg5navl2lcs6mjkpym2539svafx859-zluda-4-unstable-2025-01-28.drv'.

Are you on nixos-unstable-small?
It is not yet on nixos-unstable.

@pshirshov
Copy link
Contributor

On master.

@arunoruto
Copy link
Contributor

According to nixpk.gs, it should be available on nixos-unstable now!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

6.topic: closure size The final size of a derivation, including its dependencies 6.topic: nixos Issues or PRs affecting NixOS modules, or package usability issues specific to NixOS 6.topic: python Python is a high-level, general-purpose programming language. 6.topic: rocm ROCm is an Advanced Micro Devices software stack for graphics processing unit programming. 8.has: changelog This PR adds or changes release notes 8.has: clean-up This PR removes packages or removes other cruft 8.has: documentation This PR adds or changes documentation 8.has: package (new) This PR adds a new package 10.rebuild-darwin: 101-500 This PR causes between 101 and 500 packages to rebuild on Darwin. 10.rebuild-linux: 501-1000 This PR causes many rebuilds on Linux and should normally target the staging branches. 10.rebuild-linux: 501+ This PR causes many rebuilds on Linux and should normally target the staging branches. 12.approvals: 2 This PR was reviewed and approved by two persons. 12.approved-by: package-maintainer This PR was reviewed and approved by a maintainer listed in any of the changed packages.

Projects

None yet