Skip to content

ML CI: Linux aarch64#39666

Merged
adamjstewart merged 22 commits intospack:developfrom
adamjstewart:ci/ml-linux-aarch64
Oct 28, 2024
Merged

ML CI: Linux aarch64#39666
adamjstewart merged 22 commits intospack:developfrom
adamjstewart:ci/ml-linux-aarch64

Conversation

@adamjstewart
Copy link
Copy Markdown
Member

Reboot of #34300

Hoping there will be fewer build issues this time around.

@spackbot-app spackbot-app bot added core PR affects Spack core functionality gitlab Issues related to gitlab integration labels Aug 28, 2023
@adamjstewart

This comment was marked as outdated.

@adamjstewart

This comment was marked as outdated.

@adamjstewart

This comment was marked as outdated.

@adamjstewart
Copy link
Copy Markdown
Member Author

@kwryankrattiger that doesn't seem to have made anything worse, although it doesn't fix any of the above errors.

@adamjstewart
Copy link
Copy Markdown
Member Author

Requires spack/gitlab-runners#49 to build a new image with a modern GCC version.

@adamjstewart
Copy link
Copy Markdown
Member Author

Remaining aarch64 issues are mostly known issues:

Can anyone look at the llvm-amdgpu build issue on the ROCm stack? @haampie @renjithravindrankannath @srekolam

@adamjstewart adamjstewart changed the title [WIP] ML CI: Linux aarch64 ML CI: Linux aarch64 Apr 18, 2024
@adamjstewart
Copy link
Copy Markdown
Member Author

Nope, I'm the one who was missing something. I was looking in cmake/Modules instead of llvm/cmake/modules 😅

@aweits want to submit a PR? If not I can get to it later tonight or tomorrow.

@adamjstewart
Copy link
Copy Markdown
Member Author

@spackbot run pipeline

@spackbot-app
Copy link
Copy Markdown

spackbot-app bot commented Oct 4, 2024

I've started that pipeline for you!

@adamjstewart
Copy link
Copy Markdown
Member Author

@aweits looks like llvm-amdgpu has a new build error now

@adamjstewart
Copy link
Copy Markdown
Member Author

Todd thinks we should just give up on aarch64 + ROCm, so the JAX build is higher priority.

@tgamblin
Copy link
Copy Markdown
Member

Todd thinks we should just give up on aarch64 + ROCm, so the JAX build is higher priority.

I'd be thrilled to see it but I don't think it exists!

@adamjstewart
Copy link
Copy Markdown
Member Author

We could comment out JAX for now. Once #46346 is complete the issue will likely resolve itself. But that PR is pretty far from working.

@aweits
Copy link
Copy Markdown
Contributor

aweits commented Oct 21, 2024

i should be able to get some time to poke at this again today...

@aweits
Copy link
Copy Markdown
Contributor

aweits commented Oct 21, 2024

here's the tl;dr on a fix for arm/llvm-amdgpu - will put appropriate conditionals on this and submit a pr:

[aweits@gg-00 spack]$ git diff
diff --git a/var/spack/repos/builtin/packages/llvm-amdgpu/package.py b/var/spack/repos/builtin/packages/llvm-amdgpu/package.py
index 3e39a4445ba..c342098588f 100644
--- a/var/spack/repos/builtin/packages/llvm-amdgpu/package.py
+++ b/var/spack/repos/builtin/packages/llvm-amdgpu/package.py
@@ -222,7 +222,7 @@ def cmake_args(self):
             self.define("LIBCXXABI_ENABLE_STATIC", "ON"),
             self.define("LIBCXXABI_INSTALL_STATIC_LIBRARY", "OFF"),
             self.define("LLVM_ENABLE_RTTI", "ON"),
-            self.define("LLVM_TARGETS_TO_BUILD", "AMDGPU;X86"),
+            self.define("LLVM_TARGETS_TO_BUILD", "AMDGPU;AArch64"),
             self.define("LLVM_AMDGPU_ALLOW_NPI_TARGETS", "ON"),
             self.define("PACKAGE_VENDOR", "AMD"),
             self.define("CLANG_ENABLE_AMDCLANG", "ON"),

@aweits
Copy link
Copy Markdown
Contributor

aweits commented Oct 22, 2024

got jax building - again, will open pr with appropriate conditionals

diff --git a/var/spack/repos/builtin/packages/py-jaxlib/jaxxlatsl.patch b/var/spack/repos/builtin/packages/py-jaxlib/jaxxlatsl.patch
new file mode 100644
index 00000000000..e96cc32e263
--- /dev/null
+++ b/var/spack/repos/builtin/packages/py-jaxlib/jaxxlatsl.patch
@@ -0,0 +1,100 @@
+From 8fce7378ed8ce994107568449806cd99274ab22b Mon Sep 17 00:00:00 2001
+From: Andrew Elble <[email protected]>
+Date: Mon, 21 Oct 2024 19:42:31 -0400
+Subject: [PATCH] patchit
+
+---
+ ...ch-for-Abseil-to-fix-build-on-Jetson.patch | 68 +++++++++++++++++++
+ third_party/xla/workspace.bzl                 |  1 +
+ 2 files changed, 69 insertions(+)
+ create mode 100644 third_party/xla/0001-Add-patch-for-Abseil-to-fix-build-on-Jetson.patch
+
+diff --git a/third_party/xla/0001-Add-patch-for-Abseil-to-fix-build-on-Jetson.patch b/third_party/xla/0001-Add-patch-for-Abseil-to-fix-build-on-Jetson.patch
+new file mode 100644
+index 000000000000..5138a045082b
+--- /dev/null
++++ b/third_party/xla/0001-Add-patch-for-Abseil-to-fix-build-on-Jetson.patch
+@@ -0,0 +1,68 @@
++From 40da87a0476436ca1da2eafe08935787a05e9a61 Mon Sep 17 00:00:00 2001
++From: David Dunleavy <[email protected]>
++Date: Mon, 5 Aug 2024 11:42:53 -0700
++Subject: [PATCH] Add patch for Abseil to fix build on Jetson
++
++Patches in https://github.com/abseil/abseil-cpp/commit/372124e6af36a540e74a2ec31d79d7297a831f98
++
++PiperOrigin-RevId: 659627531
++---
++ .../tsl/third_party/absl/nvidia_jetson.patch  | 35 +++++++++++++++++++
++ .../tsl/third_party/absl/workspace.bzl        |  1 +
++ 2 files changed, 36 insertions(+)
++ create mode 100644 third_party/tsl/third_party/absl/nvidia_jetson.patch
++
++diff --git a/third_party/tsl/third_party/absl/nvidia_jetson.patch b/third_party/tsl/third_party/absl/nvidia_jetson.patch
++new file mode 100644
++index 000000000000..5328c3a0d605
++--- /dev/null
+++++ b/third_party/tsl/third_party/absl/nvidia_jetson.patch
++@@ -0,0 +1,35 @@
+++From 372124e6af36a540e74a2ec31d79d7297a831f98 Mon Sep 17 00:00:00 2001
+++From: =?UTF-8?q?Fr=C3=A9d=C3=A9ric=20Bastien?= <[email protected]>
+++Date: Thu, 1 Aug 2024 12:38:52 -0700
+++Subject: [PATCH] PR #1732: Fix build on NVIDIA Jetson board. Fix #1665
+++
+++Imported from GitHub PR https://github.com/abseil/abseil-cpp/pull/1732
+++
+++Fix build on NVIDIA Jetson board. Fix #1665
+++
+++This patch is already used by the spark project.
+++I'm fixing this as this break the build of Tensorflow and JAX on Jetson board.
+++Merge 7db2d2ab9fbed1f0fabad10a6ec73533ba71bfff into 6b8ebb35c0414ef5a2b6fd4a0f59057e41beaff9
+++
+++Merging this change closes #1732
+++
+++COPYBARA_INTEGRATE_REVIEW=https://github.com/abseil/abseil-cpp/pull/1732 from nouiz:fix_neon_on_jetson 7db2d2ab9fbed1f0fabad10a6ec73533ba71bfff
+++PiperOrigin-RevId: 658501520
+++Change-Id: If502ede4efc8c877fb3fed227eca6dc7622dd181
+++---
+++ absl/base/config.h | 2 +-
+++ 1 file changed, 1 insertion(+), 1 deletion(-)
+++
+++diff --git a/absl/base/config.h b/absl/base/config.h
+++index 97c9a22a109..ab1e9860a91 100644
+++--- a/absl/base/config.h
++++++ b/absl/base/config.h
+++@@ -926,7 +926,7 @@ static_assert(ABSL_INTERNAL_INLINE_NAMESPACE_STR[0] != 'h' ||
+++ // https://llvm.org/docs/CompileCudaWithLLVM.html#detecting-clang-vs-nvcc-from-code
+++ #ifdef ABSL_INTERNAL_HAVE_ARM_NEON
+++ #error ABSL_INTERNAL_HAVE_ARM_NEON cannot be directly set
+++-#elif defined(__ARM_NEON) && !defined(__CUDA_ARCH__)
++++#elif defined(__ARM_NEON) && !(defined(__NVCC__) && defined(__CUDACC__))
+++ #define ABSL_INTERNAL_HAVE_ARM_NEON 1
+++ #endif
+++ 
++diff --git a/third_party/tsl/third_party/absl/workspace.bzl b/third_party/tsl/third_party/absl/workspace.bzl
++index 06f75166ce4b..9565a82c3319 100644
++--- a/third_party/tsl/third_party/absl/workspace.bzl
+++++ b/third_party/tsl/third_party/absl/workspace.bzl
++@@ -44,4 +44,5 @@ def repo():
++         system_link_files = SYS_LINKS,
++         strip_prefix = "abseil-cpp-{commit}".format(commit = ABSL_COMMIT),
++         urls = tf_mirror_urls("https://github.com/abseil/abseil-cpp/archive/{commit}.tar.gz".format(commit = ABSL_COMMIT)),
+++        patch_file = ["//third_party/absl:nvidia_jetson.patch"],
++     )
++-- 
++2.31.1
++
+diff --git a/third_party/xla/workspace.bzl b/third_party/xla/workspace.bzl
+index af52e7671507..70481bc970a5 100644
+--- a/third_party/xla/workspace.bzl
++++ b/third_party/xla/workspace.bzl
+@@ -29,6 +29,7 @@ def repo():
+         name = "xla",
+         sha256 = XLA_SHA256,
+         strip_prefix = "xla-{commit}".format(commit = XLA_COMMIT),
++      patch_file = ["//third_party/xla:0001-Add-patch-for-Abseil-to-fix-build-on-Jetson.patch"],
+         urls = tf_mirror_urls("https://github.com/openxla/xla/archive/{commit}.tar.gz".format(commit = XLA_COMMIT)),
+     )
+ 
+-- 
+2.31.1
+
diff --git a/var/spack/repos/builtin/packages/py-jaxlib/package.py b/var/spack/repos/builtin/packages/py-jaxlib/package.py
index 09eb522c56a..ce7f217e8fe 100644
--- a/var/spack/repos/builtin/packages/py-jaxlib/package.py
+++ b/var/spack/repos/builtin/packages/py-jaxlib/package.py
@@ -127,7 +127,7 @@ class PyJaxlib(PythonPackage, CudaPackage, ROCmPackage):
         sha256="4dfb9f32d4eeb0a0fb3a6f4124c4170e3fe49511f1b768cd634c78d489962275",
         when="@:0.4.25",
     )
-
+    patch("jaxxlatsl.patch")
     conflicts(
         "cuda_arch=none",
         when="+cuda",

@adamjstewart
Copy link
Copy Markdown
Member Author

You can assign me as a reviewer on these and I will merge them

@aweits
Copy link
Copy Markdown
Contributor

aweits commented Oct 25, 2024

Todd thinks we should just give up on aarch64 + ROCm, so the JAX build is higher priority.

I'd be thrilled to see it but I don't think it exists!

I think we're probably best off at least separating the efforts and get this in for aarch64/cuda. Thoughts?

@adamjstewart
Copy link
Copy Markdown
Member Author

@afzpatel helped add our x86_64 ROCm pipeline and may be willing to work on an aarch64 ROCm pipeline in a separate PR. I also want to get to ppc64le soon.

@aweits
Copy link
Copy Markdown
Contributor

aweits commented Oct 25, 2024

@afzpatel helped add our x86_64 ROCm pipeline and may be willing to work on an aarch64 ROCm pipeline in a separate PR. I also want to get to ppc64le soon.

I'm certainly willing to help with this.

@adamjstewart
Copy link
Copy Markdown
Member Author

rocminfo is next. At this point, we basically need to hack every single package in the ROCm stack. I also want to make sure we're contributing these changes upstream. Let me remove all of the ROCm stuff from this PR and we can tackle ROCm another day.

@adamjstewart adamjstewart merged commit 32ce278 into spack:develop Oct 28, 2024
@adamjstewart adamjstewart deleted the ci/ml-linux-aarch64 branch October 28, 2024 09:30
@afzpatel
Copy link
Copy Markdown
Contributor

afzpatel commented Nov 1, 2024

@afzpatel helped add our x86_64 ROCm pipeline and may be willing to work on an aarch64 ROCm pipeline in a separate PR. I also want to get to ppc64le soon.

Hi @adamjstewart, ARM architecture based build is not supported in ROCm as of today. If there is any change in plan it will be announced through ROCm release notes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

conflicts core PR affects Spack core functionality dependencies gitlab Issues related to gitlab integration python update-package

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

7 participants