ML CI: Linux aarch64 by adamjstewart · Pull Request #39666 · spack/spack

adamjstewart · 2023-08-28T20:54:34Z

Reboot of #34300

Hoping there will be fewer build issues this time around.

share/spack/gitlab/cloud_pipelines/.gitlab-ci.yml

adamjstewart · 2024-02-17T11:11:18Z

@kwryankrattiger that doesn't seem to have made anything worse, although it doesn't fix any of the above errors.

var/spack/repos/builtin/packages/rust/package.py

var/spack/repos/builtin/packages/py-onnx/package.py

adamjstewart · 2024-04-10T17:45:13Z

Requires spack/gitlab-runners#49 to build a new image with a modern GCC version.

adamjstewart · 2024-04-17T20:30:12Z

Remaining aarch64 issues are mostly known issues:

PyTorch: Fails to compile with nvidia-cuda-toolkit-12.4.0 pytorch/pytorch#122169
TensorFlow: [AARCH64] Building TF 2.15.0 from sources failed with undefined __Int8x8_t tensorflow/tensorflow#62490

Can anyone look at the llvm-amdgpu build issue on the ROCm stack? @haampie @renjithravindrankannath @srekolam

var/spack/repos/builtin/packages/py-torch/package.py

adamjstewart · 2024-10-03T13:01:36Z

Nope, I'm the one who was missing something. I was looking in cmake/Modules instead of llvm/cmake/modules 😅

@aweits want to submit a PR? If not I can get to it later tonight or tomorrow.

adamjstewart · 2024-10-04T12:19:18Z

@spackbot run pipeline

spackbot-app · 2024-10-04T12:19:29Z

I've started that pipeline for you!

adamjstewart · 2024-10-13T11:19:52Z

@aweits looks like llvm-amdgpu has a new build error now

adamjstewart · 2024-10-21T06:54:34Z

Todd thinks we should just give up on aarch64 + ROCm, so the JAX build is higher priority.

tgamblin · 2024-10-21T06:58:30Z

Todd thinks we should just give up on aarch64 + ROCm, so the JAX build is higher priority.

I'd be thrilled to see it but I don't think it exists!

adamjstewart · 2024-10-21T07:03:12Z

We could comment out JAX for now. Once #46346 is complete the issue will likely resolve itself. But that PR is pretty far from working.

aweits · 2024-10-21T12:43:13Z

i should be able to get some time to poke at this again today...

aweits · 2024-10-21T21:58:41Z

here's the tl;dr on a fix for arm/llvm-amdgpu - will put appropriate conditionals on this and submit a pr:

[aweits@gg-00 spack]$ git diff
diff --git a/var/spack/repos/builtin/packages/llvm-amdgpu/package.py b/var/spack/repos/builtin/packages/llvm-amdgpu/package.py
index 3e39a4445ba..c342098588f 100644
--- a/var/spack/repos/builtin/packages/llvm-amdgpu/package.py
+++ b/var/spack/repos/builtin/packages/llvm-amdgpu/package.py
@@ -222,7 +222,7 @@ def cmake_args(self):
             self.define("LIBCXXABI_ENABLE_STATIC", "ON"),
             self.define("LIBCXXABI_INSTALL_STATIC_LIBRARY", "OFF"),
             self.define("LLVM_ENABLE_RTTI", "ON"),
-            self.define("LLVM_TARGETS_TO_BUILD", "AMDGPU;X86"),
+            self.define("LLVM_TARGETS_TO_BUILD", "AMDGPU;AArch64"),
             self.define("LLVM_AMDGPU_ALLOW_NPI_TARGETS", "ON"),
             self.define("PACKAGE_VENDOR", "AMD"),
             self.define("CLANG_ENABLE_AMDCLANG", "ON"),

aweits · 2024-10-22T00:36:57Z

got jax building - again, will open pr with appropriate conditionals

diff --git a/var/spack/repos/builtin/packages/py-jaxlib/jaxxlatsl.patch b/var/spack/repos/builtin/packages/py-jaxlib/jaxxlatsl.patch
new file mode 100644
index 00000000000..e96cc32e263
--- /dev/null
+++ b/var/spack/repos/builtin/packages/py-jaxlib/jaxxlatsl.patch
@@ -0,0 +1,100 @@
+From 8fce7378ed8ce994107568449806cd99274ab22b Mon Sep 17 00:00:00 2001
+From: Andrew Elble <[email protected]>
+Date: Mon, 21 Oct 2024 19:42:31 -0400
+Subject: [PATCH] patchit
+
+---
+ ...ch-for-Abseil-to-fix-build-on-Jetson.patch | 68 +++++++++++++++++++
+ third_party/xla/workspace.bzl                 |  1 +
+ 2 files changed, 69 insertions(+)
+ create mode 100644 third_party/xla/0001-Add-patch-for-Abseil-to-fix-build-on-Jetson.patch
+
+diff --git a/third_party/xla/0001-Add-patch-for-Abseil-to-fix-build-on-Jetson.patch b/third_party/xla/0001-Add-patch-for-Abseil-to-fix-build-on-Jetson.patch
+new file mode 100644
+index 000000000000..5138a045082b
+--- /dev/null
++++ b/third_party/xla/0001-Add-patch-for-Abseil-to-fix-build-on-Jetson.patch
+@@ -0,0 +1,68 @@
++From 40da87a0476436ca1da2eafe08935787a05e9a61 Mon Sep 17 00:00:00 2001
++From: David Dunleavy <[email protected]>
++Date: Mon, 5 Aug 2024 11:42:53 -0700
++Subject: [PATCH] Add patch for Abseil to fix build on Jetson
++
++Patches in https://github.com/abseil/abseil-cpp/commit/372124e6af36a540e74a2ec31d79d7297a831f98
++
++PiperOrigin-RevId: 659627531
++---
++ .../tsl/third_party/absl/nvidia_jetson.patch  | 35 +++++++++++++++++++
++ .../tsl/third_party/absl/workspace.bzl        |  1 +
++ 2 files changed, 36 insertions(+)
++ create mode 100644 third_party/tsl/third_party/absl/nvidia_jetson.patch
++
++diff --git a/third_party/tsl/third_party/absl/nvidia_jetson.patch b/third_party/tsl/third_party/absl/nvidia_jetson.patch
++new file mode 100644
++index 000000000000..5328c3a0d605
++--- /dev/null
+++++ b/third_party/tsl/third_party/absl/nvidia_jetson.patch
++@@ -0,0 +1,35 @@
+++From 372124e6af36a540e74a2ec31d79d7297a831f98 Mon Sep 17 00:00:00 2001
+++From: =?UTF-8?q?Fr=C3=A9d=C3=A9ric=20Bastien?= <[email protected]>
+++Date: Thu, 1 Aug 2024 12:38:52 -0700
+++Subject: [PATCH] PR #1732: Fix build on NVIDIA Jetson board. Fix #1665
+++
+++Imported from GitHub PR https://github.com/abseil/abseil-cpp/pull/1732
+++
+++Fix build on NVIDIA Jetson board. Fix #1665
+++
+++This patch is already used by the spark project.
+++I'm fixing this as this break the build of Tensorflow and JAX on Jetson board.
+++Merge 7db2d2ab9fbed1f0fabad10a6ec73533ba71bfff into 6b8ebb35c0414ef5a2b6fd4a0f59057e41beaff9
+++
+++Merging this change closes #1732
+++
+++COPYBARA_INTEGRATE_REVIEW=https://github.com/abseil/abseil-cpp/pull/1732 from nouiz:fix_neon_on_jetson 7db2d2ab9fbed1f0fabad10a6ec73533ba71bfff
+++PiperOrigin-RevId: 658501520
+++Change-Id: If502ede4efc8c877fb3fed227eca6dc7622dd181
+++---
+++ absl/base/config.h | 2 +-
+++ 1 file changed, 1 insertion(+), 1 deletion(-)
+++
+++diff --git a/absl/base/config.h b/absl/base/config.h
+++index 97c9a22a109..ab1e9860a91 100644
+++--- a/absl/base/config.h
++++++ b/absl/base/config.h
+++@@ -926,7 +926,7 @@ static_assert(ABSL_INTERNAL_INLINE_NAMESPACE_STR[0] != 'h' ||
+++ // https://llvm.org/docs/CompileCudaWithLLVM.html#detecting-clang-vs-nvcc-from-code
+++ #ifdef ABSL_INTERNAL_HAVE_ARM_NEON
+++ #error ABSL_INTERNAL_HAVE_ARM_NEON cannot be directly set
+++-#elif defined(__ARM_NEON) && !defined(__CUDA_ARCH__)
++++#elif defined(__ARM_NEON) && !(defined(__NVCC__) && defined(__CUDACC__))
+++ #define ABSL_INTERNAL_HAVE_ARM_NEON 1
+++ #endif
+++ 
++diff --git a/third_party/tsl/third_party/absl/workspace.bzl b/third_party/tsl/third_party/absl/workspace.bzl
++index 06f75166ce4b..9565a82c3319 100644
++--- a/third_party/tsl/third_party/absl/workspace.bzl
+++++ b/third_party/tsl/third_party/absl/workspace.bzl
++@@ -44,4 +44,5 @@ def repo():
++         system_link_files = SYS_LINKS,
++         strip_prefix = "abseil-cpp-{commit}".format(commit = ABSL_COMMIT),
++         urls = tf_mirror_urls("https://github.com/abseil/abseil-cpp/archive/{commit}.tar.gz".format(commit = ABSL_COMMIT)),
+++        patch_file = ["//third_party/absl:nvidia_jetson.patch"],
++     )
++-- 
++2.31.1
++
+diff --git a/third_party/xla/workspace.bzl b/third_party/xla/workspace.bzl
+index af52e7671507..70481bc970a5 100644
+--- a/third_party/xla/workspace.bzl
++++ b/third_party/xla/workspace.bzl
+@@ -29,6 +29,7 @@ def repo():
+         name = "xla",
+         sha256 = XLA_SHA256,
+         strip_prefix = "xla-{commit}".format(commit = XLA_COMMIT),
++      patch_file = ["//third_party/xla:0001-Add-patch-for-Abseil-to-fix-build-on-Jetson.patch"],
+         urls = tf_mirror_urls("https://github.com/openxla/xla/archive/{commit}.tar.gz".format(commit = XLA_COMMIT)),
+     )
+ 
+-- 
+2.31.1
+
diff --git a/var/spack/repos/builtin/packages/py-jaxlib/package.py b/var/spack/repos/builtin/packages/py-jaxlib/package.py
index 09eb522c56a..ce7f217e8fe 100644
--- a/var/spack/repos/builtin/packages/py-jaxlib/package.py
+++ b/var/spack/repos/builtin/packages/py-jaxlib/package.py
@@ -127,7 +127,7 @@ class PyJaxlib(PythonPackage, CudaPackage, ROCmPackage):
         sha256="4dfb9f32d4eeb0a0fb3a6f4124c4170e3fe49511f1b768cd634c78d489962275",
         when="@:0.4.25",
     )
-
+    patch("jaxxlatsl.patch")
     conflicts(
         "cuda_arch=none",
         when="+cuda",

adamjstewart · 2024-10-22T08:43:26Z

You can assign me as a reviewer on these and I will merge them

…6694) spack#39666 (comment)

aweits · 2024-10-25T17:38:42Z

Todd thinks we should just give up on aarch64 + ROCm, so the JAX build is higher priority.

I'd be thrilled to see it but I don't think it exists!

I think we're probably best off at least separating the efforts and get this in for aarch64/cuda. Thoughts?

adamjstewart · 2024-10-25T17:41:02Z

@afzpatel helped add our x86_64 ROCm pipeline and may be willing to work on an aarch64 ROCm pipeline in a separate PR. I also want to get to ppc64le soon.

aweits · 2024-10-25T18:24:58Z

@afzpatel helped add our x86_64 ROCm pipeline and may be willing to work on an aarch64 ROCm pipeline in a separate PR. I also want to get to ppc64le soon.

I'm certainly willing to help with this.

adamjstewart · 2024-10-26T10:00:32Z

rocminfo is next. At this point, we basically need to hack every single package in the ROCm stack. I also want to make sure we're contributing these changes upstream. Let me remove all of the ROCm stuff from this PR and we can tackle ROCm another day.

afzpatel · 2024-11-01T16:00:16Z

@afzpatel helped add our x86_64 ROCm pipeline and may be willing to work on an aarch64 ROCm pipeline in a separate PR. I also want to get to ppc64le soon.

Hi @adamjstewart, ARM architecture based build is not supported in ROCm as of today. If there is any change in plan it will be announced through ROCm release notes.

spackbot-app bot added core PR affects Spack core functionality gitlab Issues related to gitlab integration labels Aug 28, 2023

adamjstewart mentioned this pull request Aug 28, 2023

Installation issue: LLVM on aarch64 #34954

Closed

4 tasks

adamjstewart requested review from kwryankrattiger and scottwittenburg August 29, 2023 15:08

This comment was marked as outdated.

Sign in to view

adamjstewart force-pushed the ci/ml-linux-aarch64 branch from 9f0b88b to de947c8 Compare September 7, 2023 17:00

adamjstewart requested a review from alalazo September 24, 2023 14:36

adamjstewart force-pushed the ci/ml-linux-aarch64 branch from de947c8 to 6c786a5 Compare October 20, 2023 12:53

This comment was marked as outdated.

Sign in to view

adamjstewart mentioned this pull request Oct 28, 2023

PyTorch: build with external XNNPACK #40757

Closed

adamjstewart force-pushed the ci/ml-linux-aarch64 branch from 6c786a5 to 96a18b2 Compare January 12, 2024 09:52

adamjstewart force-pushed the ci/ml-linux-aarch64 branch from 96a18b2 to 71a8ed0 Compare January 29, 2024 17:07

kwryankrattiger reviewed Feb 14, 2024

View reviewed changes

share/spack/gitlab/cloud_pipelines/.gitlab-ci.yml Show resolved Hide resolved

kwryankrattiger reviewed Feb 15, 2024

View reviewed changes

share/spack/gitlab/cloud_pipelines/.gitlab-ci.yml Show resolved Hide resolved

kwryankrattiger reviewed Feb 15, 2024

View reviewed changes

share/spack/gitlab/cloud_pipelines/.gitlab-ci.yml Outdated Show resolved Hide resolved

kwryankrattiger reviewed Feb 15, 2024

View reviewed changes

share/spack/gitlab/cloud_pipelines/.gitlab-ci.yml Show resolved Hide resolved

adamjstewart force-pushed the ci/ml-linux-aarch64 branch from 160c192 to 4184de6 Compare April 10, 2024 09:36

spackbot-app bot added conflicts python update-package labels Apr 10, 2024

adamjstewart commented Apr 10, 2024

View reviewed changes

var/spack/repos/builtin/packages/rust/package.py Show resolved Hide resolved

adamjstewart commented Apr 10, 2024

View reviewed changes

var/spack/repos/builtin/packages/py-onnx/package.py Show resolved Hide resolved

spackbot-app bot added the dependencies label Apr 17, 2024

adamjstewart changed the title ~~[WIP] ML CI: Linux aarch64~~ ML CI: Linux aarch64 Apr 18, 2024

adamjstewart commented Apr 18, 2024

View reviewed changes

var/spack/repos/builtin/packages/py-torch/package.py Outdated Show resolved Hide resolved

aweits mentioned this pull request Oct 3, 2024

llvm-amdgpu: apply patch from https://github.com/llvm/llvm-project/pull/80071 #46753

Merged

tldahlgren assigned kwryankrattiger Oct 4, 2024

This was referenced Oct 22, 2024

llvm-amdgpu: support building on aarch64 #47124

Merged

py-jaxlib: backport fix for abseil-cpp on aarch64 #47125

Merged

arezaii pushed a commit to arezaii/spack that referenced this pull request Oct 22, 2024

cuda: add conflict for aarch64 with gcc-13.2.0 and [email protected] (spack#4…

4119c3a

…6694) spack#39666 (comment)

aweits mentioned this pull request Oct 25, 2024

hsa-rocr-dev: allow building on aarch64 #47216

Closed

adamjstewart added 2 commits October 26, 2024 12:05

Remove ROCm

457621d

Merge branch 'develop' into ci/ml-linux-aarch64

c2a0c46

adamjstewart force-pushed the ci/ml-linux-aarch64 branch from 248abea to c2a0c46 Compare October 26, 2024 10:05

alalazo approved these changes Oct 28, 2024

View reviewed changes

adamjstewart merged commit 32ce278 into spack:develop Oct 28, 2024

adamjstewart deleted the ci/ml-linux-aarch64 branch October 28, 2024 09:30

This was referenced Nov 25, 2024

maintenance/merge from upstream 2024 11 25 FNALssi/spack#467

Merged

maintenance/v0.23.0 fermi/fix duplication style FNALssi/spack#468

Closed

Conversation

adamjstewart commented Aug 28, 2023

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

adamjstewart commented Feb 17, 2024

Uh oh!

Uh oh!

Uh oh!

adamjstewart commented Apr 10, 2024

Uh oh!

adamjstewart commented Apr 17, 2024

Uh oh!

Uh oh!

adamjstewart commented Oct 3, 2024

Uh oh!

adamjstewart commented Oct 4, 2024

Uh oh!

spackbot-app bot commented Oct 4, 2024

Uh oh!

adamjstewart commented Oct 13, 2024

Uh oh!

adamjstewart commented Oct 21, 2024

Uh oh!

tgamblin commented Oct 21, 2024

Uh oh!

adamjstewart commented Oct 21, 2024

Uh oh!

aweits commented Oct 21, 2024

Uh oh!

aweits commented Oct 21, 2024

Uh oh!

aweits commented Oct 22, 2024

Uh oh!

adamjstewart commented Oct 22, 2024

Uh oh!

aweits commented Oct 25, 2024

Uh oh!

adamjstewart commented Oct 25, 2024

Uh oh!

aweits commented Oct 25, 2024

Uh oh!

adamjstewart commented Oct 26, 2024

Uh oh!

afzpatel commented Nov 1, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants