[WIP] ML CI: Linux aarch64 by adamjstewart · Pull Request #34300 · spack/spack

adamjstewart · 2022-12-04T02:08:51Z

Copied our existing CI from x86_64 to aarch64

adamjstewart · 2022-12-04T03:45:53Z

Any idea why the compiler isn't working? I naively copied the x86_64 version and just replaced x86_64 with aarch64, not sure what other changes need to be made.

adamjstewart · 2022-12-09T02:25:53Z

@scottwittenburg any thoughts on this one?

scottwittenburg · 2022-12-09T16:35:16Z

I think again here, you want to specify compiler and arch as part of a spec matrix. It seems by default the concretizer picked linux-amzn2-haswell, when you wanted aarch64.

Also I noticed at least one of the pipeline generation jobs had x86_64 tag instead of aarch64. I see you've tried to add the aarch64 tag, in a way similar to how some of the other pipelines for that arch do it, but somehow it's not working in your case. I'll let you know if I discover the cause of this, but in the meantime you could try extending the existing .pr-generate-aarch64 and .protected-generate-aarch64 definitions.

adamjstewart · 2022-12-10T03:52:00Z

It's working! Now to fix some package build issues...

adamjstewart · 2022-12-10T03:54:12Z

First up: intel-oneapi-tbb

==> [2022-12-09-23:30:50.352912] 'bash' 'l_tbb_oneapi_p_2021.7.1.15005_offline.sh' '-s' '-a' '-s' '--action' 'install' '--eula' 'accept' '--install-dir' '/home/software/spack/[padded-to-384-chars]/morepadding/linux-amzn2-aarch64/gcc-7.3.1/intel-oneapi-tbb-2021.7.1-upblbuqdyzjtki4kqvfqhwqu2huhs2tu'
/tmp/root/spack-stage/spack-stage-intel-oneapi-tbb-2021.7.1-upblbuqdyzjtki4kqvfqhwqu2huhs2tu/spack-src/l_tbb_oneapi_p_2021.7.1.15005_offline/install.sh: line 34: /tmp/root/spack-stage/spack-stage-intel-oneapi-tbb-2021.7.1-upblbuqdyzjtki4kqvfqhwqu2huhs2tu/spack-src/l_tbb_oneapi_p_2021.7.1.15005_offline/bootstrapper: cannot execute binary file

@rscohn2 does intel-oneapi-tbb not work on aarch64? If so I can add a conflict.

adamjstewart · 2022-12-10T04:22:00Z

Next: valgrind

/usr/lib/gcc/aarch64-redhat-linux/7/libgcc.a(lse-init.o): In function `init_have_lse_atomics':
(.text.startup+0xc): undefined reference to `getauxval'
(.text.startup+0xc): relocation truncated to fit: R_AARCH64_CALL26 against undefined symbol `getauxval'
collect2: error: ld returned 1 exit status
make[3]: *** [Makefile:1112: memcheck-arm64-linux] Error 1

Tried adding a newer version of valgrind in #34436, if we're lucky that may help. If not, I'll just disable it in the PyTorch package. I'm guessing that's the only thing that needs it.

adamjstewart · 2022-12-10T04:43:17Z

py-tensorboard-data-server issues fixed in #34437 and tensorflow/tensorboard#6101

rscohn2 · 2022-12-10T13:20:49Z

@rscohn2 does intel-oneapi-tbb not work on aarch64? If so I can add a conflict.

None of the packages will work so I added a conflict in the base package: #34441

adamjstewart · 2022-12-14T03:33:39Z

Next round of build failures. LLVM:

/tmp/root/spack-stage/spack-stage-llvm-14.0.6-5raql5vuvka6tcsadwnnzh6elturcyel/spack-src/lldb/source/Plugins/Process/Linux/NativeRegisterContextLinux_arm64.cpp:63:28: error: aggregate 'lldb_private::process_linux::NativeRegisterContextLinux::CreateHostNativeRegisterContextLinux(const lldb_private::ArchSpec&, lldb_private::process_linux::NativeThreadLinux&)::user_sve_header sve_header' has incomplete type and cannot be defined
     struct user_sve_header sve_header;
                            ^~~~~~~~~~
/tmp/root/spack-stage/spack-stage-llvm-14.0.6-5raql5vuvka6tcsadwnnzh6elturcyel/spack-src/lldb/source/Plugins/Process/Linux/NativeRegisterContextLinux_arm64.cpp: In member function 'virtual lldb_private::Status lldb_private::process_linux::NativeRegisterContextLinux_arm64::WriteRegister(const lldb_private::RegisterInfo*, const lldb_private::RegisterValue&)':
/tmp/root/spack-stage/spack-stage-llvm-14.0.6-5raql5vuvka6tcsadwnnzh6elturcyel/spack-src/lldb/source/Plugins/Process/Linux/NativeRegisterContextLinux_arm64.cpp:376:13: error: 'sve_vl_valid' was not declared in this scope
         if (sve_vl_valid(vg_value * 8)) {
             ^~~~~~~~~~~~

@haampie @trws ever seen this one before?

adamjstewart · 2022-12-14T03:39:30Z

PyTorch:

  cc1: error: invalid feature modifier in '-march=armv8.2-a+dotprod'

Looks like the same issue as google/XNNPACK#1551 and https://discuss.pytorch.org/t/installation-from-source-on-aarch64-meet-march-armv8-2-a-fp16-dotprod-failure/162336/2. Solution is to use a newer compiler. Not really sure how to get a newer GCC on these images though...

Update: easier solution is just to disable XNNPACK

adamjstewart · 2022-12-23T20:22:19Z

Ping @haampie @trws. LLVM is the only thing holding up this new pipeline at the moment. I could disable TF on aarch64, but I'd rather test it if possible.

trws · 2022-12-24T05:19:39Z

Nope, that’s a new one. All I know is it’s an sve related issue, I’ve seen a couple issues due to arch detection but this looks like a header version problem, maybe whatever is providing that header is too old?

…

adamjstewart · 2023-01-16T17:36:32Z

I could have sworn PyTorch was building before...

New build issue:

  /home/software/spack/[padded-to-384-chars]/linux-amzn2-aarch64/gcc-7.3.1/cuda-11.8.0-wiyn3mcz4hhvd6rx6ysqbbbv45rwi6y3/lib64/libcurand.so: undefined reference to `expf@GLIBC_2.27'
  /home/software/spack/[padded-to-384-chars]/linux-amzn2-aarch64/gcc-7.3.1/cuda-11.8.0-wiyn3mcz4hhvd6rx6ysqbbbv45rwi6y3/lib64/libcurand.so: undefined reference to `logf@GLIBC_2.27'

adamjstewart · 2023-01-17T06:02:58Z

@spackbot run pipeline

spackbot-app · 2023-01-17T06:03:04Z

I've started that pipeline for you!

adamjstewart · 2023-02-02T22:25:59Z

@spackbot run pipeline

spackbot-app · 2023-02-02T22:26:07Z

I've started that pipeline for you!

adamjstewart · 2023-03-11T17:08:53Z

This is outdated after the CI refactor. I'll create a new PR when I get a chance.

spackbot-app bot added core PR affects Spack core functionality gitlab Issues related to gitlab integration labels Dec 4, 2022

adamjstewart requested review from scottwittenburg and zackgalbreath December 4, 2022 03:44

adamjstewart changed the title ~~ML CI: Linux aarch64~~ [WIP] ML CI: Linux aarch64 Dec 6, 2022

adamjstewart mentioned this pull request Dec 10, 2022

py-tensorboard-data-server: add Linux aarch64 support #34437

Merged

adamjstewart force-pushed the ci/ml-linux-aarch64 branch from 5cfe9b6 to 7f66edc Compare December 10, 2022 05:00

adamjstewart force-pushed the ci/ml-linux-aarch64 branch 2 times, most recently from 3268381 to 8d1759b Compare December 12, 2022 22:11

adamjstewart mentioned this pull request Dec 17, 2022

py-horovod: patch no longer applies #34593

Merged

adamjstewart force-pushed the ci/ml-linux-aarch64 branch 3 times, most recently from 0650c39 to 1a2c300 Compare December 22, 2022 17:37

adamjstewart force-pushed the ci/ml-linux-aarch64 branch from 1a2c300 to 076ce16 Compare January 15, 2023 18:25

adamjstewart mentioned this pull request Jan 15, 2023

Installation issue: LLVM on aarch64 #34954

Closed

4 tasks

adamjstewart added 13 commits January 30, 2023 14:35

ML CI: Linux aarch64

78872e6

Rename everything

a361cdc

Don't specify compiler

e25d738

More explicit target/compiler

378e1f0

Update all stacks

648e9cc

Extend aarch64 versions

c4ad840

Valgrind doesn't work, don't specify compiler

a889b2b

Uncomment valgrind changes

44d5366

No valgrind

4c68031

No XNNPACK

9bcc84c

Faster TF

f0442ed

Skip TF since LLVM doesn't build

635d90f

Keras also depends on TF depends on LLVM

341b5bb

adamjstewart force-pushed the ci/ml-linux-aarch64 branch from 6cf9c15 to 341b5bb Compare January 30, 2023 21:35

adamjstewart closed this Mar 11, 2023

adamjstewart deleted the ci/ml-linux-aarch64 branch March 11, 2023 17:08

adamjstewart mentioned this pull request Aug 28, 2023

ML CI: Linux aarch64 #39666

Merged

Conversation

adamjstewart commented Dec 4, 2022

Uh oh!

adamjstewart commented Dec 4, 2022

Uh oh!

adamjstewart commented Dec 9, 2022

Uh oh!

scottwittenburg commented Dec 9, 2022

Uh oh!

adamjstewart commented Dec 10, 2022

Uh oh!

adamjstewart commented Dec 10, 2022

Uh oh!

adamjstewart commented Dec 10, 2022

Uh oh!

adamjstewart commented Dec 10, 2022

Uh oh!

rscohn2 commented Dec 10, 2022

Uh oh!

adamjstewart commented Dec 14, 2022

Uh oh!

adamjstewart commented Dec 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adamjstewart commented Dec 23, 2022

Uh oh!

trws commented Dec 24, 2022 via email

Uh oh!

adamjstewart commented Jan 16, 2023

Uh oh!

adamjstewart commented Jan 17, 2023

Uh oh!

spackbot-app bot commented Jan 17, 2023

Uh oh!

adamjstewart commented Feb 2, 2023

Uh oh!

spackbot-app bot commented Feb 2, 2023

Uh oh!

adamjstewart commented Mar 11, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

adamjstewart commented Dec 14, 2022 •

edited

Loading