Skip to content

[WIP] ML CI: Linux aarch64#34300

Closed
adamjstewart wants to merge 13 commits intospack:developfrom
adamjstewart:ci/ml-linux-aarch64
Closed

[WIP] ML CI: Linux aarch64#34300
adamjstewart wants to merge 13 commits intospack:developfrom
adamjstewart:ci/ml-linux-aarch64

Conversation

@adamjstewart
Copy link
Copy Markdown
Member

Copied our existing CI from x86_64 to aarch64

@spackbot-app spackbot-app bot added core PR affects Spack core functionality gitlab Issues related to gitlab integration labels Dec 4, 2022
@adamjstewart
Copy link
Copy Markdown
Member Author

Any idea why the compiler isn't working? I naively copied the x86_64 version and just replaced x86_64 with aarch64, not sure what other changes need to be made.

@adamjstewart adamjstewart changed the title ML CI: Linux aarch64 [WIP] ML CI: Linux aarch64 Dec 6, 2022
@adamjstewart
Copy link
Copy Markdown
Member Author

@scottwittenburg any thoughts on this one?

@scottwittenburg
Copy link
Copy Markdown
Contributor

I think again here, you want to specify compiler and arch as part of a spec matrix. It seems by default the concretizer picked linux-amzn2-haswell, when you wanted aarch64.

Also I noticed at least one of the pipeline generation jobs had x86_64 tag instead of aarch64. I see you've tried to add the aarch64 tag, in a way similar to how some of the other pipelines for that arch do it, but somehow it's not working in your case. I'll let you know if I discover the cause of this, but in the meantime you could try extending the existing .pr-generate-aarch64 and .protected-generate-aarch64 definitions.

@adamjstewart
Copy link
Copy Markdown
Member Author

It's working! Now to fix some package build issues...

@adamjstewart
Copy link
Copy Markdown
Member Author

First up: intel-oneapi-tbb

==> [2022-12-09-23:30:50.352912] 'bash' 'l_tbb_oneapi_p_2021.7.1.15005_offline.sh' '-s' '-a' '-s' '--action' 'install' '--eula' 'accept' '--install-dir' '/home/software/spack/[padded-to-384-chars]/morepadding/linux-amzn2-aarch64/gcc-7.3.1/intel-oneapi-tbb-2021.7.1-upblbuqdyzjtki4kqvfqhwqu2huhs2tu'
/tmp/root/spack-stage/spack-stage-intel-oneapi-tbb-2021.7.1-upblbuqdyzjtki4kqvfqhwqu2huhs2tu/spack-src/l_tbb_oneapi_p_2021.7.1.15005_offline/install.sh: line 34: /tmp/root/spack-stage/spack-stage-intel-oneapi-tbb-2021.7.1-upblbuqdyzjtki4kqvfqhwqu2huhs2tu/spack-src/l_tbb_oneapi_p_2021.7.1.15005_offline/bootstrapper: cannot execute binary file

@rscohn2 does intel-oneapi-tbb not work on aarch64? If so I can add a conflict.

@adamjstewart
Copy link
Copy Markdown
Member Author

Next: valgrind

/usr/lib/gcc/aarch64-redhat-linux/7/libgcc.a(lse-init.o): In function `init_have_lse_atomics':
(.text.startup+0xc): undefined reference to `getauxval'
(.text.startup+0xc): relocation truncated to fit: R_AARCH64_CALL26 against undefined symbol `getauxval'
collect2: error: ld returned 1 exit status
make[3]: *** [Makefile:1112: memcheck-arm64-linux] Error 1

Tried adding a newer version of valgrind in #34436, if we're lucky that may help. If not, I'll just disable it in the PyTorch package. I'm guessing that's the only thing that needs it.

@adamjstewart
Copy link
Copy Markdown
Member Author

py-tensorboard-data-server issues fixed in #34437 and tensorflow/tensorboard#6101

@rscohn2
Copy link
Copy Markdown
Member

rscohn2 commented Dec 10, 2022

@rscohn2 does intel-oneapi-tbb not work on aarch64? If so I can add a conflict.

None of the packages will work so I added a conflict in the base package: #34441

@adamjstewart adamjstewart force-pushed the ci/ml-linux-aarch64 branch 2 times, most recently from 3268381 to 8d1759b Compare December 12, 2022 22:11
@adamjstewart
Copy link
Copy Markdown
Member Author

Next round of build failures. LLVM:

/tmp/root/spack-stage/spack-stage-llvm-14.0.6-5raql5vuvka6tcsadwnnzh6elturcyel/spack-src/lldb/source/Plugins/Process/Linux/NativeRegisterContextLinux_arm64.cpp:63:28: error: aggregate 'lldb_private::process_linux::NativeRegisterContextLinux::CreateHostNativeRegisterContextLinux(const lldb_private::ArchSpec&, lldb_private::process_linux::NativeThreadLinux&)::user_sve_header sve_header' has incomplete type and cannot be defined
     struct user_sve_header sve_header;
                            ^~~~~~~~~~
/tmp/root/spack-stage/spack-stage-llvm-14.0.6-5raql5vuvka6tcsadwnnzh6elturcyel/spack-src/lldb/source/Plugins/Process/Linux/NativeRegisterContextLinux_arm64.cpp: In member function 'virtual lldb_private::Status lldb_private::process_linux::NativeRegisterContextLinux_arm64::WriteRegister(const lldb_private::RegisterInfo*, const lldb_private::RegisterValue&)':
/tmp/root/spack-stage/spack-stage-llvm-14.0.6-5raql5vuvka6tcsadwnnzh6elturcyel/spack-src/lldb/source/Plugins/Process/Linux/NativeRegisterContextLinux_arm64.cpp:376:13: error: 'sve_vl_valid' was not declared in this scope
         if (sve_vl_valid(vg_value * 8)) {
             ^~~~~~~~~~~~

@haampie @trws ever seen this one before?

@adamjstewart
Copy link
Copy Markdown
Member Author

adamjstewart commented Dec 14, 2022

PyTorch:

  cc1: error: invalid feature modifier in '-march=armv8.2-a+dotprod'

Looks like the same issue as google/XNNPACK#1551 and https://discuss.pytorch.org/t/installation-from-source-on-aarch64-meet-march-armv8-2-a-fp16-dotprod-failure/162336/2. Solution is to use a newer compiler. Not really sure how to get a newer GCC on these images though...

Update: easier solution is just to disable XNNPACK

@adamjstewart adamjstewart force-pushed the ci/ml-linux-aarch64 branch 3 times, most recently from 0650c39 to 1a2c300 Compare December 22, 2022 17:37
@adamjstewart
Copy link
Copy Markdown
Member Author

Ping @haampie @trws. LLVM is the only thing holding up this new pipeline at the moment. I could disable TF on aarch64, but I'd rather test it if possible.

@trws
Copy link
Copy Markdown
Contributor

trws commented Dec 24, 2022 via email

@adamjstewart
Copy link
Copy Markdown
Member Author

I could have sworn PyTorch was building before...

New build issue:

  /home/software/spack/[padded-to-384-chars]/linux-amzn2-aarch64/gcc-7.3.1/cuda-11.8.0-wiyn3mcz4hhvd6rx6ysqbbbv45rwi6y3/lib64/libcurand.so: undefined reference to `expf@GLIBC_2.27'
  /home/software/spack/[padded-to-384-chars]/linux-amzn2-aarch64/gcc-7.3.1/cuda-11.8.0-wiyn3mcz4hhvd6rx6ysqbbbv45rwi6y3/lib64/libcurand.so: undefined reference to `logf@GLIBC_2.27'

@adamjstewart
Copy link
Copy Markdown
Member Author

@spackbot run pipeline

@spackbot-app
Copy link
Copy Markdown

spackbot-app bot commented Jan 17, 2023

I've started that pipeline for you!

@adamjstewart
Copy link
Copy Markdown
Member Author

@spackbot run pipeline

@spackbot-app
Copy link
Copy Markdown

spackbot-app bot commented Feb 2, 2023

I've started that pipeline for you!

@adamjstewart
Copy link
Copy Markdown
Member Author

This is outdated after the CI refactor. I'll create a new PR when I get a chance.

@adamjstewart adamjstewart deleted the ci/ml-linux-aarch64 branch March 11, 2023 17:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core PR affects Spack core functionality gitlab Issues related to gitlab integration

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants