Add aws-plcuster[-aarch64] stacks #37627
Conversation
These stacks build packages defined in https://github.com/spack/spack-configs/tree/main/AWS/parallelcluster They use a custom container from https://github.com/spack/gitlab-runners which includes necessary ParallelCluster software to link and build as well as an upstream spack installation with current GCC and dependencies. Intel and ARM software is installed and used during the build stage but removed from the buildcache before the signing stage. Files `configs/linux/{arch}/ci.yaml` select the necessary providers in order to build for specific architectures (icelake, skylake, neoverse_{n,v}1).
| - - /bin/bash "${SPACK_ARTIFACTS_ROOT}/postinstall.sh" -fg | ||
| - spack config --scope site add "packages:all:target:\"target=${SPACK_TARGET_ARCH}\"" | ||
| - signing-job: | ||
| before_script: |
There was a problem hiding this comment.
I have checked that the rebuild-index job will suceed after deleting a few packages in the signing job. But wil there be consequences when downloading?
- `libunistring` cflags for intel are now supported in https://github.com/spack/spack-configs/tree/main/AWS/parallelcluster - `gromacs` and `hdf5` for `wrf` currently cannot link in container.
Removing packages to investigate offline
Same 'libimf.so not found' error in cmake build of package `git`.
scottwittenburg
left a comment
There was a problem hiding this comment.
Looking at this, I'm noticing that you've created 4 new pipelines, associated with only 2 new stacks. Can it be re-organized to add one pipeline (generate/build pair) per stack? We try to avoid allowing multiple jobs to write binaries for the same hash in parallel, as the race conditions that can result often cause checksum mismatches between the generated binaries and associated meta-data files.
But let me know if I've just missed something.
As suggested by @scottwittenburg .
Thnaks for the review. I made one stack per pipeline per your suggestions. I guess there would be potential for race condistions otherwise as some of the build targets overlap. |
|
@spackbot re-run pipeline |
|
I've started that pipeline for you! |
| - $optimized_libs | ||
|
|
||
|
|
||
| mirrors: { "mirror": "s3://spack-binaries/develop/aws-pcluster-aarch64" } |
There was a problem hiding this comment.
I will change the mirror names to match the pipelines in another commit after the pipelines ran though.
kwryankrattiger
left a comment
There was a problem hiding this comment.
Some comments. Due to the deadline maybe some of these are a follow-on
| - - curl -LfsS "https://github.com/JuliaBinaryWrappers/GNUMake_jll.jl/releases/download/GNUMake-v4.3.0+1/GNUMake.v4.3.0.x86_64-linux-gnu.tar.gz" -o gmake.tar.gz | ||
| - printf "fef1f59e56d2d11e6d700ba22d3444b6e583c663d6883fd0a4f63ab8bd280f0f gmake.tar.gz" | sha256sum --check --strict --quiet | ||
| - tar -xzf gmake.tar.gz -C /usr bin/make 2> /dev/null | ||
| tags: ["x86_64_v4"] |
There was a problem hiding this comment.
It seems to me like icelake and skylake could be reduced to x86_64_v4.
The stack spack.yamls can specifiy the SPACK_TARGET_ARCH under it's own variables.
There was a problem hiding this comment.
You are right. I did not have a separate stack spack.yaml when I made these files.
| - gettext | ||
|
|
||
| - compiler_target: | ||
| - '%[email protected] target=x86_64_v3' |
There was a problem hiding this comment.
This doesn't seem consistent with the icelake configs above. Am I missing something?
There was a problem hiding this comment.
The architecture used is defined by the cluster head node. But clusters may have different architecture compute nodes, so I want the compilers to be as broad as possible. If I made this _v4 and added a zen2 compute node, I would not be able to use the compiler for the compute node.
There was a problem hiding this comment.
if that is a concern, then I don't understand why this stack is specifically building for icelake..
|
|
||
| ci: | ||
| pipeline-gen: | ||
| - build-job: |
There was a problem hiding this comment.
It seems like this part could be moved into share/spack/gitlab/cloud_pipelines/configs/pcluster/ci.yaml and included or added to the generate jobs configs.
There was a problem hiding this comment.
Yes, I can do that.
| view: false | ||
|
|
||
| definitions: | ||
| - compiler_specs: |
There was a problem hiding this comment.
Since this stuff is identical between the stacks, but may be worth while to merge the aarch64 stacks and the x86_64_v4 and use a matrix spec and submappings to assign the runner tags. appropriately.
ci:
pipeline-gen:
- match_behavior: first
submapping
- match:
- target=neoverse_n1
build_job:
tags: ["graviton2"]
- match:
- target=neoverse_v1
build_job:
tags: ["graviton3"]I am not sure if both neoverse_n1 and neoverse_v1 can be concretized on gravitron3 runners, but I am pretty sure icelake and skylake can be so I would think the same backwards compatibility might apply to ARM architectures.
There was a problem hiding this comment.
Let me take a look at this after the deadline. It seems I need to learn some more syntax.
|
I wonder what's going on in this early-stage |
| before_script: | ||
| # Do not distribute Intel & ARM binaries | ||
| - - for i in $(aws s3 ls --recursive ${SPACK_REMOTE_MIRROR_OVERRIDE}/build_cache/ | grep intel-oneapi | awk '{print $4}' | sed -e 's?^.*build_cache/??g'); do aws s3 rm ${SPACK_REMOTE_MIRROR_OVERRIDE}/build_cache/$i; done | ||
| - for i in $(aws s3 ls --recursive ${SPACK_REMOTE_MIRROR_OVERRIDE}/build_cache/ | grep armpl | awk '{print $4}' | sed -e 's?^.*build_cache/??g'); do aws s3 rm ${SPACK_REMOTE_MIRROR_OVERRIDE}/build_cache/$i; done |
There was a problem hiding this comment.
Do I understand this correctly? You are deleting from the mirror some (possibly large) subset of the binaries just built, in order to avoid re-distribution? I thought it was going to be ok to leave these in the stack-specific mirrors, as long as we don't included them at the root. The way it is here, isn't it going to force every matching spec to be rebuilt from source on every pipeline?
There was a problem hiding this comment.
No, I am only deleting software that's distributed as binaries from Intel/ARM anyway. This is only oneapi and armPL software. On the one hand we are on the save side for re-distribution and on the hand the installation "from source" is downloading pre-built binaries anyway.
There was a problem hiding this comment.
What I should have said here, but I just didn't think of at the time, is that some time ago we made it an error (see here) for dependencies in a pipeline rebuild job to be installed from source. The definition of "from source" there doesn't consider that it might be downloading pre-built binaries, it's either from a spack buildcache or it's "from source".
|
I'm also curious if you know whether this is something to worry about: https://gitlab.spack.io/spack/spack/-/jobs/7020733#L67 |
I am afraid it's still using the "old" container which built |
Same issue with the "old" container with wrong gcc target. I will re-run the pipeline after it completed. Not sure how long the containers are cached, but I pushed the "new" one 8h ago. I hope this is sufficient now. |
Not sure how this works in Github but it seems the request for these changes is blocking the PR. @scottwittenburg I already pushed 1 pipelin per stack. Do I need to do anything else to get it resolved? |
I know you ran into a problem with spackbot causing duplicated pipelines, but normally if you push your PR branch while a pipeline is running, gitlab is supposed to notice and cancel the previous one running on the same ref. Usually it's fairly reliable, but as @alalazo pointed out in slack, there are some gitlab bugs filed around it. I'd go ahead and push your branch if you think a bunch of specs are going to change hashes and have to rebuild anyway. But I'll leave the decision up to you. |
Sorry I wasn't more clear about it, but I was hoping you would end up with two pipelines for two stacks, not add two more pipelines, but I get that this was easier to accomplish. The way I think it should be done (but let me know if you agree @kwryankrattiger), is the specs which can be concretized/built on the same image, should be combined into a single stack, and then mapping rules defined in yaml configs should take care of making sure the correct arch, etc, is used for each. But I understand it's not so clear how to achieve that, and you're under a tight deadline at this point. So maybe you can just file an issue to clean up these new stacks/pipelines and re-examine what's going on here when time is not so tight. Maybe in the issue, you can link to this comment for details. Once everything builds, I'll be ok with merging this and revisiting it in a subsequent PR. |
share/spack/gitlab/cloud_pipelines/stacks/aws-pcluster-neoverse_n1/spack.yaml
Show resolved
Hide resolved
|
@scottwittenburg can we merge? I opened an issue to address merging the stacks. |
Add aws-plcuster[-aarch64] stacks. These stacks build packages defined in https://github.com/spack/spack-configs/tree/main/AWS/parallelcluster They use a custom container from https://github.com/spack/gitlab-runners which includes necessary ParallelCluster software to link and build as well as an upstream spack installation with current GCC and dependencies. Intel and ARM software is installed and used during the build stage but removed from the buildcache before the signing stage. Files `configs/linux/{arch}/ci.yaml` select the necessary providers in order to build for specific architectures (icelake, skylake, neoverse_{n,v}1).
These stacks build packages defined in
https://github.com/spack/spack-configs/tree/main/AWS/parallelcluster
They use a custom container from https://github.com/spack/gitlab-runners which
includes necessary ParallelCluster software to link and build as well as an
upstream spack installation with current GCC and dependencies.
Intel and ARM software is installed and used during the build stage but removed
from the buildcache before the signing stage.
Files
configs/linux/{arch}/ci.yamlselect the necessary providers in order tobuild for specific architectures (icelake, skylake, neoverse_{n,v}1).