Skip to content

Add aws-plcuster[-aarch64] stacks #37627

Merged
scottwittenburg merged 15 commits intospack:developfrom
stephenmsachs:aws-pcluster-stacks
May 17, 2023
Merged

Add aws-plcuster[-aarch64] stacks #37627
scottwittenburg merged 15 commits intospack:developfrom
stephenmsachs:aws-pcluster-stacks

Conversation

@stephenmsachs
Copy link
Copy Markdown
Contributor

These stacks build packages defined in
https://github.com/spack/spack-configs/tree/main/AWS/parallelcluster

They use a custom container from https://github.com/spack/gitlab-runners which
includes necessary ParallelCluster software to link and build as well as an
upstream spack installation with current GCC and dependencies.

Intel and ARM software is installed and used during the build stage but removed
from the buildcache before the signing stage.

Files configs/linux/{arch}/ci.yaml select the necessary providers in order to
build for specific architectures (icelake, skylake, neoverse_{n,v}1).

Stephen Sachs added 2 commits May 12, 2023 10:23
These stacks build packages defined in
https://github.com/spack/spack-configs/tree/main/AWS/parallelcluster

They use a custom container from https://github.com/spack/gitlab-runners which
includes necessary ParallelCluster software to link and build as well as an
upstream spack installation with current GCC and dependencies.

Intel and ARM software is installed and used during the build stage but removed
from the buildcache before the signing stage.

Files `configs/linux/{arch}/ci.yaml` select the necessary providers in order to
build for specific architectures (icelake, skylake, neoverse_{n,v}1).
@spackbot-app spackbot-app bot added core PR affects Spack core functionality gitlab Issues related to gitlab integration labels May 12, 2023
- - /bin/bash "${SPACK_ARTIFACTS_ROOT}/postinstall.sh" -fg
- spack config --scope site add "packages:all:target:\"target=${SPACK_TARGET_ARCH}\""
- signing-job:
before_script:
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have checked that the rebuild-index job will suceed after deleting a few packages in the signing job. But wil there be consequences when downloading?

Copy link
Copy Markdown
Contributor

@scottwittenburg scottwittenburg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this, I'm noticing that you've created 4 new pipelines, associated with only 2 new stacks. Can it be re-organized to add one pipeline (generate/build pair) per stack? We try to avoid allowing multiple jobs to write binaries for the same hash in parallel, as the race conditions that can result often cause checksum mismatches between the generated binaries and associated meta-data files.

But let me know if I've just missed something.

@stephenmsachs
Copy link
Copy Markdown
Contributor Author

Looking at this, I'm noticing that you've created 4 new pipelines, associated with only 2 new stacks. Can it be re-organized to add one pipeline (generate/build pair) per stack? We try to avoid allowing multiple jobs to write binaries for the same hash in parallel, as the race conditions that can result often cause checksum mismatches between the generated binaries and associated meta-data files.

But let me know if I've just missed something.

Thnaks for the review. I made one stack per pipeline per your suggestions. I guess there would be potential for race condistions otherwise as some of the build targets overlap.

@stephenmsachs
Copy link
Copy Markdown
Contributor Author

@spackbot re-run pipeline

@spackbot-app
Copy link
Copy Markdown

spackbot-app bot commented May 17, 2023

I've started that pipeline for you!

- $optimized_libs


mirrors: { "mirror": "s3://spack-binaries/develop/aws-pcluster-aarch64" }
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will change the mirror names to match the pipelines in another commit after the pipelines ran though.

Copy link
Copy Markdown
Contributor

@kwryankrattiger kwryankrattiger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments. Due to the deadline maybe some of these are a follow-on

- - curl -LfsS "https://github.com/JuliaBinaryWrappers/GNUMake_jll.jl/releases/download/GNUMake-v4.3.0+1/GNUMake.v4.3.0.x86_64-linux-gnu.tar.gz" -o gmake.tar.gz
- printf "fef1f59e56d2d11e6d700ba22d3444b6e583c663d6883fd0a4f63ab8bd280f0f gmake.tar.gz" | sha256sum --check --strict --quiet
- tar -xzf gmake.tar.gz -C /usr bin/make 2> /dev/null
tags: ["x86_64_v4"]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me like icelake and skylake could be reduced to x86_64_v4.

The stack spack.yamls can specifiy the SPACK_TARGET_ARCH under it's own variables.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. I did not have a separate stack spack.yaml when I made these files.

- gettext

- compiler_target:
- '%[email protected] target=x86_64_v3'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem consistent with the icelake configs above. Am I missing something?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The architecture used is defined by the cluster head node. But clusters may have different architecture compute nodes, so I want the compilers to be as broad as possible. If I made this _v4 and added a zen2 compute node, I would not be able to use the compiler for the compute node.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if that is a concern, then I don't understand why this stack is specifically building for icelake..


ci:
pipeline-gen:
- build-job:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this part could be moved into share/spack/gitlab/cloud_pipelines/configs/pcluster/ci.yaml and included or added to the generate jobs configs.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I can do that.

view: false

definitions:
- compiler_specs:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this stuff is identical between the stacks, but may be worth while to merge the aarch64 stacks and the x86_64_v4 and use a matrix spec and submappings to assign the runner tags. appropriately.

ci:
  pipeline-gen:
  - match_behavior: first
    submapping
     - match:
       - target=neoverse_n1
         build_job:
           tags: ["graviton2"]
     - match:
       - target=neoverse_v1
         build_job:
           tags: ["graviton3"]

I am not sure if both neoverse_n1 and neoverse_v1 can be concretized on gravitron3 runners, but I am pretty sure icelake and skylake can be so I would think the same backwards compatibility might apply to ARM architectures.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me take a look at this after the deadline. It seems I need to learn some more syntax.

@scottwittenburg
Copy link
Copy Markdown
Contributor

I wonder what's going on in this early-stage pkgconf job. It seems to find no dependencies in the buildcache and have to install them all from source.

before_script:
# Do not distribute Intel & ARM binaries
- - for i in $(aws s3 ls --recursive ${SPACK_REMOTE_MIRROR_OVERRIDE}/build_cache/ | grep intel-oneapi | awk '{print $4}' | sed -e 's?^.*build_cache/??g'); do aws s3 rm ${SPACK_REMOTE_MIRROR_OVERRIDE}/build_cache/$i; done
- for i in $(aws s3 ls --recursive ${SPACK_REMOTE_MIRROR_OVERRIDE}/build_cache/ | grep armpl | awk '{print $4}' | sed -e 's?^.*build_cache/??g'); do aws s3 rm ${SPACK_REMOTE_MIRROR_OVERRIDE}/build_cache/$i; done
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I understand this correctly? You are deleting from the mirror some (possibly large) subset of the binaries just built, in order to avoid re-distribution? I thought it was going to be ok to leave these in the stack-specific mirrors, as long as we don't included them at the root. The way it is here, isn't it going to force every matching spec to be rebuilt from source on every pipeline?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I am only deleting software that's distributed as binaries from Intel/ARM anyway. This is only oneapi and armPL software. On the one hand we are on the save side for re-distribution and on the hand the installation "from source" is downloading pre-built binaries anyway.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I should have said here, but I just didn't think of at the time, is that some time ago we made it an error (see here) for dependencies in a pipeline rebuild job to be installed from source. The definition of "from source" there doesn't consider that it might be downloading pre-built binaries, it's either from a spack buildcache or it's "from source".

@scottwittenburg
Copy link
Copy Markdown
Contributor

I'm also curious if you know whether this is something to worry about: https://gitlab.spack.io/spack/spack/-/jobs/7020733#L67

@stephenmsachs
Copy link
Copy Markdown
Contributor Author

I wonder what's going on in this early-stage pkgconf job. It seems to find no dependencies in the buildcache and have to install them all from source.

I am afraid it's still using the "old" container which built [email protected] target=neoverse_n1 by mistake. If you pull the container now it has [email protected] target=aarch64 as is required in the job.

@stephenmsachs
Copy link
Copy Markdown
Contributor Author

I'm also curious if you know whether this is something to worry about: https://gitlab.spack.io/spack/spack/-/jobs/7020733#L67

Same issue with the "old" container with wrong gcc target. I will re-run the pipeline after it completed. Not sure how long the containers are cached, but I pushed the "new" one 8h ago. I hope this is sufficient now.

@stephenmsachs
Copy link
Copy Markdown
Contributor Author

Looking at this, I'm noticing that you've created 4 new pipelines, associated with only 2 new stacks. Can it be re-organized to add one pipeline (generate/build pair) per stack? We try to avoid allowing multiple jobs to write binaries for the same hash in parallel, as the race conditions that can result often cause checksum mismatches between the generated binaries and associated meta-data files.

But let me know if I've just missed something.

Not sure how this works in Github but it seems the request for these changes is blocking the PR. @scottwittenburg I already pushed 1 pipelin per stack. Do I need to do anything else to get it resolved?

@scottwittenburg
Copy link
Copy Markdown
Contributor

I will re-run the pipeline after it completed.

I know you ran into a problem with spackbot causing duplicated pipelines, but normally if you push your PR branch while a pipeline is running, gitlab is supposed to notice and cancel the previous one running on the same ref. Usually it's fairly reliable, but as @alalazo pointed out in slack, there are some gitlab bugs filed around it. I'd go ahead and push your branch if you think a bunch of specs are going to change hashes and have to rebuild anyway. But I'll leave the decision up to you.

@scottwittenburg scottwittenburg self-requested a review May 17, 2023 17:57
@scottwittenburg
Copy link
Copy Markdown
Contributor

I already pushed 1 pipelin per stack. Do I need to do anything else to get it resolved?

Sorry I wasn't more clear about it, but I was hoping you would end up with two pipelines for two stacks, not add two more pipelines, but I get that this was easier to accomplish. The way I think it should be done (but let me know if you agree @kwryankrattiger), is the specs which can be concretized/built on the same image, should be combined into a single stack, and then mapping rules defined in yaml configs should take care of making sure the correct arch, etc, is used for each.

But I understand it's not so clear how to achieve that, and you're under a tight deadline at this point. So maybe you can just file an issue to clean up these new stacks/pipelines and re-examine what's going on here when time is not so tight. Maybe in the issue, you can link to this comment for details.

Once everything builds, I'll be ok with merging this and revisiting it in a subsequent PR.

@stephenmsachs
Copy link
Copy Markdown
Contributor Author

@scottwittenburg can we merge? I opened an issue to address merging the stacks.

@scottwittenburg scottwittenburg merged commit 125c20b into spack:develop May 17, 2023
RikkiButler20 pushed a commit to RikkiButler20/spack that referenced this pull request May 23, 2023
Add aws-plcuster[-aarch64] stacks.  These stacks build packages defined in
https://github.com/spack/spack-configs/tree/main/AWS/parallelcluster

They use a custom container from https://github.com/spack/gitlab-runners which
includes necessary ParallelCluster software to link and build as well as an
upstream spack installation with current GCC and dependencies.

Intel and ARM software is installed and used during the build stage but removed
from the buildcache before the signing stage.

Files `configs/linux/{arch}/ci.yaml` select the necessary providers in order to
build for specific architectures (icelake, skylake, neoverse_{n,v}1).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core PR affects Spack core functionality gitlab Issues related to gitlab integration

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants