Draft: Add ppc64le pipelines for ML builds#39174
Draft: Add ppc64le pipelines for ML builds#39174nicholas-sly wants to merge 44 commits intospack:developfrom
Conversation
|
I'd rather name everything "ppc64le" to be consistent with |
scottwittenburg
left a comment
There was a problem hiding this comment.
Thanks for contributing some new stacks @nicholas-sly! I've added a few comments, addressing the one about the missing docker image may let your stacks get into the actual pipeline generation step on gitlab.
Also, @eugeneswalker manages the only power runners available to us, let's see what he has to say about their capacity to handle this extra workload.
I pinged @kwryankrattiger in my review as well, since he came up with the new scheme for organizing the gitlab ci configs.
Thanks again!
share/spack/gitlab/cloud_pipelines/stacks/ml-linux-power64le-cuda/spack.yaml
Outdated
Show resolved
Hide resolved
share/spack/gitlab/cloud_pipelines/stacks/ml-linux-power64le-cpu/spack.yaml
Outdated
Show resolved
Hide resolved
share/spack/gitlab/cloud_pipelines/stacks/ml-linux-power64le-cpu/spack.yaml
Outdated
Show resolved
Hide resolved
| - build-job: | ||
| image: | ||
| name: ecpe4s/fedora36-runner-ppc64le:2023-01-01 | ||
| entrypoint: [''] |
There was a problem hiding this comment.
I think one of these build-job sections should be removed. @kwryankrattiger can correct me when I get this backwards, but I think these are applied bottom up, so your final job configuration will just have the first one. In which case, you can remove the second.
There was a problem hiding this comment.
So, I assume you mean only have one of the ubuntu or fedora jobs. I'm actually intentionally targeting both as many of these packages, if they even support linux, are built for Ubuntu. If users are installing these packages locally on a linux system, it is very likely Ubuntu. But most of the HPC systems that I've interacted with are either SLES or RHEL-based. As such, I think it is useful to ensure that these packages are tested against both OSes as best we can.
If I'm just specifying it incorrectly to achieve that end, then I'm happy to change it.
There was a problem hiding this comment.
I think testing both ubuntu and rhel is overkill right now. We don't even do that for x86_64. This is something worth adding someday but idk if we have the CI bandwidth to do both yet.
There was a problem hiding this comment.
I've already experienced builds that succeeded with ubuntu but not with rhel8. I'm leaving it in because tests that pass on ubuntu do us no good when the majority of HPC systems are on a rhel-based OS. It really comes down to what our goal is with the CI system. If we want to ensure that changes to these packages won't have a significant impact to most of our users that try and build these recipes on most of the systems they might use, then we need to ensure that our CI is representative of those systems and users. If we just want a build that works even if it doesn't represent anything an end user might use, we can probably just ignore ppc64le altogether. As for CI bandwidth, most changes in Spack shouldn't even trigger this. The point (as I see it) is to test for any changes that will impact these builds. Once these pipelines are integrated, they shouldn't be triggered often, unless they need to be triggered. Not sure how we judge this bandwidth or if such limiting factors are publicly available, but I think this is necessary for now. Obviously, once all of these systems have been decommissioned, we can remove these pipelines. Until then, these pipelines are a useful tool for trying to get these ML packages working on the machines we want them to build on.
share/spack/gitlab/cloud_pipelines/stacks/ml-linux-power64le-cuda/spack.yaml
Outdated
Show resolved
Hide resolved
|
@scottwittenburg @adamjstewart It seems the tests are all passing now. Let me know if you have any further changes to request. Github says there's one unaddressed request, but everything I'm seeing is either outdated or addressed. Thanks. |
|
What's the reason for no ROCm? |
Are there any machines that have ppc64le processors and AMD GPUs? I'm not aware of any and I'm under the impression that IBM isn't going to be making new machines. If that's not the case, we can try. |
|
I have no idea |
adamjstewart
left a comment
There was a problem hiding this comment.
We can figure out the Bazel stuff another day
There was a problem hiding this comment.
Every pipeline needs to write into it's own mirror. Concurrent pipelines writing to the same destination creates race conditions which result in weird errors.
Currently the solution is to have a spack.yaml for every pair of generate/build jobs. But once we have #39939, that won't be necessary, as the spack.yaml will no longer be where we specify the mirrors for spack pipelines.
| ci: | ||
| pipeline-gen: | ||
| - build-job: | ||
| image: | ||
| name: ecpe4s/ubuntu20.04-runner-ppc64le:2023-01-01 | ||
| entrypoint: [''] | ||
| - build-job: | ||
| image: | ||
| name: ecpe4s/rhel7-runner-ppc64le:2023-01-01 | ||
| entrypoint: [''] |
There was a problem hiding this comment.
Did you mean to have some submapping here, rather than just build-job? If this was working, I'm not sure how. Given you were trying to have two spack.yaml for the four pipelines, you may have been able to do something like this instead:
Example:
...
- match_behavior: first
submapping:
- match:
- os=ubuntu20.04
build-job:
image:
name: ecpe4s/ubuntu20.04-runner-ppc64le:2023-01-01
entrypoint: ['']But that's irrelevant if you change to one stack per pipeline. If you wait until #39939 is merged, then I think you could stick with only two spack.yaml for your four pipelines, and the above example would be how you map build jobs to docker image.
There was a problem hiding this comment.
Not sure what the issue is here. I want each stack to build on both ubuntu and rhel8. In yaml, the pipeline-gen key takes a list, so I just put two elements in that list. You can see from the CI build that this properly broke out the pipeline into separate jobs, each utilizing the appropriate container.
I am planning to wait for #39939 to be merged to avoid breaking this into multiple files only to be able to deduplicate after that PR goes through.
There was a problem hiding this comment.
It looks to me like it didn't break them out properly. This job was supposed to be a rhel8, but ran on the ubuntu container.
There was a problem hiding this comment.
Interesting indeed. The container does seem to indicate ubuntu for the rhel job. But then the uname -a indicates a rhel OS. Presumably that's the host, so fair enough. The spack arch output seems to corroborate the ubuntu OS. But then the gnuconfig dependency installation along with the libiconv installation directories indicate a rhel OS. I guess that's due to the generate builds occurring with the correct container. I just pushed a commit that tries to be more explicit with which container image should be used for the CI jobs.
There was a problem hiding this comment.
For reasons I cannot fathom, a CI jobs ending in -build will not accept an image key in the .gitlab-ci.yaml file. Likewise, it can "extend" another job that does take an image key, but does not honor that same image when building. I can try modifying the spack.yaml files according to your above example, but without proper documentation, I'll be doing a good bit of guess and check with CI parsing to try and get it right.
There was a problem hiding this comment.
Forgot about this comment, but want to resurface this:
I think testing both ubuntu and rhel is overkill right now. We don't even do that for x86_64. This is something worth adding someday but idk if we have the CI bandwidth to do both yet.
There was a problem hiding this comment.
I stand by my response to that comment: #39174 (comment)
There was a problem hiding this comment.
The latest pipeline run https://gitlab.spack.io/spack/spack/-/pipelines/500642 is an example of exactly this happening.
There was a problem hiding this comment.
FYI, Spack is slowly moving in the direction of modeling glibc such that packages will no longer require any system dependencies. Once this is done, Ubuntu and RHEL will be the same system, and there should be no real reason to test both.
There was a problem hiding this comment.
While I appreciate your optimism in this respect, the demonstrable differences between the two OSes means I'm going to have to see the two pipelines succeed side-by-side with a significant portion of the ML packages we're interested in before I'm willing to down sample to a single OS.
|
Based on Slack discussion, it seems like ppc64le is on its way out and our CI runners are limited anyway. Let's scrap this unless IBM decides to contribute to this effort. |
No description provided.