Skip to content

Set build_jobs dynamically in CI to avoid oversubscription#35996

Merged
alalazo merged 6 commits intospack:developfrom
haampie:fix/oversubscription-in-ci
Mar 13, 2023
Merged

Set build_jobs dynamically in CI to avoid oversubscription#35996
alalazo merged 6 commits intospack:developfrom
haampie:fix/oversubscription-in-ci

Conversation

@haampie
Copy link
Copy Markdown
Member

@haampie haampie commented Mar 10, 2023

Instead of setting a fixed number of jobs (32) in spack,
set them dynamically based on YAV (yet another variable),
that corresponds to the kubernetes request.

This is mostly for packages that use a lot of parallel
resources for a long time, like paraview and what not,
since as it is right now, their resource request is 11 CPUs
and their usages is 3x as much, so, it's likely that jobs
get killed because of oversubscription.

Notice, I've increased the resource request from 11 to
16 in "huge", because 11 is not huge.

For small packages this doesn't matter much, since
often the peak resource usage is very short. But I've
also added a build_jobs limit there anyhow... if that
slows down builds too much, we can allow more
oversubscription there again later.

@spackbot-app spackbot-app bot added core PR affects Spack core functionality gitlab Issues related to gitlab integration labels Mar 10, 2023
Copy link
Copy Markdown
Contributor

@scottwittenburg scottwittenburg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a good idea, but as you point out, while it hopefully addresses the problems we've seen with py-tensorflow, it may have some detrimental effect on other specs. But it's worth giving a try, IMO.

@scottwittenburg
Copy link
Copy Markdown
Contributor

Just wondering if below is something we might consider in the future, to have more control over the -j factor for groups of specs. I'm not sure if the spack config ... incantation is correct, but hopefully the idea is clear enough.

diff --git a/share/spack/gitlab/cloud_pipelines/stacks/e4s/spack.yaml b/share/spack/gitlab/cloud_pipelines/stacks/e4s/spack.yaml
index 8b961a32e0..2a632ac5b2 100644
--- a/share/spack/gitlab/cloud_pipelines/stacks/e4s/spack.yaml
+++ b/share/spack/gitlab/cloud_pipelines/stacks/e4s/spack.yaml
@@ -252,6 +252,7 @@ spack:
       - spack arch
       - cd ${SPACK_CONCRETE_ENV_DIR}
       - spack env activate --without-view .
+      - spack config add "config:build_jobs:${SPACK_CONFIG_BUILD_JOBS}"
       - spack config add "config:install_tree:projections:${SPACK_JOB_SPEC_PKG_NAME}:'morepadding/{architecture}/{compiler.name}-{compiler.version}/{name}-{version}-{hash}'"
       - mkdir -p ${SPACK_ARTIFACTS_ROOT}/user_data
       # AWS runners mount E4S public key (verification), UO runners mount public/private (signing/verification)
@@ -284,6 +285,7 @@ spack:
             CI_JOB_SIZE: huge
             KUBERNETES_CPU_REQUEST: 11000m
             KUBERNETES_MEMORY_REQUEST: 42G
+            SPACK_CONFIG_BUILD_JOBS: 12
 
       - match:
           - cuda
@@ -321,6 +323,7 @@ spack:
             CI_JOB_SIZE: large
             KUBERNETES_CPU_REQUEST: 8000m
             KUBERNETES_MEMORY_REQUEST: 12G
+            SPACK_CONFIG_BUILD_JOBS: 8
 
       - match:
           - adios2
@@ -389,6 +392,7 @@ spack:
             CI_JOB_SIZE: "medium"
             KUBERNETES_CPU_REQUEST: "2000m"
             KUBERNETES_MEMORY_REQUEST: "4G"
+            SPACK_CONFIG_BUILD_JOBS: 2
 
       - match:
           - alsa-lib
@@ -450,12 +454,14 @@ spack:
             CI_JOB_SIZE: "small"
             KUBERNETES_CPU_REQUEST: "500m"
             KUBERNETES_MEMORY_REQUEST: "500M"
+            SPACK_CONFIG_BUILD_JOBS: 1
 
       - match: ['os=ubuntu20.04']
         runner-attributes:
           tags: ["spack", "x86_64"]
           variables:
             CI_JOB_SIZE: "default"
+            SPACK_CONFIG_BUILD_JOBS: 1
 
     broken-specs-url: "s3://spack-binaries/broken-specs"

@haampie
Copy link
Copy Markdown
Member Author

haampie commented Mar 10, 2023

If there's an alternative way of communicating the number of usable cpus to spack, that'd be helpful.

(Uh, didn't see the last comment. That'd help)

@haampie
Copy link
Copy Markdown
Member Author

haampie commented Mar 10, 2023

Yeah, the main downside of capping the number of jobs is that it's probably pessimistic, since it's likely the majority of the jobs are on average not very parallel (including git clone, configure, cmake, etc).

@kwryankrattiger
Copy link
Copy Markdown
Contributor

#34272 is just about to be merged, I think it will be better to wait for that to put this in.

@haampie
Copy link
Copy Markdown
Member Author

haampie commented Mar 10, 2023

Okay, I dropped the static build_jobs from the environments, made it "dynamic' with spack config add, but also added more jobs to those "huge" jobs cause they weren't particularly huge -- it looks like all the tensorflow and paraview builds etc kinda saturate the number of jobs during their build, so 11 is not "huge".

@scottwittenburg scottwittenburg self-requested a review March 10, 2023 19:23
scottwittenburg

This comment was marked as outdated.

@haampie haampie force-pushed the fix/oversubscription-in-ci branch from b0a5ee6 to 66917f2 Compare March 10, 2023 19:36
@haampie haampie changed the title Fix oversubscription in CI Set build_jobs dynamically in CI to avoid oversubscription Mar 10, 2023
Copy link
Copy Markdown
Contributor

@kwryankrattiger kwryankrattiger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, hopefully this gets things to pass now!

@haampie
Copy link
Copy Markdown
Member Author

haampie commented Mar 10, 2023

@kwryankrattiger / @scottwittenburg can you take this PR over from here?

You still gotta push git revert a0d68a38a899286ef9a589de25c4f94159f9cae2 of course ;)

@haampie
Copy link
Copy Markdown
Member Author

haampie commented Mar 11, 2023

the resource "CPURequest" requested "16000m" is higher than limit allowed "12"

@kwryankrattiger
Copy link
Copy Markdown
Contributor

I drop the CPU request to 12. If the job times out then maybe getting kube to give more than 12 cores would be another fix.

@haampie
Copy link
Copy Markdown
Member Author

haampie commented Mar 12, 2023

Seems like it works. Wanna revert that commit and merge?

This reverts commit a0d68a3.
@alalazo alalazo merged commit 2107b6b into spack:develop Mar 13, 2023
jmcarcell pushed a commit to key4hep/spack that referenced this pull request Apr 13, 2023
@haampie haampie deleted the fix/oversubscription-in-ci branch April 18, 2023 19:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core PR affects Spack core functionality gitlab Issues related to gitlab integration new-variant python update-package

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants