Set build_jobs dynamically in CI to avoid oversubscription#35996
Set build_jobs dynamically in CI to avoid oversubscription#35996alalazo merged 6 commits intospack:developfrom
Conversation
scottwittenburg
left a comment
There was a problem hiding this comment.
This seems like a good idea, but as you point out, while it hopefully addresses the problems we've seen with py-tensorflow, it may have some detrimental effect on other specs. But it's worth giving a try, IMO.
|
Just wondering if below is something we might consider in the future, to have more control over the |
|
If there's an alternative way of communicating the number of usable cpus to spack, that'd be helpful. (Uh, didn't see the last comment. That'd help) |
|
Yeah, the main downside of capping the number of jobs is that it's probably pessimistic, since it's likely the majority of the jobs are on average not very parallel (including git clone, configure, cmake, etc). |
|
#34272 is just about to be merged, I think it will be better to wait for that to put this in. |
|
Okay, I dropped the static |
b0a5ee6 to
66917f2
Compare
kwryankrattiger
left a comment
There was a problem hiding this comment.
LGTM, hopefully this gets things to pass now!
|
@kwryankrattiger / @scottwittenburg can you take this PR over from here? You still gotta push |
|
|
I drop the CPU request to 12. If the job times out then maybe getting kube to give more than 12 cores would be another fix. |
|
Seems like it works. Wanna revert that commit and merge? |
This reverts commit a0d68a3.
Co-authored-by: Zack Galbreath <[email protected]> Co-authored-by: Ryan Krattiger <[email protected]>
Instead of setting a fixed number of jobs (32) in spack,
set them dynamically based on YAV (yet another variable),
that corresponds to the kubernetes request.
This is mostly for packages that use a lot of parallel
resources for a long time, like paraview and what not,
since as it is right now, their resource request is 11 CPUs
and their usages is 3x as much, so, it's likely that jobs
get killed because of oversubscription.
Notice, I've increased the resource request from 11 to
16 in "huge", because 11 is not huge.
For small packages this doesn't matter much, since
often the peak resource usage is very short. But I've
also added a build_jobs limit there anyhow... if that
slows down builds too much, we can allow more
oversubscription there again later.