Make -j flag less exceptional#22360
Conversation
bf44c88 to
57fb11b
Compare
| njobs = spack.config.get('config:build_jobs') | ||
| if not njobs: | ||
| spack.config.set('config:build_jobs', 4, scope='user') |
There was a problem hiding this comment.
I thought this was redundant, because config:build_jobs is always populated https://github.com/spack/spack/blob/develop/lib/spack/spack/config.py#L103-L117
c7277b4 to
3971f24
Compare
|
Ok, this is no longer necessary since you can't expect |
28556f6 to
877530c
Compare
The -j flag in spack behaves different from make, ctest, ninja, etc, because it caps the number of jobs to an arbitrary number 16. What makes a lot more sense is if `spack install` uses some reasonable default, and `spack install -j <num>` *overrides* that default. This is how it's done in all other build systems. In particular I want to be able to write `spack install -j256` on some AMD EPYC system if I feel like it, and spack shouldn't silently stop me from doing that. Maybe spack was meant for HPC login nodes, but even login nodes get fat and in some centers you are encouraged to compile on login nodes with many cores instead of on compute nodes. Also if I don't specify `-j` on the command line, I'm fine if spack limits the number of build jobs to `min(number of cpus, 16)` -- I can see that's a reasonable default, although the 16 is still quite peculiar and unlike other build systems -- however, as it is right now, spack does a poor job at determining the number of cpus on linux, since it doesn't take cgroups into account. In particular this is problematic if you use distributed builds with slurm. On an AMD EPYC machine with 256 threads, if I would build a big spack environment with many tasks like `srun -c2 -n128 spack -e my_env install`, spack will happily start 128 processes using make -j256, instead of actually checking the process affinity, which reveals only 2 threads can be used. So this PR introduces `spack.util.cpus.cpus_available()` which does the sensible thing on linux. This should also improve the situation with Docker / Kubernetes, which also use cgroups.
877530c to
1afca67
Compare
lib/spack/spack/build_environment.py
Outdated
| config_default = config_default or spack.config.get('config:build_jobs') | ||
|
|
||
| if config_default is None: | ||
| return max_cpus |
There was a problem hiding this comment.
The previous behavior was that 16 was both the config default and the code default if no config was specified. We chose this because someone ran a study that showed there is virtually no performance gains beyond 16 threads for most build jobs.
I like that this allows the user to override that default, but I think we should preserve the code default of 16 if no config specifies a value.
| config_default = config_default or spack.config.get('config:build_jobs') | |
| if config_default is None: | |
| return max_cpus | |
| config_default = config_default or spack.config.get('config:build_jobs', 16) |
There was a problem hiding this comment.
I didn't want to duplicate that, since it is already fixed in spack.config.default_config which is used for the _builtin scope:
$ spack python -c "spack.config.scopes()['_builtin'].get_section('config')['config']['build_jobs']"
16
This scope is used almost everywhere, except when spack.config.use_config(...): is used and spack.config.default_config is not added as the "bottom" scope (which is the case in tests with config_scope fixtures, and it was the case too during bootstrap, but that's a potential bug addressed in #22610).
Do you think the 16 should still be there?
There was a problem hiding this comment.
Oh well, I can just do , 16) and drop the two lines below that then. The only problem I see is that the _builtin scope tries to solve this problem of providing a default whenever no config files are available... yet there is no guarantee that _builtin is an active scope (in tests usually it's not), so what problem does that solve ¯_(ツ)_/¯.
|
restarted ci :( a random job was "cancelled" |
The -j flag in spack behaves different from make, ctest, ninja, etc,
because it caps the number of jobs to an arbitrary number 16.
What makes a lot more sense is if
spack installuses some reasonabledefault, and
spack install -j <num>overrides that default. This ishow it's done in all other build systems.
In particular I want to be able to write
spack install -j256on someAMD EPYC system if I feel like it, and spack shouldn't silently stop me
from doing that. Maybe spack was meant for HPC login nodes, but even
login nodes get fat and in some centers you are encouraged to compile on
login nodes with many cores instead of on compute nodes.
Also if I don't specify
-jon the command line, I'm fine if spacklimits the number of build jobs to
min(number of cpus, 16)-- I cansee that's a reasonable default, although the 16 is still quite peculiar
and unlike other build systems -- however, as it is right now, spack
does a poor job at determining the number of cpus on linux, since it
doesn't take cgroups into account. In particular this is problematic if
you use distributed builds with slurm. On an AMD EPYC machine with 256
threads, if I would build a big spack environment with many tasks like
srun -c2 -n128 spack -e my_env install, spack will happily start 128processes using make -j256, instead of actually checking the process
affinity, which reveals only 2 threads can be used. So this PR
introduces
spack.util.cpus.cpus_available()which does the sensiblething on linux. This should also improve the situation with Docker /
Kubernetes, which also use cgroups.
Closes #17598