Skip to content

Make -j flag less exceptional#22360

Merged
becker33 merged 2 commits intospack:developfrom
haampie:fix-cpus-available-spack-part-2
Mar 30, 2021
Merged

Make -j flag less exceptional#22360
becker33 merged 2 commits intospack:developfrom
haampie:fix-cpus-available-spack-part-2

Conversation

@haampie
Copy link
Copy Markdown
Member

@haampie haampie commented Mar 17, 2021

The -j flag in spack behaves different from make, ctest, ninja, etc,
because it caps the number of jobs to an arbitrary number 16.

What makes a lot more sense is if spack install uses some reasonable
default, and spack install -j <num> overrides that default. This is
how it's done in all other build systems.

In particular I want to be able to write spack install -j256 on some
AMD EPYC system if I feel like it, and spack shouldn't silently stop me
from doing that. Maybe spack was meant for HPC login nodes, but even
login nodes get fat and in some centers you are encouraged to compile on
login nodes with many cores instead of on compute nodes.

Also if I don't specify -j on the command line, I'm fine if spack
limits the number of build jobs to min(number of cpus, 16) -- I can
see that's a reasonable default, although the 16 is still quite peculiar
and unlike other build systems -- however, as it is right now, spack
does a poor job at determining the number of cpus on linux, since it
doesn't take cgroups into account. In particular this is problematic if
you use distributed builds with slurm. On an AMD EPYC machine with 256
threads, if I would build a big spack environment with many tasks like
srun -c2 -n128 spack -e my_env install, spack will happily start 128
processes using make -j256, instead of actually checking the process
affinity, which reveals only 2 threads can be used. So this PR
introduces spack.util.cpus.cpus_available() which does the sensible
thing on linux. This should also improve the situation with Docker /
Kubernetes, which also use cgroups.

Closes #17598

@haampie haampie force-pushed the fix-cpus-available-spack-part-2 branch from bf44c88 to 57fb11b Compare March 17, 2021 18:12
Comment on lines -106 to -108
njobs = spack.config.get('config:build_jobs')
if not njobs:
spack.config.set('config:build_jobs', 4, scope='user')
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@haampie haampie force-pushed the fix-cpus-available-spack-part-2 branch from c7277b4 to 3971f24 Compare March 29, 2021 10:40
@haampie
Copy link
Copy Markdown
Member Author

haampie commented Mar 29, 2021

@alalazo I've tried to address the bug introduced in #19605. The problem was spack.config.use_configuration(*config_scopes) calls clear_caches which deletes InternalConfigScope's values.

Ok, this is no longer necessary since you can't expect spack.config.get('config:build_jobs') to be not-None in tests.

@haampie haampie force-pushed the fix-cpus-available-spack-part-2 branch 2 times, most recently from 28556f6 to 877530c Compare March 30, 2021 12:35
The -j flag in spack behaves different from make, ctest, ninja, etc,
because it caps the number of jobs to an arbitrary number 16.

What makes a lot more sense is if `spack install` uses some reasonable
default, and `spack install -j <num>` *overrides* that default. This is
how it's done in all other build systems.

In particular I want to be able to write `spack install -j256` on some
AMD EPYC system if I feel like it, and spack shouldn't silently stop me
from doing that. Maybe spack was meant for HPC login nodes, but even
login nodes get fat and in some centers you are encouraged to compile on
login nodes with many cores instead of on compute nodes.

Also if I don't specify `-j` on the command line, I'm fine if spack
limits the number of build jobs to `min(number of cpus, 16)` -- I can
see that's a reasonable default, although the 16 is still quite peculiar
and unlike other build systems -- however, as it is right now, spack
does a poor job at determining the number of cpus on linux, since it
doesn't take cgroups into account. In particular this is problematic if
you use distributed builds with slurm. On an AMD EPYC machine with 256
threads, if I would build a big spack environment with many tasks like
`srun -c2 -n128 spack -e my_env install`, spack will happily start 128
processes using make -j256, instead of actually checking the process
affinity, which reveals only 2 threads can be used. So this PR
introduces `spack.util.cpus.cpus_available()` which does the sensible
thing on linux. This should also improve the situation with Docker /
Kubernetes, which also use cgroups.
@haampie haampie force-pushed the fix-cpus-available-spack-part-2 branch from 877530c to 1afca67 Compare March 30, 2021 12:37
Comment on lines +480 to +483
config_default = config_default or spack.config.get('config:build_jobs')

if config_default is None:
return max_cpus
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous behavior was that 16 was both the config default and the code default if no config was specified. We chose this because someone ran a study that showed there is virtually no performance gains beyond 16 threads for most build jobs.

I like that this allows the user to override that default, but I think we should preserve the code default of 16 if no config specifies a value.

Suggested change
config_default = config_default or spack.config.get('config:build_jobs')
if config_default is None:
return max_cpus
config_default = config_default or spack.config.get('config:build_jobs', 16)

Copy link
Copy Markdown
Member Author

@haampie haampie Mar 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't want to duplicate that, since it is already fixed in spack.config.default_config which is used for the _builtin scope:

$ spack python -c "spack.config.scopes()['_builtin'].get_section('config')['config']['build_jobs']"
16

This scope is used almost everywhere, except when spack.config.use_config(...): is used and spack.config.default_config is not added as the "bottom" scope (which is the case in tests with config_scope fixtures, and it was the case too during bootstrap, but that's a potential bug addressed in #22610).

Do you think the 16 should still be there?

Copy link
Copy Markdown
Member Author

@haampie haampie Mar 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh well, I can just do , 16) and drop the two lines below that then. The only problem I see is that the _builtin scope tries to solve this problem of providing a default whenever no config files are available... yet there is no guarantee that _builtin is an active scope (in tests usually it's not), so what problem does that solve ¯_(ツ)_/¯.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done...

@haampie
Copy link
Copy Markdown
Member Author

haampie commented Mar 30, 2021

restarted ci :( a random job was "cancelled"

@becker33 becker33 merged commit 1db6cd5 into spack:develop Mar 30, 2021
@haampie haampie deleted the fix-cpus-available-spack-part-2 branch March 30, 2021 19:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make -j, --jobs flags definitive

3 participants