Use process cpu affinity instead of hardware specs to get cpu count by haampie · Pull Request #17566 · spack/spack

haampie · 2020-07-17T09:51:23Z

When using spack in slurm and/or containers I found that spack does not respect the number of cores made available to the process, but rather takes as many threads as the hardware provides. This is an issue mainly in CI, where spack commands may be executed on different nodes with a different core count, and where many commands like spack install are generated or implicit (e.g. in spack ci rebuild) s.t. one cannot easily pass the true number of cores available.

As an example, in my case spack was running make -j72, whereas only 8 cores were available. This happens because of the following:

The CI job got scheduled on a machine with 72 threads according to nproc --all
I have configured build_jobs to a high/unlimited number with the idea to make spack autodetect the number of cores available. (This seems sensible to me for CI: same config for different nodes, some nodes have a large number of cores, so the default limit of 16 is too small).
Our CI uses Slurm to start jobs, where one can control the number of cores per task via the SLURM_CPUS_PER_TASK variable, I had set it to 8, because I want to run multiple jobs on the same node. (Same story with Kubernetes and cpu requests or your favorite docker setup with the --cpus flag)
spack uses the Python call multiprocessing.cpu_count() to attempt to limit the number of build jobs to the number of cores available, but this returns the number of cores on the hardware level (72) instead of the number of cores available to the process (8).

Note that this is the same difference between calling $ nproc --all (returns 72) and $ nproc (returns 8). The latter is correct.

To fix this I'm using len(os.sched_getaffinity(0)), which is linux-specific, to get the number of cores/threads actually available. According to the man page using 0 is equivalent to getting the cpu affinity for the current process, so this should be fine.

To reproduce in Docker:

$ docker run --cpuset-cpus=0,1 spack/ubuntu-bionic install llvm+clang

it should have exactly 2 cores available for the process, but with the default config will build with make -j16 if you have at least 16 procs.

To reproduce with Slurm:

$ srun -N1 -n1 -c2 spack install llvm+clang

it should have exactly 2 cores available for the process, but builds with -j16 in the default config if you have at least 16 procs.

…ores

alalazo · 2020-07-17T14:37:48Z

In my case spack was running make -j72 whereas I only gave it 8 cores in slurm.

I wonder how "72" is possible. We have a default limit to 16 unless users modify config (or there's a bug in the current logic somewhere).

haampie · 2020-07-17T14:58:42Z

You're right, I've set build_jobs: [very large number], because the same config is used for different compute nodes which have a different number of cores. I just want to make sure spack uses all resources without having to explicitly set the core count by hand.

So, with this patch, spack respects linux cgroups, which is used in slurm and container runtimes to restrict resources to certain processes. I would say not respecting that is a bug.

Of course I could just spack install -j$(nproc) [spec], but this won't work in CI where the install command is generated.

alalazo · 2020-07-17T15:07:06Z

So, with this patch, spack respects linux cgroups, which is used in slurm and container runtimes to restrict resources to certain processes. I would say not respecting that is a bug.

Sorry, I don't get the bug being resolved here. Spack spawns 72 processes because it was told to do so. If the 72 processes are hogging more than the 8 physical cpus that Spack should be restricted to, then it seems a system configuration error to me rather than Spack fault. Am I missing part of the issue?

haampie · 2020-07-17T15:22:26Z

So there's two things:

I don't want to explicitly set build_jobs to the exact number of cores, so I set it to a very large value like 1024, effectively unlimited.
spack sets the effective number of build jobs to min(num_cores, build_jobs).

Now the problem with 2 is that it should not derive num_cores from the hardware specification, but from the cores effectively available for the current process. It should use nproc, but it is using nproc --all, basically.

E.g. in docker you have this flag to specify the cpus being made available, and you can see the difference between nproc and nproc --all

$ nproc
8
$ docker run --cpuset-cpus=0,1 --rm ubuntu:18.04 nproc
2
$ docker run --cpuset-cpus=0,1 --rm ubuntu:18.04 nproc --all
8

If I run spack in a container like the latter, I would assume the effective number of build jobs is computed as min(2, build_jobs), not min(8, build_jobs).

Hopefully that makes sense.

alalazo · 2020-07-18T11:56:25Z

Thanks @haampie now I understand what you're saying. Can you update the description at the top with the information in #17566 (comment)?

I personally see this issue as something in between a minor bug and an implementation defined behavior. I say that because (though not documented) Python multiprocessing is calling this function:

https://github.com/python/cpython/blob/cb9879b948a19c9434316f8ab6aba9c4601a8173/Modules/posixmodule.c#L12609

which in turn calls this other function :

np = sysconf(_SC_NPROCESSORS_ONLN);     /* processors available */

that return the number of processors "available" for some definition of "available". Python docs is not of much help: in multiprocessing it suggests the solution in this PR to get the number of "usable" cpus per process, in os it says that the function might not be available on all Unix systems and indeed it's present only on Linux.

I'd like to hear the opinion of other maintainers before merging this since it's in my opinion a slightly better implementation but:

It will change the default behavior for all Linux users (people might have workflows based on the current behavior?)
There are multiple ways in which you can enforce a given number of build jobs
We'll be adding a platform specific implementation to our code base
The default value limited to 16 should be a good one in practice

Regarding point 4. I mean that using 16 processes on a system confined to less cores shouldn't be that much of a problem and seeing speed-up using more than make -j 16 is rare. You have issues because you set build jobs to a very large number to start with. Does that make sense to you? @becker33 @tgamblin @scheibelp

haampie · 2020-07-20T07:13:34Z

for some definition of "available"

I don't think this is a very useful definition of 'available', the man page talks about hot pluggable cpus, it's most likely something on the hardware level.

It will change the default behavior for all Linux users (people might have workflows based on the current behavior?)

I would be highly surprised if someone relied on this behaviour. If you explicitly fix the number of cores available to a process, you most likely want that to be reflected in make -j <nproc>.

There are multiple ways in which you can enforce a given number of build jobs

Not in CI where spack ci rebuild commands are generated, which down the line calls spack install. This is exactly the situation where you can control the resources used by the container from the outside, and you expect spack to respect that during spack install.

We'll be adding a platform specific implementation to our code base

Yeah, maybe it can later move to its own python package. I would think almost all Linux users do make -j $(nproc), and nproc is based on sched_getaffinity, which is Linux-specific for sure, but it would be weird not to be using that just because it is platform-specific.

The default value limited to 16 should be a good one in practice. [..] You have issues because you set build jobs to a very large number to start with.

Yes, but in CI you might still use the same node for multiple builds where you take < 16 cores per container / build, and spack will still spin up 16 build tasks, making builds slower.

Also, "You have issues because you set build jobs to a very large number to start with" is really an issue of spack. I just want it to automatically pick the optimal number of cores as I made them available to the process, which is the easiest way to configure builds in e.g. CI where you don't have as much control because install commands are generated or hidden as a subprocess of spack ci rebuild. Also in cloud builds you don't know in advance how many cores your build environment has, and therefor adding it to the spack.yaml file is not dynamic enough... so, we rather make spack just do the sensible thing instead of having to jump through hoops.

I would say the current implementation is a bug, so it could / should be fixed within a patch version update, right?

alalazo · 2020-07-20T08:33:58Z

@haampie I'll try to summarize what I said above, since maybe it was not clear. I think this IS a slightly better implementation but since:

It changes the default behavior for everybody
It is platform specific
There are multiple ways to work around the issue (i.e. it's not that hard to turn $ srun -N1 -n1 -c2 spack install llvm+clang into $ srun -N1 -n1 -c2 spack install -j 2 llvm+clang)

I'd like to hear opinions from other maintainers before merging it. That said, I think that the impact of this is low.

EDIT: ~~It's also~~

Some arguments in the discussion are also inherently based on an assumption for which I'd like to hear more about:

I just want it to automatically pick the optimal number of cores as I made them available to the process,

What is the reasoning that leads you to the fact that having e.g. 72 physical cores available the optimal way to build is make -j 72 rather than 4 Spack processes that use make -j 18 or other combinations (that without accounting for effects due to over/undersubscribing)?

Also, can you please put in the top level description the configuration setting you need to have to reproduce the issue you are encountering on Slurm? That will make the numbers 72 and 8 understandable without going through the entire discussion.

Also in cloud builds you don't know in advance how many cores your build environment has, and therefor adding it to the spack.yaml file is not dynamic enough... so, we rather make spack just do the sensible thing instead of having to jump through hoops.

In cloud builds we might be using VMs and that adds another layer of indirection that this PR doesn't handle anyway, right?

haampie · 2020-07-20T09:59:11Z

It changes the default behavior for everybody

If it's seen as a bugfix, which I would say it is, it's only a good thing. Again, it only applies to the edge case where people are using cgroups (through slurm, docker, kubernetes), and if they do, they probably know what they're doing, and they want spack to not spawn 16 jobs if they confine the process to 1 thread.

There are multiple ways to work around the issue (i.e. it's not that hard to turn $ srun -N1 -n1 -c2 spack install llvm+clang into $ srun -N1 -n1 -c2 spack install -j 2 llvm+clang)

Except if you use spack ci generate and you can't easily change the cli command, nor can you get a proper uniform config.yaml because it's inside of a container and you want to reuse the same image everywhere, etc etc. In this case spack should just detect the number of cores correctly.

What is the reasoning that leads you to the fact that having e.g. 72 physical cores available the optimal way to build

With optimal I mean: number of build tasks = number of cores available. To the extreme: if I want to have 1 single-threaded spack build job on a node of my cluster, I don't want spack to spin up 72 tasks, which it would do now if I don't force it with -j 1. Maybe I reserve a single node in my cluster with a beefy CPU, and make multiple people use it for CI where they have a few cores available, etc etc, they should not be building with 16 jobs by default (they might not even know the hardware, so why specify the -j flag, just make spack be smart about it).

In cloud builds we might be using VMs and that adds another layer of indirection that this PR doesn't handle anyway, right?

It handles it correctly. VMs do a good job at making nproc / multiprocessing.cpu_count() return the number of virtual cpus, everything else (kubernetes, docker, podman, slurm) is cgroups to limit cores for a specific process, and spack should just handle that correctly.

alalazo · 2020-07-20T11:44:44Z

To reiterate: I see this PR as a minor improvement in some heuristics (compute the number of "usable" cpus on Linux instead of the "available") that also brings some minor concerns with it so I'd like other people to weigh in before merging it.

That said, since you are advocating so strongly for this modification, I wonder if I missed some points. Do you concur that:

If a user wants to limit the number of build jobs automatically to the lower end and starts with setting build_jobs to the highest number possible that's counter intuitive and amplifies the impact of this issue
cgroups needs not to be respected as they are enforced (see phrasing in Use process cpu affinity instead of hardware specs to get cpu count #17566 (comment) and top level description which was confusing, at least to me) What I mean is that if Spack is restricted to 2 cores on a 16 cores machine it will spawn 16 processes on 2 physical CPUs (assuming no VM involved).
If a VM is involved you'll get right at most the virtual CPUs but you won't have clues on the physical mapping (and here I assume we are talking of physical mapping, otherwise I see no difference with the case of 16 procs on 2 cores as we might be as well oversubscribing the hardware)

Also, it seems to me that after some exchange this issue boils down to use cases tied to spack ci .... This was not clear at the beginning and I wonder if it is the symptom of some handle missing in some config file there.

haampie · 2020-07-20T12:36:08Z

I wonder if I missed some points

Ah, I just discovered part of the confusion might have been by something I had missed, I'm sorry!

The whole time I was thinking this was about getting a proper, sensible default for the number of build jobs when the user does not specify the -j flag. I thought the user could overrule the default by specifying the -j flag on the command line, like spack install -j36 [spec], and it would spin up 36 build jobs.

But this is not the case, since -j just overwrites config:build_jobs, and in the end the effective number of jobs is taken as min(-j value or config:build_jobs, num procs available).

So I would agree that this PR indeed changes the defaults for everybody in a way that might be a bit of a nuisance (e.g. not being able to specify two parallel jobs in a container limited to a single thread, which might still be a useful thing to do).

Wouldn't it be better to do the following:

If the user does not specify -j, resort to a sensible default that respects cgroups (to avoid oversubscribing cores) and respects the build_jobs global maximum.
If the user does specify the -j flag, always take that exact value, do not take into account the number of cores nor config:build_jobs

haampie · 2021-03-17T18:18:48Z

Closing this in favor of #22360

haampie force-pushed the fix-cpus-available-spack branch 6 times, most recently from 71c533e to 5b877fc Compare July 17, 2020 13:34

Use process cpu affinity instead of hardware specs to get number of c…

d41123f

…ores

haampie force-pushed the fix-cpus-available-spack branch from 5b877fc to d41123f Compare July 17, 2020 13:35

haampie mentioned this pull request Jul 20, 2020

Make -j, --jobs flags definitive #17598

Closed

2 tasks

adamjstewart added the feature A feature is missing in Spack label Aug 23, 2020

haampie closed this Mar 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use process cpu affinity instead of hardware specs to get cpu count#17566

Use process cpu affinity instead of hardware specs to get cpu count#17566
haampie wants to merge 1 commit intospack:developfrom
haampie:fix-cpus-available-spack

haampie commented Jul 17, 2020 •

edited

Loading

Uh oh!

alalazo commented Jul 17, 2020 •

edited

Loading

Uh oh!

haampie commented Jul 17, 2020 •

edited

Loading

Uh oh!

alalazo commented Jul 17, 2020

Uh oh!

haampie commented Jul 17, 2020 •

edited

Loading

Uh oh!

alalazo commented Jul 18, 2020 •

edited

Loading

Uh oh!

haampie commented Jul 20, 2020 •

edited

Loading

Uh oh!

alalazo commented Jul 20, 2020 •

edited

Loading

Uh oh!

haampie commented Jul 20, 2020 •

edited

Loading

Uh oh!

alalazo commented Jul 20, 2020

Uh oh!

haampie commented Jul 20, 2020 •

edited

Loading

Uh oh!

haampie commented Mar 17, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

haampie commented Jul 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alalazo commented Jul 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

haampie commented Jul 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alalazo commented Jul 17, 2020

Uh oh!

haampie commented Jul 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alalazo commented Jul 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

haampie commented Jul 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alalazo commented Jul 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

haampie commented Jul 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alalazo commented Jul 20, 2020

Uh oh!

haampie commented Jul 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

haampie commented Mar 17, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

haampie commented Jul 17, 2020 •

edited

Loading

alalazo commented Jul 17, 2020 •

edited

Loading

haampie commented Jul 17, 2020 •

edited

Loading

haampie commented Jul 17, 2020 •

edited

Loading

alalazo commented Jul 18, 2020 •

edited

Loading

haampie commented Jul 20, 2020 •

edited

Loading

alalazo commented Jul 20, 2020 •

edited

Loading

haampie commented Jul 20, 2020 •

edited

Loading

haampie commented Jul 20, 2020 •

edited

Loading