[RFC] Fix percpu arena by using all available CPUs (not only eligible) by azat · Pull Request #1939 · jemalloc/jemalloc

azat · 2020-09-20T13:16:05Z

Prefer _SC_NPROCESSORS_ONLN over sched_getaffinity() for number of
arenas for percpu_arena enabled, since sched_getaffinity() can be changed in runtime, or it can be
allowed to run only on, say, 8 CPU, but in this case:

ncpus==1
but malloc_getcpu() will return 7

And this will trigger the following assertion in debug build:

<jemalloc>: ../src/jemalloc.c:321: Failed assertion: "ind <= narenas_total_get()"

Here is a reproducer 1.

And AFAICS eligible number of CPUs cannot be used even for
mutex detection, since the affinity mask can be changed later.

Refs: #1676
Refs: #808

Prefer _SC_NPROCESSORS_ONLN over sched_getaffinity() for number of arenas for percpu_arena enabled. Since sched_getaffinity() can be changed in runtime, or it can be allowed to run only on, say, 8 CPU, but in this case: - ncpus==8 - malloc_getcpu() will return 0/1 Reproducer [1]. [1]: https://gist.github.com/azat/9770f751c3e7d2da5422cc1ca26339c4 And frankly speaking eligible number of CPUs cannot be used even for mutex detection, since the affinity mask can be changed later. Refs: jemalloc#1676 Refs: jemalloc#808

…orrect ncpus v0: use 4 cpu v2: use last cpu

This fix is the same as from [1], but backported to 5.2.1. [1]: jemalloc/jemalloc#1939

azat · 2020-10-01T08:55:29Z

Any comments on this one?

davidtgoldblatt · 2020-10-01T16:30:37Z

+cc @interwq -- this seems like it's in your wheelhouse.

azat · 2020-10-05T18:41:16Z

@interwq can you please take a look? Thanks in advance!

interwq · 2020-10-05T23:58:09Z

Thanks for sharing the thoughts (also sorry for the delayed response). Indeed the percpu arena feature does not play well with dynamic affinity changes; we didn't really try to optimize for this case -- not only for percpu arena, but for the default number of auto arenas / scalability factor in general.

A workaround for this is to also set narenas, see the comments here: https://github.com/jemalloc/jemalloc/blob/dev/src/jemalloc.c#L1843

for example, if you specify narenas to be the actual number of CPUs, percpu_arena will attempt to utilize those arenas.

However, this can cause bad side effects: especially for decay, which by default relies on arenas being accessed regularly to return unused pages back to OS gradually. In the case of dynamic CPU binding changes + percpu_arena, the cached dirty pages in an arena may become "unaccessible" when the affinity for that CPU / arena is removed, i.e. no threads will be triggering decay for those arenas.

I'd recommend not using percpu_arena if CPU affinity changes over the lifetime of the binary, and manually setting narenas based on available resources: higher arena count typically improves scalability, at the cost of increased fragmentation / memory usage.

azat · 2020-10-06T08:20:08Z

@interwq thank you for the reply!

A workaround for this is to also set narenas, see the comments here:

But right now it is impossible to set narenas dynamically from the binary (w/o wrapping into shell script), am I missing something?
How about adding one more mode for percpu then, say, percpu_arena:percpu_all (and maybe add new percpu_arena:percpu_eligible and deprecated percpu_arena:percpu) that will set narenas to the all available CPUs.

But this will not solve other issues (like the mutex requirement detection), actually I thought about adding two different counters (total and eligible like you wrote in #1947, however later I came to conclusion that there is no point in keeping total CPUs counter)

However, this can cause bad side effects: especially for decay

But shouldn't be the special case that does not worth supporting? (since sounds pretty tricky especially with dynamic affinity changes)?

interwq · 2020-10-08T23:52:42Z

re: narenas setting -- you can set it to the total number of CPUs (e.g. via env var MALLOC_CONF), then the arena slots will be reserved and available for percpu_arena, effectively the percpu_all you asked.

The case with dynamic CPU affinity is tricky to support; to be honest we didn't really try hard to optimize for the case. For the decay issue, enabling background_thread should solve it (since the dedicated jemalloc background threads will be able to purge for the inactive arenas). There might also be other unoptimized cases as well, like the mutex spin hint you linked.

azat · 2020-10-09T06:29:16Z

re: narenas setting -- you can set it to the total number of CPUs (e.g. via env var MALLOC_CONF), then the arena slots will be reserved and available for percpu_arena, effectively the percpu_all you asked.

Agree, but my point was that you cannot do this from the binary

interwq · 2020-10-23T04:04:05Z

That's true. Unfortunately it won't be easy to make it runtime changeable, and at the moment it's not on our plan.

Closing but feel free to ask if anything.

Determinitic number of CPUs is important for percpu arena to work correctly, since it uses cpu index - sched_getcpu(), and if it will greater then number of CPUs bad thing will happen, or assertion will be failed in debug build: <jemalloc>: ../contrib/jemalloc/src/jemalloc.c:321: Failed assertion: "ind <= narenas_total_get()" Aborted (core dumped) Number of CPUs can be obtained from the following places: - sched_getaffinity() - sysconf(_SC_NPROCESSORS_ONLN) - sysconf(_SC_NPROCESSORS_CONF) For the sched_getaffinity() you may simply use taskset(1) to run program on a different cpu, and in case it will be not first, percpu will work incorrectly, i.e.: $ taskset --cpu-list $(( $(getconf _NPROCESSORS_ONLN)-1 )) <your_program> _SC_NPROCESSORS_ONLN uses /sys/devices/system/cpu/online, LXD/LXC virtualize /sys/devices/system/cpu/online file [1], and so when you run container with limited limits.cpus it will bind randomly selected CPU to it [1]: lxc/lxcfs#301 _SC_NPROCESSORS_CONF uses /sys/devices/system/cpu/cpu*, and AFAIK nobody playing with dentries there. So if all three of these are equal, percpu arenas should work correctly. And a small note regardless _SC_NPROCESSORS_ONLN/_SC_NPROCESSORS_CONF, musl uses sched_getaffinity() for both. So this will also increase the entropy. Also note, that you can check is percpu arena really applied using abort_conf:true. Refs: jemalloc#1939 Refs: ClickHouse/ClickHouse#32806 v2: move malloc_cpu_count_is_deterministic() into malloc_init_hard_recursible() since _SC_NPROCESSORS_CONF does allocations for readdir()

Determinitic number of CPUs is important for percpu arena to work correctly, since it uses cpu index - sched_getcpu(), and if it will greater then number of CPUs bad thing will happen, or assertion will be failed in debug build: <jemalloc>: ../contrib/jemalloc/src/jemalloc.c:321: Failed assertion: "ind <= narenas_total_get()" Aborted (core dumped) Number of CPUs can be obtained from the following places: - sched_getaffinity() - sysconf(_SC_NPROCESSORS_ONLN) - sysconf(_SC_NPROCESSORS_CONF) For the sched_getaffinity() you may simply use taskset(1) to run program on a different cpu, and in case it will be not first, percpu will work incorrectly, i.e.: $ taskset --cpu-list $(( $(getconf _NPROCESSORS_ONLN)-1 )) <your_program> _SC_NPROCESSORS_ONLN uses /sys/devices/system/cpu/online, LXD/LXC virtualize /sys/devices/system/cpu/online file [1], and so when you run container with limited limits.cpus it will bind randomly selected CPU to it [1]: lxc/lxcfs#301 _SC_NPROCESSORS_CONF uses /sys/devices/system/cpu/cpu*, and AFAIK nobody playing with dentries there. So if all three of these are equal, percpu arenas should work correctly. And a small note regardless _SC_NPROCESSORS_ONLN/_SC_NPROCESSORS_CONF, musl uses sched_getaffinity() for both. So this will also increase the entropy. Also note, that you can check is percpu arena really applied using abort_conf:true. Refs: jemalloc#1939 Refs: ClickHouse/ClickHouse#32806 v2: move malloc_cpu_count_is_deterministic() into malloc_init_hard_recursible() since _SC_NPROCESSORS_CONF does allocations for readdir() v3: - mark cpu_count_is_deterministic static - check only if percpu arena is enabled - check narenas

Determinitic number of CPUs is important for percpu arena to work correctly, since it uses cpu index - sched_getcpu(), and if it will greater then number of CPUs bad thing will happen, or assertion will be failed in debug build: <jemalloc>: ../contrib/jemalloc/src/jemalloc.c:321: Failed assertion: "ind <= narenas_total_get()" Aborted (core dumped) Number of CPUs can be obtained from the following places: - sched_getaffinity() - sysconf(_SC_NPROCESSORS_ONLN) - sysconf(_SC_NPROCESSORS_CONF) For the sched_getaffinity() you may simply use taskset(1) to run program on a different cpu, and in case it will be not first, percpu will work incorrectly, i.e.: $ taskset --cpu-list $(( $(getconf _NPROCESSORS_ONLN)-1 )) <your_program> _SC_NPROCESSORS_ONLN uses /sys/devices/system/cpu/online, LXD/LXC virtualize /sys/devices/system/cpu/online file [1], and so when you run container with limited limits.cpus it will bind randomly selected CPU to it [1]: lxc/lxcfs#301 _SC_NPROCESSORS_CONF uses /sys/devices/system/cpu/cpu*, and AFAIK nobody playing with dentries there. So if all three of these are equal, percpu arenas should work correctly. And a small note regardless _SC_NPROCESSORS_ONLN/_SC_NPROCESSORS_CONF, musl uses sched_getaffinity() for both. So this will also increase the entropy. Also note, that you can check is percpu arena really applied using abort_conf:true. Refs: #1939 Refs: ClickHouse/ClickHouse#32806 v2: move malloc_cpu_count_is_deterministic() into malloc_init_hard_recursible() since _SC_NPROCESSORS_CONF does allocations for readdir() v3: - mark cpu_count_is_deterministic static - check only if percpu arena is enabled - check narenas

Determinitic number of CPUs is important for percpu arena to work correctly, since it uses cpu index - sched_getcpu(), and if it will greater then number of CPUs bad thing will happen, or assertion will be failed in debug build: <jemalloc>: ../contrib/jemalloc/src/jemalloc.c:321: Failed assertion: "ind <= narenas_total_get()" Aborted (core dumped) Number of CPUs can be obtained from the following places: - sched_getaffinity() - sysconf(_SC_NPROCESSORS_ONLN) - sysconf(_SC_NPROCESSORS_CONF) For the sched_getaffinity() you may simply use taskset(1) to run program on a different cpu, and in case it will be not first, percpu will work incorrectly, i.e.: $ taskset --cpu-list $(( $(getconf _NPROCESSORS_ONLN)-1 )) <your_program> _SC_NPROCESSORS_ONLN uses /sys/devices/system/cpu/online, LXD/LXC virtualize /sys/devices/system/cpu/online file [1], and so when you run container with limited limits.cpus it will bind randomly selected CPU to it [1]: lxc/lxcfs#301 _SC_NPROCESSORS_CONF uses /sys/devices/system/cpu/cpu*, and AFAIK nobody playing with dentries there. So if all three of these are equal, percpu arenas should work correctly. And a small note regardless _SC_NPROCESSORS_ONLN/_SC_NPROCESSORS_CONF, musl uses sched_getaffinity() for both. So this will also increase the entropy. Also note, that you can check is percpu arena really applied using abort_conf:true. Refs: jemalloc#1939 Refs: ClickHouse/ClickHouse#32806 v2: move malloc_cpu_count_is_deterministic() into malloc_init_hard_recursible() since _SC_NPROCESSORS_CONF does allocations for readdir() v3: - mark cpu_count_is_deterministic static - check only if percpu arena is enabled - check narenas

Determinitic number of CPUs is important for percpu arena to work correctly, since it uses cpu index - sched_getcpu(), and if it will greater then number of CPUs bad thing will happen, or assertion will be failed in debug build: <jemalloc>: ../contrib/jemalloc/src/jemalloc.c:321: Failed assertion: "ind <= narenas_total_get()" Aborted (core dumped) Number of CPUs can be obtained from the following places: - sched_getaffinity() - sysconf(_SC_NPROCESSORS_ONLN) - sysconf(_SC_NPROCESSORS_CONF) For the sched_getaffinity() you may simply use taskset(1) to run program on a different cpu, and in case it will be not first, percpu will work incorrectly, i.e.: $ taskset --cpu-list $(( $(getconf _NPROCESSORS_ONLN)-1 )) <your_program> _SC_NPROCESSORS_ONLN uses /sys/devices/system/cpu/online, LXD/LXC virtualize /sys/devices/system/cpu/online file [1], and so when you run container with limited limits.cpus it will bind randomly selected CPU to it [1]: lxc/lxcfs#301 _SC_NPROCESSORS_CONF uses /sys/devices/system/cpu/cpu*, and AFAIK nobody playing with dentries there. So if all three of these are equal, percpu arenas should work correctly. And a small note regardless _SC_NPROCESSORS_ONLN/_SC_NPROCESSORS_CONF, musl uses sched_getaffinity() for both. So this will also increase the entropy. Also note, that you can check is percpu arena really applied using abort_conf:true. Refs: jemalloc/jemalloc#1939 Refs: ClickHouse/ClickHouse#32806 v2: move malloc_cpu_count_is_deterministic() into malloc_init_hard_recursible() since _SC_NPROCESSORS_CONF does allocations for readdir() v3: - mark cpu_count_is_deterministic static - check only if percpu arena is enabled - check narenas

azat mentioned this pull request Sep 20, 2020

Prefer getaffinity() to detect number of CPUs. #1676

Merged

azat force-pushed the ncpus-total-over-eligible branch 2 times, most recently from 9e4ef50 to d62ffab Compare September 20, 2020 13:41

Use python2 explicitly in sheban of gen_travis.py

889d138

azat force-pushed the ncpus-total-over-eligible branch from d62ffab to c7bd7ba Compare September 20, 2020 14:00

azat mentioned this pull request Sep 20, 2020

Update jemalloc to fix percpu_arena with affinity mask ClickHouse/ClickHouse#15035

Merged

Separate tests run on the last CPU to trigger possible issues for inc…

c9f10ca

…orrect ncpus v0: use 4 cpu v2: use last cpu

azat force-pushed the ncpus-total-over-eligible branch from c7bd7ba to c9f10ca Compare September 20, 2020 15:36

azat added a commit to azat/ClickHouse that referenced this pull request Sep 30, 2020

Update jemalloc to inculde fix for percpu_arena

7f776a4

This fix is the same as from [1], but backported to 5.2.1. [1]: jemalloc/jemalloc#1939

azat changed the title ~~[RFC] Use all available CPUs not only eligible to run for percpu arenas~~ [RFC] Fix percpu arena by using all available CPUs (not only eligible) Oct 1, 2020

interwq mentioned this pull request Oct 6, 2020

Add n_eligible_cpus vs n_total_cpus #1947

Open

interwq closed this Oct 23, 2020

azat mentioned this pull request Apr 17, 2021

jemalloc percpu arena with limited CPUs ClickHouse/ClickHouse#23174

Closed

vitamin-caig mentioned this pull request Dec 16, 2021

LXC containers: Crashes after update/hardware server restart ClickHouse/ClickHouse#32806

Closed

azat mentioned this pull request Dec 17, 2021

[RFC] Disable percpu arena in case of non deterministic CPU count #2181

Merged

azat mentioned this pull request Dec 17, 2021

Disable percpu arena in case of non deterministic CPU count ClickHouse/jemalloc#2

Merged

azat deleted the ncpus-total-over-eligible branch December 22, 2021 04:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Fix percpu arena by using all available CPUs (not only eligible)#1939

[RFC] Fix percpu arena by using all available CPUs (not only eligible)#1939
azat wants to merge 3 commits intojemalloc:devfrom
azat-archive:ncpus-total-over-eligible

azat commented Sep 20, 2020 •

edited

Loading

Uh oh!

azat commented Oct 1, 2020

Uh oh!

davidtgoldblatt commented Oct 1, 2020

Uh oh!

azat commented Oct 5, 2020

Uh oh!

interwq commented Oct 5, 2020

Uh oh!

azat commented Oct 6, 2020 •

edited

Loading

Uh oh!

interwq commented Oct 8, 2020

Uh oh!

azat commented Oct 9, 2020

Uh oh!

interwq commented Oct 23, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

azat commented Sep 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

azat commented Oct 1, 2020

Uh oh!

davidtgoldblatt commented Oct 1, 2020

Uh oh!

azat commented Oct 5, 2020

Uh oh!

interwq commented Oct 5, 2020

Uh oh!

azat commented Oct 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

interwq commented Oct 8, 2020

Uh oh!

azat commented Oct 9, 2020

Uh oh!

interwq commented Oct 23, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

azat commented Sep 20, 2020 •

edited

Loading

azat commented Oct 6, 2020 •

edited

Loading