[RFC] Fix percpu arena by using all available CPUs (not only eligible)#1939
[RFC] Fix percpu arena by using all available CPUs (not only eligible)#1939azat wants to merge 3 commits intojemalloc:devfrom
Conversation
Prefer _SC_NPROCESSORS_ONLN over sched_getaffinity() for number of arenas for percpu_arena enabled. Since sched_getaffinity() can be changed in runtime, or it can be allowed to run only on, say, 8 CPU, but in this case: - ncpus==8 - malloc_getcpu() will return 0/1 Reproducer [1]. [1]: https://gist.github.com/azat/9770f751c3e7d2da5422cc1ca26339c4 And frankly speaking eligible number of CPUs cannot be used even for mutex detection, since the affinity mask can be changed later. Refs: jemalloc#1676 Refs: jemalloc#808
9e4ef50 to
d62ffab
Compare
d62ffab to
c7bd7ba
Compare
…orrect ncpus v0: use 4 cpu v2: use last cpu
c7bd7ba to
c9f10ca
Compare
This fix is the same as from [1], but backported to 5.2.1.
[1]: jemalloc/jemalloc#1939
|
Any comments on this one? |
|
+cc @interwq -- this seems like it's in your wheelhouse. |
|
@interwq can you please take a look? Thanks in advance! |
|
Thanks for sharing the thoughts (also sorry for the delayed response). Indeed the percpu arena feature does not play well with dynamic affinity changes; we didn't really try to optimize for this case -- not only for percpu arena, but for the default number of auto arenas / scalability factor in general. A workaround for this is to also set for example, if you specify However, this can cause bad side effects: especially for decay, which by default relies on arenas being accessed regularly to return unused pages back to OS gradually. In the case of dynamic CPU binding changes + percpu_arena, the cached dirty pages in an arena may become "unaccessible" when the affinity for that CPU / arena is removed, i.e. no threads will be triggering decay for those arenas. I'd recommend not using percpu_arena if CPU affinity changes over the lifetime of the binary, and manually setting |
|
@interwq thank you for the reply!
But right now it is impossible to set narenas dynamically from the binary (w/o wrapping into shell script), am I missing something? But this will not solve other issues (like the mutex requirement detection), actually I thought about adding two different counters (total and eligible like you wrote in #1947, however later I came to conclusion that there is no point in keeping total CPUs counter)
But shouldn't be the special case that does not worth supporting? (since sounds pretty tricky especially with dynamic affinity changes)? |
|
re: The case with dynamic CPU affinity is tricky to support; to be honest we didn't really try hard to optimize for the case. For the decay issue, enabling background_thread should solve it (since the dedicated jemalloc background threads will be able to purge for the inactive arenas). There might also be other unoptimized cases as well, like the mutex spin hint you linked. |
Agree, but my point was that you cannot do this from the binary |
|
That's true. Unfortunately it won't be easy to make it runtime changeable, and at the moment it's not on our plan. Closing but feel free to ask if anything. |
Determinitic number of CPUs is important for percpu arena to work
correctly, since it uses cpu index - sched_getcpu(), and if it will
greater then number of CPUs bad thing will happen, or assertion will be
failed in debug build:
<jemalloc>: ../contrib/jemalloc/src/jemalloc.c:321: Failed assertion: "ind <= narenas_total_get()"
Aborted (core dumped)
Number of CPUs can be obtained from the following places:
- sched_getaffinity()
- sysconf(_SC_NPROCESSORS_ONLN)
- sysconf(_SC_NPROCESSORS_CONF)
For the sched_getaffinity() you may simply use taskset(1) to run program
on a different cpu, and in case it will be not first, percpu will work
incorrectly, i.e.:
$ taskset --cpu-list $(( $(getconf _NPROCESSORS_ONLN)-1 )) <your_program>
_SC_NPROCESSORS_ONLN uses /sys/devices/system/cpu/online, LXD/LXC
virtualize /sys/devices/system/cpu/online file [1], and so when you run
container with limited limits.cpus it will bind randomly selected CPU to
it
[1]: lxc/lxcfs#301
_SC_NPROCESSORS_CONF uses /sys/devices/system/cpu/cpu*, and AFAIK nobody
playing with dentries there.
So if all three of these are equal, percpu arenas should work correctly.
And a small note regardless _SC_NPROCESSORS_ONLN/_SC_NPROCESSORS_CONF,
musl uses sched_getaffinity() for both. So this will also increase the
entropy.
Also note, that you can check is percpu arena really applied using
abort_conf:true.
Refs: jemalloc#1939
Refs: ClickHouse/ClickHouse#32806
v2: move malloc_cpu_count_is_deterministic() into
malloc_init_hard_recursible() since _SC_NPROCESSORS_CONF does
allocations for readdir()
Determinitic number of CPUs is important for percpu arena to work
correctly, since it uses cpu index - sched_getcpu(), and if it will
greater then number of CPUs bad thing will happen, or assertion will be
failed in debug build:
<jemalloc>: ../contrib/jemalloc/src/jemalloc.c:321: Failed assertion: "ind <= narenas_total_get()"
Aborted (core dumped)
Number of CPUs can be obtained from the following places:
- sched_getaffinity()
- sysconf(_SC_NPROCESSORS_ONLN)
- sysconf(_SC_NPROCESSORS_CONF)
For the sched_getaffinity() you may simply use taskset(1) to run program
on a different cpu, and in case it will be not first, percpu will work
incorrectly, i.e.:
$ taskset --cpu-list $(( $(getconf _NPROCESSORS_ONLN)-1 )) <your_program>
_SC_NPROCESSORS_ONLN uses /sys/devices/system/cpu/online, LXD/LXC
virtualize /sys/devices/system/cpu/online file [1], and so when you run
container with limited limits.cpus it will bind randomly selected CPU to
it
[1]: lxc/lxcfs#301
_SC_NPROCESSORS_CONF uses /sys/devices/system/cpu/cpu*, and AFAIK nobody
playing with dentries there.
So if all three of these are equal, percpu arenas should work correctly.
And a small note regardless _SC_NPROCESSORS_ONLN/_SC_NPROCESSORS_CONF,
musl uses sched_getaffinity() for both. So this will also increase the
entropy.
Also note, that you can check is percpu arena really applied using
abort_conf:true.
Refs: jemalloc#1939
Refs: ClickHouse/ClickHouse#32806
v2: move malloc_cpu_count_is_deterministic() into
malloc_init_hard_recursible() since _SC_NPROCESSORS_CONF does
allocations for readdir()
Determinitic number of CPUs is important for percpu arena to work
correctly, since it uses cpu index - sched_getcpu(), and if it will
greater then number of CPUs bad thing will happen, or assertion will be
failed in debug build:
<jemalloc>: ../contrib/jemalloc/src/jemalloc.c:321: Failed assertion: "ind <= narenas_total_get()"
Aborted (core dumped)
Number of CPUs can be obtained from the following places:
- sched_getaffinity()
- sysconf(_SC_NPROCESSORS_ONLN)
- sysconf(_SC_NPROCESSORS_CONF)
For the sched_getaffinity() you may simply use taskset(1) to run program
on a different cpu, and in case it will be not first, percpu will work
incorrectly, i.e.:
$ taskset --cpu-list $(( $(getconf _NPROCESSORS_ONLN)-1 )) <your_program>
_SC_NPROCESSORS_ONLN uses /sys/devices/system/cpu/online, LXD/LXC
virtualize /sys/devices/system/cpu/online file [1], and so when you run
container with limited limits.cpus it will bind randomly selected CPU to
it
[1]: lxc/lxcfs#301
_SC_NPROCESSORS_CONF uses /sys/devices/system/cpu/cpu*, and AFAIK nobody
playing with dentries there.
So if all three of these are equal, percpu arenas should work correctly.
And a small note regardless _SC_NPROCESSORS_ONLN/_SC_NPROCESSORS_CONF,
musl uses sched_getaffinity() for both. So this will also increase the
entropy.
Also note, that you can check is percpu arena really applied using
abort_conf:true.
Refs: jemalloc#1939
Refs: ClickHouse/ClickHouse#32806
v2: move malloc_cpu_count_is_deterministic() into
malloc_init_hard_recursible() since _SC_NPROCESSORS_CONF does
allocations for readdir()
v3:
- mark cpu_count_is_deterministic static
- check only if percpu arena is enabled
- check narenas
Determinitic number of CPUs is important for percpu arena to work
correctly, since it uses cpu index - sched_getcpu(), and if it will
greater then number of CPUs bad thing will happen, or assertion will be
failed in debug build:
<jemalloc>: ../contrib/jemalloc/src/jemalloc.c:321: Failed assertion: "ind <= narenas_total_get()"
Aborted (core dumped)
Number of CPUs can be obtained from the following places:
- sched_getaffinity()
- sysconf(_SC_NPROCESSORS_ONLN)
- sysconf(_SC_NPROCESSORS_CONF)
For the sched_getaffinity() you may simply use taskset(1) to run program
on a different cpu, and in case it will be not first, percpu will work
incorrectly, i.e.:
$ taskset --cpu-list $(( $(getconf _NPROCESSORS_ONLN)-1 )) <your_program>
_SC_NPROCESSORS_ONLN uses /sys/devices/system/cpu/online, LXD/LXC
virtualize /sys/devices/system/cpu/online file [1], and so when you run
container with limited limits.cpus it will bind randomly selected CPU to
it
[1]: lxc/lxcfs#301
_SC_NPROCESSORS_CONF uses /sys/devices/system/cpu/cpu*, and AFAIK nobody
playing with dentries there.
So if all three of these are equal, percpu arenas should work correctly.
And a small note regardless _SC_NPROCESSORS_ONLN/_SC_NPROCESSORS_CONF,
musl uses sched_getaffinity() for both. So this will also increase the
entropy.
Also note, that you can check is percpu arena really applied using
abort_conf:true.
Refs: jemalloc#1939
Refs: ClickHouse/ClickHouse#32806
v2: move malloc_cpu_count_is_deterministic() into
malloc_init_hard_recursible() since _SC_NPROCESSORS_CONF does
allocations for readdir()
v3:
- mark cpu_count_is_deterministic static
- check only if percpu arena is enabled
- check narenas
Determinitic number of CPUs is important for percpu arena to work
correctly, since it uses cpu index - sched_getcpu(), and if it will
greater then number of CPUs bad thing will happen, or assertion will be
failed in debug build:
<jemalloc>: ../contrib/jemalloc/src/jemalloc.c:321: Failed assertion: "ind <= narenas_total_get()"
Aborted (core dumped)
Number of CPUs can be obtained from the following places:
- sched_getaffinity()
- sysconf(_SC_NPROCESSORS_ONLN)
- sysconf(_SC_NPROCESSORS_CONF)
For the sched_getaffinity() you may simply use taskset(1) to run program
on a different cpu, and in case it will be not first, percpu will work
incorrectly, i.e.:
$ taskset --cpu-list $(( $(getconf _NPROCESSORS_ONLN)-1 )) <your_program>
_SC_NPROCESSORS_ONLN uses /sys/devices/system/cpu/online, LXD/LXC
virtualize /sys/devices/system/cpu/online file [1], and so when you run
container with limited limits.cpus it will bind randomly selected CPU to
it
[1]: lxc/lxcfs#301
_SC_NPROCESSORS_CONF uses /sys/devices/system/cpu/cpu*, and AFAIK nobody
playing with dentries there.
So if all three of these are equal, percpu arenas should work correctly.
And a small note regardless _SC_NPROCESSORS_ONLN/_SC_NPROCESSORS_CONF,
musl uses sched_getaffinity() for both. So this will also increase the
entropy.
Also note, that you can check is percpu arena really applied using
abort_conf:true.
Refs: #1939
Refs: ClickHouse/ClickHouse#32806
v2: move malloc_cpu_count_is_deterministic() into
malloc_init_hard_recursible() since _SC_NPROCESSORS_CONF does
allocations for readdir()
v3:
- mark cpu_count_is_deterministic static
- check only if percpu arena is enabled
- check narenas
Determinitic number of CPUs is important for percpu arena to work
correctly, since it uses cpu index - sched_getcpu(), and if it will
greater then number of CPUs bad thing will happen, or assertion will be
failed in debug build:
<jemalloc>: ../contrib/jemalloc/src/jemalloc.c:321: Failed assertion: "ind <= narenas_total_get()"
Aborted (core dumped)
Number of CPUs can be obtained from the following places:
- sched_getaffinity()
- sysconf(_SC_NPROCESSORS_ONLN)
- sysconf(_SC_NPROCESSORS_CONF)
For the sched_getaffinity() you may simply use taskset(1) to run program
on a different cpu, and in case it will be not first, percpu will work
incorrectly, i.e.:
$ taskset --cpu-list $(( $(getconf _NPROCESSORS_ONLN)-1 )) <your_program>
_SC_NPROCESSORS_ONLN uses /sys/devices/system/cpu/online, LXD/LXC
virtualize /sys/devices/system/cpu/online file [1], and so when you run
container with limited limits.cpus it will bind randomly selected CPU to
it
[1]: lxc/lxcfs#301
_SC_NPROCESSORS_CONF uses /sys/devices/system/cpu/cpu*, and AFAIK nobody
playing with dentries there.
So if all three of these are equal, percpu arenas should work correctly.
And a small note regardless _SC_NPROCESSORS_ONLN/_SC_NPROCESSORS_CONF,
musl uses sched_getaffinity() for both. So this will also increase the
entropy.
Also note, that you can check is percpu arena really applied using
abort_conf:true.
Refs: jemalloc#1939
Refs: ClickHouse/ClickHouse#32806
v2: move malloc_cpu_count_is_deterministic() into
malloc_init_hard_recursible() since _SC_NPROCESSORS_CONF does
allocations for readdir()
v3:
- mark cpu_count_is_deterministic static
- check only if percpu arena is enabled
- check narenas
Determinitic number of CPUs is important for percpu arena to work
correctly, since it uses cpu index - sched_getcpu(), and if it will
greater then number of CPUs bad thing will happen, or assertion will be
failed in debug build:
<jemalloc>: ../contrib/jemalloc/src/jemalloc.c:321: Failed assertion: "ind <= narenas_total_get()"
Aborted (core dumped)
Number of CPUs can be obtained from the following places:
- sched_getaffinity()
- sysconf(_SC_NPROCESSORS_ONLN)
- sysconf(_SC_NPROCESSORS_CONF)
For the sched_getaffinity() you may simply use taskset(1) to run program
on a different cpu, and in case it will be not first, percpu will work
incorrectly, i.e.:
$ taskset --cpu-list $(( $(getconf _NPROCESSORS_ONLN)-1 )) <your_program>
_SC_NPROCESSORS_ONLN uses /sys/devices/system/cpu/online, LXD/LXC
virtualize /sys/devices/system/cpu/online file [1], and so when you run
container with limited limits.cpus it will bind randomly selected CPU to
it
[1]: lxc/lxcfs#301
_SC_NPROCESSORS_CONF uses /sys/devices/system/cpu/cpu*, and AFAIK nobody
playing with dentries there.
So if all three of these are equal, percpu arenas should work correctly.
And a small note regardless _SC_NPROCESSORS_ONLN/_SC_NPROCESSORS_CONF,
musl uses sched_getaffinity() for both. So this will also increase the
entropy.
Also note, that you can check is percpu arena really applied using
abort_conf:true.
Refs: jemalloc#1939
Refs: ClickHouse/ClickHouse#32806
v2: move malloc_cpu_count_is_deterministic() into
malloc_init_hard_recursible() since _SC_NPROCESSORS_CONF does
allocations for readdir()
v3:
- mark cpu_count_is_deterministic static
- check only if percpu arena is enabled
- check narenas
Determinitic number of CPUs is important for percpu arena to work
correctly, since it uses cpu index - sched_getcpu(), and if it will
greater then number of CPUs bad thing will happen, or assertion will be
failed in debug build:
<jemalloc>: ../contrib/jemalloc/src/jemalloc.c:321: Failed assertion: "ind <= narenas_total_get()"
Aborted (core dumped)
Number of CPUs can be obtained from the following places:
- sched_getaffinity()
- sysconf(_SC_NPROCESSORS_ONLN)
- sysconf(_SC_NPROCESSORS_CONF)
For the sched_getaffinity() you may simply use taskset(1) to run program
on a different cpu, and in case it will be not first, percpu will work
incorrectly, i.e.:
$ taskset --cpu-list $(( $(getconf _NPROCESSORS_ONLN)-1 )) <your_program>
_SC_NPROCESSORS_ONLN uses /sys/devices/system/cpu/online, LXD/LXC
virtualize /sys/devices/system/cpu/online file [1], and so when you run
container with limited limits.cpus it will bind randomly selected CPU to
it
[1]: lxc/lxcfs#301
_SC_NPROCESSORS_CONF uses /sys/devices/system/cpu/cpu*, and AFAIK nobody
playing with dentries there.
So if all three of these are equal, percpu arenas should work correctly.
And a small note regardless _SC_NPROCESSORS_ONLN/_SC_NPROCESSORS_CONF,
musl uses sched_getaffinity() for both. So this will also increase the
entropy.
Also note, that you can check is percpu arena really applied using
abort_conf:true.
Refs: jemalloc/jemalloc#1939
Refs: ClickHouse/ClickHouse#32806
v2: move malloc_cpu_count_is_deterministic() into
malloc_init_hard_recursible() since _SC_NPROCESSORS_CONF does
allocations for readdir()
v3:
- mark cpu_count_is_deterministic static
- check only if percpu arena is enabled
- check narenas
Prefer
_SC_NPROCESSORS_ONLNoversched_getaffinity()for number ofarenas for percpu_arena enabled, since
sched_getaffinity()can be changed in runtime, or it can beallowed to run only on, say, 8 CPU, but in this case:
And this will trigger the following assertion in debug build:
Here is a reproducer 1.
And AFAICS eligible number of CPUs cannot be used even for
mutex detection, since the affinity mask can be changed later.
Refs: #1676
Refs: #808