Skip to content

[RFC] Fix percpu arena by using all available CPUs (not only eligible)#1939

Closed
azat wants to merge 3 commits intojemalloc:devfrom
azat-archive:ncpus-total-over-eligible
Closed

[RFC] Fix percpu arena by using all available CPUs (not only eligible)#1939
azat wants to merge 3 commits intojemalloc:devfrom
azat-archive:ncpus-total-over-eligible

Conversation

@azat
Copy link
Copy Markdown
Contributor

@azat azat commented Sep 20, 2020

Prefer _SC_NPROCESSORS_ONLN over sched_getaffinity() for number of
arenas for percpu_arena enabled, since sched_getaffinity() can be changed in runtime, or it can be
allowed to run only on, say, 8 CPU, but in this case:

  • ncpus==1
  • but malloc_getcpu() will return 7

And this will trigger the following assertion in debug build:

<jemalloc>: ../src/jemalloc.c:321: Failed assertion: "ind <= narenas_total_get()"

Here is a reproducer 1.

And AFAICS eligible number of CPUs cannot be used even for
mutex detection, since the affinity mask can be changed later.

Refs: #1676
Refs: #808

Prefer _SC_NPROCESSORS_ONLN over sched_getaffinity() for number of
arenas for percpu_arena enabled.
Since sched_getaffinity() can be changed in runtime, or it can be
allowed to run only on, say, 8 CPU, but in this case:
- ncpus==8
- malloc_getcpu() will return 0/1

Reproducer [1].

  [1]: https://gist.github.com/azat/9770f751c3e7d2da5422cc1ca26339c4

And frankly speaking eligible number of CPUs cannot be used even for
mutex detection, since the affinity mask can be changed later.

Refs: jemalloc#1676
Refs: jemalloc#808
@azat azat force-pushed the ncpus-total-over-eligible branch 2 times, most recently from 9e4ef50 to d62ffab Compare September 20, 2020 13:41
@azat azat force-pushed the ncpus-total-over-eligible branch from c7bd7ba to c9f10ca Compare September 20, 2020 15:36
azat added a commit to azat/ClickHouse that referenced this pull request Sep 30, 2020
This fix is the same as from [1], but backported to 5.2.1.

    [1]: jemalloc/jemalloc#1939
@azat azat changed the title [RFC] Use all available CPUs not only eligible to run for percpu arenas [RFC] Fix percpu arena by using all available CPUs (not only eligible) Oct 1, 2020
@azat
Copy link
Copy Markdown
Contributor Author

azat commented Oct 1, 2020

Any comments on this one?

@davidtgoldblatt
Copy link
Copy Markdown
Contributor

+cc @interwq -- this seems like it's in your wheelhouse.

@azat
Copy link
Copy Markdown
Contributor Author

azat commented Oct 5, 2020

@interwq can you please take a look? Thanks in advance!

@interwq
Copy link
Copy Markdown
Contributor

interwq commented Oct 5, 2020

Thanks for sharing the thoughts (also sorry for the delayed response). Indeed the percpu arena feature does not play well with dynamic affinity changes; we didn't really try to optimize for this case -- not only for percpu arena, but for the default number of auto arenas / scalability factor in general.

A workaround for this is to also set narenas, see the comments here: https://github.com/jemalloc/jemalloc/blob/dev/src/jemalloc.c#L1843

for example, if you specify narenas to be the actual number of CPUs, percpu_arena will attempt to utilize those arenas.

However, this can cause bad side effects: especially for decay, which by default relies on arenas being accessed regularly to return unused pages back to OS gradually. In the case of dynamic CPU binding changes + percpu_arena, the cached dirty pages in an arena may become "unaccessible" when the affinity for that CPU / arena is removed, i.e. no threads will be triggering decay for those arenas.

I'd recommend not using percpu_arena if CPU affinity changes over the lifetime of the binary, and manually setting narenas based on available resources: higher arena count typically improves scalability, at the cost of increased fragmentation / memory usage.

@azat
Copy link
Copy Markdown
Contributor Author

azat commented Oct 6, 2020

@interwq thank you for the reply!

A workaround for this is to also set narenas, see the comments here:

But right now it is impossible to set narenas dynamically from the binary (w/o wrapping into shell script), am I missing something?
How about adding one more mode for percpu then, say, percpu_arena:percpu_all (and maybe add new percpu_arena:percpu_eligible and deprecated percpu_arena:percpu) that will set narenas to the all available CPUs.

But this will not solve other issues (like the mutex requirement detection), actually I thought about adding two different counters (total and eligible like you wrote in #1947, however later I came to conclusion that there is no point in keeping total CPUs counter)

However, this can cause bad side effects: especially for decay

But shouldn't be the special case that does not worth supporting? (since sounds pretty tricky especially with dynamic affinity changes)?

@interwq
Copy link
Copy Markdown
Contributor

interwq commented Oct 8, 2020

re: narenas setting -- you can set it to the total number of CPUs (e.g. via env var MALLOC_CONF), then the arena slots will be reserved and available for percpu_arena, effectively the percpu_all you asked.

The case with dynamic CPU affinity is tricky to support; to be honest we didn't really try hard to optimize for the case. For the decay issue, enabling background_thread should solve it (since the dedicated jemalloc background threads will be able to purge for the inactive arenas). There might also be other unoptimized cases as well, like the mutex spin hint you linked.

@azat
Copy link
Copy Markdown
Contributor Author

azat commented Oct 9, 2020

re: narenas setting -- you can set it to the total number of CPUs (e.g. via env var MALLOC_CONF), then the arena slots will be reserved and available for percpu_arena, effectively the percpu_all you asked.

Agree, but my point was that you cannot do this from the binary

@interwq
Copy link
Copy Markdown
Contributor

interwq commented Oct 23, 2020

That's true. Unfortunately it won't be easy to make it runtime changeable, and at the moment it's not on our plan.

Closing but feel free to ask if anything.

@interwq interwq closed this Oct 23, 2020
azat added a commit to azat-archive/jemalloc that referenced this pull request Dec 17, 2021
Determinitic number of CPUs is important for percpu arena to work
correctly, since it uses cpu index - sched_getcpu(), and if it will
greater then number of CPUs bad thing will happen, or assertion will be
failed in debug build:

    <jemalloc>: ../contrib/jemalloc/src/jemalloc.c:321: Failed assertion: "ind <= narenas_total_get()"
    Aborted (core dumped)

Number of CPUs can be obtained from the following places:
- sched_getaffinity()
- sysconf(_SC_NPROCESSORS_ONLN)
- sysconf(_SC_NPROCESSORS_CONF)

For the sched_getaffinity() you may simply use taskset(1) to run program
on a different cpu, and in case it will be not first, percpu will work
incorrectly, i.e.:

    $ taskset --cpu-list $(( $(getconf _NPROCESSORS_ONLN)-1 )) <your_program>

_SC_NPROCESSORS_ONLN uses /sys/devices/system/cpu/online, LXD/LXC
virtualize /sys/devices/system/cpu/online file [1], and so when you run
container with limited limits.cpus it will bind randomly selected CPU to
it

  [1]: lxc/lxcfs#301

_SC_NPROCESSORS_CONF uses /sys/devices/system/cpu/cpu*, and AFAIK nobody
playing with dentries there.

So if all three of these are equal, percpu arenas should work correctly.

And a small note regardless _SC_NPROCESSORS_ONLN/_SC_NPROCESSORS_CONF,
musl uses sched_getaffinity() for both. So this will also increase the
entropy.

Also note, that you can check is percpu arena really applied using
abort_conf:true.

Refs: jemalloc#1939
Refs: ClickHouse/ClickHouse#32806

v2: move malloc_cpu_count_is_deterministic() into
malloc_init_hard_recursible() since _SC_NPROCESSORS_CONF does
allocations for readdir()
azat added a commit to ClickHouse/jemalloc that referenced this pull request Dec 17, 2021
Determinitic number of CPUs is important for percpu arena to work
correctly, since it uses cpu index - sched_getcpu(), and if it will
greater then number of CPUs bad thing will happen, or assertion will be
failed in debug build:

    <jemalloc>: ../contrib/jemalloc/src/jemalloc.c:321: Failed assertion: "ind <= narenas_total_get()"
    Aborted (core dumped)

Number of CPUs can be obtained from the following places:
- sched_getaffinity()
- sysconf(_SC_NPROCESSORS_ONLN)
- sysconf(_SC_NPROCESSORS_CONF)

For the sched_getaffinity() you may simply use taskset(1) to run program
on a different cpu, and in case it will be not first, percpu will work
incorrectly, i.e.:

    $ taskset --cpu-list $(( $(getconf _NPROCESSORS_ONLN)-1 )) <your_program>

_SC_NPROCESSORS_ONLN uses /sys/devices/system/cpu/online, LXD/LXC
virtualize /sys/devices/system/cpu/online file [1], and so when you run
container with limited limits.cpus it will bind randomly selected CPU to
it

  [1]: lxc/lxcfs#301

_SC_NPROCESSORS_CONF uses /sys/devices/system/cpu/cpu*, and AFAIK nobody
playing with dentries there.

So if all three of these are equal, percpu arenas should work correctly.

And a small note regardless _SC_NPROCESSORS_ONLN/_SC_NPROCESSORS_CONF,
musl uses sched_getaffinity() for both. So this will also increase the
entropy.

Also note, that you can check is percpu arena really applied using
abort_conf:true.

Refs: jemalloc#1939
Refs: ClickHouse/ClickHouse#32806

v2: move malloc_cpu_count_is_deterministic() into
malloc_init_hard_recursible() since _SC_NPROCESSORS_CONF does
allocations for readdir()
azat added a commit to azat-archive/jemalloc that referenced this pull request Dec 18, 2021
Determinitic number of CPUs is important for percpu arena to work
correctly, since it uses cpu index - sched_getcpu(), and if it will
greater then number of CPUs bad thing will happen, or assertion will be
failed in debug build:

    <jemalloc>: ../contrib/jemalloc/src/jemalloc.c:321: Failed assertion: "ind <= narenas_total_get()"
    Aborted (core dumped)

Number of CPUs can be obtained from the following places:
- sched_getaffinity()
- sysconf(_SC_NPROCESSORS_ONLN)
- sysconf(_SC_NPROCESSORS_CONF)

For the sched_getaffinity() you may simply use taskset(1) to run program
on a different cpu, and in case it will be not first, percpu will work
incorrectly, i.e.:

    $ taskset --cpu-list $(( $(getconf _NPROCESSORS_ONLN)-1 )) <your_program>

_SC_NPROCESSORS_ONLN uses /sys/devices/system/cpu/online, LXD/LXC
virtualize /sys/devices/system/cpu/online file [1], and so when you run
container with limited limits.cpus it will bind randomly selected CPU to
it

  [1]: lxc/lxcfs#301

_SC_NPROCESSORS_CONF uses /sys/devices/system/cpu/cpu*, and AFAIK nobody
playing with dentries there.

So if all three of these are equal, percpu arenas should work correctly.

And a small note regardless _SC_NPROCESSORS_ONLN/_SC_NPROCESSORS_CONF,
musl uses sched_getaffinity() for both. So this will also increase the
entropy.

Also note, that you can check is percpu arena really applied using
abort_conf:true.

Refs: jemalloc#1939
Refs: ClickHouse/ClickHouse#32806

v2: move malloc_cpu_count_is_deterministic() into
    malloc_init_hard_recursible() since _SC_NPROCESSORS_CONF does
    allocations for readdir()
v3:
- mark cpu_count_is_deterministic static
- check only if percpu arena is enabled
- check narenas
azat added a commit to azat-archive/jemalloc that referenced this pull request Dec 21, 2021
Determinitic number of CPUs is important for percpu arena to work
correctly, since it uses cpu index - sched_getcpu(), and if it will
greater then number of CPUs bad thing will happen, or assertion will be
failed in debug build:

    <jemalloc>: ../contrib/jemalloc/src/jemalloc.c:321: Failed assertion: "ind <= narenas_total_get()"
    Aborted (core dumped)

Number of CPUs can be obtained from the following places:
- sched_getaffinity()
- sysconf(_SC_NPROCESSORS_ONLN)
- sysconf(_SC_NPROCESSORS_CONF)

For the sched_getaffinity() you may simply use taskset(1) to run program
on a different cpu, and in case it will be not first, percpu will work
incorrectly, i.e.:

    $ taskset --cpu-list $(( $(getconf _NPROCESSORS_ONLN)-1 )) <your_program>

_SC_NPROCESSORS_ONLN uses /sys/devices/system/cpu/online, LXD/LXC
virtualize /sys/devices/system/cpu/online file [1], and so when you run
container with limited limits.cpus it will bind randomly selected CPU to
it

  [1]: lxc/lxcfs#301

_SC_NPROCESSORS_CONF uses /sys/devices/system/cpu/cpu*, and AFAIK nobody
playing with dentries there.

So if all three of these are equal, percpu arenas should work correctly.

And a small note regardless _SC_NPROCESSORS_ONLN/_SC_NPROCESSORS_CONF,
musl uses sched_getaffinity() for both. So this will also increase the
entropy.

Also note, that you can check is percpu arena really applied using
abort_conf:true.

Refs: jemalloc#1939
Refs: ClickHouse/ClickHouse#32806

v2: move malloc_cpu_count_is_deterministic() into
    malloc_init_hard_recursible() since _SC_NPROCESSORS_CONF does
    allocations for readdir()
v3:
- mark cpu_count_is_deterministic static
- check only if percpu arena is enabled
- check narenas
Lapenkov pushed a commit that referenced this pull request Dec 21, 2021
Determinitic number of CPUs is important for percpu arena to work
correctly, since it uses cpu index - sched_getcpu(), and if it will
greater then number of CPUs bad thing will happen, or assertion will be
failed in debug build:

    <jemalloc>: ../contrib/jemalloc/src/jemalloc.c:321: Failed assertion: "ind <= narenas_total_get()"
    Aborted (core dumped)

Number of CPUs can be obtained from the following places:
- sched_getaffinity()
- sysconf(_SC_NPROCESSORS_ONLN)
- sysconf(_SC_NPROCESSORS_CONF)

For the sched_getaffinity() you may simply use taskset(1) to run program
on a different cpu, and in case it will be not first, percpu will work
incorrectly, i.e.:

    $ taskset --cpu-list $(( $(getconf _NPROCESSORS_ONLN)-1 )) <your_program>

_SC_NPROCESSORS_ONLN uses /sys/devices/system/cpu/online, LXD/LXC
virtualize /sys/devices/system/cpu/online file [1], and so when you run
container with limited limits.cpus it will bind randomly selected CPU to
it

  [1]: lxc/lxcfs#301

_SC_NPROCESSORS_CONF uses /sys/devices/system/cpu/cpu*, and AFAIK nobody
playing with dentries there.

So if all three of these are equal, percpu arenas should work correctly.

And a small note regardless _SC_NPROCESSORS_ONLN/_SC_NPROCESSORS_CONF,
musl uses sched_getaffinity() for both. So this will also increase the
entropy.

Also note, that you can check is percpu arena really applied using
abort_conf:true.

Refs: #1939
Refs: ClickHouse/ClickHouse#32806

v2: move malloc_cpu_count_is_deterministic() into
    malloc_init_hard_recursible() since _SC_NPROCESSORS_CONF does
    allocations for readdir()
v3:
- mark cpu_count_is_deterministic static
- check only if percpu arena is enabled
- check narenas
azat added a commit to ClickHouse/jemalloc that referenced this pull request Dec 22, 2021
Determinitic number of CPUs is important for percpu arena to work
correctly, since it uses cpu index - sched_getcpu(), and if it will
greater then number of CPUs bad thing will happen, or assertion will be
failed in debug build:

    <jemalloc>: ../contrib/jemalloc/src/jemalloc.c:321: Failed assertion: "ind <= narenas_total_get()"
    Aborted (core dumped)

Number of CPUs can be obtained from the following places:
- sched_getaffinity()
- sysconf(_SC_NPROCESSORS_ONLN)
- sysconf(_SC_NPROCESSORS_CONF)

For the sched_getaffinity() you may simply use taskset(1) to run program
on a different cpu, and in case it will be not first, percpu will work
incorrectly, i.e.:

    $ taskset --cpu-list $(( $(getconf _NPROCESSORS_ONLN)-1 )) <your_program>

_SC_NPROCESSORS_ONLN uses /sys/devices/system/cpu/online, LXD/LXC
virtualize /sys/devices/system/cpu/online file [1], and so when you run
container with limited limits.cpus it will bind randomly selected CPU to
it

  [1]: lxc/lxcfs#301

_SC_NPROCESSORS_CONF uses /sys/devices/system/cpu/cpu*, and AFAIK nobody
playing with dentries there.

So if all three of these are equal, percpu arenas should work correctly.

And a small note regardless _SC_NPROCESSORS_ONLN/_SC_NPROCESSORS_CONF,
musl uses sched_getaffinity() for both. So this will also increase the
entropy.

Also note, that you can check is percpu arena really applied using
abort_conf:true.

Refs: jemalloc#1939
Refs: ClickHouse/ClickHouse#32806

v2: move malloc_cpu_count_is_deterministic() into
    malloc_init_hard_recursible() since _SC_NPROCESSORS_CONF does
    allocations for readdir()
v3:
- mark cpu_count_is_deterministic static
- check only if percpu arena is enabled
- check narenas
azat added a commit to ClickHouse/jemalloc that referenced this pull request Dec 22, 2021
Determinitic number of CPUs is important for percpu arena to work
correctly, since it uses cpu index - sched_getcpu(), and if it will
greater then number of CPUs bad thing will happen, or assertion will be
failed in debug build:

    <jemalloc>: ../contrib/jemalloc/src/jemalloc.c:321: Failed assertion: "ind <= narenas_total_get()"
    Aborted (core dumped)

Number of CPUs can be obtained from the following places:
- sched_getaffinity()
- sysconf(_SC_NPROCESSORS_ONLN)
- sysconf(_SC_NPROCESSORS_CONF)

For the sched_getaffinity() you may simply use taskset(1) to run program
on a different cpu, and in case it will be not first, percpu will work
incorrectly, i.e.:

    $ taskset --cpu-list $(( $(getconf _NPROCESSORS_ONLN)-1 )) <your_program>

_SC_NPROCESSORS_ONLN uses /sys/devices/system/cpu/online, LXD/LXC
virtualize /sys/devices/system/cpu/online file [1], and so when you run
container with limited limits.cpus it will bind randomly selected CPU to
it

  [1]: lxc/lxcfs#301

_SC_NPROCESSORS_CONF uses /sys/devices/system/cpu/cpu*, and AFAIK nobody
playing with dentries there.

So if all three of these are equal, percpu arenas should work correctly.

And a small note regardless _SC_NPROCESSORS_ONLN/_SC_NPROCESSORS_CONF,
musl uses sched_getaffinity() for both. So this will also increase the
entropy.

Also note, that you can check is percpu arena really applied using
abort_conf:true.

Refs: jemalloc#1939
Refs: ClickHouse/ClickHouse#32806

v2: move malloc_cpu_count_is_deterministic() into
    malloc_init_hard_recursible() since _SC_NPROCESSORS_CONF does
    allocations for readdir()
v3:
- mark cpu_count_is_deterministic static
- check only if percpu arena is enabled
- check narenas
@azat azat deleted the ncpus-total-over-eligible branch December 22, 2021 04:15
minaripenguin pushed a commit to minaripenguin/android_external_jemalloc-new that referenced this pull request Jan 28, 2023
Determinitic number of CPUs is important for percpu arena to work
correctly, since it uses cpu index - sched_getcpu(), and if it will
greater then number of CPUs bad thing will happen, or assertion will be
failed in debug build:

    <jemalloc>: ../contrib/jemalloc/src/jemalloc.c:321: Failed assertion: "ind <= narenas_total_get()"
    Aborted (core dumped)

Number of CPUs can be obtained from the following places:
- sched_getaffinity()
- sysconf(_SC_NPROCESSORS_ONLN)
- sysconf(_SC_NPROCESSORS_CONF)

For the sched_getaffinity() you may simply use taskset(1) to run program
on a different cpu, and in case it will be not first, percpu will work
incorrectly, i.e.:

    $ taskset --cpu-list $(( $(getconf _NPROCESSORS_ONLN)-1 )) <your_program>

_SC_NPROCESSORS_ONLN uses /sys/devices/system/cpu/online, LXD/LXC
virtualize /sys/devices/system/cpu/online file [1], and so when you run
container with limited limits.cpus it will bind randomly selected CPU to
it

  [1]: lxc/lxcfs#301

_SC_NPROCESSORS_CONF uses /sys/devices/system/cpu/cpu*, and AFAIK nobody
playing with dentries there.

So if all three of these are equal, percpu arenas should work correctly.

And a small note regardless _SC_NPROCESSORS_ONLN/_SC_NPROCESSORS_CONF,
musl uses sched_getaffinity() for both. So this will also increase the
entropy.

Also note, that you can check is percpu arena really applied using
abort_conf:true.

Refs: jemalloc/jemalloc#1939
Refs: ClickHouse/ClickHouse#32806

v2: move malloc_cpu_count_is_deterministic() into
    malloc_init_hard_recursible() since _SC_NPROCESSORS_CONF does
    allocations for readdir()
v3:
- mark cpu_count_is_deterministic static
- check only if percpu arena is enabled
- check narenas
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants