Skip to content

Disable percpu arena in case of non deterministic CPU count#2

Merged
alexey-milovidov merged 3 commits intochfrom
per-cpu-arena-check-ch
Dec 22, 2021
Merged

Disable percpu arena in case of non deterministic CPU count#2
alexey-milovidov merged 3 commits intochfrom
per-cpu-arena-check-ch

Conversation

@azat
Copy link
Copy Markdown
Member

@azat azat commented Dec 17, 2021

Determinitic number of CPUs is important for percpu arena to work
correctly, since it uses cpu index - sched_getcpu(), and if it will
greater then number of CPUs bad thing will happen, or assertion will be
failed in debug build:

<jemalloc>: ../contrib/jemalloc/src/jemalloc.c:321: Failed assertion: "ind <= narenas_total_get()"
Aborted (core dumped)

Number of CPUs can be obtained from the following places:

  • sched_getaffinity()
  • sysconf(_SC_NPROCESSORS_ONLN)
  • sysconf(_SC_NPROCESSORS_CONF)

For the sched_getaffinity() you may simply use taskset(1) to run program
on a different cpu, and in case it will be not first, percpu will work
incorrectly, i.e.:

$ taskset --cpu-list $(( $(getconf _NPROCESSORS_ONLN)-1 )) <your_program>

_SC_NPROCESSORS_ONLN uses /sys/devices/system/cpu/online, LXD/LXC
virtualize /sys/devices/system/cpu/online file 1, and so when you run
container with limited limits.cpus it will bind randomly selected CPU to
it

_SC_NPROCESSORS_CONF uses /sys/devices/system/cpu/cpu*, and AFAIK nobody
playing with dentries there.

So if all three of these are equal, percpu arenas should work correctly.

And a small note regardless _SC_NPROCESSORS_ONLN/_SC_NPROCESSORS_CONF,
musl uses sched_getaffinity() for both. So this will also increase the
entropy.

Also note, that you can check is percpu arena really applied using
abort_conf:true.

Refs: jemalloc#1939
Refs: ClickHouse/ClickHouse#32806
Refs: jemalloc#2181

v2: move malloc_cpu_count_is_deterministic() into
malloc_init_hard_recursible() since _SC_NPROCESSORS_CONF does
allocations for readdir()

@alexey-milovidov
Copy link
Copy Markdown
Member

What will happen if per-cpu arenas are disabled?
Will it be slow / leak memory?

@azat
Copy link
Copy Markdown
Member Author

azat commented Dec 18, 2021

What will happen if per-cpu arenas are disabled?
Will it be slow / leak memory?

It may take more time to release the memory to the OS, since jemalloc has per-thread pools, and if it is not active nothing will be released, so it depends on the scheduler. But the difference should not be huge though (default narenas is cpu_cores*2).

background_thread can be enabled to avoid this, I've already tried this in #11086, but this was done as an alternative to percpu arenas #11084, and I think that choice was correct.
Plus right now you cannot simply enable background_thread, since because of watchdog it does not work - #23259.
Anyway it will be better than it was, before the behavior was undefined, it may go out of bound and so on, and now it simply disable percpu arena, so it will not work worse then stock jemalloc.
And this should affect only setup in some clouds/vms I guess.
I will get back to this later if any problems will be arise.

@azat azat force-pushed the per-cpu-arena-check-ch branch from 10655f3 to afce492 Compare December 22, 2021 04:04
@azat azat marked this pull request as ready for review December 22, 2021 04:05
@azat
Copy link
Copy Markdown
Member Author

azat commented Dec 22, 2021

Patch merged to upstream.

Also this PR reverts custom patches from jemalloc (93e27e4 and 3b877a1) that was a workaround for percpu arenas, so now we are using almost upstream version, except for our fork based on 5.2.1 while patch had been merged into dev branch (cafe9a3) and we also need one more patch that exists only in dev branch (a943172), so this fork is still required.

azat added 2 commits December 22, 2021 07:12
Determinitic number of CPUs is important for percpu arena to work
correctly, since it uses cpu index - sched_getcpu(), and if it will
greater then number of CPUs bad thing will happen, or assertion will be
failed in debug build:

    <jemalloc>: ../contrib/jemalloc/src/jemalloc.c:321: Failed assertion: "ind <= narenas_total_get()"
    Aborted (core dumped)

Number of CPUs can be obtained from the following places:
- sched_getaffinity()
- sysconf(_SC_NPROCESSORS_ONLN)
- sysconf(_SC_NPROCESSORS_CONF)

For the sched_getaffinity() you may simply use taskset(1) to run program
on a different cpu, and in case it will be not first, percpu will work
incorrectly, i.e.:

    $ taskset --cpu-list $(( $(getconf _NPROCESSORS_ONLN)-1 )) <your_program>

_SC_NPROCESSORS_ONLN uses /sys/devices/system/cpu/online, LXD/LXC
virtualize /sys/devices/system/cpu/online file [1], and so when you run
container with limited limits.cpus it will bind randomly selected CPU to
it

  [1]: lxc/lxcfs#301

_SC_NPROCESSORS_CONF uses /sys/devices/system/cpu/cpu*, and AFAIK nobody
playing with dentries there.

So if all three of these are equal, percpu arenas should work correctly.

And a small note regardless _SC_NPROCESSORS_ONLN/_SC_NPROCESSORS_CONF,
musl uses sched_getaffinity() for both. So this will also increase the
entropy.

Also note, that you can check is percpu arena really applied using
abort_conf:true.

Refs: jemalloc#1939
Refs: ClickHouse/ClickHouse#32806

v2: move malloc_cpu_count_is_deterministic() into
    malloc_init_hard_recursible() since _SC_NPROCESSORS_CONF does
    allocations for readdir()
v3:
- mark cpu_count_is_deterministic static
- check only if percpu arena is enabled
- check narenas
@azat azat force-pushed the per-cpu-arena-check-ch branch from afce492 to 189c070 Compare December 22, 2021 04:12
@alexey-milovidov alexey-milovidov merged commit a140480 into ch Dec 22, 2021
@alexey-milovidov alexey-milovidov self-assigned this Dec 22, 2021
@azat azat deleted the per-cpu-arena-check-ch branch December 22, 2021 18:25
antonio2368 pushed a commit that referenced this pull request Apr 1, 2026
We tried to load `g` from `bitmap[i]` before checking it is actually a
valid load. Tweaked a loop a bit to `break` early, when we are done
scanning for bits.

Before this commit undefined behaviour sanitizer from GCC 14+ was
unhappy at `test/unit/bitmap` test with following error.

```
../include/jemalloc/internal/bitmap.h:293:5: runtime error: load of
address 0x7bb1c2e08008 with insufficient space for an object of type
'const bitmap_t'
<...>
    #0 0x62671a149954 in bitmap_ffu ../include/jemalloc/internal/bitmap.h:293
    #1 0x62671a149954 in test_bitmap_xfu_body ../test/unit/bitmap.c:275
    #2 0x62671a14b767 in test_bitmap_xfu ../test/unit/bitmap.c:323
    #3 0x62671a376ad1 in p_test_impl ../test/src/test.c:149
    #4 0x62671a377135 in p_test ../test/src/test.c:200
    jemalloc#5 0x62671a13da06 in main ../test/unit/bitmap.c:336
<...>
```
antonio2368 pushed a commit that referenced this pull request Apr 1, 2026
This commit allows to enable sanitizers with autoconf options, instead
of modifying `CFLAGS`, `CXXFLAGS` and `LDFLAGS` directly.

* `--enable-tsan` option to enable Thread Sanitizer.
* `--enable-ubsan` option to enable Undefined Behaviour Sanitizer.

End goal is to speedup development by finding problems quickly, early
and easier. Eventually, when all current issues will be fixed, we can
enable sanitizers in CI. Fortunately, there are not a lot of problems we
need to fix.

Address Sanitizer is a bit controversial, because it replaces memory
allocator, so we decided to left it out for a while.

Below are couple of examples of how tests look like under different
sanitizers at the moment.

```
$  ../configure --enable-tsan --enable-debug
<...>
asan               : 0
tsan               : 1
ubsan              : 0
$ make -j`nproc` check
<...>
  Thread T13 (tid=332043, running) created by main thread at:
    #0 pthread_create <null> (libtsan.so.0+0x61748)
    #1 thd_create ../test/src/thd.c:25 (bin_batching+0x5631ca)
    #2 stress_run ../test/unit/bin_batching.c:148
(bin_batching+0x40364c)
    #3 test_races ../test/unit/bin_batching.c:249
(bin_batching+0x403d79)
    #4 p_test_impl ../test/src/test.c:149 (bin_batching+0x562811)
    jemalloc#5 p_test_no_reentrancy ../test/src/test.c:213
(bin_batching+0x562d35)
    jemalloc#6 main ../test/unit/bin_batching.c:268 (bin_batching+0x40417e)

SUMMARY: ThreadSanitizer: data race
../include/jemalloc/internal/edata.h:498 in edata_nfree_inc
```

```
$ ../configure --enable-ubsan --enable-debug
<...>
asan               : 0
tsan               : 0
ubsan              : 1
$ make -j`nproc` check
<...>
=== test/unit/hash ===
../test/unit/hash.c:119:16: runtime error: left shift of 176 by 24
places cannot be represented in type 'int'
<...>
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants