Tweak jemalloc for more stable performance tests#11401
Tweak jemalloc for more stable performance tests#11401
Conversation
| * system overhead. | ||
| */ | ||
| #if defined(__linux__) | ||
| #define JEMALLOC_PURGE_MADVISE_FREE |
There was a problem hiding this comment.
AFAIK, it's also Ok on FreeBSD.
| #define JEMALLOC_PURGE_MADVISE_FREE | ||
| #else | ||
| #define JEMALLOC_PURGE_MADVISE_DONTNEED | ||
| #define JEMALLOC_PURGE_MADVISE_DONTNEED_ZEROS |
There was a problem hiding this comment.
FWIW JEMALLOC_PURGE_MADVISE_DONTNEED_ZEROS should be disable for non-linux (regardless this PR)
There was a problem hiding this comment.
FWIW JEMALLOC_PURGE_MADVISE_DONTNEED_ZEROS should be disable for non-linux (regardless this PR)
And there are some other issues with shipped header config, I guess it is better to fix this separately (linux is ok, at least freebsd has some issues)
There was a problem hiding this comment.
But according to cmake jemalloc is supported only on linux, so never mind
|
Some comments from Telegram (in Russian): Alexander Kuzmenkov, [03.06.20 22:37] Alexey Milovidov, [03.06.20 22:56] Azat Khuzhin, [03.06.20 23:53] А это приводит к сл.:
Там немного запутанно (да нужно глянуть на tcmalloc он не первый взгляд показался чище), но на первый взгляд выглядит так что как раз muzzy->clean с помощью mmap() повлияло
Не уверен что это то, что нужно, может попробовать просто увеличить muzzy_decay_ms? (т.к. вроде он отвечает за то, когда используется DONTNEED) или вообще выключить (поставить в -1)
либо dirty->muzzy будет noop либо muzzy уйдет, то есть будет dirty->clean |
|
So, what behaviour to expect in production on servers with old Linux kernel? |
I want to try another approach -- disable DONTNEED_ZEROS so that MADV_FREE is used + increase muzzy decay time. |
AFAIR (when I looked into it) this PR does not changes behavior for older linux, since:
And actually even upgrade jemalloc (#11163) does not changes behavior a lot, since in case of
Interesting, can you point at the code? (I can't find such bits there) |
Sorry, I'm out of context... do you mean that it's using |
This was my misinterpretation of your above comment. Looks like DONTNEED without DONTNEED_ZEROS is a no-op, and decommit is used instead if one of them is not defined. |
| # muzzy_decay_ms -- use MADV_FREE when available on newer Linuxes, to avoid | ||
| # spurios latencies and additional work associated with MADV_DONTNEED. | ||
| # See https://github.com/ClickHouse/ClickHouse/issues/11121 for motivation. | ||
| set (JEMALLOC_CONFIG_MALLOC_CONF "percpu_arena:percpu,oversize_threshold:0,muzzy_decay_ms:10000") |
There was a problem hiding this comment.
AFAICS this is the default value, that said that now it should be the same as in upstream
There was a problem hiding this comment.
Where can I find it to check? I was looking here: https://github.com/jemalloc/jemalloc/blob/ea6b3e973b477b8061e0076bb257dbd7f3faa756/include/jemalloc/internal/arena_types.h#L12 and also seen a comment in changelog about disabling muzzy decay by default.
There was a problem hiding this comment.
Where can I find it to check?
You are right, my bad, I got this from "documentation" (I knew that I should look into sources instead, sigh)
So, jemalloc/jemalloc@8e9a613 "disables" it by default, but according to documentation:
A decay time of 0 causes all unused muzzy pages to be purged immediately upon creation. A decay time of -1 disables purging.
So with default settings it should purge (free) muzzy pages immediately, and after first version of this PR it uses mmap for this, maybe it slows down free, but speedsup allocation then, and hence latency goes down?
muzzy_decay_ms -- use MADV_FREE when available on newer Linuxes
From the documentation:
Approximate time in milliseconds from the creation of a set of unused muzzy pages until an equivalent set of unused muzzy pages is purged (i.e. converted to clean) and/or reused. Muzzy pages are defined as previously having been unused dirty pages that were subsequently purged in a manner that left them subject to the reclamation whims of the operating system (e.g. madvise(...MADV_FREE)), and therefore in an indeterminate state.
So AFAICS muzzy_decay_ms is muzzy->clean, so it uses madvice(DONTNEED) or mmap, dirty_decay_ms uses MADV_FREE
No, muzzy->clean makes pages "clean", i.e. they are zero-filled and can be used again after this operation (if I understand correctly), hence it uses mmap with MAP_FIXED to clean the arena.
Indeed, and the fact that after 1b97274 it became faster is interesting, at least I don't understand this for now Actually there are some metrics that may help with understanding of this:
It is a little bit trickier then this, since there is also |
|
The new test results just don't make sense. |
jemalloc code paths have changed, they are somewhat more frequent in profiles, SoftPageFaults metric variability is the same for at least some queries. By the way, the most unstable queries (cryptographic_hashes) are due to query profiler -- half of traces is |
6c77191 to
09b9a30
Compare
|
Let's see what we have with muzzy decay enabled. Perf test result No real performance or stability changes. Call graph for madvise: MADV_DONTNEED is still called a lot, and the total number of madvise calls is higher. |
|
Experiment #3: muzzy decay 10 ms, MADV_FREE, no MADV_DONTNEED Some queries are faster, but some are slower. Interestingly, no mmap calls from jemaloc registered by the profiler -- not even sure how this can be, maybe it uses some weird wrapper. Ran |
@akuzm AFAIK flamegraphs are build automatically, and they are in archive, which one you posted in the #11401 (comment) (tried play with SHA in the link - did not work)? |
Nope, it's a custom thing -- all call stacks with madvise in them. I'm using a script like this to build it from perf test output archive: https://gist.github.com/akuzm/3f06cb15c7093d823554a14f1a9a1fcd |
|
@akuzm correct me if I'm wrong:
[1]: |
|
In the experiment 2.3, many arithmetic queries are 10-20% slower, but why? It just doesn't make any sense, as usual. We see about ten synonyms for query time, measured in terms of various clocks, that are all increasing. |
OSWriteChars was the only suspicious metric, so I turned off the query profiler, and now there are almost no changes in performance ¯\_(ツ)_/¯ But still, a couple of arithmetic queries are slower, with the same nonsensical metrics: |
Due to a bug in CI, a second performance test task just finished, compared the same commits, and now 15 arithmetic queries are faster! It's hilarious. |
Statistically significant changes in query metrics for these two runs of the same commit. Baffling. |
@akuzm which machines are used for performance tests? Virtualization? Maybe there is steal? (BTW may worth adding it into I doubt that this will affect, but maybe worth checking with disabled turbo boost? ( |
No virtualization.
Sure, what exactly to add?
I'll try to check this one. By the way, we now have graphs for CPU frequency in perf test output archive in 'metrics/CpuFrequenceMHz_0.png' etc, they look very noisy, but I didn't study them in detail. My current hypothesis is that changes in |
steal form
Yeah, I saw this code in
Even trickier... |
|
Merged muzzy_decay_ms = 10s. |
Changelog category (leave one):
See #11121 for motivation and discussion.
master: no muzzy decay, MADV_DONTNEED.
The slowest.
Flame graph for
madviseExperiment no. 2: muzzy decay 10 s, both MADV_FREE and MADV_DONTNEED
Result
No performance difference compared to master.
Flame graph for madvise -- the setting worked.
A couple of variations that didn't really give new results:
2.1: muzzy decay 60 s, MADV_FREE + MADV_DONTNEED
2.2: muzzy decay 10s, FREE, DONTNEED, w/new metrics
Experiment no. 3: muzzy decay 10 s, MADV_FREE, no MADV_DONTNEED
Somewhat faster: Result
Flame graph for
madviseconfirms that only MADV_FREE and not MADV_DONTNEED is used.Some queries are faster, but some are slower. Interestingly, no mmap calls from jemaloc registered by the profiler -- not even sure how this can be, maybe it uses some weird wrapper.
Ran
website.xmllocally and looked at the top -- both VIRT and RES decrease in seconds after a query finished, so I guess there is no problem with freeing memory in this configuration.Experiment no. 4: no muzzy decay, MADV_FREE
Results not much different from no.3 and better than master.
Flame graph for
madvise-- different from no. 3, so the settings work. Not sure what it means though.Experiment no. 5: no MADV_FREE/DONTNEED at all.
Surprisingly, it's the fastest: results
The memory is not unmapped, so RSS always grows: see the graph of MemoryResident from system.asynchronous_metrics
Open questions
website.xmlwithclickhouse-benchmark, and measure the resulting QPS. The performance test only runs queries in isolation, and many of them are single-threaded, so it is harder to notice the possibly increased concurrency overhead.mmapcalls from jemalloc are registered with our profiler? I thought that maybe it uses some custom wrapper, or blocks signals which would block our profiler, but no, judging from the sources it's just a normal mmap.