Skip to content

Tweak jemalloc for more stable performance tests#11401

Closed
akuzm wants to merge 40 commits intomasterfrom
aku/tweak-jemalloc
Closed

Tweak jemalloc for more stable performance tests#11401
akuzm wants to merge 40 commits intomasterfrom
aku/tweak-jemalloc

Conversation

@akuzm
Copy link
Copy Markdown
Contributor

@akuzm akuzm commented Jun 3, 2020

Changelog category (leave one):

  • Not for changelog (changelog entry is not required)

See #11121 for motivation and discussion.

master: no muzzy decay, MADV_DONTNEED.

The slowest.
Flame graph for madvise

Experiment no. 2: muzzy decay 10 s, both MADV_FREE and MADV_DONTNEED

Result
No performance difference compared to master.
Flame graph for madvise -- the setting worked.

A couple of variations that didn't really give new results:
2.1: muzzy decay 60 s, MADV_FREE + MADV_DONTNEED
2.2: muzzy decay 10s, FREE, DONTNEED, w/new metrics

Experiment no. 3: muzzy decay 10 s, MADV_FREE, no MADV_DONTNEED

Somewhat faster: Result

Flame graph for madvise confirms that only MADV_FREE and not MADV_DONTNEED is used.

Some queries are faster, but some are slower. Interestingly, no mmap calls from jemaloc registered by the profiler -- not even sure how this can be, maybe it uses some weird wrapper.

Ran website.xml locally and looked at the top -- both VIRT and RES decrease in seconds after a query finished, so I guess there is no problem with freeing memory in this configuration.

Experiment no. 4: no muzzy decay, MADV_FREE

Results not much different from no.3 and better than master.
Flame graph for madvise -- different from no. 3, so the settings work. Not sure what it means though.

Experiment no. 5: no MADV_FREE/DONTNEED at all.

Surprisingly, it's the fastest: results
The memory is not unmapped, so RSS always grows: see the graph of MemoryResident from system.asynchronous_metrics

Open questions

  • What is the impact of these changes on a concurrent workload -- e.g., run all queries from website.xml with clickhouse-benchmark, and measure the resulting QPS. The performance test only runs queries in isolation, and many of them are single-threaded, so it is harder to notice the possibly increased concurrency overhead.
  • Why no mmap calls from jemalloc are registered with our profiler? I thought that maybe it uses some custom wrapper, or blocks signals which would block our profiler, but no, judging from the sources it's just a normal mmap.
    • opt.retain (bool) r- is disabled by default on 64-bit Linux

@blinkov blinkov added the pr-not-for-changelog This PR should not be mentioned in the changelog label Jun 3, 2020
* system overhead.
*/
#if defined(__linux__)
#define JEMALLOC_PURGE_MADVISE_FREE
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, it's also Ok on FreeBSD.

#define JEMALLOC_PURGE_MADVISE_FREE
#else
#define JEMALLOC_PURGE_MADVISE_DONTNEED
#define JEMALLOC_PURGE_MADVISE_DONTNEED_ZEROS
Copy link
Copy Markdown
Member

@azat azat Jun 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW JEMALLOC_PURGE_MADVISE_DONTNEED_ZEROS should be disable for non-linux (regardless this PR)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW JEMALLOC_PURGE_MADVISE_DONTNEED_ZEROS should be disable for non-linux (regardless this PR)

And there are some other issues with shipped header config, I guess it is better to fix this separately (linux is ok, at least freebsd has some issues)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But according to cmake jemalloc is supported only on linux, so never mind

@akuzm
Copy link
Copy Markdown
Contributor Author

akuzm commented Jun 4, 2020

Some comments from Telegram (in Russian):

Alexander Kuzmenkov, [03.06.20 22:37]
О, нашёл ишшую где описано, зачем им двухстадийное освобождение страниц. Нам надо вторую стадию как-то отключить, что ли...

jemalloc/jemalloc#521

Alexey Milovidov, [03.06.20 22:56]
[In reply to Alexander Kuzmenkov]
У нас max_server_memory_usage считается по RSS, а затем ограничивает новые запросы, и получается, что нам нельзя слишком лениво отдавать память системе.

Azat Khuzhin, [03.06.20 23:53]
[In reply to Alexander Kuzmenkov]
Получается что #11401
выключает MADV_DONTNEED и оставляет только MADV_FREE

А это приводит к сл.:

  • pages_purge_lazy (dirty->muzzy) - работает как раньше, т.к. MADV_FREE там предпочитается MADV_DONTNEED (и на самом деле он там никогда не использовался, т.к. стоит еще JEMALLOC_PURGE_MADVISE_DONTNEED_ZEROS), поэтому на самом деле поведение очистки dirty->muzzy никак не поменялось, и fallback до DONTNEED не работает как и раньше

  • pages_purge_forced (muzzy->clean) - использует просто mmap() (overlap mapping) вместо MADV_DONTNEED

Там немного запутанно (да нужно глянуть на tcmalloc он не первый взгляд показался чище), но на первый взгляд выглядит так что как раз muzzy->clean с помощью mmap() повлияло

Нам нужен какой-нибудь способ заставить его использовать MADV_FREE если есть, а если нет -- MADV_DONTNEED.

Не уверен что это то, что нужно, может попробовать просто увеличить muzzy_decay_ms? (т.к. вроде он отвечает за то, когда используется DONTNEED) или вообще выключить (поставить в -1)

MADV_FREE разрешён, но не понимаю, что от этого произойдёт, если он не поддерживается.

либо dirty->muzzy будет noop либо muzzy уйдет, то есть будет dirty->clean

@alexey-milovidov
Copy link
Copy Markdown
Member

So, what behaviour to expect in production on servers with old Linux kernel?
Maybe it's ready for merge?

@akuzm
Copy link
Copy Markdown
Contributor Author

akuzm commented Jun 5, 2020

So, what behaviour to expect in production on servers with old Linux kernel?
Maybe it's ready for merge?

I want to try another approach -- disable DONTNEED_ZEROS so that MADV_FREE is used + increase muzzy decay time.
By the way, this PR didn't significantly affect the variability of perf test results -- I'm seeing only a 5% decrease in query duration spread, on average for unstable queries.

@azat
Copy link
Copy Markdown
Member

azat commented Jun 5, 2020

So, what behaviour to expect in production on servers with old Linux kernel?

AFAIR (when I looked into it) this PR does not changes behavior for older linux, since:

  • dirty->muzzy always uses MADV_FREE if defined
  • muzzy->clean uses mmap instead of MADV_DONTNEED after this PR

And actually even upgrade jemalloc (#11163) does not changes behavior a lot, since in case of MADV_FREE is not supported then dirty->muzzy will be replaced by dirty->clean and that's it, so that said that everything should be ok even if MADV_FREE is not supported.

disable DONTNEED_ZEROS so that MADV_FREE is used

Interesting, can you point at the code? (I can't find such bits there)

@alexey-milovidov
Copy link
Copy Markdown
Member

muzzy->clean uses mmap instead of MADV_DONTNEED after this PR

Sorry, I'm out of context... do you mean that it's using munmap to return memory?
(it should be significantly slower than MADV_DONTNEED and MADV_FREE)

@akuzm
Copy link
Copy Markdown
Contributor Author

akuzm commented Jun 5, 2020

disable DONTNEED_ZEROS so that MADV_FREE is used

Interesting, can you point at the code? (I can't find such bits there)

This was my misinterpretation of your above comment. Looks like DONTNEED without DONTNEED_ZEROS is a no-op, and decommit is used instead if one of them is not defined.

https://github.com/jemalloc/jemalloc/blob/8da0896b7913470250a0220504822028e2aa8f2a/include/jemalloc/internal/pages.h#L41

@robot-clickhouse robot-clickhouse added the submodule changed At least one submodule changed in this PR. label Jun 5, 2020
# muzzy_decay_ms -- use MADV_FREE when available on newer Linuxes, to avoid
# spurios latencies and additional work associated with MADV_DONTNEED.
# See https://github.com/ClickHouse/ClickHouse/issues/11121 for motivation.
set (JEMALLOC_CONFIG_MALLOC_CONF "percpu_arena:percpu,oversize_threshold:0,muzzy_decay_ms:10000")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICS this is the default value, that said that now it should be the same as in upstream

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where can I find it to check? I was looking here: https://github.com/jemalloc/jemalloc/blob/ea6b3e973b477b8061e0076bb257dbd7f3faa756/include/jemalloc/internal/arena_types.h#L12 and also seen a comment in changelog about disabling muzzy decay by default.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where can I find it to check?

You are right, my bad, I got this from "documentation" (I knew that I should look into sources instead, sigh)

So, jemalloc/jemalloc@8e9a613 "disables" it by default, but according to documentation:

A decay time of 0 causes all unused muzzy pages to be purged immediately upon creation. A decay time of -1 disables purging.

So with default settings it should purge (free) muzzy pages immediately, and after first version of this PR it uses mmap for this, maybe it slows down free, but speedsup allocation then, and hence latency goes down?

muzzy_decay_ms -- use MADV_FREE when available on newer Linuxes

From the documentation:

Approximate time in milliseconds from the creation of a set of unused muzzy pages until an equivalent set of unused muzzy pages is purged (i.e. converted to clean) and/or reused. Muzzy pages are defined as previously having been unused dirty pages that were subsequently purged in a manner that left them subject to the reclamation whims of the operating system (e.g. madvise(...MADV_FREE)), and therefore in an indeterminate state.

So AFAICS muzzy_decay_ms is muzzy->clean, so it uses madvice(DONTNEED) or mmap, dirty_decay_ms uses MADV_FREE

@azat
Copy link
Copy Markdown
Member

azat commented Jun 5, 2020

Sorry, I'm out of context... do you mean that it's using munmap to return memory?

No, muzzy->clean makes pages "clean", i.e. they are zero-filled and can be used again after this operation (if I understand correctly), hence it uses mmap with MAP_FIXED to clean the arena.

From http://jemalloc.net/jemalloc.3.html

Cleanup attempts cascade from deallocation to decommit to forced purging to lazy purging
...
A lazy extent purge function (e.g. implemented via madvise(...MADV_FREE)) can delay purging indefinitely and leave the pages within the purged virtual memory range in an indeterminite state, whereas a forced extent purge function immediately purges, and the pages within the virtual memory range will be zero-filled the next time they are accessed

(it should be significantly slower than MADV_DONTNEED and MADV_FREE)

Indeed, and the fact that after 1b97274 it became faster is interesting, at least I don't understand this for now

Actually there are some metrics that may help with understanding of this:

  • stats.arenas.<i>.pmuzzy
  • stats.arenas.<i>.pdirty
  • and others

This was my misinterpretation of your above comment. Looks like DONTNEED without DONTNEED_ZEROS is a no-op, and decommit is used instead if one of them is not defined.

It is a little bit trickier then this, since there is also JEMALLOC_MAPS_COALESCE, so it will not be noop but will use mmap instead.

@akuzm
Copy link
Copy Markdown
Contributor Author

akuzm commented Jun 8, 2020

The new test results just don't make sense.
I'll run it again, with more iterations, less frequent profiler, etc.

@akuzm
Copy link
Copy Markdown
Contributor Author

akuzm commented Jun 8, 2020

The new test results just don't make sense.
I'll run it again, with more iterations, less frequent profiler, etc.

jemalloc code paths have changed, they are somewhat more frequent in profiles, SoftPageFaults metric variability is the same for at least some queries.

By the way, the most unstable queries (cryptographic_hashes) are due to query profiler -- half of traces is StackTrace.

@alexey-milovidov alexey-milovidov force-pushed the master branch 2 times, most recently from 6c77191 to 09b9a30 Compare June 9, 2020 02:03
@akuzm
Copy link
Copy Markdown
Contributor Author

akuzm commented Jun 9, 2020

Let's see what we have with muzzy decay enabled. Perf test result

No real performance or stability changes.

Call graph for madvise:
master
muzzy decay enabled

MADV_DONTNEED is still called a lot, and the total number of madvise calls is higher.

@akuzm
Copy link
Copy Markdown
Contributor Author

akuzm commented Jun 9, 2020

Experiment #3: muzzy decay 10 ms, MADV_FREE, no MADV_DONTNEED
Result

Some queries are faster, but some are slower. Interestingly, no mmap calls from jemaloc registered by the profiler -- not even sure how this can be, maybe it uses some weird wrapper.

Ran website.xml locally and looked at the top -- both VIRT and RES decrease in seconds after a query finished, so I guess there is no problem with freeing memory in this configuration.

@azat
Copy link
Copy Markdown
Member

azat commented Jun 9, 2020

Experiment #3: muzzy decay 10 ms, MADV_FREE, no MADV_DONTNEED

@akuzm AFAIK flamegraphs are build automatically, and they are in archive, which one you posted in the #11401 (comment) (tried play with SHA in the link - did not work)?

@akuzm
Copy link
Copy Markdown
Contributor Author

akuzm commented Jun 9, 2020

Experiment #3: muzzy decay 10 ms, MADV_FREE, no MADV_DONTNEED

@akuzm AFAIK flamegraphs are build automatically, and they are in archive, which one you posted in the #11401 (comment) (tried play with SHA in the link - did not work)?

Nope, it's a custom thing -- all call stacks with madvise in them. I'm using a script like this to build it from perf test output archive: https://gist.github.com/akuzm/3f06cb15c7093d823554a14f1a9a1fcd
(you can repeat it if you download output.7z).

@azat
Copy link
Copy Markdown
Member

azat commented Jun 10, 2020

@akuzm correct me if I'm wrong:

  • master - no lazy purging - ok

  • muzzy decay enabled - half lazy, half forced - ok

    MADV_DONTNEED is still called a lot

    and the total number of madvise calls is higher.

    Indeed, since now it has lazy purging, which is madvise(MADV_FREE)

  • muzzy decay 10 ms, MADV_FREE, no MADV_DONTNEED [1] - only lazy purging - ok

    Interestingly, no mmap calls from jemaloc registered by the profiler -- not even sure how this can be

    Indeed interesting, and this is because when force purging uses mmap, it is no-op if overcommit is enabled (and I guess you have defaults, i.e. heuristic overcommit)

[1]: docker run --rm -it --name ch -v /src/ch/ch-env/scripts:/scripts:ro -e PATH=/usr/bin:/usr/sbin:/bin:/sbin:/scripts -v $PWD:/wrk -w /wrk yandex/clickhouse-server:20.4 bash -c 'apt update; apt install -y perl; analyze-traces.sh'

@akuzm
Copy link
Copy Markdown
Contributor Author

akuzm commented Jun 22, 2020

In the experiment 2.3, many arithmetic queries are 10-20% slower, but why?
Locally I managed to calculate the same statistics for all metrics from PerformanceEvents from query log. Here is a list of those metrics that significantly changed for many arithmetic queries,
and the amounts of change:

akuzm@akuzm-nix:~/w37$ ~/ch4/build-gcc9-rel/programs/clickhouse-local --query "select count(*) num_queries, median(left) med_old_value, (arrayMap(x->floor(x, 3), quantiles(0., 0.5, 0.9)(diff)) as q)[1] min_diff, q[2] med_diff, q[3] max_diff, metric_name from (select left, right, diff, stat_threshold, metric_name, test, query_index from file('report/query-metric-stats.tsv', TSVWithNamesAndTypes, 'metric_name String, left Float64, right Float64, diff Float64, stat_threshold Float64, test text, query_index int, query_display_name text') where abs(diff) > stat_threshold order by diff desc, metric_name, test, query_index) group by metric_name having abs(med_diff) > 0.05 and num_queries > 10 order by q[2] desc format PrettyCompact"

┌─num_queries─┬─med_old_value─┬─min_diff─┬─med_diff─┬─max_diff─┬─metric_name────────────────────┐
│          13 │           857 │    0.218 │    0.358 │    0.519 │ NetworkSendElapsedMicroseconds │
│          59 │      18280000 │   -0.115 │    0.284 │    0.378 │ UserTimeMicroseconds           │
│          59 │     453138800 │    -0.11 │    0.283 │     0.37 │ PerfBusCycles                  │
│          59 │   51077214000 │   -0.115 │     0.28 │     0.37 │ PerfCpuCycles                  │
│          59 │   38124417000 │    -0.11 │     0.28 │    0.369 │ PerfRefCpuCycles               │
│          57 │   18861773000 │   -0.116 │    0.259 │     0.34 │ PerfTaskClock                  │
│          57 │   18861756000 │   -0.116 │    0.259 │     0.34 │ PerfCpuClock                   │
│          58 │      18900857 │   -0.117 │    0.257 │    0.339 │ OSCPUVirtualTimeMicroseconds   │
│          54 │      276717.5 │   -0.147 │     0.18 │    0.241 │ OSWriteChars                   │
│          51 │      24091740 │   -0.203 │    0.171 │    0.247 │ RealTimeMicroseconds           │
│          51 │         0.587 │   -0.219 │    0.158 │    0.228 │ server_time                    │
│          51 │        0.5882 │   -0.216 │    0.156 │    0.223 │ client_time                    │
│          57 │        544000 │    -0.78 │   -0.621 │   -0.509 │ SystemTimeMicroseconds         │
│          59 │         95995 │       -1 │    -0.96 │   -0.855 │ SoftPageFaults                 │
└─────────────┴───────────────┴──────────┴──────────┴──────────┴────────────────────────────────┘

It just doesn't make any sense, as usual. We see about ten synonyms for query time, measured in terms of various clocks, that are all increasing.
The soft page faults decreased significanlty, but this doesn't help.

@akuzm
Copy link
Copy Markdown
Contributor Author

akuzm commented Jun 24, 2020

In the experiment 2.3, many arithmetic queries are 10-20% slower, but why?

OSWriteChars was the only suspicious metric, so I turned off the query profiler, and now there are almost no changes in performance ¯\_(ツ)_/¯

Results here: https://clickhouse-test-reports.s3.yandex.net/11401/ae1bc3cb27225afe6d20e9beb38cad1e4643049c/performance_comparison/report.html#fail1

But still, a couple of arithmetic queries are slower, with the same nonsensical metrics:

akuzm@akuzm-nix:~/w40$ ~/ch4/build-gcc9-rel/programs/clickhouse-local --query "select left, right, diff, stat_threshold, metric_name, test, query_index from file('report/query-metric-stats.tsv', TSVWithNamesAndTypes, 'metric_name String, left Float64, right Float64, diff Float64, stat_threshold Float64, test text, query_index int, query_display_name text') where not isNaN(stat_threshold) and abs(diff) > stat_threshold and abs(diff) < 1. and test = 'arithmetic' and query_index in (2) order by diff desc, metric_name, test, query_index format PrettyCompact"
┌────────left─┬───────right─┬───diff─┬─stat_threshold─┬─metric_name──────────────────┬─test───────┬─query_index─┐
│    23186466 │    26417636 │  0.139 │          0.121 │ RealTimeMicroseconds         │ arithmetic │           2 │
│       0.563 │       0.637 │  0.131 │          0.108 │ server_time                  │ arithmetic │           2 │
│      0.5647 │      0.6382 │   0.13 │          0.107 │ client_time                  │ arithmetic │           2 │
│    18828000 │    20692000 │  0.099 │          0.095 │ UserTimeMicroseconds         │ arithmetic │           2 │
│ 52697604000 │ 57759285000 │  0.096 │          0.092 │ PerfCpuCycles                │ arithmetic │           2 │
│    19298918 │    20837040 │  0.079 │          0.078 │ OSCPUVirtualTimeMicroseconds │ arithmetic │           2 │
│   736472600 │   743253100 │  0.009 │          0.007 │ PerfCacheReferences          │ arithmetic │           2 │
│       88521 │       88431 │ -0.002 │              0 │ OSReadChars                  │ arithmetic │           2 │
│      404000 │       84000 │ -0.793 │          0.663 │ SystemTimeMicroseconds       │ arithmetic │           2 │
│       74968 │       10121 │ -0.865 │          0.814 │ SoftPageFaults               │ arithmetic │           2 │
└─────────────┴─────────────┴────────┴────────────────┴──────────────────────────────┴────────────┴─────────────┘

@akuzm
Copy link
Copy Markdown
Contributor Author

akuzm commented Jun 25, 2020

OSWriteChars was the only suspicious metric, so I turned off the query profiler, and now there are almost no changes in performance ¯_(ツ)_/¯

Results here: https://clickhouse-test-reports.s3.yandex.net/11401/ae1bc3cb27225afe6d20e9beb38cad1e4643049c/performance_comparison/report.html#fail1

Due to a bug in CI, a second performance test task just finished, compared the same commits, and now 15 arithmetic queries are faster! It's hilarious.

@akuzm
Copy link
Copy Markdown
Contributor Author

akuzm commented Jun 26, 2020

OSWriteChars was the only suspicious metric, so I turned off the query profiler, and now there are almost no changes in performance ¯_(ツ)_/¯
Results here: https://clickhouse-test-reports.s3.yandex.net/11401/ae1bc3cb27225afe6d20e9beb38cad1e4643049c/performance_comparison/report.html#fail1

Due to a bug in CI, a second performance test task just finished, compared the same commits, and now 15 arithmetic queries are faster! It's hilarious.

akuzm@akuzm-nix:~/w41$ ~/ch4/build-gcc9-rel/programs/clickhouse-local --query "select count(*) num_queries, metric_name, (arrayMap(x->floor(x, 3), quantiles(0., 0.5, 1.)(diff)) as q)[1] min, q[2] med, q[3] max from file('report/query-metric-stats.tsv', TSVWithNamesAndTypes, 'metric_name text, left float, right float, diff float, stat_threshold float, test text, query_index int, query_display_name text') where abs(diff) > stat_threshold and abs(diff) > 0.05 group by metric_name having isFinite(q[1]) and abs(q[1]) > 0 and abs(q[1]) < 1 and num_queries > 3 order by q[2] desc format PrettyCompact"

### Run 1
┌─num_queries─┬─metric_name──────────────────┬────min─┬────med─┬────max─┐
│          27 │ UserTimeMicroseconds         │ -0.166 │ -0.068 │  0.152 │
│          29 │ PerfCpuCycles                │ -0.171 │ -0.076 │  0.143 │
│          24 │ OSCPUVirtualTimeMicroseconds │ -0.166 │ -0.101 │  0.115 │
│          23 │ RealTimeMicroseconds         │ -0.173 │ -0.119 │  0.107 │
│          22 │ server_time                  │ -0.173 │ -0.121 │  0.086 │
│          22 │ client_time                  │ -0.174 │ -0.121 │  0.083 │
│          58 │ SystemTimeMicroseconds       │ -0.871 │ -0.742 │ -0.561 │
│          60 │ SoftPageFaults               │ -0.991 │   -0.9 │  -0.77 │
└─────────────┴──────────────────────────────┴────────┴────────┴────────┘

### Run 2
┌─num_queries─┬─metric_name──────────────────┬────min─┬────med─┬────max─┐
│          15 │ PerfCpuCycles                │  0.054 │  0.085 │  0.115 │
│          16 │ UserTimeMicroseconds         │   0.05 │  0.079 │  0.119 │
│           7 │ OSCPUVirtualTimeMicroseconds │  0.061 │  0.078 │  0.104 │
│           4 │ RealTimeMicroseconds         │ -0.102 │  0.039 │  0.138 │
│           4 │ server_time                  │ -0.095 │   0.03 │   0.13 │
│           4 │ client_time                  │ -0.095 │  0.029 │  0.129 │
│          54 │ SystemTimeMicroseconds       │ -0.874 │ -0.799 │ -0.573 │
│          57 │ SoftPageFaults               │ -0.973 │ -0.908 │ -0.718 │
└─────────────┴──────────────────────────────┴────────┴────────┴────────┘

Statistically significant changes in query metrics for these two runs of the same commit. Baffling.

@azat
Copy link
Copy Markdown
Member

azat commented Jun 29, 2020

Due to a bug in CI, a second performance test task just finished, compared the same commits, and now 15 arithmetic queries are faster! It's hilarious.

@akuzm which machines are used for performance tests? Virtualization? Maybe there is steal? (BTW may worth adding it into AsynchronousMetrics::update too?).

I doubt that this will affect, but maybe worth checking with disabled turbo boost? (/sys/devices/system/cpu/intel_pstate/no_turbo or through MSR)

@akuzm
Copy link
Copy Markdown
Contributor Author

akuzm commented Jun 29, 2020

Due to a bug in CI, a second performance test task just finished, compared the same commits, and now 15 arithmetic queries are faster! It's hilarious.

@akuzm which machines are used for performance tests? Virtualization?

No virtualization.

Maybe there is steal? (BTW may worth adding it into AsynchronousMetrics::update too?).

Sure, what exactly to add?

I doubt that this will affect, but maybe worth checking with disabled turbo boost? (/sys/devices/system/cpu/intel_pstate/no_turbo or through MSR)

I'll try to check this one. By the way, we now have graphs for CPU frequency in perf test output archive in 'metrics/CpuFrequenceMHz_0.png' etc, they look very noisy, but I didn't study them in detail.

My current hypothesis is that changes in arithmetic that are consistent for one test run, but change between runs, are related to different memory layout for Memory table that is used for these test. Not sure what metrics can be useful, I'm adding dTLB misses now. Last level cache misses (PerfCacheMisses) don't seem correlated to that.

@azat
Copy link
Copy Markdown
Member

azat commented Jun 29, 2020

Maybe there is steal? (BTW may worth adding it into AsynchronousMetrics::update too?).

Sure, what exactly to add?

steal form /proc/stat, from proc 5 it is 8 column and linux 2.6.11+

By the way, we now have graphs for CPU frequency in perf test output archive

Yeah, I saw this code in AsynchronousMetrics.cpp, but it is guage metric, and it will show something only if it will be different for a long period of time (long is >10 seconds or similar)

but change between runs, are related to different memory layout for Memory table that is used for these test

Even trickier...

@akuzm
Copy link
Copy Markdown
Contributor Author

akuzm commented Jul 1, 2020

Merged muzzy_decay_ms = 10s.

@akuzm akuzm closed this Jul 1, 2020
@akuzm akuzm deleted the aku/tweak-jemalloc branch April 1, 2021 17:09
@akuzm akuzm mentioned this pull request Apr 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-not-for-changelog This PR should not be mentioned in the changelog submodule changed At least one submodule changed in this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants