Attempt to make performance test more reliable #14685

alexey-milovidov · 2020-09-10T09:09:22Z

Changelog category (leave one):

Build/Testing/Packaging Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Attempt to make performance test more reliable. It is done by remapping the executable memory of the process on the fly with madvise to use transparent huge pages - it can lower the number of iTLB misses which is the main source of instabilities in performance tests.

Detailed description:
The "worst" code I've ever written.

Presentation in russian: https://presentations.clickhouse.tech/database_saturday_2020/
Video: https://youtu.be/chY9LnhLjco?t=8573

amosbird · 2020-09-10T15:14:31Z

Will it improve performance in any way?

use transparent huge pages

IIUC, aren't they just normal huge pages instead?

alexey-milovidov · 2020-09-10T17:08:49Z

Will it improve performance in any way?

Yes, it is possible. But I expect only small gains.

IIUC, aren't they just normal huge pages instead?

Normal huge pages have very poor usability (very difficult to set up and to keep configuration sane) and we cannot use them in our CI environment for performance tests.

There are anecdotic stories how people set up static huge pages in their baremetal servers, then forgot about it, quit from company, and other developers get wonder why everything is slow and memory is not used at all.

akuzm · 2020-09-11T10:33:59Z

A dramatic effect on iTLB misses, with no effect on query time (these are CDF-like graphs, X is relative change of metric, and Y is the number of queries in which the change is less).

No effect on query run time instability (still 400 unstable queries). Maybe it will be better when the huge page version is on both sides.
https://clickhouse-test-reports.s3.yandex.net/14685/5675efbd47fde50524463a14758c672091264897/performance_comparison/report.html#test-performance-changes

Overall I like this feature -- it's very bad indeed, but I would enable it in perf tests, just to have one thing less to worry about.

alexey-milovidov · 2020-09-11T10:46:38Z

Ok. I see the following levels of integration:

Add it as an option in config, disable by default, enable for integration tests.
Add it as an option in config, enable by default.

Let's do 1 and if everything will go well after a while we can enable by default.

PS. There are also the following side effects:

if server has no swap, the memory cannot be swapped out because the mapping is anonymous (not backed by a file);
in either case, the mapping is preloaded in memory;
the effect is the same as mlock but no CAP_IPC_LOCK is required;
automatic protection from editing the file with binary (changes won't reflect in memory).

I will keep the mlock option in config and add a new option near.

zhang2014 · 2020-09-12T14:48:29Z

src/Common/remapExecutable.cpp

+
+__attribute__((__noinline__)) void remapToHugeStep3(void * scratch, size_t size, size_t offset)
+{
+    /// The function should not use the stack, otherwise various optimizations, including "omit-frame-pointer" may break the code.


Just my curious. Is that why we can't clean up after jump? It's like https://github.com/ClickHouse/ClickHouse/pull/14685/files#diff-aec223e38b51ea8306ee7a57c508315dR188.

Sorry, did not get the question completely... how to clean up after jump?

Probably we also rely on the fact that the function does not have prologue.
(It also does not permit debug build.)
But we also can compile this function without optimizations (using some attribute).

azat · 2020-09-13T13:18:20Z

src/Common/remapExecutable.cpp

+
+    /// As the memory region is anonymous, we can do madvise with MADV_HUGEPAGE.
+
+    syscall_func(SYS_madvise, begin, size, MADV_HUGEPAGE);


Interesting.

I thought about making is simpler using memfd_create + fexecve, and here is simple PoC:
https://github.com/azat/hugeelfexec#hugeelfexec

Proof of concept:

$ ./hugeelfexec clickhouse local -q 'select sleepEachRow(1) from numbers(10)' & # wait until process will start $ fgrep -i ShmemHugePages /proc/meminfo ShmemHugePages: 2607104 kB

But what bothers me is that numa_maps does not reflect this:

$ grep /proc/$(pgrep -f 'clickhouse local')/kernelpagesize_kB numa_maps | grep -v kernelpagesize_kB=4$ -c 0

But I guess that hugepages from shmem (and even anon) does not showed there, only the one from hugepagefs (need to verify this though).

Compared to your binary:

$ grepAnonHugePages /proc/meminfo AnonHugePages: 184320 kB $ grep /proc/$(pgrep -f 'clickhouse server')/kernelpagesize_kB numa_maps | grep -v kernelpagesize_kB=4$ -c 0

Also the difference is that mine uses hugepages for the whole binary:

this PR: AnonHugePages: 184320 kB

mine: ShmemHugePages: 2607104 kB

But I guess that hugepages from shmem (and even anon) does not showed there, only the one from hugepagefs (need to verify this though).

Looks like this assumption is correct:

vma_kernel_pagesize is used for printing info into numa_maps:

unsigned long vma_kernel_pagesize(struct vm_area_struct *vma) { if (vma->vm_ops && vma->vm_ops->pagesize) return vma->vm_ops->pagesize(vma); return PAGE_SIZE; }

.pagesize

hugetlb_vm_ops is used only in hugetlbfs

Very interesting! - did not know that it's possible, never heard about fexecve before.

Also the difference is that mine uses hugepages for the whole binary

But we don't need to load debuginfo, symtab into memory. It will waste 3 GiB...

But we don't need to load debuginfo, symtab into memory. It will waste 3 GiB...

Indeed, kind of a problem, but production binaries stores debug symbols separately, so should not be a problem there.

There are also some usages of /proc/self/exe across the code.

alexey-milovidov · 2020-09-19T10:49:37Z

Now we have only 176 unstable queries (was 418), so maybe it has helped
(or other recent attempts to lower binary size have helped).

akuzm · 2020-09-21T11:22:53Z

Now we have only 176 unstable queries (was 418), so maybe it has helped
(or other recent attempts to lower binary size have helped).

I think it was this one #14883
(I increased instability thresholds for 30 tests, based on historical data)

@alexey-milovidov

Some CPUs is affected by iTLB multihit bug [1], and the cost to mitigate it in software is page fault. Since according to [1]: In order to mitigate the vulnerability, KVM initially marks all huge pages as non-executable. If the guest attempts to execute in one of those pages, the page is broken down into 4K pages, which are then marked executable [1]: https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/multihit.html And in case of failures of prewarm queries I see lots of SoftPageFaults [2]: $ clickhouse-local --input-format TSVWithNamesAndTypes --file left-query-log.tsv --structure "$(cat left-query-log.tsv.columns | sed "s/\\\'/'/g")" -q "select query_id, ProfileEvents['SoftPageFaults'] from table where query_duration_ms >= 15e3 and query_id not like '%-%-%' /* uuid */" | column -t trim_numbers.query5.prewarm0 486 avg_weighted.query7.prewarm0 986 great_circle_dist.query1.prewarm0 1292 hashed_dictionary.query10.prewarm0 654664 random_string.query1.prewarm0 10091 array_fill.query4.prewarm0 341801 window_functions.query5.prewarm0 46230 [2]: https://clickhouse-test-reports.s3.yandex.net/30882/54c89e0f0e9b7b18ab40e755805c115a462a6669/performance_comparison/report.html#fail1 And yes, Intel Xeon Gold 6230R [3] is vulnerable to iTLB multihit [4]. [3]: https://ark.intel.com/content/www/us/en/ark/products/192437/intel-xeon-gold-6230-processor-27-5m-cache-2-10-ghz.html [4]: https://openbenchmarking.org/s/Intel%20Xeon%20Gold%206230R NOTE: that you should not look at openbenchmarking.org for "Intel Xeon E5-2660 v4" [5], since apparently lscpu was old, and bugs was not reported and hence parsed [5]: https://ark.intel.com/content/www/us/en/ark/products/91772/intel-xeon-processor-e52660-v4-35m-cache-2-00-ghz.html Refs: ClickHouse#14685 Cc: @alexey-milovidov

Attempt to make performance test more reliable

a644733

robot-clickhouse added the pr-build Pull request with build/testing/packaging improvement label Sep 10, 2020

Fix typo

532d121

alexey-milovidov marked this pull request as draft September 10, 2020 09:15

Disable under certain conditions

65e9c67

Fix build

5675efb

zhang2014 reviewed Sep 12, 2020

View reviewed changes

azat reviewed Sep 13, 2020

View reviewed changes

alexey-milovidov added 4 commits September 14, 2020 16:31

Style

b966697

Merge branch 'master' into remap-executable

0e73b8a

Disable in shared libraries mode

f9361ac

Added config option

2ac88ab

alexey-milovidov marked this pull request as ready for review September 14, 2020 18:09

alexey-milovidov added 2 commits September 14, 2020 21:12

Remove debug output

3530f83

Enable remap in performance tests

b87c108

alexey-milovidov merged commit 018f596 into master Sep 15, 2020

alexey-milovidov deleted the remap-executable branch September 15, 2020 12:09

alexey-milovidov mentioned this pull request Sep 20, 2020

what to do about iTLB misses #14519

Closed

azat mentioned this pull request Nov 19, 2021

Log size of remapped memory (remap_executable) #31033

Merged

azat mentioned this pull request Nov 19, 2021

perf: disable remap_executable to avoid iTLB multihit mitigation cost #31543

Closed

Algunenano mentioned this pull request Jan 5, 2024

See impact of DISABLE_OMIT_FRAME_POINTER #58529

Closed

1 task

qhsong mentioned this pull request Oct 11, 2024

Clarify what linux version should have THP set to madvise ClickHouse/clickhouse-docs#2997

Open


		/// As the memory region is anonymous, we can do madvise with MADV_HUGEPAGE.

		syscall_func(SYS_madvise, begin, size, MADV_HUGEPAGE);

Attempt to make performance test more reliable #14685

Attempt to make performance test more reliable #14685

Uh oh!

Conversation

alexey-milovidov commented Sep 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amosbird commented Sep 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexey-milovidov commented Sep 10, 2020

Uh oh!

akuzm commented Sep 11, 2020

Uh oh!

alexey-milovidov commented Sep 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexey-milovidov Sep 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

azat Sep 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexey-milovidov commented Sep 19, 2020

Uh oh!

akuzm commented Sep 21, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

alexey-milovidov commented Sep 10, 2020 •

edited

Loading

amosbird commented Sep 10, 2020 •

edited

Loading

alexey-milovidov commented Sep 11, 2020 •

edited

Loading

alexey-milovidov Sep 14, 2020 •

edited

Loading

azat Sep 13, 2020 •

edited

Loading