-
Notifications
You must be signed in to change notification settings - Fork 8k
Attempt to make performance test more reliable #14685
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Will it improve performance in any way?
IIUC, aren't they just normal huge pages instead? |
Yes, it is possible. But I expect only small gains.
Normal huge pages have very poor usability (very difficult to set up and to keep configuration sane) and we cannot use them in our CI environment for performance tests. There are anecdotic stories how people set up static huge pages in their baremetal servers, then forgot about it, quit from company, and other developers get wonder why everything is slow and memory is not used at all. |
|
A dramatic effect on iTLB misses, with no effect on query time (these are CDF-like graphs, X is relative change of metric, and Y is the number of queries in which the change is less). No effect on query run time instability (still 400 unstable queries). Maybe it will be better when the huge page version is on both sides. Overall I like this feature -- it's very bad indeed, but I would enable it in perf tests, just to have one thing less to worry about. |
|
Ok. I see the following levels of integration:
Let's do 1 and if everything will go well after a while we can enable by default. PS. There are also the following side effects:
I will keep the mlock option in config and add a new option near. |
|
|
||
| __attribute__((__noinline__)) void remapToHugeStep3(void * scratch, size_t size, size_t offset) | ||
| { | ||
| /// The function should not use the stack, otherwise various optimizations, including "omit-frame-pointer" may break the code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just my curious. Is that why we can't clean up after jump? It's like https://github.com/ClickHouse/ClickHouse/pull/14685/files#diff-aec223e38b51ea8306ee7a57c508315dR188.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, did not get the question completely... how to clean up after jump?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably we also rely on the fact that the function does not have prologue.
(It also does not permit debug build.)
But we also can compile this function without optimizations (using some attribute).
|
|
||
| /// As the memory region is anonymous, we can do madvise with MADV_HUGEPAGE. | ||
|
|
||
| syscall_func(SYS_madvise, begin, size, MADV_HUGEPAGE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting.
I thought about making is simpler using memfd_create + fexecve, and here is simple PoC:
https://github.com/azat/hugeelfexec#hugeelfexec
Proof of concept:
$ ./hugeelfexec clickhouse local -q 'select sleepEachRow(1) from numbers(10)' &
# wait until process will start
$ fgrep -i ShmemHugePages /proc/meminfo
ShmemHugePages: 2607104 kBBut what bothers me is that numa_maps does not reflect this:
$ grep /proc/$(pgrep -f 'clickhouse local')/kernelpagesize_kB numa_maps | grep -v kernelpagesize_kB=4$ -c
0But I guess that hugepages from shmem (and even anon) does not showed there, only the one from hugepagefs (need to verify this though).
Compared to your binary:
$ grepAnonHugePages /proc/meminfo
AnonHugePages: 184320 kB
$ grep /proc/$(pgrep -f 'clickhouse server')/kernelpagesize_kB numa_maps | grep -v kernelpagesize_kB=4$ -c
0There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also the difference is that mine uses hugepages for the whole binary:
- this PR:
AnonHugePages: 184320 kB - mine:
ShmemHugePages: 2607104 kB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I guess that hugepages from shmem (and even anon) does not showed there, only the one from hugepagefs (need to verify this though).
Looks like this assumption is correct:
- vma_kernel_pagesize is used for printing info into
numa_maps:
unsigned long vma_kernel_pagesize(struct vm_area_struct *vma)
{
if (vma->vm_ops && vma->vm_ops->pagesize)
return vma->vm_ops->pagesize(vma);
return PAGE_SIZE;
}-
hugetlb_vm_opsis used only inhugetlbfs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very interesting! - did not know that it's possible, never heard about fexecve before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also the difference is that mine uses hugepages for the whole binary
But we don't need to load debuginfo, symtab into memory. It will waste 3 GiB...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we don't need to load debuginfo, symtab into memory. It will waste 3 GiB...
Indeed, kind of a problem, but production binaries stores debug symbols separately, so should not be a problem there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are also some usages of /proc/self/exe across the code.
|
Now we have only 176 unstable queries (was 418), so maybe it has helped |
I think it was this one #14883 |
Some CPUs is affected by iTLB multihit bug [1], and the cost to mitigate
it in software is page fault.
Since according to [1]:
In order to mitigate the vulnerability, KVM initially marks all huge
pages as non-executable. If the guest attempts to execute in one of
those pages, the page is broken down into 4K pages, which are then
marked executable
[1]: https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/multihit.html
And in case of failures of prewarm queries I see lots of SoftPageFaults [2]:
$ clickhouse-local --input-format TSVWithNamesAndTypes --file left-query-log.tsv --structure "$(cat left-query-log.tsv.columns | sed "s/\\\'/'/g")" -q "select query_id, ProfileEvents['SoftPageFaults'] from table where query_duration_ms >= 15e3 and query_id not like '%-%-%' /* uuid */" | column -t
trim_numbers.query5.prewarm0 486
avg_weighted.query7.prewarm0 986
great_circle_dist.query1.prewarm0 1292
hashed_dictionary.query10.prewarm0 654664
random_string.query1.prewarm0 10091
array_fill.query4.prewarm0 341801
window_functions.query5.prewarm0 46230
[2]: https://clickhouse-test-reports.s3.yandex.net/30882/54c89e0f0e9b7b18ab40e755805c115a462a6669/performance_comparison/report.html#fail1
And yes, Intel Xeon Gold 6230R [3] is vulnerable to iTLB multihit [4].
[3]: https://ark.intel.com/content/www/us/en/ark/products/192437/intel-xeon-gold-6230-processor-27-5m-cache-2-10-ghz.html
[4]: https://openbenchmarking.org/s/Intel%20Xeon%20Gold%206230R
NOTE: that you should not look at openbenchmarking.org for "Intel Xeon E5-2660 v4" [5],
since apparently lscpu was old, and bugs was not reported and hence parsed
[5]: https://ark.intel.com/content/www/us/en/ark/products/91772/intel-xeon-processor-e52660-v4-35m-cache-2-00-ghz.html
Refs: ClickHouse#14685
Cc: @alexey-milovidov


Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Attempt to make performance test more reliable. It is done by remapping the executable memory of the process on the fly with
madviseto use transparent huge pages - it can lower the number of iTLB misses which is the main source of instabilities in performance tests.Detailed description:
The "worst" code I've ever written.
Presentation in russian: https://presentations.clickhouse.tech/database_saturday_2020/
Video: https://youtu.be/chY9LnhLjco?t=8573