Skip to content

Conversation

@alexey-milovidov
Copy link
Member

@alexey-milovidov alexey-milovidov commented Sep 10, 2020

Changelog category (leave one):

  • Build/Testing/Packaging Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Attempt to make performance test more reliable. It is done by remapping the executable memory of the process on the fly with madvise to use transparent huge pages - it can lower the number of iTLB misses which is the main source of instabilities in performance tests.

Detailed description:
The "worst" code I've ever written.

Presentation in russian: https://presentations.clickhouse.tech/database_saturday_2020/
Video: https://youtu.be/chY9LnhLjco?t=8573

@robot-clickhouse robot-clickhouse added the pr-build Pull request with build/testing/packaging improvement label Sep 10, 2020
@alexey-milovidov alexey-milovidov marked this pull request as draft September 10, 2020 09:15
@amosbird
Copy link
Collaborator

amosbird commented Sep 10, 2020

Will it improve performance in any way?

use transparent huge pages

IIUC, aren't they just normal huge pages instead?

@alexey-milovidov
Copy link
Member Author

Will it improve performance in any way?

Yes, it is possible. But I expect only small gains.

IIUC, aren't they just normal huge pages instead?

Normal huge pages have very poor usability (very difficult to set up and to keep configuration sane) and we cannot use them in our CI environment for performance tests.

There are anecdotic stories how people set up static huge pages in their baremetal servers, then forgot about it, quit from company, and other developers get wonder why everything is slow and memory is not used at all.

@akuzm
Copy link
Contributor

akuzm commented Sep 11, 2020

A dramatic effect on iTLB misses, with no effect on query time (these are CDF-like graphs, X is relative change of metric, and Y is the number of queries in which the change is less).
image
image

No effect on query run time instability (still 400 unstable queries). Maybe it will be better when the huge page version is on both sides.
https://clickhouse-test-reports.s3.yandex.net/14685/5675efbd47fde50524463a14758c672091264897/performance_comparison/report.html#test-performance-changes

Overall I like this feature -- it's very bad indeed, but I would enable it in perf tests, just to have one thing less to worry about.

@alexey-milovidov
Copy link
Member Author

alexey-milovidov commented Sep 11, 2020

Ok. I see the following levels of integration:

  1. Add it as an option in config, disable by default, enable for integration tests.
  2. Add it as an option in config, enable by default.

Let's do 1 and if everything will go well after a while we can enable by default.

PS. There are also the following side effects:

  • if server has no swap, the memory cannot be swapped out because the mapping is anonymous (not backed by a file);
  • in either case, the mapping is preloaded in memory;
  • the effect is the same as mlock but no CAP_IPC_LOCK is required;
  • automatic protection from editing the file with binary (changes won't reflect in memory).

I will keep the mlock option in config and add a new option near.


__attribute__((__noinline__)) void remapToHugeStep3(void * scratch, size_t size, size_t offset)
{
/// The function should not use the stack, otherwise various optimizations, including "omit-frame-pointer" may break the code.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, did not get the question completely... how to clean up after jump?

Copy link
Member Author

@alexey-milovidov alexey-milovidov Sep 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably we also rely on the fact that the function does not have prologue.
(It also does not permit debug build.)
But we also can compile this function without optimizations (using some attribute).


/// As the memory region is anonymous, we can do madvise with MADV_HUGEPAGE.

syscall_func(SYS_madvise, begin, size, MADV_HUGEPAGE);
Copy link
Member

@azat azat Sep 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting.

I thought about making is simpler using memfd_create + fexecve, and here is simple PoC:
https://github.com/azat/hugeelfexec#hugeelfexec

Proof of concept:

$ ./hugeelfexec clickhouse local -q 'select sleepEachRow(1) from numbers(10)' &
# wait until process will start
$ fgrep -i ShmemHugePages /proc/meminfo 
ShmemHugePages:  2607104 kB

But what bothers me is that numa_maps does not reflect this:

$ grep  /proc/$(pgrep -f 'clickhouse local')/kernelpagesize_kB numa_maps | grep -v kernelpagesize_kB=4$ -c
0

But I guess that hugepages from shmem (and even anon) does not showed there, only the one from hugepagefs (need to verify this though).

Compared to your binary:

$ grepAnonHugePages /proc/meminfo 
AnonHugePages:    184320 kB
$ grep /proc/$(pgrep -f 'clickhouse server')/kernelpagesize_kB numa_maps | grep -v kernelpagesize_kB=4$ -c
0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also the difference is that mine uses hugepages for the whole binary:

  • this PR: AnonHugePages: 184320 kB
  • mine: ShmemHugePages: 2607104 kB

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I guess that hugepages from shmem (and even anon) does not showed there, only the one from hugepagefs (need to verify this though).

Looks like this assumption is correct:

unsigned long vma_kernel_pagesize(struct vm_area_struct *vma)
{
    if (vma->vm_ops && vma->vm_ops->pagesize)
        return vma->vm_ops->pagesize(vma);
    return PAGE_SIZE;
}
  • .pagesize

  • hugetlb_vm_ops is used only in hugetlbfs

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very interesting! - did not know that it's possible, never heard about fexecve before.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also the difference is that mine uses hugepages for the whole binary

But we don't need to load debuginfo, symtab into memory. It will waste 3 GiB...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we don't need to load debuginfo, symtab into memory. It will waste 3 GiB...

Indeed, kind of a problem, but production binaries stores debug symbols separately, so should not be a problem there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are also some usages of /proc/self/exe across the code.

@alexey-milovidov alexey-milovidov marked this pull request as ready for review September 14, 2020 18:09
@alexey-milovidov alexey-milovidov merged commit 018f596 into master Sep 15, 2020
@alexey-milovidov alexey-milovidov deleted the remap-executable branch September 15, 2020 12:09
@alexey-milovidov
Copy link
Member Author

Now we have only 176 unstable queries (was 418), so maybe it has helped
(or other recent attempts to lower binary size have helped).

@akuzm
Copy link
Contributor

akuzm commented Sep 21, 2020

Now we have only 176 unstable queries (was 418), so maybe it has helped
(or other recent attempts to lower binary size have helped).

I think it was this one #14883
(I increased instability thresholds for 30 tests, based on historical data)

azat added a commit to azat/ClickHouse that referenced this pull request Nov 19, 2021
Some CPUs is affected by iTLB multihit bug [1], and the cost to mitigate
it in software is page fault.

Since according to [1]:

    In order to mitigate the vulnerability, KVM initially marks all huge
    pages as non-executable. If the guest attempts to execute in one of
    those pages, the page is broken down into 4K pages, which are then
    marked executable

  [1]: https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/multihit.html

And in case of failures of prewarm queries I see lots of SoftPageFaults [2]:

    $ clickhouse-local --input-format TSVWithNamesAndTypes --file left-query-log.tsv --structure "$(cat left-query-log.tsv.columns | sed "s/\\\'/'/g")" -q "select query_id, ProfileEvents['SoftPageFaults'] from table where query_duration_ms >= 15e3 and query_id not like '%-%-%' /* uuid */"  | column -t
    trim_numbers.query5.prewarm0        486
    avg_weighted.query7.prewarm0        986
    great_circle_dist.query1.prewarm0   1292
    hashed_dictionary.query10.prewarm0  654664
    random_string.query1.prewarm0       10091
    array_fill.query4.prewarm0          341801
    window_functions.query5.prewarm0    46230

  [2]: https://clickhouse-test-reports.s3.yandex.net/30882/54c89e0f0e9b7b18ab40e755805c115a462a6669/performance_comparison/report.html#fail1

And yes, Intel Xeon Gold 6230R [3] is vulnerable to iTLB multihit [4].

  [3]: https://ark.intel.com/content/www/us/en/ark/products/192437/intel-xeon-gold-6230-processor-27-5m-cache-2-10-ghz.html
  [4]: https://openbenchmarking.org/s/Intel%20Xeon%20Gold%206230R

NOTE: that you should not look at openbenchmarking.org for "Intel Xeon E5-2660 v4" [5],
      since apparently lscpu was old, and bugs was not reported and hence parsed

  [5]: https://ark.intel.com/content/www/us/en/ark/products/91772/intel-xeon-processor-e52660-v4-35m-cache-2-00-ghz.html

Refs: ClickHouse#14685
Cc: @alexey-milovidov
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-build Pull request with build/testing/packaging improvement

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants