Skip to content

Conversation

@azat
Copy link
Member

@azat azat commented Nov 19, 2021

Changelog category (leave one):

  • Not for changelog (changelog entry is not required)

Some CPUs is affected by iTLB multihit bug 1, and the cost to mitigate
it in software is page fault.

Since according to 1:

In order to mitigate the vulnerability, KVM initially marks all huge
pages as non-executable. If the guest attempts to execute in one of
those pages, the page is broken down into 4K pages, which are then
marked executable

And in case of failures of prewarm queries I see lots of SoftPageFaults 2:

$ clickhouse-local --input-format TSVWithNamesAndTypes --file left-query-log.tsv --structure "$(cat left-query-log.tsv.columns | sed "s/\\\'/'/g")" -q "select query_id, ProfileEvents['SoftPageFaults'] from table where query_duration_ms >= 15e3 and query_id not like '%-%-%' /* uuid */"  | column -t
trim_numbers.query5.prewarm0        486
avg_weighted.query7.prewarm0        986
great_circle_dist.query1.prewarm0   1292
hashed_dictionary.query10.prewarm0  654664
random_string.query1.prewarm0       10091
array_fill.query4.prewarm0          341801
window_functions.query5.prewarm0    46230

And yes, Intel Xeon Gold 6230R 3 is vulnerable to iTLB multihit 4.

NOTE: that you should not look at openbenchmarking.org for "Intel Xeon E5-2660 v4" 5,
since apparently lscpu was old, and bugs was not reported and hence parsed

Refs: #14685
Cc: @alexey-milovidov
Refs: #31063 (comment)

Some CPUs is affected by iTLB multihit bug [1], and the cost to mitigate
it in software is page fault.

Since according to [1]:

    In order to mitigate the vulnerability, KVM initially marks all huge
    pages as non-executable. If the guest attempts to execute in one of
    those pages, the page is broken down into 4K pages, which are then
    marked executable

  [1]: https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/multihit.html

And in case of failures of prewarm queries I see lots of SoftPageFaults [2]:

    $ clickhouse-local --input-format TSVWithNamesAndTypes --file left-query-log.tsv --structure "$(cat left-query-log.tsv.columns | sed "s/\\\'/'/g")" -q "select query_id, ProfileEvents['SoftPageFaults'] from table where query_duration_ms >= 15e3 and query_id not like '%-%-%' /* uuid */"  | column -t
    trim_numbers.query5.prewarm0        486
    avg_weighted.query7.prewarm0        986
    great_circle_dist.query1.prewarm0   1292
    hashed_dictionary.query10.prewarm0  654664
    random_string.query1.prewarm0       10091
    array_fill.query4.prewarm0          341801
    window_functions.query5.prewarm0    46230

  [2]: https://clickhouse-test-reports.s3.yandex.net/30882/54c89e0f0e9b7b18ab40e755805c115a462a6669/performance_comparison/report.html#fail1

And yes, Intel Xeon Gold 6230R [3] is vulnerable to iTLB multihit [4].

  [3]: https://ark.intel.com/content/www/us/en/ark/products/192437/intel-xeon-gold-6230-processor-27-5m-cache-2-10-ghz.html
  [4]: https://openbenchmarking.org/s/Intel%20Xeon%20Gold%206230R

NOTE: that you should not look at openbenchmarking.org for "Intel Xeon E5-2660 v4" [5],
      since apparently lscpu was old, and bugs was not reported and hence parsed

  [5]: https://ark.intel.com/content/www/us/en/ark/products/91772/intel-xeon-processor-e52660-v4-35m-cache-2-00-ghz.html

Refs: ClickHouse#14685
Cc: @alexey-milovidov
@robot-clickhouse robot-clickhouse added the pr-not-for-changelog This PR should not be mentioned in the changelog label Nov 19, 2021
@alexey-milovidov
Copy link
Member

If it will help, we can remove "remap executable" at all.

@azat
Copy link
Member Author

azat commented Nov 19, 2021

If it will help, we can remove "remap executable" at all.

Agree, but this is just an attempt.
Also to be sure about this change perf tests need to be run on Intel Gold CPU.

@azat
Copy link
Member Author

azat commented Nov 19, 2021

@mergify update (an attempt to run perf tests on Intel Xeon Gold CPU)

@mergify
Copy link
Contributor

mergify bot commented Nov 19, 2021

update (an attempt to run perf tests on Intel Xeon Gold CPU)

☑️ Nothing to do

Details
  • -closed [:pushpin: update requirement]
  • #commits-behind>0 [:pushpin: update requirement]

Hey, I reacted but my real name is @Mergifyio

@azat
Copy link
Member Author

azat commented Nov 20, 2021

Fast test — Cannot fetch submodules

2021-11-20 10:22:49 fatal: unable to access 'https://github.com/facebook/zstd.git/': The requested URL returned error: 503
2021-11-20 10:22:49 Fetched in submodule path 'contrib/zstd', but it did not contain a488ba114ec17ea1054b9057c26a046fc122b3b6. Direct fetching of that commit failed.

Marker check — Fast Test has failed

Can someone add a force-test label please?

@azat
Copy link
Member Author

azat commented Nov 20, 2021

@mergify update (an attempt to run perf tests on Intel Xeon Gold CPU)

@mergify
Copy link
Contributor

mergify bot commented Nov 20, 2021

update (an attempt to run perf tests on Intel Xeon Gold CPU)

✅ Branch has been successfully updated

Hey, I reacted but my real name is @Mergifyio

@mergify
Copy link
Contributor

mergify bot commented Nov 20, 2021

update (an attempt to run perf tests on Intel Xeon Gold CPU)

❌ Base branch update has failed

Details

expected head sha didn’t match current head ref.
err-code: D4300

Hey, I reacted but my real name is @Mergifyio

@azat
Copy link
Member Author

azat commented Nov 21, 2021

Lots of failures in performance tests due to an issue, that had been fixed #31565

@azat
Copy link
Member Author

azat commented Nov 21, 2021

@mergify update (an attempt to run perf tests on Intel Xeon Gold CPU)

@mergify
Copy link
Contributor

mergify bot commented Nov 21, 2021

update (an attempt to run perf tests on Intel Xeon Gold CPU)

✅ Branch has been successfully updated

Hey, I reacted but my real name is @Mergifyio

@azat
Copy link
Member Author

azat commented Nov 21, 2021

@Mergifyio update (an attempt to run perf tests on Intel Xeon Gold CPU)

@mergify
Copy link
Contributor

mergify bot commented Nov 21, 2021

update (an attempt to run perf tests on Intel Xeon Gold CPU)

✅ Branch has been successfully updated

@azat
Copy link
Member Author

azat commented Nov 22, 2021

Performance — 3 errors, 1 faster, 4 unstable

Still an issue, parallel_final.query5.prewarm0 fails with >15sec (note, that settings was applied for both servers since it was done via xml).

cpu query_id time
Gold this PR parallel_final.query5.prewarm0 15251
E5 some upstream parallel_final.query5.prewarm0 2172
Gold 30882 parallel_final.query5.prewarm0 3313

@azat azat closed this Nov 22, 2021
@azat azat deleted the perf-spikes-v21.12 branch November 19, 2022 11:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-not-for-changelog This PR should not be mentioned in the changelog

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants