ColumnVector: optimize filter with AVX512VBMI2 compress store by guowangy · Pull Request #39633 · ClickHouse/ClickHouse

guowangy · 2022-07-27T05:57:35Z

Changelog category (leave one):

Performance Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

ColumnVector: optimize filter with AVX512VBMI2 compress store

The patch set can have about 6% performance gain in SSB (SF=100) query 3.1, 3.2, 3.3
(tested on Icelake: Xeon 8380 * 2 socket).

src/Columns/ColumnVector.cpp

src/Common/TargetSpecific.h

src/Columns/ColumnVector.cpp

rschu1ze · 2022-08-01T07:49:35Z

Thanks for the adjustments.

Dockerhub infrastructure is shaky today ... I restarted the CI builds.

We had a problem a few weeks ago that AVX512 code could not be tested in CI because we lacked machines with AVX512 capabilities. That should be fixed by now.
@Felixoid Could you briefly confirm?

Felixoid · 2022-08-01T11:21:06Z

we have r5, c5 and m5 instances in the stress-testers group. Are we speaking about performance tests RN?

rschu1ze · 2022-08-01T11:30:15Z

Stress tests on nodes with AVX-512 are what I had in mind.

Performance tests on nodes with AVX-512 are a different story. These are nice-to-have IMHO but I guess we want to run performance tests on a pool of machines with exactly the same specs. Otherwise, finding performance regressions/improvements in PRs like this becomes playing lottery.

Felixoid · 2022-08-01T11:39:42Z

These are nice-to-have IMHO but I guess we want to run performance tests on a pool of machines with exactly the same specs.

A performance test is running the same set of queries twice on the same host. One for the PR artifact, and one for the most recent release (not 100% sure, but it's the most probable option)

rschu1ze · 2022-08-01T11:49:16Z

A performance test is running the same set of queries twice on the same host. One for the PR artifact, and one for the most recent release (not 100% sure, but it's the most probable option)

I know. What I was saying is that if there is a regression in AVX-512 code, then it can only be reliably detected if all machines in the performance pool are AVX-512-enabled.

Felixoid · 2022-08-01T13:05:27Z

A performance test is running the same set of queries twice on the same host. One for the PR artifact, and one for the most recent release (not 100% sure, but it's the most probable option)

I know. What I was saying is that if there is a regression in AVX-512 code, then it can only be reliably detected if all machines in the performance pool are AVX-512-enabled.

They do, the following lines are from the pages above

C5 instances provide support for the new Intel Advanced Vector Extensions 512 (AVX-512)
M5 instances provide support for the Intel Advanced Vector Extensions 512 (AVX-512)
R5 instances provide support for the Intel Advanced Vector Extensions 512 (AVX-512)

alexey-milovidov · 2022-08-08T02:04:57Z

@rschu1ze please resubmit #39895 (comment)

guowangy · 2022-08-08T09:05:20Z

src/Columns/ColumnVector.cpp

+        /// to avoid calling resize too frequently, resize to reserve buffer.
+        if (reserve_size - current_offset < SIMD_BYTES)
+        {
+            reserve_size += alloc_size;
+            res_data.resize(reserve_size);
+            alloc_size *= 2;
+        }


@alexey-milovidov @rschu1ze
I am able to reproduce #39895 if built and unittest with msan.

The problem is happen here: we reserve a buffer in advance and then use the pointer to write data with AVX512 instruction. msan complains for such behaviour.
If I replace Line 550 with explicitly filling to touch memory, the problem gone:

res_data.resize_fill(reserve_size, static_cast<T>(0));

Should I disable it in MEMORY_SANITIZER mode like LZ4_decompress_faster.cpp#L279?

For completeness, I'll paste my minimal repro, but you were obviously quicker 😄

Explicitly initialization of allocated memory via resize_fill adds unnecessary performance overhead. So yes, I would favor !defined(MEMORY_SANITIZER) like in the file you mentioned. What about something like

#if defined(MEMORY_SANITIZER) res_data.resize_fill(reserve_size, static_cast<T>(0)); // MSan doesn't recognize that all allocated memory is written by AVX-512 intrinsics. #else res_data.resize(reserve_size); #endif

TEST(ColumnSparse, Filter) { const size_t rows = 1000; auto col_src = ColumnVector<UInt64>::create(); for (size_t i = 0; i < rows; ++i) col_src->getData().push_back(1); PaddedPODArray<UInt8> filter(rows); for (size_t i = 0; i < rows; ++i) filter[i] = i % 2 == 0; auto col_dst = col_src->filter(filter, -1); if (col_dst->compareAt(0, 0, *col_dst, 0) != 0) { throw Exception(error_code, "Columns are unequal"); } }

It sounds good so we can still test on other parts.

#if defined(MEMORY_SANITIZER) res_data.resize_fill(reserve_size, static_cast<T>(0)); // MSan doesn't recognize that all allocated memory is written by AVX-512 intrinsics. #else res_data.resize(reserve_size); #endif

But the compiler will raise an warning -Wembedded-directive to #if defined(...) since we are declaring the function within marco arguments. We may need to make it as inline function to avoid warning (like ColumnVector.cpp#L480 in this PR).

Sounds good. Let's make that one-liner an inlineable function (like blsr).

guowangy added 6 commits July 27, 2022 13:30

CpuId: add AVX512VBMI2 detection

50fdbcd

TargetSpecific: add AVX512VBMI2 support

e6752d6

ColumnVector: optimize filter with AVX512VBMI2 compress store

7820e82

ColumnVector: bug fix for unit test failure

d781ed5

ColumnVector: add unit test for filter

6d7bfc3

ColumnVector: avoid calling resize too frequently

b772147

robot-clickhouse added the pr-performance Pull request with some performance improvements label Jul 27, 2022

rschu1ze self-assigned this Jul 27, 2022

alexey-milovidov added the can be tested Allows running workflows for external contributors label Jul 30, 2022

rschu1ze reviewed Jul 31, 2022

View reviewed changes

guowangy added 2 commits August 1, 2022 10:16

ColumnVector: naming style fix

b05be56

ColumnVector: refactory to use TargetSpecific::Default::doFilterAligned

6a67147

guowangy force-pushed the filter-vbmi2 branch from 82bf07e to 6a67147 Compare August 1, 2022 05:41

Merge master and resolve conflict

6a72132

rschu1ze merged commit 00a7c87 into ClickHouse:master Aug 3, 2022

CurtizJ mentioned this pull request Aug 4, 2022

Use of uninitialized value in ColumnSparse.Filter (unit test) #39895

Closed

alexey-milovidov mentioned this pull request Aug 8, 2022

Revert "ColumnVector: optimize filter with AVX512VBMI2 compress store" #39963

Merged

guowangy commented Aug 8, 2022

View reviewed changes

rschu1ze mentioned this pull request Aug 9, 2022

Revert the revert of "ColumnVector: optimize filter with AVX512 VBMI2 compress store" #40033

Merged

Conversation

guowangy commented Jul 27, 2022

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rschu1ze commented Aug 1, 2022

Uh oh!

Felixoid commented Aug 1, 2022

Uh oh!

rschu1ze commented Aug 1, 2022

Uh oh!

Felixoid commented Aug 1, 2022

Uh oh!

rschu1ze commented Aug 1, 2022

Uh oh!

Felixoid commented Aug 1, 2022

Uh oh!

alexey-milovidov commented Aug 8, 2022

Uh oh!

guowangy Aug 8, 2022

Choose a reason for hiding this comment

Uh oh!

rschu1ze Aug 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guowangy Aug 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rschu1ze Aug 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

rschu1ze Aug 8, 2022 •

edited

Loading

guowangy Aug 8, 2022 •

edited

Loading

rschu1ze Aug 8, 2022 •

edited

Loading