ColumnVector: optimize filter with AVX512VBMI2 compress store#39633
ColumnVector: optimize filter with AVX512VBMI2 compress store#39633rschu1ze merged 9 commits intoClickHouse:masterfrom
Conversation
|
Thanks for the adjustments. Dockerhub infrastructure is shaky today ... I restarted the CI builds. We had a problem a few weeks ago that AVX512 code could not be tested in CI because we lacked machines with AVX512 capabilities. That should be fixed by now. |
|
Stress tests on nodes with AVX-512 are what I had in mind. Performance tests on nodes with AVX-512 are a different story. These are nice-to-have IMHO but I guess we want to run performance tests on a pool of machines with exactly the same specs. Otherwise, finding performance regressions/improvements in PRs like this becomes playing lottery. |
A performance test is running the same set of queries twice on the same host. One for the PR artifact, and one for the most recent release (not 100% sure, but it's the most probable option) |
I know. What I was saying is that if there is a regression in AVX-512 code, then it can only be reliably detected if all machines in the performance pool are AVX-512-enabled. |
They do, the following lines are from the pages above
|
|
@rschu1ze please resubmit #39895 (comment) |
| /// to avoid calling resize too frequently, resize to reserve buffer. | ||
| if (reserve_size - current_offset < SIMD_BYTES) | ||
| { | ||
| reserve_size += alloc_size; | ||
| res_data.resize(reserve_size); | ||
| alloc_size *= 2; | ||
| } |
There was a problem hiding this comment.
@alexey-milovidov @rschu1ze
I am able to reproduce #39895 if built and unittest with msan.
The problem is happen here: we reserve a buffer in advance and then use the pointer to write data with AVX512 instruction. msan complains for such behaviour.
If I replace Line 550 with explicitly filling to touch memory, the problem gone:
res_data.resize_fill(reserve_size, static_cast<T>(0));Should I disable it in MEMORY_SANITIZER mode like LZ4_decompress_faster.cpp#L279?
There was a problem hiding this comment.
For completeness, I'll paste my minimal repro, but you were obviously quicker 😄
Explicitly initialization of allocated memory via resize_fill adds unnecessary performance overhead. So yes, I would favor !defined(MEMORY_SANITIZER) like in the file you mentioned. What about something like
#if defined(MEMORY_SANITIZER)
res_data.resize_fill(reserve_size, static_cast<T>(0)); // MSan doesn't recognize that all allocated memory is written by AVX-512 intrinsics.
#else
res_data.resize(reserve_size);
#endif
TEST(ColumnSparse, Filter)
{
const size_t rows = 1000;
auto col_src = ColumnVector<UInt64>::create();
for (size_t i = 0; i < rows; ++i)
col_src->getData().push_back(1);
PaddedPODArray<UInt8> filter(rows);
for (size_t i = 0; i < rows; ++i)
filter[i] = i % 2 == 0;
auto col_dst = col_src->filter(filter, -1);
if (col_dst->compareAt(0, 0, *col_dst, 0) != 0)
{
throw Exception(error_code, "Columns are unequal");
}
}
There was a problem hiding this comment.
It sounds good so we can still test on other parts.
#if defined(MEMORY_SANITIZER)
res_data.resize_fill(reserve_size, static_cast<T>(0)); // MSan doesn't recognize that all allocated memory is written by AVX-512 intrinsics.
#else
res_data.resize(reserve_size);
#endifBut the compiler will raise an warning -Wembedded-directive to #if defined(...) since we are declaring the function within marco arguments. We may need to make it as inline function to avoid warning (like ColumnVector.cpp#L480 in this PR).
There was a problem hiding this comment.
Sounds good. Let's make that one-liner an inlineable function (like blsr).
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
ColumnVector: optimize filter with AVX512VBMI2 compress store
The patch set can have about 6% performance gain in SSB (SF=100) query
3.1,3.2,3.3(tested on Icelake: Xeon 8380 * 2 socket).