Skip to content

ColumnVector: optimize filter with AVX512VBMI2 compress store#39633

Merged
rschu1ze merged 9 commits intoClickHouse:masterfrom
guowangy:filter-vbmi2
Aug 3, 2022
Merged

ColumnVector: optimize filter with AVX512VBMI2 compress store#39633
rschu1ze merged 9 commits intoClickHouse:masterfrom
guowangy:filter-vbmi2

Conversation

@guowangy
Copy link
Copy Markdown
Contributor

Changelog category (leave one):

  • Performance Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

ColumnVector: optimize filter with AVX512VBMI2 compress store

The patch set can have about 6% performance gain in SSB (SF=100) query 3.1, 3.2, 3.3
(tested on Icelake: Xeon 8380 * 2 socket).

@robot-clickhouse robot-clickhouse added the pr-performance Pull request with some performance improvements label Jul 27, 2022
@rschu1ze rschu1ze self-assigned this Jul 27, 2022
@alexey-milovidov alexey-milovidov added the can be tested Allows running workflows for external contributors label Jul 30, 2022
@rschu1ze
Copy link
Copy Markdown
Member

rschu1ze commented Aug 1, 2022

Thanks for the adjustments.

Dockerhub infrastructure is shaky today ... I restarted the CI builds.

We had a problem a few weeks ago that AVX512 code could not be tested in CI because we lacked machines with AVX512 capabilities. That should be fixed by now.
@Felixoid Could you briefly confirm?

@Felixoid
Copy link
Copy Markdown
Member

Felixoid commented Aug 1, 2022

we have r5, c5 and m5 instances in the stress-testers group. Are we speaking about performance tests RN?

@rschu1ze
Copy link
Copy Markdown
Member

rschu1ze commented Aug 1, 2022

Stress tests on nodes with AVX-512 are what I had in mind.

Performance tests on nodes with AVX-512 are a different story. These are nice-to-have IMHO but I guess we want to run performance tests on a pool of machines with exactly the same specs. Otherwise, finding performance regressions/improvements in PRs like this becomes playing lottery.

@Felixoid
Copy link
Copy Markdown
Member

Felixoid commented Aug 1, 2022

These are nice-to-have IMHO but I guess we want to run performance tests on a pool of machines with exactly the same specs.

A performance test is running the same set of queries twice on the same host. One for the PR artifact, and one for the most recent release (not 100% sure, but it's the most probable option)

@rschu1ze
Copy link
Copy Markdown
Member

rschu1ze commented Aug 1, 2022

A performance test is running the same set of queries twice on the same host. One for the PR artifact, and one for the most recent release (not 100% sure, but it's the most probable option)

I know. What I was saying is that if there is a regression in AVX-512 code, then it can only be reliably detected if all machines in the performance pool are AVX-512-enabled.

@Felixoid
Copy link
Copy Markdown
Member

Felixoid commented Aug 1, 2022

A performance test is running the same set of queries twice on the same host. One for the PR artifact, and one for the most recent release (not 100% sure, but it's the most probable option)

I know. What I was saying is that if there is a regression in AVX-512 code, then it can only be reliably detected if all machines in the performance pool are AVX-512-enabled.

They do, the following lines are from the pages above

C5 instances provide support for the new Intel Advanced Vector Extensions 512 (AVX-512)
M5 instances provide support for the Intel Advanced Vector Extensions 512 (AVX-512)
R5 instances provide support for the Intel Advanced Vector Extensions 512 (AVX-512)

@alexey-milovidov
Copy link
Copy Markdown
Member

@rschu1ze please resubmit #39895 (comment)

Comment on lines +546 to +552
/// to avoid calling resize too frequently, resize to reserve buffer.
if (reserve_size - current_offset < SIMD_BYTES)
{
reserve_size += alloc_size;
res_data.resize(reserve_size);
alloc_size *= 2;
}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexey-milovidov @rschu1ze
I am able to reproduce #39895 if built and unittest with msan.

The problem is happen here: we reserve a buffer in advance and then use the pointer to write data with AVX512 instruction. msan complains for such behaviour.
If I replace Line 550 with explicitly filling to touch memory, the problem gone:

res_data.resize_fill(reserve_size, static_cast<T>(0));

Should I disable it in MEMORY_SANITIZER mode like LZ4_decompress_faster.cpp#L279?

Copy link
Copy Markdown
Member

@rschu1ze rschu1ze Aug 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For completeness, I'll paste my minimal repro, but you were obviously quicker 😄

Explicitly initialization of allocated memory via resize_fill adds unnecessary performance overhead. So yes, I would favor !defined(MEMORY_SANITIZER) like in the file you mentioned. What about something like

#if defined(MEMORY_SANITIZER)
    res_data.resize_fill(reserve_size, static_cast<T>(0)); // MSan doesn't recognize that all allocated memory is written by AVX-512 intrinsics.
#else
    res_data.resize(reserve_size);
#endif

TEST(ColumnSparse, Filter)
{
    const size_t rows = 1000;

    auto col_src = ColumnVector<UInt64>::create();
    for (size_t i = 0; i < rows; ++i)
        col_src->getData().push_back(1);

    PaddedPODArray<UInt8> filter(rows);
    for (size_t i = 0; i < rows; ++i)
     filter[i] = i % 2 == 0;

    auto col_dst = col_src->filter(filter, -1);

    if (col_dst->compareAt(0, 0, *col_dst, 0) != 0)
    {
        throw Exception(error_code, "Columns are unequal");
    }
}

Copy link
Copy Markdown
Contributor Author

@guowangy guowangy Aug 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds good so we can still test on other parts.

#if defined(MEMORY_SANITIZER)
    res_data.resize_fill(reserve_size, static_cast<T>(0)); // MSan doesn't recognize that all allocated memory is written by AVX-512 intrinsics.
#else
    res_data.resize(reserve_size);
#endif

But the compiler will raise an warning -Wembedded-directive to #if defined(...) since we are declaring the function within marco arguments. We may need to make it as inline function to avoid warning (like ColumnVector.cpp#L480 in this PR).

Copy link
Copy Markdown
Member

@rschu1ze rschu1ze Aug 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Let's make that one-liner an inlineable function (like blsr).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-performance Pull request with some performance improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants