Skip to content

ColumnVector::replicate() is even more readily vectorized now#9442

Merged
alexey-milovidov merged 2 commits intomasterfrom
akz/column_vector_replicate_faster
Mar 3, 2020
Merged

ColumnVector::replicate() is even more readily vectorized now#9442
alexey-milovidov merged 2 commits intomasterfrom
akz/column_vector_replicate_faster

Conversation

@Akazz
Copy link
Copy Markdown
Contributor

@Akazz Akazz commented Feb 28, 2020

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

Changelog category (leave one):

  • Performance Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Another minor performance improvement to ColumnVector::replicate() - an even further improvement to #9293

This is most useful for generating synthetic data in (performance) tests.

Notice the overhead decrease:

"Before":

  25.18%  QueryPipelineEx  clickhouse  [.] DB::ColumnVector<unsigned short>::replicate                                                                                                                                                              
  23.18%  QueryPipelineEx  clickhouse  [.] DB::ColumnVector<unsigned int>::replicate                                                                                                                                                                
  19.81%  QueryPipelineEx  clickhouse  [.] DB::(anonymous namespace)::TypedExecutorInvoker<DB::FunctionsLogicalDetail::AndImpl, ... >::apply
  11.98%  QueryPipelineEx  clickhouse  [.] DB::(anonymous namespace)::NumbersSource::generate
  ...

"After":

  33.29%  QueryPipelineEx  clickhouse  [.] DB::(anonymous namespace)::TypedExecutorInvoker<DB::FunctionsLogicalDetail::AndImpl, ... >::apply
  22.21%  QueryPipelineEx  clickhouse  [.] DB::(anonymous namespace)::NumbersSource::generate
   9.35%  QueryPipelineEx  clickhouse  [.] DB::ColumnVector<unsigned int>::replicate
   4.43%  QueryPipelineEx  clickhouse  [.] DB::ColumnVector<unsigned short>::replicate
  ...

@Akazz Akazz added pr-performance Pull request with some performance improvements no-docs-needed labels Feb 28, 2020
@amosbird
Copy link
Copy Markdown
Collaborator

It would be ideal to see the changed assembly and with a comment about how this helps vectorization. BTW is it compiler related?

@alexey-milovidov
Copy link
Copy Markdown
Member

@amosbird Yes, it is absolutely unclear how does it help vectorization. Akazz invited us to guess :)

@alexey-milovidov
Copy link
Copy Markdown
Member

Unit tests — fail: 1, passed: 5205

It's my mistake that was in master.
We can merge this PR after adding a comment...

@alexey-milovidov
Copy link
Copy Markdown
Member

BTW, I don't see the relevant changes in performance test. We can merge anyway.

@Akazz
Copy link
Copy Markdown
Contributor Author

Akazz commented Mar 2, 2020

First off, here's the request that I am testing against:

SELECT count() FROM (SELECT materialize(toUInt16(1)) AS x1, materialize(toUInt32(1)) AS x2 FROM numbers(20000000000)) WHERE NOT ignore(and(x1,x2))

I found no significant difference in performance depending on compiler (GCC 9.2.1 vs Clang 9.0.0), although the generated code can be quite different.

Here is a snippet of disassembly as reported by perf (compiled with GCC 9.2.1):

       │         auto it = res->getData().begin();
       │       mov    %r8,%rax
       │       shl    $0x2,%rbx
       │       nop
       │           boost::iterators::detail::iterator_adaptor_assert_traversal<my_traversal, cat>();
       │
       │           void advance(typename super_t::difference_type n)
       │           {
       │               BOOST_ITERATOR_ADAPTOR_ASSERT_TRAVERSAL(random_access_traversal_tag)
       │               m_iterator += n;
       │140:┌─→mov    (%r9,%rdi,2),%rdx
       │    │  lea    (%r8,%rdx,4),%rcx
       │    │          BOOST_ITERATOR_ADAPTOR_ASSERT_TRAVERSAL(random_access_traversal_tag)
       │    │          // Maybe readd with same_distance
       │    │          //           BOOST_STATIC_ASSERT(
       │    │          //               (detail::same_category_and_difference<Derived,OtherDerived>::value)
       │    │          //               );
       │    │          return y.base() - m_iterator;
       │    │  mov    %rcx,%rdx
       │    │  sub    %rax,%rdx
       │    │    for (size_t i = 0; i < size; ++i)
       │    │    {
       │    │        const auto span_end = res->getData().begin() + offsets[i];
       │    │        for (; it < span_end; ++it)
       │    │  test   %rdx,%rdx
       │    │↓ jle    9031114 <DB::ColumnVector<unsigned int>::replicate(DB::PODArray<unsigned long, 4096ul, 174
       │    │  mov    0x10(%rbp),%rsi
       │    │  add    %rdi,%rsi
       │    │  nop
       │    │            *it = data[i];
  0.03 │160:│  mov    (%rsi),%edx
       │    │      void increment() { ++m_iterator; }
 33.05 │    │  add    $0x4,%rax
       │    │  mov    %edx,-0x4(%rax)
       │    │          return y.base() - m_iterator;
 34.47 │    │  mov    %rcx,%rdx
       │    │  sub    %rax,%rdx
       │    │        for (; it < span_end; ++it)
 32.24 │    │  test   %rdx,%rdx
       │    │↑ jg     9031100 <DB::ColumnVector<unsigned int>::replicate(DB::PODArray<unsigned long, 4096ul, 160
       │    │    for (size_t i = 0; i < size; ++i)
       │174:│  add    $0x4,%rdi
       │    ├──cmp    %rbx,%rdi
       │    └──jne    90310e0 <DB::ColumnVector<unsigned int>::replicate(DB::PODArray<unsigned long, 4096ul, 140

Unfortunately, compiler is unable to vectorize the inner loop above.

The proposed patch makes it right.
GCC 9.2.1:

       │       nop
       │                 *curr++ = data[i];
 23.74 │1a0:   movups %xmm0,(%rax)
       │             while (curr < span_end)
 66.17 │       add    $0x10,%rax
  8.05 │       cmp    %rsi,%rax
       │     ↑ jne    902cf00 <DB::ColumnVector<unsigned int>::replicate(DB::PODArray<unsigned long, 4096ul, 1a0
       │       mov    %rdi,%rsi
       │       and    $0xfffffffffffffffc,%rsi
       │       lea    (%rdx,%rsi,4),%rax
       │       cmp    %rsi,%rdi
       │     ↓ je     902cf3c <DB::ColumnVector<unsigned int>::replicate(DB::PODArray<unsigned long, 4096ul, 1dc
       │                 *curr++ = data[i];
       │       mov    (%r9),%edx
       │       lea    0x4(%rax),%rsi
       │       mov    %edx,(%rax)
       │             while (curr < span_end)
       │       cmp    %rsi,%r8
       │     ↓ jbe    902cf3c <DB::ColumnVector<unsigned int>::replicate(DB::PODArray<unsigned long, 4096ul, 1dc
       │                 *curr++ = data[i];
       │       mov    %edx,0x4(%rax)
       │       lea    0x8(%rax),%rdx
       │             while (curr < span_end)
       │       cmp    %rdx,%r8
       │     ↓ jbe    902cf3c <DB::ColumnVector<unsigned int>::replicate(DB::PODArray<unsigned long, 4096ul, 1dc
       │                 *curr++ = data[i];
       │       mov    (%r9),%edx
       │       mov    %edx,0x8(%rax)
       │1dc:   mov    %r10,%rdx
       │         for (size_t i = 0; i < size; ++i)
  0.55 │1df:   cmp    %rcx,%rbx
       │     ↑ jne    902ce90 <DB::ColumnVector<unsigned int>::replicate(DB::PODArray<unsigned long, 4096ul, 130
       │     #else

Clang 9.0.0:

       │     ↓ je     8a6cd52 <DB::ColumnVector<unsigned int>::replicate(DB::PODArray<unsigned long, 4096ul, 1a2
       │       movd   (%rcx),%xmm0
       │       pshufd $0x0,%xmm0,%xmm0
       │       mov    %r15,%r11
       │       sub    %rax,%r11
       │       xor    %eax,%eax
       │       nop
       │       nop
  1.69 │140:┌─→movdqu %xmm0,(%rdi,%rax,4)
 37.23 │    │  movdqu %xmm0,0x10(%rdi,%rax,4)
 26.62 │    │  movdqu %xmm0,0x20(%rdi,%rax,4)
 13.69 │    │  movdqu %xmm0,0x30(%rdi,%rax,4)
 19.85 │    │  add    $0x10,%rax
       │    ├──add    $0x2,%r11
       │    └──jne    8a6ccf0 <DB::ColumnVector<unsigned int>::replicate(DB::PODArray<unsigned long, 4096ul, 140
       │       test   %r15,%r15
       │     ↓ je     8a6cd2a <DB::ColumnVector<unsigned int>::replicate(DB::PODArray<unsigned long, 4096ul, 17a
       │                 *curr++ = data[i];
       │166:   movd   (%rcx),%xmm0
       │       pshufd $0x0,%xmm0,%xmm0
       │       movdqu %xmm0,(%rdi,%rax,4)
       │       movdqu %xmm0,0x10(%rdi,%rax,4)
       │17a:   lea    (%rdi,%r13,4),%rdi
       │             while (curr < span_end)
       │       cmp    %r13,%r12
       │       mov    -0x70(%rbp),%r15
       │     ↑ je     8a6cc50 <DB::ColumnVector<unsigned int>::replicate(DB::PODArray<unsigned long, 4096ul, a0

@Akazz
Copy link
Copy Markdown
Contributor Author

Akazz commented Mar 2, 2020

I've looked into it a little bit deeper and isolated the problem to be with ... < - the comparison operator! Clearly, our existing implementation of PODArray::iterator::operator<() prevents compiler from employing further optimizations. See the new patch!
GCC 9.2.1:

       │                 *it = data[i];
 18.65 │1b0:┌─→movups %xmm0,(%rax)
       │    │        for (; it != span_end; ++it)
 70.93 │    │  add    $0x10,%rax
  8.78 │    ├──cmp    %rsi,%rax
       │    └──jne    90317e0 <DB::ColumnVector<unsigned int>::replicate(DB::PODArray<unsigned long, 4096ul, 1b0
  0.18 │       mov    %r9,%rsi
       │       and    $0xfffffffffffffffc,%rsi
  0.18 │       lea    (%rcx,%rsi,4),%rax
       │       cmp    %rsi,%r9
       │     ↓ je     903181c <DB::ColumnVector<unsigned int>::replicate(DB::PODArray<unsigned long, 4096ul, 1ec
       │                 *it = data[i];
       │       mov    (%r8),%ecx
       │           }
       │
       │           void increment() { ++m_iterator; }
       │       lea    0x4(%rax),%rsi
       │       mov    %ecx,(%rax)
       │             for (; it != span_end; ++it)
       │       cmp    %rsi,%rdi
       │     ↓ je     903181c <DB::ColumnVector<unsigned int>::replicate(DB::PODArray<unsigned long, 4096ul, 1ec
       │                 *it = data[i];
       │       mov    %ecx,0x4(%rax)
       │       lea    0x8(%rax),%rcx
       │             for (; it != span_end; ++it)
       │       cmp    %rcx,%rdi
       │     ↓ je     903181c <DB::ColumnVector<unsigned int>::replicate(DB::PODArray<unsigned long, 4096ul, 1ec
       │                 *it = data[i];
       │       mov    (%r8),%ecx
       │       mov    %ecx,0x8(%rax)
  0.18 │1ec:   mov    %r10,%rcx
       │         for (size_t i = 0; i < size; ++i)
  0.18 │1ef:   cmp    %rdx,%rbx
       │     ↑ jne    9031760 <DB::ColumnVector<unsigned int>::replicate(DB::PODArray<unsigned long, 4096ul, 130

Clang 9.0.0

       │       mov    %r12,%rax
       │       sub    %r15,%rax
       │       xor    %r15d,%r15d
  0.92 │140:┌─→movdqu %xmm0,(%rdi,%r15,4)
 30.55 │    │  movdqu %xmm0,0x10(%rdi,%r15,4)
 27.69 │    │  movdqu %xmm0,0x20(%rdi,%r15,4)
 20.71 │    │  movdqu %xmm0,0x30(%rdi,%r15,4)
 18.65 │    │  add    $0x10,%r15
       │    ├──add    $0x2,%rax
       │    └──jne    8a6cda0 <DB::ColumnVector<unsigned int>::replicate(DB::PODArray<unsigned long, 4096ul, 140
int>::replicate(DB::PODArray<unsigned long, 4096ul, 140
       │       test   %r12,%r12
       │     ↓ je     8a6cde0 <DB::ColumnVector<unsigned int>::replicate(DB::PODArray<unsigned long, 4096ul, 180
       │                 *it = data[i];
       │16a:   movd   (%rcx),%xmm0
       │       pshufd $0x0,%xmm0,%xmm0
       │       movdqu %xmm0,(%rdi,%r15,4)
       │       movdqu %xmm0,0x10(%rdi,%r15,4)
       │             for (; it != span_end; ++it)
  0.11 │180:   cmp    %rbx,%r11
       │     ↓ jne    8a6ce01 <DB::ColumnVector<unsigned int>::replicate(DB::PODArray<unsigned long, 4096ul, 1a1
       │       mov    %rdx,%rdi
       │       mov    %r8,%r15
       │       mov    -0x48(%rbp),%r8
       │         for (size_t i = 0; i < size; ++i)
       │       add    $0x1,%rsi
       │       cmp    %r14,%rsi
       │     ↑ jne    8a6cd10 <DB::ColumnVector<unsigned int>::replicate(DB::PODArray<unsigned long, 4096ul, b0
       │     ↓ jmpq   8a6ce96 <DB::ColumnVector<unsigned int>::replicate(DB::PODArray<unsigned long, 4096ul, 236
       │1a1:   lea    (%rdi,%rbx,4),%rdi
       │       mov    %r8,%r15
       │       mov    -0x48(%rbp),%r8
       │       nop
       │             for (; it != span_end; ++it)
       │1b0:   sub    %rdi,%r13
       │       mov    %r13d,%eax
       │       shr    $0x2,%eax
       │       add    $0x1,%eax
       │       and    $0x7,%rax
       │     ↓ je     8a6ce3d <DB::ColumnVector<unsigned int>::replicate(DB::PODArray<unsigned long, 4096ul, 1dd
       │       neg    %rax
       │       nop
       │       nop
       │                 *it = data[i];
       │1d0:   mov    (%rcx),%ebx
       │       mov    %ebx,(%rdi)
       │           }

@Akazz Akazz marked this pull request as ready for review March 2, 2020 19:05
@danlark1
Copy link
Copy Markdown
Contributor

danlark1 commented Mar 2, 2020

The problem is in boost::iterator https://gcc.godbolt.org/z/qgwFCF

<source>:44:9: remark: loop not vectorized: could not determine number of loop iterations [-Rpass-analysis=loop-vectorize]

        for (; it < span_end; ++it)

        ^

@alexey-milovidov
Copy link
Copy Markdown
Member

It's clearly better.

SELECT count() FROM (SELECT materialize(1) AS x1, materialize(1) AS x2, materialize(1) AS x3, materialize(1) AS x4, materialize(1) AS x5, materialize(1) AS x6, materialize(1) AS x7, materialize(1) AS x8, materialize(1) AS x9, materialize(1) AS x10 FROM numbers(500000000)) WHERE NOT ignore(xor(x1,x2,x3,x4,x5,x6,x7,x8,x9,x10))
  • 39% faster!

@alexey-milovidov alexey-milovidov merged commit f6a19ba into master Mar 3, 2020
@alexey-milovidov alexey-milovidov deleted the akz/column_vector_replicate_faster branch March 3, 2020 19:06
@alexey-milovidov
Copy link
Copy Markdown
Member

@danlark1 Thank you! I will try to get rid of boost::iterator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

no-docs-needed pr-performance Pull request with some performance improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants