Skip to content

Counting utf8 (and utf16) code words.#28

Merged
lemire merged 10 commits intomasterfrom
dlemire/counting
Mar 17, 2021
Merged

Counting utf8 (and utf16) code words.#28
lemire merged 10 commits intomasterfrom
dlemire/counting

Conversation

@lemire
Copy link
Copy Markdown
Member

@lemire lemire commented Mar 9, 2021

As you would expect, you can count UTF8 code points at high speed:

AMD Rome (GNU GCC 10):

kernel speed
fallback 6.506 GB/s
SSE 34.045 GB/s
AVX 60.680 GB/s

ARM M1 (Apple)

kernel speed
fallback 3.619 GB/s
NEON 52.934 GB/s

This provides tests and benchmarks for UTF8 counting, but not yet for UTF 16.

Fixes https://github.com/lemire/simdutf/issues/8

Fixes https://github.com/lemire/simdutf/issues/27

@lemire lemire requested a review from WojciechMula March 9, 2021 23:46
Comment thread src/scalar/utf8.h
@lemire
Copy link
Copy Markdown
Member Author

lemire commented Mar 10, 2021

@WojciechMula Added benchmarks and tests for UTF 16 character counting.

@lemire
Copy link
Copy Markdown
Member Author

lemire commented Mar 10, 2021

@WojciechMula As a sanity check, I tried the following NEON function to count UTF16 words...

size_t neon_count_16(const char16_t *input, size_t length) {
  size_t count{0};
  size_t pos{0};
  uint16x8_t low  = vmovq_n_u16(0xDC00);
  uint16x8_t high = vmovq_n_u16(0xDFFF);
  
  while(pos + 8 < length) {
    size_t next_stop = pos + (length - pos > 0xFFFFF ? 0xFFFFF : length - pos);
    uint16x8_t counter = vdupq_n_u16(0);
    for(;pos + 8 < next_stop; pos += 8) {
      uint16x8_t in = vld1q_u16(reinterpret_cast<const uint16_t*>(input + pos));
      counter = vsubq_s16(counter,vorrq_u16(vcgtq_u16(in,high), vcltq_u16(in,low)));
    }
    count += vpaddd_u64(vpaddlq_u32(vpaddlq_u16(counter)));
  }
  return count + scalar::utf16::count_code_points(input + pos, length - pos);

}

It was slower.

In any case, it was enough to convince me that my code is not absolutely terrible from a performance point of view.

Copy link
Copy Markdown
Collaborator

@WojciechMula WojciechMula left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job! And a great amount of work.

Comment thread include/simdutf/arm64/simd.h
Comment thread include/simdutf/haswell/simd.h
Comment thread include/simdutf/implementation.h
Comment thread src/scalar/utf16.h Outdated
Comment thread src/scalar/utf8.h
Comment thread tests/count_utf16.cpp
@lemire
Copy link
Copy Markdown
Member Author

lemire commented Mar 10, 2021

@WojciechMula My expectation is that these functions can be made much faster but that's ok. This is just the foundation.

@WojciechMula
Copy link
Copy Markdown
Collaborator

@lemire I'm of course for merging this PR. You did a great job. Please do not wait for me for any approvals in the future, just merge when you are happy about the code shape. I think at this stage of pre-alpha the review process shouldn't be very strict. It's easier to have everything in master.

@lemire lemire merged commit ac225b2 into master Mar 17, 2021
@lemire lemire deleted the dlemire/counting branch July 7, 2021 19:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

add fast utf16 code-point counting functions add fast code-point counter functions

2 participants