Counting utf8 (and utf16) code words. by lemire · Pull Request #28 · simdutf/simdutf

lemire · 2021-03-09T23:46:03Z

As you would expect, you can count UTF8 code points at high speed:

AMD Rome (GNU GCC 10):

kernel	speed
fallback	6.506 GB/s
SSE	34.045 GB/s
AVX	60.680 GB/s

ARM M1 (Apple)

kernel	speed
fallback	3.619 GB/s
NEON	52.934 GB/s

This provides tests and benchmarks for UTF8 counting, but not yet for UTF 16.

Fixes https://github.com/lemire/simdutf/issues/8

Fixes https://github.com/lemire/simdutf/issues/27

… I want to get it working everywhere.

lemire · 2021-03-10T20:11:32Z

@WojciechMula Added benchmarks and tests for UTF 16 character counting.

lemire · 2021-03-10T20:41:48Z

@WojciechMula As a sanity check, I tried the following NEON function to count UTF16 words...

size_t neon_count_16(const char16_t *input, size_t length) {
  size_t count{0};
  size_t pos{0};
  uint16x8_t low  = vmovq_n_u16(0xDC00);
  uint16x8_t high = vmovq_n_u16(0xDFFF);
  
  while(pos + 8 < length) {
    size_t next_stop = pos + (length - pos > 0xFFFFF ? 0xFFFFF : length - pos);
    uint16x8_t counter = vdupq_n_u16(0);
    for(;pos + 8 < next_stop; pos += 8) {
      uint16x8_t in = vld1q_u16(reinterpret_cast<const uint16_t*>(input + pos));
      counter = vsubq_s16(counter,vorrq_u16(vcgtq_u16(in,high), vcltq_u16(in,low)));
    }
    count += vpaddd_u64(vpaddlq_u32(vpaddlq_u16(counter)));
  }
  return count + scalar::utf16::count_code_points(input + pos, length - pos);

}

It was slower.

In any case, it was enough to convince me that my code is not absolutely terrible from a performance point of view.

WojciechMula

Good job! And a great amount of work.

lemire · 2021-03-10T21:35:11Z

@WojciechMula My expectation is that these functions can be made much faster but that's ok. This is just the foundation.

WojciechMula · 2021-03-12T19:07:18Z

@lemire I'm of course for merging this PR. You did a great job. Please do not wait for me for any approvals in the future, just merge when you are happy about the code shape. I think at this stage of pre-alpha the review process shouldn't be very strict. It's easier to have everything in master.

Counting utf8 (and utf16) code words.

3bac09b

lemire requested a review from WojciechMula March 9, 2021 23:46

lemire commented Mar 9, 2021

View reviewed changes

Comment thread src/scalar/utf8.h

lemire added 4 commits March 10, 2021 11:50

Adding tests for count_utf16

ba0f76a

I am unhappy with this count16 but before I produce something better,…

cfa69d6

… I want to get it working everywhere.

Now works on x64 (but slow).

df13c40

It is slow, but it works.

e8f3774

More cleaning.

138f53d

WojciechMula reviewed Mar 10, 2021

View reviewed changes

Comment thread include/simdutf/arm64/simd.h

Comment thread include/simdutf/haswell/simd.h

Comment thread include/simdutf/implementation.h

Comment thread src/scalar/utf16.h Outdated

Comment thread src/scalar/utf8.h

Comment thread tests/count_utf16.cpp

lemire added 3 commits March 10, 2021 16:16

Removing spurious lines.

1294ce0

Marking as const.

70102fe

Let us go with the simpler UTF16 validation routine.

9e5fd7c

Merge branch 'master' into dlemire/counting

8db01d8

lemire merged commit ac225b2 into master Mar 17, 2021

lemire deleted the dlemire/counting branch July 7, 2021 19:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Counting utf8 (and utf16) code words.#28

Counting utf8 (and utf16) code words.#28
lemire merged 10 commits intomasterfrom
dlemire/counting

lemire commented Mar 9, 2021 •

edited

Loading

Uh oh!

Uh oh!

lemire commented Mar 10, 2021

Uh oh!

lemire commented Mar 10, 2021

Uh oh!

WojciechMula left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lemire commented Mar 10, 2021

Uh oh!

WojciechMula commented Mar 12, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lemire commented Mar 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

lemire commented Mar 10, 2021

Uh oh!

lemire commented Mar 10, 2021

Uh oh!

WojciechMula left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lemire commented Mar 10, 2021

Uh oh!

WojciechMula commented Mar 12, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lemire commented Mar 9, 2021 •

edited

Loading