ARROW-9131: [C++] Faster ascii_lower and ascii_upper.#7434
ARROW-9131: [C++] Faster ascii_lower and ascii_upper.#7434maartenbreddels wants to merge 5 commits intoapache:masterfrom
Conversation
Following up on apache#7418 I tried and benchmarked a different way for * ascii_lower * ascii_upper Before (lower is similar): ``` -------------------------------------------------- Benchmark Time CPU Iterations -------------------------------------------------- AsciiUpper_median 4922843 ns 4918961 ns 10 bytes_per_second=3.1457G/s items_per_second=213.17M/s ``` After: ``` -------------------------------------------------- Benchmark Time CPU Iterations -------------------------------------------------- AsciiUpper_median 1391272 ns 1390014 ns 10 bytes_per_second=11.132G/s items_per_second=754.363M/s ``` This is a 3.7x speedup (on a AMD machine). Using http://quick-bench.com/JaDErmVCY23Z1tu6YZns_KBt0qU I found 4.6x speedup for clang 9, 6.4x for GCC 9.2. Also, the test is expanded a bit to include a non-ascii codepoint, to make explicit it is fine to upper or lower case a utf8 string. The non-overlap encoding of utf8 make this ok (see section 2.5 of Unicode Standard Core Specification v13.0).
pitrou
left a comment
There was a problem hiding this comment.
This is neat. It appears that both gcc and clang auto-vectorize the conditional expression.
| const uint8_t utf8_code_unit = *input++; | ||
| // Code units in the range [a-z] can only be an encoding of an ascii | ||
| // character/codepoint, not the 2nd, 3rd or 4th code unit (byte) of an different | ||
| // codepoint. This guaranteed by non-overal design of the unicode standard. (see |
|
I also thought that we could do a bit check instead of the range check, e.g. The generated code looks vectorized indeed. I didn't look into the details of the generated code by clang and GCC, it seems their performance is a bit different, so we might be able to squeeze out a bit more if we want. Happy to look into that later (create a new issue), but I rather spend my time on other functions now. |
|
It's ok, there's no need to further optimize those functions. |
|
Travis-CI failure is unrelated. |
|
Thank you @maartenbreddels ! |
|
Thanks, let me know if my workflow is ok, or if I can make things go smoother. PS: I am looking for a document describing the kernel design. I see these two cases ( PS2: Are there alternative channels for quick/small questions, or is this fine? |
|
I think the current approach (implement kernels one-by-one) is reasonable and manageable for us (and for you as well I hope). I don't think there's much documentation for now around the kernel design. Basically, kernels should usually be able to process two kinds of inputs (represented as For quick development questions, we have a public chat instance at https://ursalabs.zulipchat.com/ - just register though and you can chat with the team. The main channel there is the "dev" channel, and Zulip allows you to create subtopics - don't hesitate to use those! |
No, this is fine.
👍 |
Can we use [email protected] so everyone can see the discussion and it's searchable via Google later?
Keep in mind that the new kernels framework is only a few weeks old (7ad49ee), so developer documentation is a bit behind, so don't hesitate to ask questions. |
I'm ok with that if not considered too noisy (like small cmake questions etc). |
|
It shouldn't be a problem. |
Following up on #7418 I tried and benchmarked a different way for
Before (lower is similar):
After:
This is a 3.7x speedup (on a AMD machine).
Using http://quick-bench.com/JaDErmVCY23Z1tu6YZns_KBt0qU I found 4.6x speedup for clang 9, 6.4x for GCC 9.2.
Also, the test is expanded a bit to include a non-ascii codepoint, to make explicit it is fine to upper
or lower case a utf8 string. The non-overlap encoding of utf8 make this ok (see section 2.5 of Unicode
Standard Core Specification v13.0).