-
Notifications
You must be signed in to change notification settings - Fork 1.2k
[utf8-validator] eliminate unnecessary comparison from must_be_2_3_continuation #2113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[utf8-validator] eliminate unnecessary comparison from must_be_2_3_continuation #2113
Conversation
|
@Validark I think your assumption is correct. I am running test, and I expect to merge your PR. |
| simd8<bool> is_third_byte = prev2 >= uint8_t(0xe0u); | ||
| simd8<bool> is_fourth_byte = prev3 >= uint8_t(0xf0u); | ||
| return is_third_byte ^ is_fourth_byte; | ||
| simdjson_inline simd8<uint8_t> must_be_2_3_continuation(const simd8<uint8_t> prev2, const simd8<uint8_t> prev3) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure that the ARM change buys you something. Does it?
My expectation is that it is not going to affect the performance. Do you disagree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on my use of analogous techniques in Zig, I would think the arm emit would have an equal number of instructions either way. I made the change to the arm implementation solely because I was thinking it has to have the same return type as the other implementations. I suppose I could have achieved that in a different way, but I'm not sure it makes a difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good answer. I just wanted to make sure I understood the change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I may need to retract my earlier statement. After getting my hands on the Apple Silicon CPU optimization guide, it appears that UQSUB has a latency of 3, while CMHI has a latency of 2 on performance cores. On efficiency cores, CMHI has a latency of 3 and it makes no difference. Not sure if other arches have a differing opinion of which instruction they prefer but it couldn't hurt as a micro-optimization to switch to CMHI.
|
@Validark I am pushing your proposal to another library (simdutf), and I am getting a nice boost: |
|
I am merging this. This is very nice. |
enable ternary logic optimization. Kudos to @Validark (see simdjson/simdjson#2113)
enable ternary logic optimization. Kudos to @Validark (see simdjson/simdjson#2113)
enable ternary logic optimization. Kudos to @Validark (see simdjson/simdjson#2113)
Hello, I just thought of this optimization while refactoring the code in simdjzon.
For Zig, which uses LLVM, this optimization does not occur automatically. For that, I have submitted an issue to LLVM.
However, this project compiles with multiple compilers which may or may not do this optimization automatically, so I have edited your code to reflect the optimization idea. It is very basic, and I did not test the modified C++ code. But it looks right to me. (I did test my Zig implementation)
Hope it works for you.
‒ Validark
P.S. I assumed
simd8<uint8_t> must23_80 = must23 & uint8_t(0x80);splats the0x80. Is that correct?