Skip to content

Conversation

@Validark
Copy link
Contributor

@Validark Validark commented Jan 27, 2024

Hello, I just thought of this optimization while refactoring the code in simdjzon.

For Zig, which uses LLVM, this optimization does not occur automatically. For that, I have submitted an issue to LLVM.

However, this project compiles with multiple compilers which may or may not do this optimization automatically, so I have edited your code to reflect the optimization idea. It is very basic, and I did not test the modified C++ code. But it looks right to me. (I did test my Zig implementation)

Hope it works for you.

‒ Validark

P.S. I assumed simd8<uint8_t> must23_80 = must23 & uint8_t(0x80); splats the 0x80. Is that correct?

@lemire
Copy link
Member

lemire commented Jan 27, 2024

@Validark I think your assumption is correct.

I am running test, and I expect to merge your PR.

simd8<bool> is_third_byte = prev2 >= uint8_t(0xe0u);
simd8<bool> is_fourth_byte = prev3 >= uint8_t(0xf0u);
return is_third_byte ^ is_fourth_byte;
simdjson_inline simd8<uint8_t> must_be_2_3_continuation(const simd8<uint8_t> prev2, const simd8<uint8_t> prev3) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure that the ARM change buys you something. Does it?

My expectation is that it is not going to affect the performance. Do you disagree?

Copy link
Contributor Author

@Validark Validark Jan 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on my use of analogous techniques in Zig, I would think the arm emit would have an equal number of instructions either way. I made the change to the arm implementation solely because I was thinking it has to have the same return type as the other implementations. I suppose I could have achieved that in a different way, but I'm not sure it makes a difference.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good answer. I just wanted to make sure I understood the change.

Copy link
Contributor Author

@Validark Validark Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I may need to retract my earlier statement. After getting my hands on the Apple Silicon CPU optimization guide, it appears that UQSUB has a latency of 3, while CMHI has a latency of 2 on performance cores. On efficiency cores, CMHI has a latency of 3 and it makes no difference. Not sure if other arches have a differing opinion of which instruction they prefer but it couldn't hurt as a micro-optimization to switch to CMHI.

@lemire
Copy link
Member

lemire commented Jan 28, 2024

@Validark I am pushing your proposal to another library (simdutf), and I am getting a nice boost:

simdutf/simdutf#365

@lemire
Copy link
Member

lemire commented Jan 28, 2024

I am merging this. This is very nice.

@lemire lemire merged commit 9b0435d into simdjson:master Jan 28, 2024
hkratz added a commit to rusticstuff/simdutf8 that referenced this pull request Nov 21, 2025
hkratz added a commit to rusticstuff/simdutf8 that referenced this pull request Nov 22, 2025
hkratz added a commit to rusticstuff/simdutf8 that referenced this pull request Dec 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants