Conversation
|
I'll review it in a couple of hours. Looks really good. |
|
My main question would be: What's the performance characteristics of the following code in incorrect and correct utf16 inputs? if (simdutf::validate_utf16())
return simdutf:: utf8_length_from_utf16()
return simdutf:: utf8_length_from_utf16_with_replacement()vs return simdutf:: utf8_length_from_utf16_with_replacement() |
|
@anonrig I think that's the optimization I applied: we use the no-surrogate case as a happy path. :-) |
Co-authored-by: Paul Dreik <[email protected]>
Co-authored-by: Paul Dreik <[email protected]>
Co-authored-by: Paul Dreik <[email protected]>
Co-authored-by: Paul Dreik <[email protected]>
| // for remaining_next_mask, we start from 0x7FFFFFFFULL instead of | ||
| // 0xFFFFFFFFULL We could also just shift right by one remaining_mask. | ||
| __mmask32 remaining_next_mask = 0x7FFFFFFFULL >> (32 - (size - pos)); | ||
| __m512i input = _mm512_maskz_loadu_epi16(remaining_mask, in + pos); |
There was a problem hiding this comment.
I don't understand how this load doesn't segfault on strings near page edges. Perhaps I'm overlooking something, but I think you should step back by the number of positions the mask was shifted by and shift the mask the other way. Since we are in the large-input function we can't segfault at the beginning by doing this.
There was a problem hiding this comment.
@erikcorry Masked loads in AVX-512 do not segfaults.
Daniel Lemire, "Modern vector programming with masked loads and stores," in Daniel Lemire's blog, November 8, 2022, https://lemire.me/blog/2022/11/08/modern-vector-programming-with-masked-loads-and-stores/.
See also:
| } | ||
| __m512i input_next = | ||
| _mm512_maskz_loadu_epi16(remaining_next_mask, in + pos + 1); | ||
| if (!match_system(big_endian)) { |
There was a problem hiding this comment.
Same problem here, but more?
There was a problem hiding this comment.
Masked loads do not segfaults if the mask only loads valid bytes.
There was a problem hiding this comment.
That's amazing! (Claude lied to me on this!)
|
An implementation of my idea, and I think it also fixes the out-of-bounds read: |
| using vector_u16 = simd16<uint16_t>; | ||
| constexpr size_t N = vector_u16::ELEMENTS; | ||
| constexpr size_t N = vector_u16::ELEMENTS; // 32 on AVX-512 | ||
| if (N + 1 > size) { |
There was a problem hiding this comment.
An alternative way to handle this would be to check whether the first and last characters were in the same 32-character (64 byte) cache line:
size_t aligned_start = (size_t)in & 0xffff'ffff'ffff'ffc0;
if (aligned_start == (((size_t)(in + size - 1)) & 0xffff'ffff'ffff'ffc0) {
// Calculate mask for actual string contents.
_m512i input1 = _mm512_maskz_loadu_epi16(contents_mask, aligned_start);
// No more loads here.
return length;
}
This would probably be faster than the current fallback for cases with a large number of small strings. (Can probably just reuse the last-few-characters code at the end of the current function, just with a different load address and mask.)
Once you have detected this case you can still go to the regular implementation below knowing that the final load that gets the last char can't fault, because the string, however small, spans two 64 byte cache lines, both of which can't fault.
Disadvantage is you have to mark it so that sanitizers don't get annoyed with the OOB read, but I think you already have this issue.
There was a problem hiding this comment.
Rumour has it that IceLake is slow if you just use AVX-512 once because it has to wake up that section of the chip, but that newer Intel CPUs AMD's Zen4 doesn't have that issue. So this might not pay off everywhere.
There was a problem hiding this comment.
You are correct that for small strings, we could do the full input in AVX-512, and given that we have established that mask loads are safe, it is not difficult... But I think you will agree that this would requiring performance testing.
I think we can make such changes in future PRs.
Thanks. I still want to merge this PR this week, but given Eric's work, I will delay a bit the merge. |
* Alternative strategy for UTF-8 length from malformed UTF-16 * Don't expect any surrogates, skip work in this case
|
The implementations can be further improved, but I think that this is good enough as an initial implementation. Merging. Release soon after. |
This adds functions of the type
utf8_length_from_utf16_with_replacement: they compute how many bytes of UTF-8 data you would need to convert the (potentially invalid) UTF-16 input, where unpaired surrogates are turned into the replacement character. Observe that we do not yet support the conversion operation itself.The functions
utf8_length_from_utf16_with_replacementare fast, but slower thanutf8_length_from_utf16because we need to effectively validate the content, so have a double burden: validation and counting. It should still be quite fast.We only support ARM + x64. I will open additional issues later.
I have also added an experimental (still buggy) script to help us add new functions to simdutf so that future work is faster. I have also tuned a couple of
utf8_length_from_utf16implementations : AVX-512 and ARM have now faster functions. There are few other minor changes.Another change is that we make match_system a constexpr function.
fixes #849