Hi, I've been working on RVV native Unicode conversion routines, and have optimized validating utf8->utf32, utf8->utf16 and partial utf16->utf8 working (see last part for benchmarks).
I'd like to upstream this to simdutf in a custom backend, similar to how the icelake one works.
For testing, I generate random valid input utf32, convert it to the input format, randomly perform n random bit flips on it, and validate the output against the simdutf scalar implementation. Ideally I'd also like to use coverage guided fuzzing, but I wasn't able to get fuzzing working on RISC-V yet.
The code will be published to my RVV benchmark soon (it still needs some cleanup), hopefully with an associated article/blog post.
Edit: Here is the code: utf8_to_utf32/utf8_to_utf16, utf16_to_utf8
There are some open questions, though.
-
What does one need to do/which files to touch/what tools are there, to add a new architecture to simdutf?
-
Should we use the explicit intrinsics or the overloaded intrinsics?
The overloaded intrinsics are IMO more readable and better refactorable.
From what I can tell both are "mandated" by the RVV intrinsics spec, but whiles clang supports them since supporting RVV intrinsics (clang 16 and above), gcc currently doesn't support them, but it looks like upstream is currently working on it. I expect that once RVV 1.0 hardware becomes more available, gcc will should have support. There is currently only one board Kendryte K230, which is slowly being rolled out in batches.
-
Which extensions should we target?
I think we should orient our self by the RVA profiles and only support the standard V extension, so 8 to 64 bit wide elements, with a VLEN >= 128 bits, and not things like Zve64x.
Supporting Zvbb is also quite useful, as it has an endianness swap instruction, but I think we should make this optional and detect support from compiler settings.
-
Can should we assume fast vrgather and vcompress?
RVV has two permutation instructions that currently vary widely in performance between processors:
vcompress.vm:
|
VLEN |
e8m1 |
e8m2 |
e8m4 |
e8m8 |
| c906 |
128 |
4 |
10 |
32 |
136 |
| c908 |
128 |
4 |
10 |
32 |
139.4 |
| c920 |
128 |
0.5 |
2.4 |
5.4 |
20.0 |
| bobcat* |
256 |
32 |
64 |
132 |
260 |
| x280* |
512 |
65 |
129 |
257 |
513 |
vrgather.vv:
|
VLEN |
e8m1 |
e8m2 |
e8m4 |
e8m8 |
| c906 |
128 |
4 |
16 |
64 |
256 |
| c908 |
128 |
4 |
16 |
64.9 |
261.1 |
| c920 |
128 |
0.5 |
2.4 |
8.0 |
32.0 |
| bobcat* |
256 |
68 |
132 |
260 |
516 |
| x280* |
512 |
65 |
129 |
257 |
513 |
-
...
*bobcat: note that this is an open source proof-of-concept core, and they explicitly stated, that they didn't optimize the permutation instructions
*x280: the numbers are from llvm-mca, but I was told they match reality. There is also supposed to be a vrgather fast path for vl<=256. I think they didn't have much incentive to make this fast, as the x280 mostly targets AI.
My code currently uses e8m1 vrgather and e8m2 vcompress, which works great on the C9xx cores, but not so great on the others. I suspect, however, that well see future desktop cores implement fast vcompress and at least fast LMUL=1 vrgather.
For one, because vcompress implementations can be scaled up almost linearly with vector length, which doesn't seem to be true for vrgather without exploding the gate count (Although admittedly I don't know much about hardware design). Secondly because using vrgather for 4 bit LUTs and in lane shuffles will be the most common operations, so vendors will need to optimize for those.
For now, I wouldn't add gather free implementations and performance measurements, but that might be necessary in the future, if I'm wrong about this.
Benchmarks
Processors:
Implementations:
-
utf8_to_utf32/utf8_to_utf16: fast path for 1 byte, 1/2 byte, 1/2/3 byte, average > 2 bytes, general case
Emoji-Lipsum could probably be artificially speed up by an all 4 byte case, but I don't think that is a realistic case to optimize for, so I left it out.
-
utf16_to_utf8: fast path for 1 byte output, 1/2 byte output consumes everything until 3/4 byte output, which is converted with scalar code until a 1/2 byte output is reached.
I plan on adding a 1/2/3 vectorized path, and maybe an 1/2/3/4, if I can figure it out.
Metric:
b/c is "input bytes processed"/cycle.
c908 utf8_to_utf32
lipsum/Latin-Lipsum.utf8.txt scalar: 0.1292010 b/c rvv 0.7918574 b/c speedup: 6.1288759x
wm/english.utf8.txt scalar: 0.1107906 b/c rvv 0.6070963 b/c speedup: 5.4796718x
lipsum/Arabic-Lipsum.utf8.txt scalar: 0.0328398 b/c rvv 0.1568164 b/c speedup: 4.7751939x
lipsum/Russian-Lipsum.utf8.txt scalar: 0.0333284 b/c rvv 0.1573165 b/c speedup: 4.7201892x
lipsum/Hebrew-Lipsum.utf8.txt scalar: 0.0332853 b/c rvv 0.1568517 b/c speedup: 4.7123306x
wm/arabic.utf8.txt scalar: 0.0481720 b/c rvv 0.2215483 b/c speedup: 4.5991074x
wm/russian.utf8.txt scalar: 0.0455210 b/c rvv 0.2010354 b/c speedup: 4.4163192x
wm/greek.utf8.txt scalar: 0.0483209 b/c rvv 0.2132376 b/c speedup: 4.4129416x
wm/hebrew.utf8.txt scalar: 0.0436073 b/c rvv 0.1914899 b/c speedup: 4.3912301x
wm/turkish.utf8.txt scalar: 0.0549728 b/c rvv 0.2392147 b/c speedup: 4.3515041x
wm/czech.utf8.txt scalar: 0.0496503 b/c rvv 0.2124387 b/c speedup: 4.2786960x
wm/persan.utf8.txt scalar: 0.0480962 b/c rvv 0.1994732 b/c speedup: 4.1473762x
wm/vietnamese.utf8.txt scalar: 0.0425005 b/c rvv 0.1761435 b/c speedup: 4.1445023x
wm/french.utf8.txt scalar: 0.0676433 b/c rvv 0.2610237 b/c speedup: 3.8588245x
wm/german.utf8.txt scalar: 0.0817605 b/c rvv 0.3151995 b/c speedup: 3.8551556x
wm/esperanto.utf8.txt scalar: 0.0805715 b/c rvv 0.3071995 b/c speedup: 3.8127556x
wm/portuguese.utf8.txt scalar: 0.0722839 b/c rvv 0.2748710 b/c speedup: 3.8026557x
wm/korean.utf8.txt scalar: 0.0496705 b/c rvv 0.1627952 b/c speedup: 3.2774992x
wm/hindi.utf8.txt scalar: 0.0539742 b/c rvv 0.1739320 b/c speedup: 3.2225014x
wm/chinese.utf8.txt scalar: 0.0539208 b/c rvv 0.1691128 b/c speedup: 3.1363154x
wm/japanese.utf8.txt scalar: 0.0536440 b/c rvv 0.1681587 b/c speedup: 3.1347166x
wm/thai.utf8.txt scalar: 0.0576478 b/c rvv 0.1801362 b/c speedup: 3.1247700x
lipsum/Korean-Lipsum.utf8.txt scalar: 0.0394845 b/c rvv 0.1222531 b/c speedup: 3.0962288x
lipsum/Hindi-Lipsum.utf8.txt scalar: 0.0421478 b/c rvv 0.1226263 b/c speedup: 2.9094319x
lipsum/Japanese-Lipsum.utf8.txt scalar: 0.0448010 b/c rvv 0.1226832 b/c speedup: 2.7383986x
lipsum/Chinese-Lipsum.utf8.txt scalar: 0.0456330 b/c rvv 0.1225884 b/c speedup: 2.6863942x
lipsum/Emoji-Lipsum.utf8.txt scalar: 0.0558446 b/c rvv 0.1189647 b/c speedup: 2.1302799x
c908 utf8_to_utf16
lipsum/Latin-Lipsum.utf8.txt scalar: 0.1462973 b/c rvv 1.0275230 b/c speedup: 7.0235252x
wm/english.utf8.txt scalar: 0.1275831 b/c rvv 0.7338758 b/c speedup: 5.7521362x
lipsum/Hebrew-Lipsum.utf8.txt scalar: 0.0330693 b/c rvv 0.1675394 b/c speedup: 5.0663088x
lipsum/Arabic-Lipsum.utf8.txt scalar: 0.0331370 b/c rvv 0.1676699 b/c speedup: 5.0598918x
lipsum/Russian-Lipsum.utf8.txt scalar: 0.0331387 b/c rvv 0.1674591 b/c speedup: 5.0532761x
wm/arabic.utf8.txt scalar: 0.0497569 b/c rvv 0.2353216 b/c speedup: 4.7294242x
wm/greek.utf8.txt scalar: 0.0497033 b/c rvv 0.2285679 b/c speedup: 4.5986446x
wm/russian.utf8.txt scalar: 0.0466324 b/c rvv 0.2121076 b/c speedup: 4.5484982x
wm/hebrew.utf8.txt scalar: 0.0448840 b/c rvv 0.2028331 b/c speedup: 4.5190476x
wm/turkish.utf8.txt scalar: 0.0587339 b/c rvv 0.2584671 b/c speedup: 4.4006435x
wm/czech.utf8.txt scalar: 0.0528302 b/c rvv 0.2278120 b/c speedup: 4.3121470x
wm/persan.utf8.txt scalar: 0.0496008 b/c rvv 0.2126346 b/c speedup: 4.2869173x
wm/vietnamese.utf8.txt scalar: 0.0447605 b/c rvv 0.1853099 b/c speedup: 4.1400298x
wm/esperanto.utf8.txt scalar: 0.0881123 b/c rvv 0.3412668 b/c speedup: 3.8730881x
wm/german.utf8.txt scalar: 0.0905761 b/c rvv 0.3502627 b/c speedup: 3.8670545x
wm/french.utf8.txt scalar: 0.0737802 b/c rvv 0.2843437 b/c speedup: 3.8539292x
wm/portuguese.utf8.txt scalar: 0.0791921 b/c rvv 0.3004463 b/c speedup: 3.7938890x
wm/korean.utf8.txt scalar: 0.0522578 b/c rvv 0.1727464 b/c speedup: 3.3056579x
wm/hindi.utf8.txt scalar: 0.0563662 b/c rvv 0.1848405 b/c speedup: 3.2792800x
lipsum/Korean-Lipsum.utf8.txt scalar: 0.0399847 b/c rvv 0.1300402 b/c speedup: 3.2522438x
wm/thai.utf8.txt scalar: 0.0600823 b/c rvv 0.1928356 b/c speedup: 3.2095216x
wm/japanese.utf8.txt scalar: 0.0560357 b/c rvv 0.1775926 b/c speedup: 3.1692714x
wm/chinese.utf8.txt scalar: 0.0565430 b/c rvv 0.1788198 b/c speedup: 3.1625422x
lipsum/Hindi-Lipsum.utf8.txt scalar: 0.0424079 b/c rvv 0.1302720 b/c speedup: 3.0718763x
lipsum/Japanese-Lipsum.utf8.txt scalar: 0.0448987 b/c rvv 0.1302059 b/c speedup: 2.8999905x
lipsum/Chinese-Lipsum.utf8.txt scalar: 0.0457254 b/c rvv 0.1301323 b/c speedup: 2.8459495x
lipsum/Emoji-Lipsum.utf8.txt scalar: 0.0522199 b/c rvv 0.0831130 b/c speedup: 1.5915968x
c908 utf16_to_utf8
lipsum/Russian-Lipsum.utf16.txt scalar: 0.0445853 b/c rvv: 0.2163190 b/c speedup: 4.8517938x
lipsum/Arabic-Lipsum.utf16.txt scalar: 0.0449275 b/c rvv: 0.2153480 b/c speedup: 4.7932246x
lipsum/Hebrew-Lipsum.utf16.txt scalar: 0.0448793 b/c rvv: 0.2136721 b/c speedup: 4.7610340x
lipsum/Latin-Lipsum.utf16.txt scalar: 0.1028043 b/c rvv: 0.4050746 b/c speedup: 3.9402459x
wm/greek.utf16.txt scalar: 0.0718830 b/c rvv: 0.2716538 b/c speedup: 3.7791077x
wm/russian.utf16.txt scalar: 0.0688957 b/c rvv: 0.2488844 b/c speedup: 3.6124804x
wm/arabic.utf16.txt scalar: 0.0721544 b/c rvv: 0.2600413 b/c speedup: 3.6039526x
wm/hebrew.utf16.txt scalar: 0.0682180 b/c rvv: 0.2447910 b/c speedup: 3.5883632x
wm/esperanto.utf16.txt scalar: 0.0963212 b/c rvv: 0.3292119 b/c speedup: 3.4178546x
wm/persan.utf16.txt scalar: 0.0726062 b/c rvv: 0.2366135 b/c speedup: 3.2588582x
wm/english.utf16.txt scalar: 0.1015669 b/c rvv: 0.3270337 b/c speedup: 3.2198835x
wm/german.utf16.txt scalar: 0.0975311 b/c rvv: 0.3023158 b/c speedup: 3.0996865x
wm/portuguese.utf16.txt scalar: 0.0962536 b/c rvv: 0.2863991 b/c speedup: 2.9754628x
wm/french.utf16.txt scalar: 0.0952526 b/c rvv: 0.2773457 b/c speedup: 2.9116858x
wm/czech.utf16.txt scalar: 0.0872352 b/c rvv: 0.2453764 b/c speedup: 2.8128122x
wm/turkish.utf16.txt scalar: 0.0894998 b/c rvv: 0.2483814 b/c speedup: 2.7752177x
wm/thai.utf16.txt scalar: 0.0742528 b/c rvv: 0.1800184 b/c speedup: 2.4243965x
wm/japanese.utf16.txt scalar: 0.0750324 b/c rvv: 0.1785757 b/c speedup: 2.3799792x
lipsum/Chinese-Lipsum.utf16.txt scalar: 0.0422231 b/c rvv: 0.0993063 b/c speedup: 2.3519384x
wm/vietnamese.utf16.txt scalar: 0.0796325 b/c rvv: 0.1822895 b/c speedup: 2.2891332x
wm/chinese.utf16.txt scalar: 0.0781047 b/c rvv: 0.1772665 b/c speedup: 2.2695999x
lipsum/Japanese-Lipsum.utf16.txt scalar: 0.0424322 b/c rvv: 0.0920647 b/c speedup: 2.1696894x
wm/hindi.utf16.txt scalar: 0.0716071 b/c rvv: 0.1415199 b/c speedup: 1.9763381x
wm/korean.utf16.txt scalar: 0.0742212 b/c rvv: 0.1447335 b/c speedup: 1.9500284x
lipsum/Emoji-Lipsum.utf16.txt scalar: 0.0560671 b/c rvv: 0.1017256 b/c speedup: 1.8143532x
lipsum/Hindi-Lipsum.utf16.txt scalar: 0.0423512 b/c rvv: 0.0653709 b/c speedup: 1.5435430x
lipsum/Korean-Lipsum.utf16.txt scalar: 0.0431370 b/c rvv: 0.0527462 b/c speedup: 1.2227593x
c920 utf8_to_utf32
lipsum/Latin-Lipsum.utf8.txt scalar: 0.1983016 b/c rvv 1.6172459 b/c speedup: 8.1554844x
wm/english.utf8.txt scalar: 0.1787050 b/c rvv 0.9249580 b/c speedup: 5.1758932x
wm/greek.utf8.txt scalar: 0.0720639 b/c rvv 0.3620777 b/c speedup: 5.0243981x
lipsum/Arabic-Lipsum.utf8.txt scalar: 0.0489671 b/c rvv 0.2433533 b/c speedup: 4.9697240x
lipsum/Hebrew-Lipsum.utf8.txt scalar: 0.0484946 b/c rvv 0.2363269 b/c speedup: 4.8732567x
lipsum/Russian-Lipsum.utf8.txt scalar: 0.0501499 b/c rvv 0.2380047 b/c speedup: 4.7458662x
wm/czech.utf8.txt scalar: 0.0720776 b/c rvv 0.3390725 b/c speedup: 4.7042668x
wm/hebrew.utf8.txt scalar: 0.0636149 b/c rvv 0.2747983 b/c speedup: 4.3197121x
wm/turkish.utf8.txt scalar: 0.0806716 b/c rvv 0.3274432 b/c speedup: 4.0589630x
wm/esperanto.utf8.txt scalar: 0.1170809 b/c rvv 0.4577151 b/c speedup: 3.9093893x
wm/arabic.utf8.txt scalar: 0.0714500 b/c rvv 0.2772353 b/c speedup: 3.8801297x
wm/persan.utf8.txt scalar: 0.0704970 b/c rvv 0.2690839 b/c speedup: 3.8169557x
wm/russian.utf8.txt scalar: 0.0683159 b/c rvv 0.2570850 b/c speedup: 3.7631801x
wm/german.utf8.txt scalar: 0.1248275 b/c rvv 0.4611884 b/c speedup: 3.6946062x
wm/vietnamese.utf8.txt scalar: 0.0612176 b/c rvv 0.2055558 b/c speedup: 3.3577844x
wm/korean.utf8.txt scalar: 0.0727457 b/c rvv 0.2360132 b/c speedup: 3.2443591x
wm/portuguese.utf8.txt scalar: 0.1077901 b/c rvv 0.3433450 b/c speedup: 3.1853110x
wm/japanese.utf8.txt scalar: 0.0822007 b/c rvv 0.2396279 b/c speedup: 2.9151562x
wm/french.utf8.txt scalar: 0.0996538 b/c rvv 0.2892530 b/c speedup: 2.9025785x
wm/hindi.utf8.txt scalar: 0.0828941 b/c rvv 0.2307050 b/c speedup: 2.7831279x
lipsum/Korean-Lipsum.utf8.txt scalar: 0.0581741 b/c rvv 0.1554148 b/c speedup: 2.6715442x
wm/chinese.utf8.txt scalar: 0.0817867 b/c rvv 0.2103001 b/c speedup: 2.5713221x
lipsum/Hindi-Lipsum.utf8.txt scalar: 0.0674511 b/c rvv 0.1558572 b/c speedup: 2.3106693x
wm/thai.utf8.txt scalar: 0.0933146 b/c rvv 0.2127180 b/c speedup: 2.2795790x
lipsum/Japanese-Lipsum.utf8.txt scalar: 0.0739905 b/c rvv 0.1558918 b/c speedup: 2.1069166x
lipsum/Chinese-Lipsum.utf8.txt scalar: 0.0762008 b/c rvv 0.1563651 b/c speedup: 2.0520142x
lipsum/Emoji-Lipsum.utf8.txt scalar: 0.0956014 b/c rvv 0.1901396 b/c speedup: 1.9888773x
c920 utf8_to_utf16
lipsum/Latin-Lipsum.utf8.txt scalar: 0.2109710 b/c rvv 2.2189945 b/c speedup: 10.518002x
wm/english.utf8.txt scalar: 0.1827197 b/c rvv 1.4220564 b/c speedup: 7.7827185x
wm/greek.utf8.txt scalar: 0.0755349 b/c rvv 0.3727045 b/c speedup: 4.9341973x
wm/czech.utf8.txt scalar: 0.0755292 b/c rvv 0.3633922 b/c speedup: 4.8112804x
lipsum/Hebrew-Lipsum.utf8.txt scalar: 0.0494750 b/c rvv 0.2242518 b/c speedup: 4.5326243x
lipsum/Russian-Lipsum.utf8.txt scalar: 0.0509461 b/c rvv 0.2299110 b/c speedup: 4.5128233x
lipsum/Arabic-Lipsum.utf8.txt scalar: 0.0497147 b/c rvv 0.2216386 b/c speedup: 4.4582093x
wm/arabic.utf8.txt scalar: 0.0744813 b/c rvv 0.3212577 b/c speedup: 4.3132627x
wm/hebrew.utf8.txt scalar: 0.0661844 b/c rvv 0.2759717 b/c speedup: 4.1697346x
wm/esperanto.utf8.txt scalar: 0.1263762 b/c rvv 0.5216554 b/c speedup: 4.1277957x
wm/german.utf8.txt scalar: 0.1296940 b/c rvv 0.5333178 b/c speedup: 4.1121239x
wm/turkish.utf8.txt scalar: 0.0847053 b/c rvv 0.3365346 b/c speedup: 3.9730052x
wm/russian.utf8.txt scalar: 0.0708612 b/c rvv 0.2807201 b/c speedup: 3.9615460x
wm/portuguese.utf8.txt scalar: 0.1171257 b/c rvv 0.4517186 b/c speedup: 3.8566995x
wm/persan.utf8.txt scalar: 0.0742834 b/c rvv 0.2688770 b/c speedup: 3.6196109x
wm/vietnamese.utf8.txt scalar: 0.0642606 b/c rvv 0.2283996 b/c speedup: 3.5542678x
wm/french.utf8.txt scalar: 0.1070867 b/c rvv 0.3670641 b/c speedup: 3.4277275x
wm/korean.utf8.txt scalar: 0.0765637 b/c rvv 0.2563889 b/c speedup: 3.3486993x
wm/hindi.utf8.txt scalar: 0.0857724 b/c rvv 0.2719399 b/c speedup: 3.1704799x
wm/japanese.utf8.txt scalar: 0.0866018 b/c rvv 0.2630760 b/c speedup: 3.0377647x
lipsum/Korean-Lipsum.utf8.txt scalar: 0.0596959 b/c rvv 0.1592920 b/c speedup: 2.6683889x
wm/chinese.utf8.txt scalar: 0.0855446 b/c rvv 0.2223617 b/c speedup: 2.5993654x
wm/thai.utf8.txt scalar: 0.0963939 b/c rvv 0.2377943 b/c speedup: 2.4669006x
lipsum/Hindi-Lipsum.utf8.txt scalar: 0.0700269 b/c rvv 0.1601953 b/c speedup: 2.2876225x
lipsum/Japanese-Lipsum.utf8.txt scalar: 0.0772785 b/c rvv 0.1603533 b/c speedup: 2.0750034x
lipsum/Chinese-Lipsum.utf8.txt scalar: 0.0797070 b/c rvv 0.1608991 b/c speedup: 2.0186326x
lipsum/Emoji-Lipsum.utf8.txt scalar: 0.0923569 b/c rvv 0.1242158 b/c speedup: 1.3449541x
Hi, I've been working on RVV native Unicode conversion routines, and have optimized validating utf8->utf32, utf8->utf16 and partial utf16->utf8 working (see last part for benchmarks).
I'd like to upstream this to simdutf in a custom backend, similar to how the icelake one works.
For testing, I generate random valid input utf32, convert it to the input format, randomly perform n random bit flips on it, and validate the output against the simdutf scalar implementation. Ideally I'd also like to use coverage guided fuzzing, but I wasn't able to get fuzzing working on RISC-V yet.
The code will be published to my RVV benchmark soon (it still needs some cleanup), hopefully with an associated article/blog post.
Edit: Here is the code: utf8_to_utf32/utf8_to_utf16, utf16_to_utf8
There are some open questions, though.
What does one need to do/which files to touch/what tools are there, to add a new architecture to simdutf?
Should we use the explicit intrinsics or the overloaded intrinsics?
The overloaded intrinsics are IMO more readable and better refactorable.
From what I can tell both are "mandated" by the RVV intrinsics spec, but whiles clang supports them since supporting RVV intrinsics (clang 16 and above), gcc currently doesn't support them, but it looks like upstream is currently working on it. I expect that once RVV 1.0 hardware becomes more available, gcc will should have support. There is currently only one board Kendryte K230, which is slowly being rolled out in batches.
Which extensions should we target?
I think we should orient our self by the RVA profiles and only support the standard V extension, so 8 to 64 bit wide elements, with a VLEN >= 128 bits, and not things like Zve64x.
Supporting Zvbb is also quite useful, as it has an endianness swap instruction, but I think we should make this optional and detect support from compiler settings.
Can should we assume fast vrgather and vcompress?
RVV has two permutation instructions that currently vary widely in performance between processors:
vcompress.vm:vrgather.vv:...
*bobcat: note that this is an open source proof-of-concept core, and they explicitly stated, that they didn't optimize the permutation instructions
*x280: the numbers are from llvm-mca, but I was told they match reality. There is also supposed to be a vrgather fast path for vl<=256. I think they didn't have much incentive to make this fast, as the x280 mostly targets AI.
My code currently uses e8m1 vrgather and e8m2 vcompress, which works great on the C9xx cores, but not so great on the others. I suspect, however, that well see future desktop cores implement fast vcompress and at least fast LMUL=1 vrgather.
For one, because vcompress implementations can be scaled up almost linearly with vector length, which doesn't seem to be true for vrgather without exploding the gate count (Although admittedly I don't know much about hardware design). Secondly because using vrgather for 4 bit LUTs and in lane shuffles will be the most common operations, so vendors will need to optimize for those.
For now, I wouldn't add gather free implementations and performance measurements, but that might be necessary in the future, if I'm wrong about this.
Benchmarks
Processors:
C908: in-order at 1.6GHz, supports RVV 1.0 with VLEN=128
C920: out-of-order double issue at 2GHz, supports RVV 0.7.1 with VLEN=128
I needed to manually convert the assembly to rvv 0.7.1, which increased the code size by about 20 instructions. I've yet to the conversion for the utf16_to_utf8 code, so there aren't any results for that below.
Implementations:
utf8_to_utf32/utf8_to_utf16: fast path for 1 byte, 1/2 byte, 1/2/3 byte, average > 2 bytes, general case
Emoji-Lipsum could probably be artificially speed up by an all 4 byte case, but I don't think that is a realistic case to optimize for, so I left it out.
utf16_to_utf8: fast path for 1 byte output, 1/2 byte output consumes everything until 3/4 byte output, which is converted with scalar code until a 1/2 byte output is reached.
I plan on adding a 1/2/3 vectorized path, and maybe an 1/2/3/4, if I can figure it out.
Metric:
b/cis "input bytes processed"/cycle.c908 utf8_to_utf32
c908 utf8_to_utf16
c908 utf16_to_utf8
c920 utf8_to_utf32
c920 utf8_to_utf16