Skip to content

icelake like backend for RVV (RISC-V vector extension) #362

@camel-cdr

Description

@camel-cdr

Hi, I've been working on RVV native Unicode conversion routines, and have optimized validating utf8->utf32, utf8->utf16 and partial utf16->utf8 working (see last part for benchmarks).
I'd like to upstream this to simdutf in a custom backend, similar to how the icelake one works.

For testing, I generate random valid input utf32, convert it to the input format, randomly perform n random bit flips on it, and validate the output against the simdutf scalar implementation. Ideally I'd also like to use coverage guided fuzzing, but I wasn't able to get fuzzing working on RISC-V yet.

The code will be published to my RVV benchmark soon (it still needs some cleanup), hopefully with an associated article/blog post.

Edit: Here is the code: utf8_to_utf32/utf8_to_utf16, utf16_to_utf8

There are some open questions, though.

  1. What does one need to do/which files to touch/what tools are there, to add a new architecture to simdutf?

  2. Should we use the explicit intrinsics or the overloaded intrinsics?

    The overloaded intrinsics are IMO more readable and better refactorable.
    From what I can tell both are "mandated" by the RVV intrinsics spec, but whiles clang supports them since supporting RVV intrinsics (clang 16 and above), gcc currently doesn't support them, but it looks like upstream is currently working on it. I expect that once RVV 1.0 hardware becomes more available, gcc will should have support. There is currently only one board Kendryte K230, which is slowly being rolled out in batches.

  3. Which extensions should we target?

    I think we should orient our self by the RVA profiles and only support the standard V extension, so 8 to 64 bit wide elements, with a VLEN >= 128 bits, and not things like Zve64x.
    Supporting Zvbb is also quite useful, as it has an endianness swap instruction, but I think we should make this optional and detect support from compiler settings.

  4. Can should we assume fast vrgather and vcompress?

    RVV has two permutation instructions that currently vary widely in performance between processors:

vcompress.vm:

VLEN e8m1 e8m2 e8m4 e8m8
c906 128 4 10 32 136
c908 128 4 10 32 139.4
c920 128 0.5 2.4 5.4 20.0
bobcat* 256 32 64 132 260
x280* 512 65 129 257 513

vrgather.vv:

VLEN e8m1 e8m2 e8m4 e8m8
c906 128 4 16 64 256
c908 128 4 16 64.9 261.1
c920 128 0.5 2.4 8.0 32.0
bobcat* 256 68 132 260 516
x280* 512 65 129 257 513
  1. ...

    *bobcat: note that this is an open source proof-of-concept core, and they explicitly stated, that they didn't optimize the permutation instructions

    *x280: the numbers are from llvm-mca, but I was told they match reality. There is also supposed to be a vrgather fast path for vl<=256. I think they didn't have much incentive to make this fast, as the x280 mostly targets AI.

    My code currently uses e8m1 vrgather and e8m2 vcompress, which works great on the C9xx cores, but not so great on the others. I suspect, however, that well see future desktop cores implement fast vcompress and at least fast LMUL=1 vrgather.

    For one, because vcompress implementations can be scaled up almost linearly with vector length, which doesn't seem to be true for vrgather without exploding the gate count (Although admittedly I don't know much about hardware design). Secondly because using vrgather for 4 bit LUTs and in lane shuffles will be the most common operations, so vendors will need to optimize for those.

    For now, I wouldn't add gather free implementations and performance measurements, but that might be necessary in the future, if I'm wrong about this.

Benchmarks

Processors:

Implementations:

  • utf8_to_utf32/utf8_to_utf16: fast path for 1 byte, 1/2 byte, 1/2/3 byte, average > 2 bytes, general case

    Emoji-Lipsum could probably be artificially speed up by an all 4 byte case, but I don't think that is a realistic case to optimize for, so I left it out.

  • utf16_to_utf8: fast path for 1 byte output, 1/2 byte output consumes everything until 3/4 byte output, which is converted with scalar code until a 1/2 byte output is reached.

    I plan on adding a 1/2/3 vectorized path, and maybe an 1/2/3/4, if I can figure it out.

Metric:

  • b/c is "input bytes processed"/cycle.
c908 utf8_to_utf32
lipsum/Latin-Lipsum.utf8.txt     scalar: 0.1292010 b/c  rvv 0.7918574 b/c  speedup: 6.1288759x
wm/english.utf8.txt              scalar: 0.1107906 b/c  rvv 0.6070963 b/c  speedup: 5.4796718x
lipsum/Arabic-Lipsum.utf8.txt    scalar: 0.0328398 b/c  rvv 0.1568164 b/c  speedup: 4.7751939x
lipsum/Russian-Lipsum.utf8.txt   scalar: 0.0333284 b/c  rvv 0.1573165 b/c  speedup: 4.7201892x
lipsum/Hebrew-Lipsum.utf8.txt    scalar: 0.0332853 b/c  rvv 0.1568517 b/c  speedup: 4.7123306x
wm/arabic.utf8.txt               scalar: 0.0481720 b/c  rvv 0.2215483 b/c  speedup: 4.5991074x
wm/russian.utf8.txt              scalar: 0.0455210 b/c  rvv 0.2010354 b/c  speedup: 4.4163192x
wm/greek.utf8.txt                scalar: 0.0483209 b/c  rvv 0.2132376 b/c  speedup: 4.4129416x
wm/hebrew.utf8.txt               scalar: 0.0436073 b/c  rvv 0.1914899 b/c  speedup: 4.3912301x
wm/turkish.utf8.txt              scalar: 0.0549728 b/c  rvv 0.2392147 b/c  speedup: 4.3515041x
wm/czech.utf8.txt                scalar: 0.0496503 b/c  rvv 0.2124387 b/c  speedup: 4.2786960x
wm/persan.utf8.txt               scalar: 0.0480962 b/c  rvv 0.1994732 b/c  speedup: 4.1473762x
wm/vietnamese.utf8.txt           scalar: 0.0425005 b/c  rvv 0.1761435 b/c  speedup: 4.1445023x
wm/french.utf8.txt               scalar: 0.0676433 b/c  rvv 0.2610237 b/c  speedup: 3.8588245x
wm/german.utf8.txt               scalar: 0.0817605 b/c  rvv 0.3151995 b/c  speedup: 3.8551556x
wm/esperanto.utf8.txt            scalar: 0.0805715 b/c  rvv 0.3071995 b/c  speedup: 3.8127556x
wm/portuguese.utf8.txt           scalar: 0.0722839 b/c  rvv 0.2748710 b/c  speedup: 3.8026557x
wm/korean.utf8.txt               scalar: 0.0496705 b/c  rvv 0.1627952 b/c  speedup: 3.2774992x
wm/hindi.utf8.txt                scalar: 0.0539742 b/c  rvv 0.1739320 b/c  speedup: 3.2225014x
wm/chinese.utf8.txt              scalar: 0.0539208 b/c  rvv 0.1691128 b/c  speedup: 3.1363154x
wm/japanese.utf8.txt             scalar: 0.0536440 b/c  rvv 0.1681587 b/c  speedup: 3.1347166x
wm/thai.utf8.txt                 scalar: 0.0576478 b/c  rvv 0.1801362 b/c  speedup: 3.1247700x
lipsum/Korean-Lipsum.utf8.txt    scalar: 0.0394845 b/c  rvv 0.1222531 b/c  speedup: 3.0962288x
lipsum/Hindi-Lipsum.utf8.txt     scalar: 0.0421478 b/c  rvv 0.1226263 b/c  speedup: 2.9094319x
lipsum/Japanese-Lipsum.utf8.txt  scalar: 0.0448010 b/c  rvv 0.1226832 b/c  speedup: 2.7383986x
lipsum/Chinese-Lipsum.utf8.txt   scalar: 0.0456330 b/c  rvv 0.1225884 b/c  speedup: 2.6863942x
lipsum/Emoji-Lipsum.utf8.txt     scalar: 0.0558446 b/c  rvv 0.1189647 b/c  speedup: 2.1302799x
c908 utf8_to_utf16
lipsum/Latin-Lipsum.utf8.txt     scalar: 0.1462973 b/c  rvv 1.0275230 b/c  speedup: 7.0235252x
wm/english.utf8.txt              scalar: 0.1275831 b/c  rvv 0.7338758 b/c  speedup: 5.7521362x
lipsum/Hebrew-Lipsum.utf8.txt    scalar: 0.0330693 b/c  rvv 0.1675394 b/c  speedup: 5.0663088x
lipsum/Arabic-Lipsum.utf8.txt    scalar: 0.0331370 b/c  rvv 0.1676699 b/c  speedup: 5.0598918x
lipsum/Russian-Lipsum.utf8.txt   scalar: 0.0331387 b/c  rvv 0.1674591 b/c  speedup: 5.0532761x
wm/arabic.utf8.txt               scalar: 0.0497569 b/c  rvv 0.2353216 b/c  speedup: 4.7294242x
wm/greek.utf8.txt                scalar: 0.0497033 b/c  rvv 0.2285679 b/c  speedup: 4.5986446x
wm/russian.utf8.txt              scalar: 0.0466324 b/c  rvv 0.2121076 b/c  speedup: 4.5484982x
wm/hebrew.utf8.txt               scalar: 0.0448840 b/c  rvv 0.2028331 b/c  speedup: 4.5190476x
wm/turkish.utf8.txt              scalar: 0.0587339 b/c  rvv 0.2584671 b/c  speedup: 4.4006435x
wm/czech.utf8.txt                scalar: 0.0528302 b/c  rvv 0.2278120 b/c  speedup: 4.3121470x
wm/persan.utf8.txt               scalar: 0.0496008 b/c  rvv 0.2126346 b/c  speedup: 4.2869173x
wm/vietnamese.utf8.txt           scalar: 0.0447605 b/c  rvv 0.1853099 b/c  speedup: 4.1400298x
wm/esperanto.utf8.txt            scalar: 0.0881123 b/c  rvv 0.3412668 b/c  speedup: 3.8730881x
wm/german.utf8.txt               scalar: 0.0905761 b/c  rvv 0.3502627 b/c  speedup: 3.8670545x
wm/french.utf8.txt               scalar: 0.0737802 b/c  rvv 0.2843437 b/c  speedup: 3.8539292x
wm/portuguese.utf8.txt           scalar: 0.0791921 b/c  rvv 0.3004463 b/c  speedup: 3.7938890x
wm/korean.utf8.txt               scalar: 0.0522578 b/c  rvv 0.1727464 b/c  speedup: 3.3056579x
wm/hindi.utf8.txt                scalar: 0.0563662 b/c  rvv 0.1848405 b/c  speedup: 3.2792800x
lipsum/Korean-Lipsum.utf8.txt    scalar: 0.0399847 b/c  rvv 0.1300402 b/c  speedup: 3.2522438x
wm/thai.utf8.txt                 scalar: 0.0600823 b/c  rvv 0.1928356 b/c  speedup: 3.2095216x
wm/japanese.utf8.txt             scalar: 0.0560357 b/c  rvv 0.1775926 b/c  speedup: 3.1692714x
wm/chinese.utf8.txt              scalar: 0.0565430 b/c  rvv 0.1788198 b/c  speedup: 3.1625422x
lipsum/Hindi-Lipsum.utf8.txt     scalar: 0.0424079 b/c  rvv 0.1302720 b/c  speedup: 3.0718763x
lipsum/Japanese-Lipsum.utf8.txt  scalar: 0.0448987 b/c  rvv 0.1302059 b/c  speedup: 2.8999905x
lipsum/Chinese-Lipsum.utf8.txt   scalar: 0.0457254 b/c  rvv 0.1301323 b/c  speedup: 2.8459495x
lipsum/Emoji-Lipsum.utf8.txt     scalar: 0.0522199 b/c  rvv 0.0831130 b/c  speedup: 1.5915968x
c908 utf16_to_utf8
lipsum/Russian-Lipsum.utf16.txt  scalar: 0.0445853 b/c  rvv: 0.2163190 b/c  speedup: 4.8517938x
lipsum/Arabic-Lipsum.utf16.txt   scalar: 0.0449275 b/c  rvv: 0.2153480 b/c  speedup: 4.7932246x
lipsum/Hebrew-Lipsum.utf16.txt   scalar: 0.0448793 b/c  rvv: 0.2136721 b/c  speedup: 4.7610340x
lipsum/Latin-Lipsum.utf16.txt    scalar: 0.1028043 b/c  rvv: 0.4050746 b/c  speedup: 3.9402459x
wm/greek.utf16.txt               scalar: 0.0718830 b/c  rvv: 0.2716538 b/c  speedup: 3.7791077x
wm/russian.utf16.txt             scalar: 0.0688957 b/c  rvv: 0.2488844 b/c  speedup: 3.6124804x
wm/arabic.utf16.txt              scalar: 0.0721544 b/c  rvv: 0.2600413 b/c  speedup: 3.6039526x
wm/hebrew.utf16.txt              scalar: 0.0682180 b/c  rvv: 0.2447910 b/c  speedup: 3.5883632x
wm/esperanto.utf16.txt           scalar: 0.0963212 b/c  rvv: 0.3292119 b/c  speedup: 3.4178546x
wm/persan.utf16.txt              scalar: 0.0726062 b/c  rvv: 0.2366135 b/c  speedup: 3.2588582x
wm/english.utf16.txt             scalar: 0.1015669 b/c  rvv: 0.3270337 b/c  speedup: 3.2198835x
wm/german.utf16.txt              scalar: 0.0975311 b/c  rvv: 0.3023158 b/c  speedup: 3.0996865x
wm/portuguese.utf16.txt          scalar: 0.0962536 b/c  rvv: 0.2863991 b/c  speedup: 2.9754628x
wm/french.utf16.txt              scalar: 0.0952526 b/c  rvv: 0.2773457 b/c  speedup: 2.9116858x
wm/czech.utf16.txt               scalar: 0.0872352 b/c  rvv: 0.2453764 b/c  speedup: 2.8128122x
wm/turkish.utf16.txt             scalar: 0.0894998 b/c  rvv: 0.2483814 b/c  speedup: 2.7752177x
wm/thai.utf16.txt                scalar: 0.0742528 b/c  rvv: 0.1800184 b/c  speedup: 2.4243965x
wm/japanese.utf16.txt            scalar: 0.0750324 b/c  rvv: 0.1785757 b/c  speedup: 2.3799792x
lipsum/Chinese-Lipsum.utf16.txt  scalar: 0.0422231 b/c  rvv: 0.0993063 b/c  speedup: 2.3519384x
wm/vietnamese.utf16.txt          scalar: 0.0796325 b/c  rvv: 0.1822895 b/c  speedup: 2.2891332x
wm/chinese.utf16.txt             scalar: 0.0781047 b/c  rvv: 0.1772665 b/c  speedup: 2.2695999x
lipsum/Japanese-Lipsum.utf16.txt scalar: 0.0424322 b/c  rvv: 0.0920647 b/c  speedup: 2.1696894x
wm/hindi.utf16.txt               scalar: 0.0716071 b/c  rvv: 0.1415199 b/c  speedup: 1.9763381x
wm/korean.utf16.txt              scalar: 0.0742212 b/c  rvv: 0.1447335 b/c  speedup: 1.9500284x
lipsum/Emoji-Lipsum.utf16.txt    scalar: 0.0560671 b/c  rvv: 0.1017256 b/c  speedup: 1.8143532x
lipsum/Hindi-Lipsum.utf16.txt    scalar: 0.0423512 b/c  rvv: 0.0653709 b/c  speedup: 1.5435430x
lipsum/Korean-Lipsum.utf16.txt   scalar: 0.0431370 b/c  rvv: 0.0527462 b/c  speedup: 1.2227593x
c920 utf8_to_utf32
lipsum/Latin-Lipsum.utf8.txt     scalar: 0.1983016 b/c  rvv 1.6172459 b/c  speedup: 8.1554844x
wm/english.utf8.txt              scalar: 0.1787050 b/c  rvv 0.9249580 b/c  speedup: 5.1758932x
wm/greek.utf8.txt                scalar: 0.0720639 b/c  rvv 0.3620777 b/c  speedup: 5.0243981x
lipsum/Arabic-Lipsum.utf8.txt    scalar: 0.0489671 b/c  rvv 0.2433533 b/c  speedup: 4.9697240x
lipsum/Hebrew-Lipsum.utf8.txt    scalar: 0.0484946 b/c  rvv 0.2363269 b/c  speedup: 4.8732567x
lipsum/Russian-Lipsum.utf8.txt   scalar: 0.0501499 b/c  rvv 0.2380047 b/c  speedup: 4.7458662x
wm/czech.utf8.txt                scalar: 0.0720776 b/c  rvv 0.3390725 b/c  speedup: 4.7042668x
wm/hebrew.utf8.txt               scalar: 0.0636149 b/c  rvv 0.2747983 b/c  speedup: 4.3197121x
wm/turkish.utf8.txt              scalar: 0.0806716 b/c  rvv 0.3274432 b/c  speedup: 4.0589630x
wm/esperanto.utf8.txt            scalar: 0.1170809 b/c  rvv 0.4577151 b/c  speedup: 3.9093893x
wm/arabic.utf8.txt               scalar: 0.0714500 b/c  rvv 0.2772353 b/c  speedup: 3.8801297x
wm/persan.utf8.txt               scalar: 0.0704970 b/c  rvv 0.2690839 b/c  speedup: 3.8169557x
wm/russian.utf8.txt              scalar: 0.0683159 b/c  rvv 0.2570850 b/c  speedup: 3.7631801x
wm/german.utf8.txt               scalar: 0.1248275 b/c  rvv 0.4611884 b/c  speedup: 3.6946062x
wm/vietnamese.utf8.txt           scalar: 0.0612176 b/c  rvv 0.2055558 b/c  speedup: 3.3577844x
wm/korean.utf8.txt               scalar: 0.0727457 b/c  rvv 0.2360132 b/c  speedup: 3.2443591x
wm/portuguese.utf8.txt           scalar: 0.1077901 b/c  rvv 0.3433450 b/c  speedup: 3.1853110x
wm/japanese.utf8.txt             scalar: 0.0822007 b/c  rvv 0.2396279 b/c  speedup: 2.9151562x
wm/french.utf8.txt               scalar: 0.0996538 b/c  rvv 0.2892530 b/c  speedup: 2.9025785x
wm/hindi.utf8.txt                scalar: 0.0828941 b/c  rvv 0.2307050 b/c  speedup: 2.7831279x
lipsum/Korean-Lipsum.utf8.txt    scalar: 0.0581741 b/c  rvv 0.1554148 b/c  speedup: 2.6715442x
wm/chinese.utf8.txt              scalar: 0.0817867 b/c  rvv 0.2103001 b/c  speedup: 2.5713221x
lipsum/Hindi-Lipsum.utf8.txt     scalar: 0.0674511 b/c  rvv 0.1558572 b/c  speedup: 2.3106693x
wm/thai.utf8.txt                 scalar: 0.0933146 b/c  rvv 0.2127180 b/c  speedup: 2.2795790x
lipsum/Japanese-Lipsum.utf8.txt  scalar: 0.0739905 b/c  rvv 0.1558918 b/c  speedup: 2.1069166x
lipsum/Chinese-Lipsum.utf8.txt   scalar: 0.0762008 b/c  rvv 0.1563651 b/c  speedup: 2.0520142x
lipsum/Emoji-Lipsum.utf8.txt     scalar: 0.0956014 b/c  rvv 0.1901396 b/c  speedup: 1.9888773x
c920 utf8_to_utf16
lipsum/Latin-Lipsum.utf8.txt     scalar: 0.2109710 b/c  rvv 2.2189945 b/c  speedup: 10.518002x
wm/english.utf8.txt              scalar: 0.1827197 b/c  rvv 1.4220564 b/c  speedup: 7.7827185x
wm/greek.utf8.txt                scalar: 0.0755349 b/c  rvv 0.3727045 b/c  speedup: 4.9341973x
wm/czech.utf8.txt                scalar: 0.0755292 b/c  rvv 0.3633922 b/c  speedup: 4.8112804x
lipsum/Hebrew-Lipsum.utf8.txt    scalar: 0.0494750 b/c  rvv 0.2242518 b/c  speedup: 4.5326243x
lipsum/Russian-Lipsum.utf8.txt   scalar: 0.0509461 b/c  rvv 0.2299110 b/c  speedup: 4.5128233x
lipsum/Arabic-Lipsum.utf8.txt    scalar: 0.0497147 b/c  rvv 0.2216386 b/c  speedup: 4.4582093x
wm/arabic.utf8.txt               scalar: 0.0744813 b/c  rvv 0.3212577 b/c  speedup: 4.3132627x
wm/hebrew.utf8.txt               scalar: 0.0661844 b/c  rvv 0.2759717 b/c  speedup: 4.1697346x
wm/esperanto.utf8.txt            scalar: 0.1263762 b/c  rvv 0.5216554 b/c  speedup: 4.1277957x
wm/german.utf8.txt               scalar: 0.1296940 b/c  rvv 0.5333178 b/c  speedup: 4.1121239x
wm/turkish.utf8.txt              scalar: 0.0847053 b/c  rvv 0.3365346 b/c  speedup: 3.9730052x
wm/russian.utf8.txt              scalar: 0.0708612 b/c  rvv 0.2807201 b/c  speedup: 3.9615460x
wm/portuguese.utf8.txt           scalar: 0.1171257 b/c  rvv 0.4517186 b/c  speedup: 3.8566995x
wm/persan.utf8.txt               scalar: 0.0742834 b/c  rvv 0.2688770 b/c  speedup: 3.6196109x
wm/vietnamese.utf8.txt           scalar: 0.0642606 b/c  rvv 0.2283996 b/c  speedup: 3.5542678x
wm/french.utf8.txt               scalar: 0.1070867 b/c  rvv 0.3670641 b/c  speedup: 3.4277275x
wm/korean.utf8.txt               scalar: 0.0765637 b/c  rvv 0.2563889 b/c  speedup: 3.3486993x
wm/hindi.utf8.txt                scalar: 0.0857724 b/c  rvv 0.2719399 b/c  speedup: 3.1704799x
wm/japanese.utf8.txt             scalar: 0.0866018 b/c  rvv 0.2630760 b/c  speedup: 3.0377647x
lipsum/Korean-Lipsum.utf8.txt    scalar: 0.0596959 b/c  rvv 0.1592920 b/c  speedup: 2.6683889x
wm/chinese.utf8.txt              scalar: 0.0855446 b/c  rvv 0.2223617 b/c  speedup: 2.5993654x
wm/thai.utf8.txt                 scalar: 0.0963939 b/c  rvv 0.2377943 b/c  speedup: 2.4669006x
lipsum/Hindi-Lipsum.utf8.txt     scalar: 0.0700269 b/c  rvv 0.1601953 b/c  speedup: 2.2876225x
lipsum/Japanese-Lipsum.utf8.txt  scalar: 0.0772785 b/c  rvv 0.1603533 b/c  speedup: 2.0750034x
lipsum/Chinese-Lipsum.utf8.txt   scalar: 0.0797070 b/c  rvv 0.1608991 b/c  speedup: 2.0186326x
lipsum/Emoji-Lipsum.utf8.txt     scalar: 0.0923569 b/c  rvv 0.1242158 b/c  speedup: 1.3449541x

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions