add RVV (RISC-V Vector Extension) backend, resolves #362 by camel-cdr · Pull Request #373 · simdutf/simdutf

camel-cdr · 2024-02-29T23:15:18Z

Hi, the rvv port is finally ready to be reviewed.
All functions are vectorized, and all tests pass on qemu (with different vector length) and real hardware (with and without sanitizers enabled).

I decided to only support the v0.12 and above intrinsics, since there shouldn't be any breaking changes until v1.0.
v0.11 would also work with the current code, and clang 16 and gcc 13 support it, but gcc 13.2.0 currently has a codegen bug (ta,ma should be tu,ma), and this was the easiest way to get around that. v0.12 has support since clang 17 and gcc 14. I changed this to minimum v0.11, but manually excluded gcc 13.2.0.

The minimum requirements are the standard V extension (so VLEN>=128 and SEW>=64), but zvbb is used when available to accelerate endianness swaps (see simdutf_byteswap).
The code also assumes that LMUL=1 vrgather is decently fast, and vcompress.vm LMUL<=4 isn't horribly slow.
On the C908, LMUL=4 vcompress.vm it's 4.5 times slower than other LMUL=4 vector operations, and we still get great speedups for code paths that use the instruction.

clang 19 and above is, as of now, the only compiler that supports RISC-V target attributes, but that doesn't work with intrinsics yet.
So for now you need to build with -march=rv64gcv to enable the RVV backend, but the code is in place and should start working once compilers support it.

I've been cross-compiling with CXX=riscv64-linux-gnu-g++-14 CXXFLAGS="-std=c++11 -Ofast -march=rv64gcv_zba_zbb_zbs" cmake -B build so far, but this requires manually commenting out line 27 in src/CMakeLists.txt. This is required, because it checks for the host machine isa to add extra avx2 options, so this should also be a problem when cross-compiling to arm. I'm not sure how to fix this, and how simdutf handles cross-compilation in general. Native builds work as expected.

See my article and this issue for basic benchmarks on the C908 and C920.
The following are the results of running the simdutf benchmarks on the C908:

latin1.log
utf8.log
utf16.log
utf32.log

For simplicity's sake and to save time, I only ran the plain conversion functions and didn't include conversions from latin1.
Note that some plain conversion functions don't try to exit early, similar to how it's handled in icelake, although their with_errors counterparts do.
I could change this behavior, if that's preferred.

lemire · 2024-03-01T16:55:46Z

This is required, because it checks for the host machine isa to add extra avx2 options, so this should also be a problem when cross-compiling to arm. I'm not sure how to fix this, and how simdutf handles cross-compilation in general. Native builds work as expected.

The CMAKE_SYSTEM_PROCESSOR is the target system, not the host system. If you are cross-compiling, you need to set CMAKE_SYSTEM_PROCESSOR accordingly. Quoting from the documentation:

"When cross-compiling, a CMAKE_TOOLCHAIN_FILE should set the CMAKE_SYSTEM_PROCESSOR variable to match target architecture that it specifies"

So you need to tell CMake that you are cross-compiling.

lemire · 2024-03-01T17:10:25Z

@camel-cdr Thanks. We are going to need some kind of testing.

Can you review camel-cdr#1 and possibly merge?

If we cannot actually run the code, then I would recommend we merge but in a risc_devel branch, waiting for the moment where we can run tests.

camel-cdr · 2024-03-01T20:51:20Z

Sorry for the github action spam, I didn't realize it would also update them here. Sadly the ones I've added for testing haven't executed yet. That's probably because I need to use ubuntu 24.04 to get the gcc-14 package, and github doesn't seem to like that.

lemire · 2024-03-01T21:54:14Z

GCC 14 is unreleased (people run a pre-release).

But LLVM 17 should be available:

  - name: Install newer Clang
      run: |
       wget https://apt.llvm.org/llvm.sh
       chmod +x ./llvm.sh
       sudo ./llvm.sh 17

And then...

-DCMAKE_C_COMPILER=clang-17 -DCMAKE_CXX_COMPILER=clang++-17.

I have not tested it, but it should work.

lemire · 2024-03-04T14:00:44Z

Thanks. I am going to try to find someone to review the code. I can vouch for the overall code organization, but I am knowledgeable of RVV.

richardstartin · 2024-03-04T14:01:34Z

@luhenry may be interested in this

lemire · 2024-03-04T15:09:34Z

@luhenry Want to have a look?

We don't need a deep dive, just a second pair of eyes.

DenisYaroshevskiy · 2024-03-04T15:16:55Z

@ita-sc - maybe you are interested, they are looking for RVV expertise here.

camel-cdr · 2024-03-04T16:34:56Z

I just looked over the benchmark results and realized that vectorized utf32->utf16 was sometimes slower than scalar, because I forgot to add a fast path, this has been fixed now.

@lemire I saw your tweet, but I don't have Twitter, so I'll respond here:

RVV 1.0 has been ratified since November 2021, and there is currently only one device available that supports it, the kendryte k230 with a xuantie C908 core, it's an in-order core with VLEN=128. I've got one, so I'm able to run benchmarks on it.

There is also the out-of-order C920, which supports a pre-ratification RVV 0.7.1, with double issue and VLEN=128.
I've manually backported the assembly for utf8->utf16 in the benchmarks from my article. gcc-14 now has support for compiling rvv 1.0 intrinsics to the thead RVV 0.7.1 dialect, so once they fixed that, we'll also get results for that CPU.

This year, there are two new RVV 1.0 capable boards expected:

The Banana Pi F3, with Spacemit X60 core, that are in-order, double issue, and have a VLEN=256, should be available in a few months. (they should have Cortex-A510 performance)
The SG2380 in Q4, with sifive P670 (about Cortex-A78 level performance with VLEN=128) and X280 (for AI and number-crunching, with VLEN=512 and slow permute instructions) cores.

For reviewers:

The RVV spec can be found here, the intrinsics spec here, I can also recommend this unofficial intrinsics viewer.
The code currently supports the v0.11 of the intrinsics, which is forward compatible with v0.12, of which "no more incompatibility will be introduced" for the 1.0 release. Stay away from gcc-13.2 for now, as there is a codegen bug.

I'll send updated benchmark results, once they've finish building/running.

camel-cdr · 2024-03-04T21:04:35Z

Alright, here are the new C908 benchmark results in cycles/byte:

from utf8                 fallback   rvv      speedup
convert_utf8_to_utf16     19.3437    5.3619   3.60x
convert_utf8_to_utf32     19.7537    5.5580   3.55x
count_utf8                2.4057     0.4032   5.96x
validate_utf8             13.4617    2.8034   4.80x
AVG                       13.7412    3.5316   3.89x

from utf16                fallback   rvv      speedup
convert_utf16_to_utf32    4.0307     1.8522   2.17x
convert_utf16_to_utf8     7.5495     2.8781   2.62x
count_utf16               1.5602     0.5439   2.86x
validate_utf16            3.4757     0.8822   3.93x
AVG                       4.1540     1.5391   2.69x

from utf32                fallback   rvv      speedup
convert_utf32_to_utf16    2.1792     1.1484   1.89x
convert_utf32_to_utf8     3.1355     1.7592   1.78x
validate_utf32            1.6212     0.3479   4.65x
AVG                       2.3119     1.0851   2.13x

latin1                    fallback   rvv      speedup
convert_latin1_to_utf16   0.8115     0.6542   1.24x
convert_latin1_to_utf32   1.5332     1.1407   1.34x
convert_latin1_to_utf8    8.7432     1.3902   6.28x
convert_utf8_to_latin1    8.9257     1.4897   5.99x
convert_utf16_to_latin1   0.7562     0.4685   1.61x
convert_utf32_to_latin1   0.6022     0.5302   1.13x
AVG                       3.5620     0.9455   3.76x

====                      fallback   rvv      speedup
AVR of AVG                5.9422     1.7753   3.34x
Total AVG                 5.8758     1.7183   3.42x

The above are the averages of all inputs, here is the full table: c908-results.txt (Note that capitalized languages come from the lipsum dataset and the others from the mars wikipedia dataset)

lemire · 2024-03-05T14:24:08Z

@camel-cdr Thanks. Just to reassure you, I have every intention of merging this PR. But I am hoping we can get someone to independently review the code (if only quickly) before we merge.

camel-cdr · 2024-03-05T19:40:30Z

No worries, I'd love to have somebody look over this.

luhenry · 2024-03-06T09:57:55Z

Hi! I won't have the time to review it myself, however I've send a call for review to multiple people involved in RISC-V. I hope it gets answered! :)

@camel-cdr thanks for doing all this amazing work, it's clearly noticed across many projects!

WojciechMula

First of all - that's an impressive piece of work!

I had only a few minor comments.

And two generic comments to the coding style. Do not hesitate to split long, nested expressions into several variables - it increases readability. Also, do not reuse variables; in most cases you may create as many (preferably const) bindings as you need.

WojciechMula · 2024-03-06T17:52:00Z

+    long idx = __riscv_vfirst_m_b8(__riscv_vmxor_mm_b8(surhi0, surlo1, vl), vl);
+    if (idx >= 0) {
+      last = idx > 0 ? simdutf_byteflip<bflip>(src[idx-1]) : last;
+      return result(error_code::SURROGATE, src - srcBeg + idx - (last*1u - 0xD800 < 0x400));


Why last*1u?

I was trying to avoid unsigned to signed promotion here.
Alternatively I could also make the constants unsigned, so last - 0xD800u < 0x400u.
I'll change it to the above, since it is less confusing.

WojciechMula · 2024-03-06T17:55:24Z

+    long idx = __riscv_vfirst_m_b4(__riscv_vmsgtu_vx_u32m8_b4(v, 255, vl), vl);
+    if (idx >= 0)
+      return result(error_code::TOO_LARGE, src - beg + idx);
+    __riscv_vse8_v_u8m2((uint8_t*)dst, __riscv_vncvt_x_x_w_u8m2(__riscv_vncvt_x_x_w_u16m4(v, vl), vl), vl);


Did you consider use of vcompress?

Yes, but I tried to avoid vcompress when possible, because it's performance varies widely on current platforms.
I'd leave the code as is, but this is certainly something to reconsider once there is more hardware available.

Approximation from my measurements (vncvt is vnsrl.vi/vnsra.vi):

C908: vncvt_m4+vncvt_m2: 12 cycles, vcompress_m4 36 cycles.

C920: vncvt_m4+vncvt_m2: 6 cycles, vcompress_m4 5.4 cycles.

I tried to avoid vcompress when possible, because it's performance varies widely on current platforms.
I'd leave the code as is, but this is certainly something to reconsider once there is more hardware available.

It could be interesting to leave comment which leaves a trace, explaining why you chose this route rather than another one.

Good idea, I've added a comment, and also one to the utf8 -> utf16 function, where one might also consider using vrgather instead of a arithmetic expression.

WojciechMula · 2024-03-06T18:36:45Z

+      : simdutf::implementation("rvv", "RISC-V Vector Extension",
+                                 internal::instruction_set::RVV)
+      , _supports_zvbb(
+#if SIMDUTF_IS_ZVBB


We should relay on detect_supported_architectures result. I think we can remove this ifdef.

Ah, this was meant to test for SIMDUTF_HAS_ZVBB_INTRINSICS, since we can't always enable the zvbb target attribute. I'll update this, the same problem is also in simdutf_byteflip(vuint16m8_t...

Edit: if SIMDUTF_IS_ZVBB is true, then we actually just want to return true, because the code won't run on a platform where zvbb isn't available if it's enabled globally.

@camel-cdr

if SIMDUTF_IS_ZVBB is true, then we actually just want to return true, because the code won't run on a platform where zvbb isn't available if it's enabled globally.

Don't you then want something like...

#if SIMDUTF_IS_ZVBB static inline uint32_t detect_supported_architectures() { return internal::instruction_set::ZVBB; } #endif

like...

simdutf/include/simdutf/internal/isadetection.h

Line 87 in 73ac74e

#elif defined(__aarch64__) || defined(_M_ARM64)

I see that you already have...

#if SIMDUTF_IS_ZVBB host_isa |= instruction_set::ZVBB; #endif

So unless I am missing something, you already have that internal::detect_supported_architectures() & internal::instruction_set::ZVBB is true if SIMDUTF_IS_ZVBB evaluates to true. Correct?

Now, what you could do with #if SIMDUTF_IS_ZVBB that would be helpful to the code is to do...

#if SIMDUTF_IS_ZVBB bool supports_zvbb() const { return true; } #else const bool _supports_zvbb; bool supports_zvbb() const { return _supports_zvbb; } #endif

This would be better because if SIMDUTF_IS_ZVBB is true, then you'd avoid the runtime checks.

Oh, yeah that's a better place to put the checks, I've adjusted the code.

BTW you might think that the compiler would still optimize it away, but no... not even with -O3.

camel-cdr · 2024-03-06T22:15:06Z

@WojciechMula thanks for the review, I've adjusted the code based on your comments.

Regarding the coding style, I tried cleaning up the worst offenders.
I think naming things is hard, and a lot of the variable reuse is incrementally building up a value, that could otherwise have been a single large expression.

E.g. the following builds up the surrogate pair:

simdutf/src/rvv/rvv_utf8_to.inl.cpp

Lines 3 to 9 in c9b3e51

    
             /* convert [000000000000aaaa|aaaaaabbbbbbbbbb] 
        
              * to      [110111bbbbbbbbbb|110110aaaaaaaaaa] */ 
        
             vuint32m4_t sur = __riscv_vsub_vx_u32m4(utf32, 0x10000, vl); 
        
             sur = __riscv_vor_vv_u32m4(__riscv_vsll_vx_u32m4(sur, 16, vl), 
        
                                        __riscv_vsrl_vx_u32m4(sur, 10, vl), vl); 
        
             sur = __riscv_vand_vx_u32m4(sur, 0x3FF03FF, vl); 
        
             sur = __riscv_vor_vx_u32m4(sur, 0xDC00D800, vl);

Splitting that up into constant variable declarations, to me makes the code more noisy.
I'm the one who wrote it, so I'm obviously a bit biased.

lemire · 2024-03-08T02:09:01Z

@camel-cdr I take @WojciechMula's stylistic comments (reproduced below) as suggestions more than requirements.

Do not hesitate to split long, nested expressions into several variables - it increases readability. Also, do not reuse variables; in most cases you may create as many (preferably const) bindings as you need.

The advantage of following @WojciechMula's advice is that it makes it easier to document the code, because the variable has just one meaning. But, again, I think that @WojciechMula would agree that there are several valid ways of writing code. You may consider @WojciechMula's if you ever plan to write about this code (which you should).

lemire · 2024-03-08T16:59:27Z

@camel-cdr I expect to merge this soon, but I will give some time for folks to come in and comment further. I also want to make sure that @WojciechMula has time to comment if he wants.

WojciechMula · 2024-03-11T19:20:48Z

@camel-cdr as Daniel noted, stylistics notes were just suggestions, but based on some experience. I was offline in the past weekend, but I'm back online and re-reviewing your PR right now.

WojciechMula · 2024-03-11T19:32:44Z

+  for (size_t vl; len > 0; len -= vl, src += vl, dst += vl) {
+    vl = __riscv_vsetvl_e16m8(len);
+    vuint16m8_t v = __riscv_vle16_v_u16m8((uint16_t*)src, vl);
+    __riscv_vse8_v_u8m4((uint8_t*)dst, __riscv_vnsrl_wx_u8m4(v, 8, vl), vl);


Did you try the segmented stores to perform swaps? Are they fast?

I've tried something similar, and segmented store are quite slow on the C920, but fine on the C908, see: c908, c920

I haven't seen it be faster than regular stores.

I also don't see how that would be useful here, as segmented stores interleave vector registers, which isn't what needs to happen here.

Per the RISC-V Optimization Guide, equivalent loads/stores are better done with normal loads/stores rather than segmented loads/stores. The later are not expected to be fast on every implementations, as the C920 shows compared to C908.

@luhenry thanks, didn't know that document

FYI, if something doesn't quite match the optimization guide, I've got an open issue with thoughts: https://gitlab.com/riseproject/riscv-optimization-guide/-/issues/1

WojciechMula · 2024-03-11T19:50:18Z

+       * 2: [        |   aaaaa] vsrl 6
+       * 3: [00111111|00111111]
+       * 4: [  bbbbbb|000aaaaa] (1|2)&3
+       * 5: [11000000|11000000]


fix comment: 5: [10000000|11000000]

lemire · 2024-03-11T21:40:54Z

@WojciechMula

stylistics notes were just suggestions, but based on some experience.

It is definitively easier if you want to write an article about it... because the text can refer unambiguously to variable "x" as one thing.

lemire · 2024-03-18T14:31:46Z

Thanks @WojciechMula and others. I am merging this. It will be part of the next release.

OMaghiarIMG · 2024-03-19T09:16:20Z

+ * first invalid one, but never overestimating. */
+simdutf_really_inline static size_t rvv_count_valid_utf8(const char *src, size_t len) {
+  const char *beg = src;
+  size_t tail = 32; // minimum of 3


Hello, is 32 the correct value here?

Yes, it should be correct. It could be as low as 3, but I choose a larger number to avoid running through the vector code for small inputs.
You are right, though, it's not very clear, and the "validate first three bytes" code below doesn't do what it says because of this. The tail is also needlessly large, the fallback to scalar and tail length can be independent.
Looking over the code again, I also found an unrelated bug, in the ASCII fast path.
I'll create a new PR, that cleans it up and fixes the validation bug.
Thanks

camel-cdr force-pushed the master branch 4 times, most recently from 55712e6 to c226f5d Compare March 1, 2024 20:09

WojciechMula reviewed Mar 1, 2024

View reviewed changes

Comment thread include/simdutf/internal/isadetection.h Outdated

camel-cdr force-pushed the master branch from c226f5d to 66c8e7d Compare March 1, 2024 21:07

camel-cdr force-pushed the master branch 10 times, most recently from 112155e to 099f26c Compare March 3, 2024 19:28

camel-cdr force-pushed the master branch from 099f26c to e7e1d4c Compare March 4, 2024 16:34

WojciechMula reviewed Mar 6, 2024

View reviewed changes

camel-cdr force-pushed the master branch from e7e1d4c to c9b3e51 Compare March 6, 2024 22:10

camel-cdr force-pushed the master branch 2 times, most recently from d6801bc to b178853 Compare March 8, 2024 16:45

WojciechMula reviewed Mar 11, 2024

View reviewed changes

add rvv backend

5b11b5f

camel-cdr force-pushed the master branch from b178853 to 5b11b5f Compare March 11, 2024 20:22

lemire merged commit fde372f into simdutf:master Mar 18, 2024

Porkepix mentioned this pull request Mar 18, 2024

simdutf 5.0.0 Homebrew/homebrew-core#166524

Merged

OMaghiarIMG reviewed Mar 19, 2024

View reviewed changes

camel-cdr mentioned this pull request Mar 23, 2024

fix rvv utf8 validation bug #381

Merged

Conversation

camel-cdr commented Feb 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lemire commented Mar 1, 2024

Uh oh!

lemire commented Mar 1, 2024

Uh oh!

camel-cdr commented Mar 1, 2024

Uh oh!

Uh oh!

lemire commented Mar 1, 2024

Uh oh!

lemire commented Mar 4, 2024

Uh oh!

richardstartin commented Mar 4, 2024

Uh oh!

lemire commented Mar 4, 2024

Uh oh!

DenisYaroshevskiy commented Mar 4, 2024

Uh oh!

camel-cdr commented Mar 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

camel-cdr commented Mar 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lemire commented Mar 5, 2024

Uh oh!

camel-cdr commented Mar 5, 2024

Uh oh!

luhenry commented Mar 6, 2024

Uh oh!

WojciechMula left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

camel-cdr Mar 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

camel-cdr Mar 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

camel-cdr commented Mar 6, 2024

Uh oh!

lemire commented Mar 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lemire commented Mar 8, 2024

Uh oh!

WojciechMula commented Mar 11, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

camel-cdr commented Feb 29, 2024 •

edited

Loading

camel-cdr commented Mar 4, 2024 •

edited

Loading

camel-cdr commented Mar 4, 2024 •

edited

Loading

camel-cdr Mar 6, 2024 •

edited

Loading

camel-cdr Mar 6, 2024 •

edited

Loading

lemire commented Mar 8, 2024 •

edited

Loading

camel-cdr Mar 11, 2024 •

edited

Loading