add RVV (RISC-V Vector Extension) backend, resolves #362#373
add RVV (RISC-V Vector Extension) backend, resolves #362#373lemire merged 1 commit intosimdutf:masterfrom
Conversation
The "When cross-compiling, a CMAKE_TOOLCHAIN_FILE should set the CMAKE_SYSTEM_PROCESSOR variable to match target architecture that it specifies" So you need to tell CMake that you are cross-compiling. |
|
@camel-cdr Thanks. We are going to need some kind of testing. Can you review camel-cdr#1 and possibly merge? If we cannot actually run the code, then I would recommend we merge but in a risc_devel branch, waiting for the moment where we can run tests. |
55712e6 to
c226f5d
Compare
|
Sorry for the github action spam, I didn't realize it would also update them here. Sadly the ones I've added for testing haven't executed yet. That's probably because I need to use ubuntu 24.04 to get the gcc-14 package, and github doesn't seem to like that. |
|
GCC 14 is unreleased (people run a pre-release). But LLVM 17 should be available: - name: Install newer Clang
run: |
wget https://apt.llvm.org/llvm.sh
chmod +x ./llvm.sh
sudo ./llvm.sh 17And then...
I have not tested it, but it should work. |
112155e to
099f26c
Compare
|
Thanks. I am going to try to find someone to review the code. I can vouch for the overall code organization, but I am knowledgeable of RVV. |
|
@luhenry may be interested in this |
|
@luhenry Want to have a look? We don't need a deep dive, just a second pair of eyes. |
|
@ita-sc - maybe you are interested, they are looking for RVV expertise here. |
|
I just looked over the benchmark results and realized that vectorized utf32->utf16 was sometimes slower than scalar, because I forgot to add a fast path, this has been fixed now. @lemire I saw your tweet, but I don't have Twitter, so I'll respond here: RVV 1.0 has been ratified since November 2021, and there is currently only one device available that supports it, the kendryte k230 with a xuantie C908 core, it's an in-order core with VLEN=128. I've got one, so I'm able to run benchmarks on it. There is also the out-of-order C920, which supports a pre-ratification RVV 0.7.1, with double issue and VLEN=128. This year, there are two new RVV 1.0 capable boards expected:
For reviewers: The RVV spec can be found here, the intrinsics spec here, I can also recommend this unofficial intrinsics viewer. I'll send updated benchmark results, once they've finish building/running. |
|
Alright, here are the new C908 benchmark results in cycles/byte: The above are the averages of all inputs, here is the full table: c908-results.txt (Note that capitalized languages come from the lipsum dataset and the others from the mars wikipedia dataset) |
|
@camel-cdr Thanks. Just to reassure you, I have every intention of merging this PR. But I am hoping we can get someone to independently review the code (if only quickly) before we merge. |
|
No worries, I'd love to have somebody look over this. |
|
Hi! I won't have the time to review it myself, however I've send a call for review to multiple people involved in RISC-V. I hope it gets answered! :) @camel-cdr thanks for doing all this amazing work, it's clearly noticed across many projects! |
WojciechMula
left a comment
There was a problem hiding this comment.
First of all - that's an impressive piece of work!
I had only a few minor comments.
And two generic comments to the coding style. Do not hesitate to split long, nested expressions into several variables - it increases readability. Also, do not reuse variables; in most cases you may create as many (preferably const) bindings as you need.
| long idx = __riscv_vfirst_m_b8(__riscv_vmxor_mm_b8(surhi0, surlo1, vl), vl); | ||
| if (idx >= 0) { | ||
| last = idx > 0 ? simdutf_byteflip<bflip>(src[idx-1]) : last; | ||
| return result(error_code::SURROGATE, src - srcBeg + idx - (last*1u - 0xD800 < 0x400)); |
There was a problem hiding this comment.
I was trying to avoid unsigned to signed promotion here.
Alternatively I could also make the constants unsigned, so last - 0xD800u < 0x400u.
I'll change it to the above, since it is less confusing.
| long idx = __riscv_vfirst_m_b4(__riscv_vmsgtu_vx_u32m8_b4(v, 255, vl), vl); | ||
| if (idx >= 0) | ||
| return result(error_code::TOO_LARGE, src - beg + idx); | ||
| __riscv_vse8_v_u8m2((uint8_t*)dst, __riscv_vncvt_x_x_w_u8m2(__riscv_vncvt_x_x_w_u16m4(v, vl), vl), vl); |
There was a problem hiding this comment.
Did you consider use of vcompress?
There was a problem hiding this comment.
Yes, but I tried to avoid vcompress when possible, because it's performance varies widely on current platforms.
I'd leave the code as is, but this is certainly something to reconsider once there is more hardware available.
Approximation from my measurements (vncvt is vnsrl.vi/vnsra.vi):
There was a problem hiding this comment.
I tried to avoid vcompress when possible, because it's performance varies widely on current platforms.
I'd leave the code as is, but this is certainly something to reconsider once there is more hardware available.
It could be interesting to leave comment which leaves a trace, explaining why you chose this route rather than another one.
There was a problem hiding this comment.
Good idea, I've added a comment, and also one to the utf8 -> utf16 function, where one might also consider using vrgather instead of a arithmetic expression.
| : simdutf::implementation("rvv", "RISC-V Vector Extension", | ||
| internal::instruction_set::RVV) | ||
| , _supports_zvbb( | ||
| #if SIMDUTF_IS_ZVBB |
There was a problem hiding this comment.
We should relay on detect_supported_architectures result. I think we can remove this ifdef.
There was a problem hiding this comment.
Ah, this was meant to test for SIMDUTF_HAS_ZVBB_INTRINSICS, since we can't always enable the zvbb target attribute. I'll update this, the same problem is also in simdutf_byteflip(vuint16m8_t...
Edit: if SIMDUTF_IS_ZVBB is true, then we actually just want to return true, because the code won't run on a platform where zvbb isn't available if it's enabled globally.
There was a problem hiding this comment.
if SIMDUTF_IS_ZVBB is true, then we actually just want to return true, because the code won't run on a platform where zvbb isn't available if it's enabled globally.
Don't you then want something like...
#if SIMDUTF_IS_ZVBB
static inline uint32_t detect_supported_architectures() {
return internal::instruction_set::ZVBB;
}
#endiflike...
There was a problem hiding this comment.
I see that you already have...
#if SIMDUTF_IS_ZVBB
host_isa |= instruction_set::ZVBB;
#endifSo unless I am missing something, you already have that internal::detect_supported_architectures() & internal::instruction_set::ZVBB is true if SIMDUTF_IS_ZVBB evaluates to true. Correct?
There was a problem hiding this comment.
Now, what you could do with #if SIMDUTF_IS_ZVBB that would be helpful to the code is to do...
#if SIMDUTF_IS_ZVBB
bool supports_zvbb() const { return true; }
#else
const bool _supports_zvbb;
bool supports_zvbb() const { return _supports_zvbb; }
#endifThis would be better because if SIMDUTF_IS_ZVBB is true, then you'd avoid the runtime checks.
There was a problem hiding this comment.
Oh, yeah that's a better place to put the checks, I've adjusted the code.
|
@WojciechMula thanks for the review, I've adjusted the code based on your comments. Regarding the coding style, I tried cleaning up the worst offenders. E.g. the following builds up the surrogate pair: simdutf/src/rvv/rvv_utf8_to.inl.cpp Lines 3 to 9 in c9b3e51 Splitting that up into constant variable declarations, to me makes the code more noisy. |
|
@camel-cdr I take @WojciechMula's stylistic comments (reproduced below) as suggestions more than requirements.
The advantage of following @WojciechMula's advice is that it makes it easier to document the code, because the variable has just one meaning. But, again, I think that @WojciechMula would agree that there are several valid ways of writing code. You may consider @WojciechMula's if you ever plan to write about this code (which you should). |
d6801bc to
b178853
Compare
|
@camel-cdr I expect to merge this soon, but I will give some time for folks to come in and comment further. I also want to make sure that @WojciechMula has time to comment if he wants. |
|
@camel-cdr as Daniel noted, stylistics notes were just suggestions, but based on some experience. I was offline in the past weekend, but I'm back online and re-reviewing your PR right now. |
| for (size_t vl; len > 0; len -= vl, src += vl, dst += vl) { | ||
| vl = __riscv_vsetvl_e16m8(len); | ||
| vuint16m8_t v = __riscv_vle16_v_u16m8((uint16_t*)src, vl); | ||
| __riscv_vse8_v_u8m4((uint8_t*)dst, __riscv_vnsrl_wx_u8m4(v, 8, vl), vl); |
There was a problem hiding this comment.
Did you try the segmented stores to perform swaps? Are they fast?
There was a problem hiding this comment.
I've tried something similar, and segmented store are quite slow on the C920, but fine on the C908, see: c908, c920
I haven't seen it be faster than regular stores.
I also don't see how that would be useful here, as segmented stores interleave vector registers, which isn't what needs to happen here.
There was a problem hiding this comment.
Per the RISC-V Optimization Guide, equivalent loads/stores are better done with normal loads/stores rather than segmented loads/stores. The later are not expected to be fast on every implementations, as the C920 shows compared to C908.
There was a problem hiding this comment.
FYI, if something doesn't quite match the optimization guide, I've got an open issue with thoughts: https://gitlab.com/riseproject/riscv-optimization-guide/-/issues/1
| * 2: [ | aaaaa] vsrl 6 | ||
| * 3: [00111111|00111111] | ||
| * 4: [ bbbbbb|000aaaaa] (1|2)&3 | ||
| * 5: [11000000|11000000] |
There was a problem hiding this comment.
fix comment: 5: [10000000|11000000]
It is definitively easier if you want to write an article about it... because the text can refer unambiguously to variable "x" as one thing. |
|
Thanks @WojciechMula and others. I am merging this. It will be part of the next release. |
| * first invalid one, but never overestimating. */ | ||
| simdutf_really_inline static size_t rvv_count_valid_utf8(const char *src, size_t len) { | ||
| const char *beg = src; | ||
| size_t tail = 32; // minimum of 3 |
There was a problem hiding this comment.
Hello, is 32 the correct value here?
There was a problem hiding this comment.
Yes, it should be correct. It could be as low as 3, but I choose a larger number to avoid running through the vector code for small inputs.
You are right, though, it's not very clear, and the "validate first three bytes" code below doesn't do what it says because of this. The tail is also needlessly large, the fallback to scalar and tail length can be independent.
Looking over the code again, I also found an unrelated bug, in the ASCII fast path.
I'll create a new PR, that cleans it up and fixes the validation bug.
Thanks


Hi, the rvv port is finally ready to be reviewed.
All functions are vectorized, and all tests pass on qemu (with different vector length) and real hardware (with and without sanitizers enabled).
I decided to only support the v0.12 and above intrinsics, since there shouldn't be any breaking changes until v1.0.I changed this to minimum v0.11, but manually excluded gcc 13.2.0.v0.11 would also work with the current code, and clang 16 and gcc 13 support it, but gcc 13.2.0 currently has a codegen bug (
ta,mashould betu,ma), and this was the easiest way to get around that. v0.12 has support since clang 17 and gcc 14.The minimum requirements are the standard V extension (so
VLEN>=128andSEW>=64), but zvbb is used when available to accelerate endianness swaps (see simdutf_byteswap).The code also assumes that
LMUL=1vrgatheris decently fast, andvcompress.vmLMUL<=4isn't horribly slow.On the C908,
LMUL=4vcompress.vmit's 4.5 times slower than otherLMUL=4vector operations, and we still get great speedups for code paths that use the instruction.clang 19 and above is, as of now, the only compiler that supports RISC-V target attributes, but that doesn't work with intrinsics yet.
So for now you need to build with
-march=rv64gcvto enable the RVV backend, but the code is in place and should start working once compilers support it.I've been cross-compiling with
CXX=riscv64-linux-gnu-g++-14 CXXFLAGS="-std=c++11 -Ofast -march=rv64gcv_zba_zbb_zbs" cmake -B buildso far, but this requires manually commenting out line 27 in src/CMakeLists.txt. This is required, because it checks for the host machine isa to add extra avx2 options, so this should also be a problem when cross-compiling to arm. I'm not sure how to fix this, and how simdutf handles cross-compilation in general. Native builds work as expected.See my article and this issue for basic benchmarks on the C908 and C920.
The following are the results of running the simdutf benchmarks on the C908:
latin1.log
utf8.log
utf16.log
utf32.log
For simplicity's sake and to save time, I only ran the plain conversion functions and didn't include conversions from latin1.
Note that some plain conversion functions don't try to exit early, similar to how it's handled in icelake, although their with_errors counterparts do.
I could change this behavior, if that's preferred.