Skip to content

add RVV (RISC-V Vector Extension) backend, resolves #362#373

Merged
lemire merged 1 commit intosimdutf:masterfrom
camel-cdr:master
Mar 18, 2024
Merged

add RVV (RISC-V Vector Extension) backend, resolves #362#373
lemire merged 1 commit intosimdutf:masterfrom
camel-cdr:master

Conversation

@camel-cdr
Copy link
Copy Markdown
Contributor

@camel-cdr camel-cdr commented Feb 29, 2024

Hi, the rvv port is finally ready to be reviewed.
All functions are vectorized, and all tests pass on qemu (with different vector length) and real hardware (with and without sanitizers enabled).

I decided to only support the v0.12 and above intrinsics, since there shouldn't be any breaking changes until v1.0.
v0.11 would also work with the current code, and clang 16 and gcc 13 support it, but gcc 13.2.0 currently has a codegen bug (ta,ma should be tu,ma), and this was the easiest way to get around that. v0.12 has support since clang 17 and gcc 14.
I changed this to minimum v0.11, but manually excluded gcc 13.2.0.

The minimum requirements are the standard V extension (so VLEN>=128 and SEW>=64), but zvbb is used when available to accelerate endianness swaps (see simdutf_byteswap).
The code also assumes that LMUL=1 vrgather is decently fast, and vcompress.vm LMUL<=4 isn't horribly slow.
On the C908, LMUL=4 vcompress.vm it's 4.5 times slower than other LMUL=4 vector operations, and we still get great speedups for code paths that use the instruction.

clang 19 and above is, as of now, the only compiler that supports RISC-V target attributes, but that doesn't work with intrinsics yet.
So for now you need to build with -march=rv64gcv to enable the RVV backend, but the code is in place and should start working once compilers support it.

I've been cross-compiling with CXX=riscv64-linux-gnu-g++-14 CXXFLAGS="-std=c++11 -Ofast -march=rv64gcv_zba_zbb_zbs" cmake -B build so far, but this requires manually commenting out line 27 in src/CMakeLists.txt. This is required, because it checks for the host machine isa to add extra avx2 options, so this should also be a problem when cross-compiling to arm. I'm not sure how to fix this, and how simdutf handles cross-compilation in general. Native builds work as expected.

See my article and this issue for basic benchmarks on the C908 and C920.
The following are the results of running the simdutf benchmarks on the C908:

latin1.log
utf8.log
utf16.log
utf32.log

For simplicity's sake and to save time, I only ran the plain conversion functions and didn't include conversions from latin1.
Note that some plain conversion functions don't try to exit early, similar to how it's handled in icelake, although their with_errors counterparts do.
I could change this behavior, if that's preferred.

@lemire
Copy link
Copy Markdown
Member

lemire commented Mar 1, 2024

This is required, because it checks for the host machine isa to add extra avx2 options, so this should also be a problem when cross-compiling to arm. I'm not sure how to fix this, and how simdutf handles cross-compilation in general. Native builds work as expected.

The CMAKE_SYSTEM_PROCESSOR is the target system, not the host system. If you are cross-compiling, you need to set CMAKE_SYSTEM_PROCESSOR accordingly. Quoting from the documentation:

"When cross-compiling, a CMAKE_TOOLCHAIN_FILE should set the CMAKE_SYSTEM_PROCESSOR variable to match target architecture that it specifies"

So you need to tell CMake that you are cross-compiling.

@lemire
Copy link
Copy Markdown
Member

lemire commented Mar 1, 2024

@camel-cdr Thanks. We are going to need some kind of testing.

Can you review camel-cdr#1 and possibly merge?

If we cannot actually run the code, then I would recommend we merge but in a risc_devel branch, waiting for the moment where we can run tests.

@camel-cdr camel-cdr force-pushed the master branch 4 times, most recently from 55712e6 to c226f5d Compare March 1, 2024 20:09
@camel-cdr
Copy link
Copy Markdown
Contributor Author

Sorry for the github action spam, I didn't realize it would also update them here. Sadly the ones I've added for testing haven't executed yet. That's probably because I need to use ubuntu 24.04 to get the gcc-14 package, and github doesn't seem to like that.

Comment thread include/simdutf/internal/isadetection.h Outdated
@lemire
Copy link
Copy Markdown
Member

lemire commented Mar 1, 2024

GCC 14 is unreleased (people run a pre-release).

But LLVM 17 should be available:

  - name: Install newer Clang
      run: |
       wget https://apt.llvm.org/llvm.sh
       chmod +x ./llvm.sh
       sudo ./llvm.sh 17

And then...

-DCMAKE_C_COMPILER=clang-17 -DCMAKE_CXX_COMPILER=clang++-17.

I have not tested it, but it should work.

@camel-cdr camel-cdr force-pushed the master branch 10 times, most recently from 112155e to 099f26c Compare March 3, 2024 19:28
@lemire
Copy link
Copy Markdown
Member

lemire commented Mar 4, 2024

Thanks. I am going to try to find someone to review the code. I can vouch for the overall code organization, but I am knowledgeable of RVV.

@richardstartin
Copy link
Copy Markdown

@luhenry may be interested in this

@lemire
Copy link
Copy Markdown
Member

lemire commented Mar 4, 2024

@luhenry Want to have a look?

We don't need a deep dive, just a second pair of eyes.

@DenisYaroshevskiy
Copy link
Copy Markdown

@ita-sc - maybe you are interested, they are looking for RVV expertise here.

@camel-cdr
Copy link
Copy Markdown
Contributor Author

camel-cdr commented Mar 4, 2024

I just looked over the benchmark results and realized that vectorized utf32->utf16 was sometimes slower than scalar, because I forgot to add a fast path, this has been fixed now.

@lemire I saw your tweet, but I don't have Twitter, so I'll respond here:

RVV 1.0 has been ratified since November 2021, and there is currently only one device available that supports it, the kendryte k230 with a xuantie C908 core, it's an in-order core with VLEN=128. I've got one, so I'm able to run benchmarks on it.

There is also the out-of-order C920, which supports a pre-ratification RVV 0.7.1, with double issue and VLEN=128.
I've manually backported the assembly for utf8->utf16 in the benchmarks from my article. gcc-14 now has support for compiling rvv 1.0 intrinsics to the thead RVV 0.7.1 dialect, so once they fixed that, we'll also get results for that CPU.

This year, there are two new RVV 1.0 capable boards expected:

  • The Banana Pi F3, with Spacemit X60 core, that are in-order, double issue, and have a VLEN=256, should be available in a few months. (they should have Cortex-A510 performance)
  • The SG2380 in Q4, with sifive P670 (about Cortex-A78 level performance with VLEN=128) and X280 (for AI and number-crunching, with VLEN=512 and slow permute instructions) cores.

For reviewers:

The RVV spec can be found here, the intrinsics spec here, I can also recommend this unofficial intrinsics viewer.
The code currently supports the v0.11 of the intrinsics, which is forward compatible with v0.12, of which "no more incompatibility will be introduced" for the 1.0 release. Stay away from gcc-13.2 for now, as there is a codegen bug.

I'll send updated benchmark results, once they've finish building/running.

@camel-cdr
Copy link
Copy Markdown
Contributor Author

camel-cdr commented Mar 4, 2024

Alright, here are the new C908 benchmark results in cycles/byte:

from utf8                 fallback   rvv      speedup
convert_utf8_to_utf16     19.3437    5.3619   3.60x
convert_utf8_to_utf32     19.7537    5.5580   3.55x
count_utf8                2.4057     0.4032   5.96x
validate_utf8             13.4617    2.8034   4.80x
AVG                       13.7412    3.5316   3.89x

from utf16                fallback   rvv      speedup
convert_utf16_to_utf32    4.0307     1.8522   2.17x
convert_utf16_to_utf8     7.5495     2.8781   2.62x
count_utf16               1.5602     0.5439   2.86x
validate_utf16            3.4757     0.8822   3.93x
AVG                       4.1540     1.5391   2.69x

from utf32                fallback   rvv      speedup
convert_utf32_to_utf16    2.1792     1.1484   1.89x
convert_utf32_to_utf8     3.1355     1.7592   1.78x
validate_utf32            1.6212     0.3479   4.65x
AVG                       2.3119     1.0851   2.13x

latin1                    fallback   rvv      speedup
convert_latin1_to_utf16   0.8115     0.6542   1.24x
convert_latin1_to_utf32   1.5332     1.1407   1.34x
convert_latin1_to_utf8    8.7432     1.3902   6.28x
convert_utf8_to_latin1    8.9257     1.4897   5.99x
convert_utf16_to_latin1   0.7562     0.4685   1.61x
convert_utf32_to_latin1   0.6022     0.5302   1.13x
AVG                       3.5620     0.9455   3.76x

====                      fallback   rvv      speedup
AVR of AVG                5.9422     1.7753   3.34x
Total AVG                 5.8758     1.7183   3.42x

The above are the averages of all inputs, here is the full table: c908-results.txt (Note that capitalized languages come from the lipsum dataset and the others from the mars wikipedia dataset)

@lemire
Copy link
Copy Markdown
Member

lemire commented Mar 5, 2024

@camel-cdr Thanks. Just to reassure you, I have every intention of merging this PR. But I am hoping we can get someone to independently review the code (if only quickly) before we merge.

@camel-cdr
Copy link
Copy Markdown
Contributor Author

No worries, I'd love to have somebody look over this.

@luhenry
Copy link
Copy Markdown

luhenry commented Mar 6, 2024

Hi! I won't have the time to review it myself, however I've send a call for review to multiple people involved in RISC-V. I hope it gets answered! :)

@camel-cdr thanks for doing all this amazing work, it's clearly noticed across many projects!

Copy link
Copy Markdown
Collaborator

@WojciechMula WojciechMula left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all - that's an impressive piece of work!

I had only a few minor comments.

And two generic comments to the coding style. Do not hesitate to split long, nested expressions into several variables - it increases readability. Also, do not reuse variables; in most cases you may create as many (preferably const) bindings as you need.

Comment thread src/rvv/rvv_latin1_to.inl.cpp
Comment thread src/rvv/rvv_utf16_to.inl.cpp Outdated
Comment thread src/rvv/rvv_utf16_to.inl.cpp Outdated
Comment thread src/rvv/rvv_utf16_to.inl.cpp Outdated
long idx = __riscv_vfirst_m_b8(__riscv_vmxor_mm_b8(surhi0, surlo1, vl), vl);
if (idx >= 0) {
last = idx > 0 ? simdutf_byteflip<bflip>(src[idx-1]) : last;
return result(error_code::SURROGATE, src - srcBeg + idx - (last*1u - 0xD800 < 0x400));
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why last*1u?

Copy link
Copy Markdown
Contributor Author

@camel-cdr camel-cdr Mar 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to avoid unsigned to signed promotion here.
Alternatively I could also make the constants unsigned, so last - 0xD800u < 0x400u.
I'll change it to the above, since it is less confusing.

long idx = __riscv_vfirst_m_b4(__riscv_vmsgtu_vx_u32m8_b4(v, 255, vl), vl);
if (idx >= 0)
return result(error_code::TOO_LARGE, src - beg + idx);
__riscv_vse8_v_u8m2((uint8_t*)dst, __riscv_vncvt_x_x_w_u8m2(__riscv_vncvt_x_x_w_u16m4(v, vl), vl), vl);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you consider use of vcompress?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I tried to avoid vcompress when possible, because it's performance varies widely on current platforms.
I'd leave the code as is, but this is certainly something to reconsider once there is more hardware available.

Approximation from my measurements (vncvt is vnsrl.vi/vnsra.vi):

  • C908: vncvt_m4+vncvt_m2: 12 cycles, vcompress_m4 36 cycles.
  • C920: vncvt_m4+vncvt_m2: 6 cycles, vcompress_m4 5.4 cycles.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to avoid vcompress when possible, because it's performance varies widely on current platforms.
I'd leave the code as is, but this is certainly something to reconsider once there is more hardware available.

It could be interesting to leave comment which leaves a trace, explaining why you chose this route rather than another one.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I've added a comment, and also one to the utf8 -> utf16 function, where one might also consider using vrgather instead of a arithmetic expression.

Comment thread src/simdutf/rvv/implementation.h Outdated
: simdutf::implementation("rvv", "RISC-V Vector Extension",
internal::instruction_set::RVV)
, _supports_zvbb(
#if SIMDUTF_IS_ZVBB
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should relay on detect_supported_architectures result. I think we can remove this ifdef.

Copy link
Copy Markdown
Contributor Author

@camel-cdr camel-cdr Mar 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this was meant to test for SIMDUTF_HAS_ZVBB_INTRINSICS, since we can't always enable the zvbb target attribute. I'll update this, the same problem is also in simdutf_byteflip(vuint16m8_t...

Edit: if SIMDUTF_IS_ZVBB is true, then we actually just want to return true, because the code won't run on a platform where zvbb isn't available if it's enabled globally.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@camel-cdr

if SIMDUTF_IS_ZVBB is true, then we actually just want to return true, because the code won't run on a platform where zvbb isn't available if it's enabled globally.

Don't you then want something like...

#if SIMDUTF_IS_ZVBB

static inline uint32_t detect_supported_architectures() {
  return internal::instruction_set::ZVBB;
}

#endif

like...

#elif defined(__aarch64__) || defined(_M_ARM64)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that you already have...

 #if SIMDUTF_IS_ZVBB
   host_isa |= instruction_set::ZVBB;
 #endif

So unless I am missing something, you already have that internal::detect_supported_architectures() & internal::instruction_set::ZVBB is true if SIMDUTF_IS_ZVBB evaluates to true. Correct?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, what you could do with #if SIMDUTF_IS_ZVBB that would be helpful to the code is to do...

#if SIMDUTF_IS_ZVBB
   bool supports_zvbb() const { return true; }
#else
  const bool _supports_zvbb;
   bool supports_zvbb() const { return _supports_zvbb; }
#endif

This would be better because if SIMDUTF_IS_ZVBB is true, then you'd avoid the runtime checks.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, yeah that's a better place to put the checks, I've adjusted the code.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW you might think that the compiler would still optimize it away, but no... not even with -O3.
Screenshot 2024-03-08 at 11 57 26 AM
Screenshot 2024-03-08 at 11 57 22 AM

@camel-cdr
Copy link
Copy Markdown
Contributor Author

@WojciechMula thanks for the review, I've adjusted the code based on your comments.

Regarding the coding style, I tried cleaning up the worst offenders.
I think naming things is hard, and a lot of the variable reuse is incrementally building up a value, that could otherwise have been a single large expression.

E.g. the following builds up the surrogate pair:

/* convert [000000000000aaaa|aaaaaabbbbbbbbbb]
* to [110111bbbbbbbbbb|110110aaaaaaaaaa] */
vuint32m4_t sur = __riscv_vsub_vx_u32m4(utf32, 0x10000, vl);
sur = __riscv_vor_vv_u32m4(__riscv_vsll_vx_u32m4(sur, 16, vl),
__riscv_vsrl_vx_u32m4(sur, 10, vl), vl);
sur = __riscv_vand_vx_u32m4(sur, 0x3FF03FF, vl);
sur = __riscv_vor_vx_u32m4(sur, 0xDC00D800, vl);

Splitting that up into constant variable declarations, to me makes the code more noisy.
I'm the one who wrote it, so I'm obviously a bit biased.

@lemire
Copy link
Copy Markdown
Member

lemire commented Mar 8, 2024

@camel-cdr I take @WojciechMula's stylistic comments (reproduced below) as suggestions more than requirements.

Do not hesitate to split long, nested expressions into several variables - it increases readability. Also, do not reuse variables; in most cases you may create as many (preferably const) bindings as you need.

The advantage of following @WojciechMula's advice is that it makes it easier to document the code, because the variable has just one meaning. But, again, I think that @WojciechMula would agree that there are several valid ways of writing code. You may consider @WojciechMula's if you ever plan to write about this code (which you should).

@camel-cdr camel-cdr force-pushed the master branch 2 times, most recently from d6801bc to b178853 Compare March 8, 2024 16:45
@lemire
Copy link
Copy Markdown
Member

lemire commented Mar 8, 2024

@camel-cdr I expect to merge this soon, but I will give some time for folks to come in and comment further. I also want to make sure that @WojciechMula has time to comment if he wants.

@WojciechMula
Copy link
Copy Markdown
Collaborator

@camel-cdr as Daniel noted, stylistics notes were just suggestions, but based on some experience. I was offline in the past weekend, but I'm back online and re-reviewing your PR right now.

for (size_t vl; len > 0; len -= vl, src += vl, dst += vl) {
vl = __riscv_vsetvl_e16m8(len);
vuint16m8_t v = __riscv_vle16_v_u16m8((uint16_t*)src, vl);
__riscv_vse8_v_u8m4((uint8_t*)dst, __riscv_vnsrl_wx_u8m4(v, 8, vl), vl);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you try the segmented stores to perform swaps? Are they fast?

Copy link
Copy Markdown
Contributor Author

@camel-cdr camel-cdr Mar 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried something similar, and segmented store are quite slow on the C920, but fine on the C908, see: c908, c920

I haven't seen it be faster than regular stores.

I also don't see how that would be useful here, as segmented stores interleave vector registers, which isn't what needs to happen here.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per the RISC-V Optimization Guide, equivalent loads/stores are better done with normal loads/stores rather than segmented loads/stores. The later are not expected to be fast on every implementations, as the C920 shows compared to C908.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@luhenry thanks, didn't know that document

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, if something doesn't quite match the optimization guide, I've got an open issue with thoughts: https://gitlab.com/riseproject/riscv-optimization-guide/-/issues/1

Comment thread src/rvv/rvv_utf32_to.inl.cpp Outdated
* 2: [ | aaaaa] vsrl 6
* 3: [00111111|00111111]
* 4: [ bbbbbb|000aaaaa] (1|2)&3
* 5: [11000000|11000000]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix comment: 5: [10000000|11000000]

@lemire
Copy link
Copy Markdown
Member

lemire commented Mar 11, 2024

@WojciechMula

stylistics notes were just suggestions, but based on some experience.

It is definitively easier if you want to write an article about it... because the text can refer unambiguously to variable "x" as one thing.

@lemire
Copy link
Copy Markdown
Member

lemire commented Mar 18, 2024

Thanks @WojciechMula and others. I am merging this. It will be part of the next release.

@lemire lemire merged commit fde372f into simdutf:master Mar 18, 2024
* first invalid one, but never overestimating. */
simdutf_really_inline static size_t rvv_count_valid_utf8(const char *src, size_t len) {
const char *beg = src;
size_t tail = 32; // minimum of 3
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello, is 32 the correct value here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it should be correct. It could be as low as 3, but I choose a larger number to avoid running through the vector code for small inputs.
You are right, though, it's not very clear, and the "validate first three bytes" code below doesn't do what it says because of this. The tail is also needlessly large, the fallback to scalar and tail length can be independent.
Looking over the code again, I also found an unrelated bug, in the ASCII fast path.
I'll create a new PR, that cleans it up and fixes the validation bug.
Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants