-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Suggestion from @clausecker
- conditional branches are expensive if they are hard to predict. Consider reducing the amount of conditional layers to just two (one SIMD register full, then byte-by-byte)
- also consider unrolling the main loop a bit more
- for AVX-512 you can use masking instead of a separate loop to deal with the tail
- you should align at least one of the inputs to one SIMD register worth of data before you start with the main loop. Memory accesses crossing cache line boundaries incur an extra penalty
- there is probably not too much of a benefit in using 512 bit registers here since the code is largely memory bound. Using 512 bit registers incurs a thermal throttle, so it's only advisable for long compute bound sections
- instead of moving two pointers and an index, consider using a double-indexed addressing mode so you only have to advance one register per iteration
- the tail code is wrong: it always writes 8 full bytes, even if the slices is shorter. This causes incorrect results when you for example slice from a larger array and only compute the bitwise and of the small slice
- For inspiration on how to do better, consider asking a C compiler. For example, clang suggests this kind of code for AVX2 which addresses the issues I remarked.
Metadata
Metadata
Assignees
Labels
No labels