TODO

[Suggestion](https://github.com/clausecker/pospop/issues/2) from @clausecker

- [ ] conditional branches are expensive if they are hard to predict. Consider reducing the amount of conditional layers to just two (one SIMD register full, then byte-by-byte)
- [ ] also consider unrolling the main loop a bit more
- [ ] for AVX-512 you can use masking instead of a separate loop to deal with the tail
- [ ] you should align at least one of the inputs to one SIMD register worth of data before you start with the main loop. Memory accesses crossing cache line boundaries incur an extra penalty
- [ ] there is probably not too much of a benefit in using 512 bit registers here since the code is largely memory bound. Using 512 bit registers incurs a thermal throttle, so it's only advisable for long compute bound sections
- [ ] instead of moving two pointers and an index, consider using a double-indexed addressing mode so you only have to advance one register per iteration
- [x] the tail code is wrong: it always writes 8 full bytes, even if the slices is shorter. This causes incorrect results when you for example slice from a larger array and only compute the bitwise and of the small slice
- [ ] For inspiration on how to do better, consider asking a C compiler. For example, clang suggests this kind of code for AVX2 which addresses the issues I remarked.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

TODO #1

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

TODO #1

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions