Some possible improvements

As discussed on Stack Overflow, I played around with the code a tiny bit to fix some warnings.
https://godbolt.org/z/bdzoEP8b8  Maybe diff that against 1brc_valid14.cpp which is what I started with, to see what I changed.  (IIRC I made comments on most changes.)  Nothing huge, just minor notes like trying memset(..., 0, 100) fixed-size first.  GCC uses rep stosq for -march=znver1, which seems crazy to me.  Clang uses 3x vmovdqu ymm, which is good if aligned.

There's other stuff that could be done, like avoiding narrow signed types to save a `movsxd` or `cdqe` sign-extension instruction in some places, but that's probably minor.

Moving my comments from SO to here so a moderator is less likely to swoop in and destroy *all* the comments, moving them to chat where their vote totals are lost and only a tiny fraction of readers will ever look at them.  (I hate it when moderators do that.)

* Why does your code use _mm_testc_si128 with a comment that says "SIMD string comparison"? That only checks that all the bits set in the right side are set in the left side. Two ASCII strings can differ while testc is still true, e.g. "ccc..." (0x63) and "aaa..." (0x61). If that works for your use-case or is intended as an early-out approximation, it deserves a bigger comment somewhere. ptest is 2 uops on most CPUs (but one on Zen 1 & 2) vs. pcmpeqb/pmovmskb also being 2 uops. test/jcc fuses into 1 uop even on AMD. Different code-size for code cache, though.

(I saw your response to that one already)

* Re: reading past the end of the buffer: just allocate at least 15 extra bytes in your buffer so you can safely do a 16-byte load from the last byte of actual data.  I think `mmap` should allow a larger mapping, at least with MAP_PRIVATE.  If not, use `mmap(MAP_FIXED_NOREPLACE)` to map another page after the file-backed mapping to make sure a mapping exists.  I guess it's possible the next page could be mapped but not readable, like a guard page for stack growth, in which case you're out of luck if the file length is within 15 bytes of the end of a page.  It's always safe to read past the end of the file length as long as that's not into a new page.  https://unix.stackexchange.com/questions/616848/what-is-the-behaviour-of-a-file-backed-memory-map-when-reading-from-or-writing-t


* `sumchars` should use `_mm_shuffle_epi32` for better non-AVX performance (avoid a `movdqa`), or `_mm_unpackhi_epi64` for AVX (avoid an immediate).  See https://stackoverflow.com/questions/6996764/fastest-way-to-do-horizontal-sse-vector-sum-or-other-reduction

* Pointing `uint64_t*` at a `__m128i` is strict-aliasing UB.  In 64-bit code for Linux, `uint64_t` is `unsigned long` which isn't even alias-compatible with `long long` (which is the element type for `__m128i`.)  Use `_mm_cvtsi128_si64` to get the low element with `movq`.

 * Am I reading this right that `hmap_insert` reloads the key from data, regenerates the mask, and redoes masking? You could just pass it the vector variables you already have, at least for the non-SLOW_HASH version. (Which you can get rid of if you allocate or mmap a slightly larger buffer, otherwise I guess declare the vector variables outside the if block so you can still pass them and just not use them in one side of hash_insert)

 You claim you tried but found it slower.  IDK, if so that might change if you eliminate the SLOW_HASH case or whatever other branching makes the compiler not sure those variables are already computed.  Or look at the asm and see what looks inefficient to figure out why it would be slower to do less work.  I don't see any non-inline function calls so the variables should still be hot in regs.  Maybe the compiler moved the work later, worsening critical-path latency for stuff that wants `pos` ready sooner?

----

If you're working your way through a buffer with pointer updates based only on `pos`, that dependency-chain latency could be a bottleneck.  Loading 2 YMM vectors and scanning it for newlines could give you a shorter critical path latency.  Like 2x `vpcmpeqb` / `vpmovmskb` / `shl` / `or` to create a 64-bit map of the upcoming newlines, then loop over that with `blsr` / `tzcnt`, so the loop-carried dep chain is just `blsr` until the next load based on that.   Unfortunately `blsr` is 2 uops on Zen 3 and earlier, but it's still pretty cheap, and might save some of the computation based on `pos` so end up being break-even.

Perhaps align the YMM loads by 16 for efficiency on Zen 1, and right-shift the resulting 64-bit mask by 0..15, using the low 4 bits of the address as a shift count.  https://travisdowns.github.io/blog/2019/06/11/speed-limits.html#load-split-cache-lines .

If you go until you get all the newlines from the window, that loop branch will probably mispredict often.  So perhaps set a limit on the number of lines from that 64-byte window, such that most of the time the loop does that many iterations, only occasionally stopping early if the lines were very long.  Like 4, 5, or 6 lines, something that's rarely exceeded in your typical data.

If pointer-increment critical path latency wasn't hurting out-of-order execution, this is extra work for no benefit that costs throughput resources.

---


From another comment thread on SO under Simon's answer:

*  You could just use `calloc` to get zeroed memory. It's either zeroed already by the kernel, or a wider zeroing operation that you overwrite with memcpy has negligible extra cost. (Doing the zeroing yourself with a compile-time-constant size might even be better, but for the calloc library function the length is a runtime variable. OTOH, it's always the same size so branching goes the same way.)

*  Your code has at least a couple bugs. IDK if they affect performance or not. GCC/clang warn about memset size being a constant 0; you swapped the value and length args. Also, you deref a `uint64_t*` pointing at a `__m128i`. That's the opposite of dereferencing a `__m128i*`, and is not well-defined, like I wrote in the SO answer linked in a comment in the code! (And `uint64_t` aka `unsigned long` in a 64-bit build is not alias-compatible with `long long`.)  It breaks in practice for `int*`. Use `_mm_cvtsi128_si64` for vmovq to get the low half!



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some possible improvements #3

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Some possible improvements #3

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions