As discussed on Stack Overflow, I played around with the code a tiny bit to fix some warnings.
https://godbolt.org/z/bdzoEP8b8 Maybe diff that against 1brc_valid14.cpp which is what I started with, to see what I changed. (IIRC I made comments on most changes.) Nothing huge, just minor notes like trying memset(..., 0, 100) fixed-size first. GCC uses rep stosq for -march=znver1, which seems crazy to me. Clang uses 3x vmovdqu ymm, which is good if aligned.
There's other stuff that could be done, like avoiding narrow signed types to save a movsxd or cdqe sign-extension instruction in some places, but that's probably minor.
Moving my comments from SO to here so a moderator is less likely to swoop in and destroy all the comments, moving them to chat where their vote totals are lost and only a tiny fraction of readers will ever look at them. (I hate it when moderators do that.)
- Why does your code use _mm_testc_si128 with a comment that says "SIMD string comparison"? That only checks that all the bits set in the right side are set in the left side. Two ASCII strings can differ while testc is still true, e.g. "ccc..." (0x63) and "aaa..." (0x61). If that works for your use-case or is intended as an early-out approximation, it deserves a bigger comment somewhere. ptest is 2 uops on most CPUs (but one on Zen 1 & 2) vs. pcmpeqb/pmovmskb also being 2 uops. test/jcc fuses into 1 uop even on AMD. Different code-size for code cache, though.
(I saw your response to that one already)
-
Re: reading past the end of the buffer: just allocate at least 15 extra bytes in your buffer so you can safely do a 16-byte load from the last byte of actual data. I think mmap should allow a larger mapping, at least with MAP_PRIVATE. If not, use mmap(MAP_FIXED_NOREPLACE) to map another page after the file-backed mapping to make sure a mapping exists. I guess it's possible the next page could be mapped but not readable, like a guard page for stack growth, in which case you're out of luck if the file length is within 15 bytes of the end of a page. It's always safe to read past the end of the file length as long as that's not into a new page. https://unix.stackexchange.com/questions/616848/what-is-the-behaviour-of-a-file-backed-memory-map-when-reading-from-or-writing-t
-
sumchars should use _mm_shuffle_epi32 for better non-AVX performance (avoid a movdqa), or _mm_unpackhi_epi64 for AVX (avoid an immediate). See https://stackoverflow.com/questions/6996764/fastest-way-to-do-horizontal-sse-vector-sum-or-other-reduction
-
Pointing uint64_t* at a __m128i is strict-aliasing UB. In 64-bit code for Linux, uint64_t is unsigned long which isn't even alias-compatible with long long (which is the element type for __m128i.) Use _mm_cvtsi128_si64 to get the low element with movq.
-
Am I reading this right that hmap_insert reloads the key from data, regenerates the mask, and redoes masking? You could just pass it the vector variables you already have, at least for the non-SLOW_HASH version. (Which you can get rid of if you allocate or mmap a slightly larger buffer, otherwise I guess declare the vector variables outside the if block so you can still pass them and just not use them in one side of hash_insert)
You claim you tried but found it slower. IDK, if so that might change if you eliminate the SLOW_HASH case or whatever other branching makes the compiler not sure those variables are already computed. Or look at the asm and see what looks inefficient to figure out why it would be slower to do less work. I don't see any non-inline function calls so the variables should still be hot in regs. Maybe the compiler moved the work later, worsening critical-path latency for stuff that wants pos ready sooner?
If you're working your way through a buffer with pointer updates based only on pos, that dependency-chain latency could be a bottleneck. Loading 2 YMM vectors and scanning it for newlines could give you a shorter critical path latency. Like 2x vpcmpeqb / vpmovmskb / shl / or to create a 64-bit map of the upcoming newlines, then loop over that with blsr / tzcnt, so the loop-carried dep chain is just blsr until the next load based on that. Unfortunately blsr is 2 uops on Zen 3 and earlier, but it's still pretty cheap, and might save some of the computation based on pos so end up being break-even.
Perhaps align the YMM loads by 16 for efficiency on Zen 1, and right-shift the resulting 64-bit mask by 0..15, using the low 4 bits of the address as a shift count. https://travisdowns.github.io/blog/2019/06/11/speed-limits.html#load-split-cache-lines .
If you go until you get all the newlines from the window, that loop branch will probably mispredict often. So perhaps set a limit on the number of lines from that 64-byte window, such that most of the time the loop does that many iterations, only occasionally stopping early if the lines were very long. Like 4, 5, or 6 lines, something that's rarely exceeded in your typical data.
If pointer-increment critical path latency wasn't hurting out-of-order execution, this is extra work for no benefit that costs throughput resources.
From another comment thread on SO under Simon's answer:
-
You could just use calloc to get zeroed memory. It's either zeroed already by the kernel, or a wider zeroing operation that you overwrite with memcpy has negligible extra cost. (Doing the zeroing yourself with a compile-time-constant size might even be better, but for the calloc library function the length is a runtime variable. OTOH, it's always the same size so branching goes the same way.)
-
Your code has at least a couple bugs. IDK if they affect performance or not. GCC/clang warn about memset size being a constant 0; you swapped the value and length args. Also, you deref a uint64_t* pointing at a __m128i. That's the opposite of dereferencing a __m128i*, and is not well-defined, like I wrote in the SO answer linked in a comment in the code! (And uint64_t aka unsigned long in a 64-bit build is not alias-compatible with long long.) It breaks in practice for int*. Use _mm_cvtsi128_si64 for vmovq to get the low half!
As discussed on Stack Overflow, I played around with the code a tiny bit to fix some warnings.
https://godbolt.org/z/bdzoEP8b8 Maybe diff that against 1brc_valid14.cpp which is what I started with, to see what I changed. (IIRC I made comments on most changes.) Nothing huge, just minor notes like trying memset(..., 0, 100) fixed-size first. GCC uses rep stosq for -march=znver1, which seems crazy to me. Clang uses 3x vmovdqu ymm, which is good if aligned.
There's other stuff that could be done, like avoiding narrow signed types to save a
movsxdorcdqesign-extension instruction in some places, but that's probably minor.Moving my comments from SO to here so a moderator is less likely to swoop in and destroy all the comments, moving them to chat where their vote totals are lost and only a tiny fraction of readers will ever look at them. (I hate it when moderators do that.)
(I saw your response to that one already)
Re: reading past the end of the buffer: just allocate at least 15 extra bytes in your buffer so you can safely do a 16-byte load from the last byte of actual data. I think
mmapshould allow a larger mapping, at least with MAP_PRIVATE. If not, usemmap(MAP_FIXED_NOREPLACE)to map another page after the file-backed mapping to make sure a mapping exists. I guess it's possible the next page could be mapped but not readable, like a guard page for stack growth, in which case you're out of luck if the file length is within 15 bytes of the end of a page. It's always safe to read past the end of the file length as long as that's not into a new page. https://unix.stackexchange.com/questions/616848/what-is-the-behaviour-of-a-file-backed-memory-map-when-reading-from-or-writing-tsumcharsshould use_mm_shuffle_epi32for better non-AVX performance (avoid amovdqa), or_mm_unpackhi_epi64for AVX (avoid an immediate). See https://stackoverflow.com/questions/6996764/fastest-way-to-do-horizontal-sse-vector-sum-or-other-reductionPointing
uint64_t*at a__m128iis strict-aliasing UB. In 64-bit code for Linux,uint64_tisunsigned longwhich isn't even alias-compatible withlong long(which is the element type for__m128i.) Use_mm_cvtsi128_si64to get the low element withmovq.Am I reading this right that
hmap_insertreloads the key from data, regenerates the mask, and redoes masking? You could just pass it the vector variables you already have, at least for the non-SLOW_HASH version. (Which you can get rid of if you allocate or mmap a slightly larger buffer, otherwise I guess declare the vector variables outside the if block so you can still pass them and just not use them in one side of hash_insert)You claim you tried but found it slower. IDK, if so that might change if you eliminate the SLOW_HASH case or whatever other branching makes the compiler not sure those variables are already computed. Or look at the asm and see what looks inefficient to figure out why it would be slower to do less work. I don't see any non-inline function calls so the variables should still be hot in regs. Maybe the compiler moved the work later, worsening critical-path latency for stuff that wants
posready sooner?If you're working your way through a buffer with pointer updates based only on
pos, that dependency-chain latency could be a bottleneck. Loading 2 YMM vectors and scanning it for newlines could give you a shorter critical path latency. Like 2xvpcmpeqb/vpmovmskb/shl/orto create a 64-bit map of the upcoming newlines, then loop over that withblsr/tzcnt, so the loop-carried dep chain is justblsruntil the next load based on that. Unfortunatelyblsris 2 uops on Zen 3 and earlier, but it's still pretty cheap, and might save some of the computation based onposso end up being break-even.Perhaps align the YMM loads by 16 for efficiency on Zen 1, and right-shift the resulting 64-bit mask by 0..15, using the low 4 bits of the address as a shift count. https://travisdowns.github.io/blog/2019/06/11/speed-limits.html#load-split-cache-lines .
If you go until you get all the newlines from the window, that loop branch will probably mispredict often. So perhaps set a limit on the number of lines from that 64-byte window, such that most of the time the loop does that many iterations, only occasionally stopping early if the lines were very long. Like 4, 5, or 6 lines, something that's rarely exceeded in your typical data.
If pointer-increment critical path latency wasn't hurting out-of-order execution, this is extra work for no benefit that costs throughput resources.
From another comment thread on SO under Simon's answer:
You could just use
callocto get zeroed memory. It's either zeroed already by the kernel, or a wider zeroing operation that you overwrite with memcpy has negligible extra cost. (Doing the zeroing yourself with a compile-time-constant size might even be better, but for the calloc library function the length is a runtime variable. OTOH, it's always the same size so branching goes the same way.)Your code has at least a couple bugs. IDK if they affect performance or not. GCC/clang warn about memset size being a constant 0; you swapped the value and length args. Also, you deref a
uint64_t*pointing at a__m128i. That's the opposite of dereferencing a__m128i*, and is not well-defined, like I wrote in the SO answer linked in a comment in the code! (Anduint64_takaunsigned longin a 64-bit build is not alias-compatible withlong long.) It breaks in practice forint*. Use_mm_cvtsi128_si64for vmovq to get the low half!