GH-37712: [Go][Parquet] Fix ARM64 assembly for bitmap extract bits#37785
GH-37712: [Go][Parquet] Fix ARM64 assembly for bitmap extract bits#37785zeroshade merged 16 commits intoapache:mainfrom
Conversation
|
|
|
Ugh, looks like ALL the neon assembly is borked, it just wasn't ACTUALLY being run because I guess I'll take a stab at fixing and debugging that tomorrow. |
|
Interestingly, it's only failing on linux/arm64 not macOS ARM64.... which is, well, odd. sigh Why can't things be easy. I'm guessing there's probably a feature flag that i should be checking for that I'm not |
go/parquet/internal/bmi/bmi_arm64.go
Outdated
There was a problem hiding this comment.
Hmm, so if an invalid value is passed in ARM_ENABLE_EXT, all extensions are silently disabled? It would be more user-friendly to print a warning.
There was a problem hiding this comment.
technically we aren't actually disabling the extensions on the CPU, we're just telling the library that anything that wants to check for those extensions should consider them disabled and act accordingly. For example if an invalid value is passed we behave as if those extensions weren't enabled regardless and default to the pure go implementation.
That said, I agree it would be more user-friendly to have an explicit "DISABLE" value, and then just print a warning for invalid values. I'll make that change.
There was a problem hiding this comment.
technically we aren't actually disabling the extensions on the CPU
Yes, I figured that out :-)
There was a problem hiding this comment.
updated so that if the value of the env var is "DISABLE" then we disable things, otherwise we just print any unrecognized options to stderr
There was a problem hiding this comment.
Hmm, how did you come with this fix? I thought this assembly code wasn't written by hand?
There was a problem hiding this comment.
The assembly wasn't generated by hand, but this fix was absolutely written by hand (me) by way of looking at the code, the generated assembly, and stepping through the instructions to figure out what was happening incorrectly.
In this case, the problem was that the instructions which zero-out the return value got placed after the loop condition (checking if the selection bitmap was zero), so in that scenario we ended up returning whatever happened to be the next 8 bytes after the memory location the arguments were placed. I had to look up a bunch of stuff about how Go plan9 assembly works and also ARM64 assembly, to work out that zero'ing out one area was able to be done by simply storing from ZR the zero-register.
There was a problem hiding this comment.
Do you think it's a C/C++ compiler bug? That sounds a bit unlikely to me, given how simple the extract_bits function is.
Also, why are you compiling this code for Neon? I find it unlikely that Neon can be used to speed up your SW emulation of PEXT.
Also, we have a SW emulation in Parquet C++ that processes 5 bits at a time rather than one. You could perhaps try to compile that one (or rewrite it in Go) :-)
There was a problem hiding this comment.
Without knowing how Yuqi originally generated the arm64 assembly, I'm not sure what the cause of this was. All i know is that this was never correct, but was hidden by the fact that we were unaware of a bug in the golang.org/x/sys/cpu package which was failing to detect ARM features correctly so the ARM builds were just using the pure go implementation.
Also, why are you compiling this code for Neon? I find it unlikely that Neon can be used to speed up your SW emulation of PEXT.
From what I can tell, ARM doesn't have a PEXT instruction, again this C impl was provided by a contributor but doing a bit of research it looks like the implementation there is essentially the standard way to implement PEXT for ARM in C when I do some stack overflow searching. (There's also a couple variations I came across, but they roughly even out).
Also, we have a SW emulation in Parquet C++ that processes 5 bits at a time rather than one. You could perhaps try to compile that one (or rewrite it in Go) :-)
The pure go impl already does it! if you look at bmi.go it's a direct port of the software emulation in level_conversion_inc.h from the C++ parquet impl. We only go to the assembly / non-pure-go impl if our CPU check says that it supports BMI2 so that we can call pext_u64 directly at runtime.
The compiling for NEON is to get the benefit of vectorization for the levels_to_bitmap function (i.e. the GreaterThanBitmap SIMD method).
There was a problem hiding this comment.
Without knowing how Yuqi originally generated the arm64 assembly, I'm not sure what the cause of this was.
There should be a well-defined way to generate the assembly, so that other people can later participate in maintenance.
(also, is it possible that the original arm64 assembly was simply out of sync with the current C source?)
The pure go impl already does it!
Nice, so you just need to port it to C now :-)
There was a problem hiding this comment.
The assembly wasn't generated by hand, but this fix was absolutely written by hand (me) by way of looking at the code, the generated assembly, and stepping through the instructions to figure out what was happening incorrectly.
This is cool!
Might be bug from asm2plan9s. Or errors from manually converting/formatting from normal arm assembly to plan9.
Not sure the best approach to convert normal C/assembly to plan9 assembly. asm2plan9s and c2goasm are archived. Looks we should avoid them.
For this pext like function, the go version (lookup table) should perform better than this one. Maybe we can just delete this assembly implementation?
There was a problem hiding this comment.
@cyb70289 I'd be fine with removing the arm assembly implementation of the pext function. I haven't benchmarked the lookup table vs the ARM assembly version yet. but I agree the lookup table would likely be faster.
The other issue i'm having is the bit packing side of things, though I think i may have figured out a solution to it... and i've learned more about ARM64 assembly in the last few days than i ever intended on learning LOL as I've needed to do a lot of handcoding assembly to get this to work. Looks like the issue boils down to needing anything with a label to be explicit in the assembly rather than using the WORD/BYTE syntax to pass them to the processor stream so that the labels get translated to the correct PC locations.
There should be a well-defined way to generate the assembly, so that other people can later participate in maintenance.
I agree! I was a little dissappointed that @guyuqi didn't add explicit steps to how he generated the arm64 assembly but wanted to benefit from it, so that was my mistake to accept that PR so long ago. Currently I'm trying to explicitly define (and automate as much as possible) with the generation based on when i figure out what works and solves this issue.
There was a problem hiding this comment.
Wrote a benchmark, the golang version is indeed faster than assembly. Tested on Neoverse-N1.
$ go test -bench=.
goos: linux
goarch: arm64
pkg: bmi
BenchmarkGo-80 663 1759920 ns/op
BenchmarkAsm-80 435 2720235 ns/op
PASS
ok bmi 2.838s
There was a problem hiding this comment.
well that settles that :) i'll update this and remove the asm version for arm 😄 Thanks!
There was a problem hiding this comment.
removed the ARM asm version so that it just uses the pure go lookup-table impl
go/parquet/internal/bmi/bmi_arm64.go
Outdated
There was a problem hiding this comment.
cpu.ARM64.HasASIMD is false on some arm64 platform?
It does work correctly on my test machine (ubunt22.05, aarch64, golang1.21).
There was a problem hiding this comment.
From what I could tell in my testing, it appears that on golang1.20 and below, it was ending up false on arm64 platforms when it shouldn't. Try testing it on golang1.20 instead of 1.21 and let me know, because that's what I was observing in my tests (and explains why we never caught this issue before now)
There was a problem hiding this comment.
This is strange. I tested earlier golang versions. Also tried x/sys/cpu v0.5.0 (used by arrow) and latest 0.12.0. All tests are okay.
Curious what might be the reason. If possible, can your share info about the test env? E.g.. tested in arm64 baremetal or virtual machine, linux kernel version, output of "lscpu". Thanks.
There was a problem hiding this comment.
It was a virtual machine launched via EC2 on AWS as a Debian 12 arm64 machine. That said, the original issue that launched this investigation was a mac M1 which appeared to show the same situation that using go1.20 acted as per x/sys/cpu returning false for HasASIMD while using go1.21 got true and then hit this issue and failed horribly.
I'll launch a new one when i get a chance and paste the output of lscpu here later on
There was a problem hiding this comment.
@cyb70289 Here is the output of lscpu on the AWS EC2 instance that I've been using for testing:
admin@ip-172-31-78-97:~/arrow/go/parquet/internal/utils$ lscpu
Architecture: aarch64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 1
On-line CPU(s) list: 0
Vendor ID: ARM
Model name: Neoverse-N1
Model: 1
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 1
Stepping: r3p1
BogoMIPS: 243.75
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
Caches (sum of all):
L1d: 64 KiB (1 instance)
L1i: 64 KiB (1 instance)
L2: 1 MiB (1 instance)
L3: 32 MiB (1 instance)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0
Vulnerabilities:
Gather data sampling: Not affected
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Not affected
Retbleed: Not affected
Spec rstack overflow: Not affected
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Spectre v1: Mitigation; __user pointer sanitization
Spectre v2: Mitigation; CSV2, BHB
Srbds: Not affected
Tsx async abort: Not affected
go/parquet/internal/utils/clib.go
Outdated
There was a problem hiding this comment.
I'm curious, what is this and what is it good for? Doesn't Go provide memcpy/memset replacements?
There was a problem hiding this comment.
this exists purely to test the assembly versions I provided.
The issue is that the C implementation I'm using to generate the assembly makes calls to memset and memcpy, but C's calling conventions are different than Go's. So in the assembly when it calls memset/memcpy I replaced those CALL instructions to call these replacement versions (because I can't call out to libc from inside of Go like that, no guarantee that the lib is loaded). Which are essentially just ported versions of the C memcpy/memset to ensure that the Calling conventions and argument handling are the same.
The underscore prefix there ensures that the methods are not exported from the package, so I'm only using these for clib_test.go in order to sanity check the assembly methods.
There was a problem hiding this comment.
Yuck, is there a way to avoid the calls to memset and memcpy?
There was a problem hiding this comment.
Even the original C++ implementations (bpacking_simd128_generated.h) use memset and memcpy. There might be ways to avoid those calls but they'd likely be much less efficient. We also already use amd64 versions of these anyways, so it's good to have arm64 implementations.
There was a problem hiding this comment.
By the looks of it, the calls to memcpy/memset could easily be replaced by the equivalent for loop. Given the size is small and constant, I don't think calling a generic version of memcpy/memset would be a significant win.
There was a problem hiding this comment.
Latest update has removed the memset/memcpy and replaced them with simple fixed size loops. My local testing showed all the tests passing, so assuming the CI is happy with things then we should be good here if you're happy with the results @pitrou. Let me know please when you get a chance.
|
Regardless of automation, there should be READMEs explaining the rationale and motivations, and also how to generate those files anew. |
|
@pitrou fair point, I'll update / write the README stuff for this today. |
Co-authored-by: Antoine Pitrou <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
|
@tschaub Can you please try testing with the latest version of this branch/PR and confirm for me that everything is still working on your end and there's no performance degradation vs the previous fix? I agree with @pitrou's statement that it's unlikely the generic |
|
@zeroshade - I only tested a couple different cases, but I don't see any negative effect with the latest changes. Here are some benchmarks transforming a And here are some numbers from a So no significant difference in those between the before (c9693c5) and after (9d4c29b) cases. The other case is a build with |
| RET | ||
|
|
||
| // func _ClibMemcpy(dst, src unsafe.Pointer, n uint) unsafe.Pointer | ||
| TEXT ·_ClibMemcpy(SB), NOSPLIT|NOFRAME, $16-24 |
There was a problem hiding this comment.
Hmm, so is this file still useful?
There was a problem hiding this comment.
I didn't make any changes to the amd64 implementation which still uses memcpy. Since there's no issues or problems with the amd64 implementation, I didn't want to change or remove anything with it. I just left the addition the Go function definition there for easier testing in the future if we need it.
There was a problem hiding this comment.
Ah, I see. Too bad, perhaps you can create a separate GH issue for it?
|
Thanks for checking @tschaub! looks great |
cyb70289
left a comment
There was a problem hiding this comment.
LGTM, thanks @zeroshade !
|
After merging your PR, Conbench analyzed the 5 benchmarking runs that have been run so far on merge-commit f7530de. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 2 possible false positives for unstable benchmarks that are known to sometimes produce them. |
…its (apache#37785) ### Rationale for this change The linked issue discovered a problem with the ARM64 implementation of the extractBits functionality which was being hidden because until `go1.21`, it looks like `golang.org/x/sys/cpu` wasn't properly detecting the ASIMD bit flag on the ARM processors and so it was using the pure go implementation. So we fix the assembly and add a test which was able to reproduce the failure without the fix on an ARM64 machine. ### What changes are included in this PR? There was a problem with the assembly as it existed as the short circuit case where the selection bitmap is 0 wasn't actually setting the 0 value back onto the stack as a return. The assembly which initialized that register to 0 only occurred AFTER the `CBZ` instruction which would jump to the bottom and return. This means that we *actually* returned whatever happened to be on the stack at that location (i.e. garbage!). This only occurred if you were using the ARM64-specific assembly implementation, not the amd64 or pure go versions. We also include a test to ensure we get the correct values for a variety of bitmaps and selections, along with this ensuring there's enough data on the stack so that if this problem re-occured we would catch it in this test. ### Are these changes tested? Test is added. * Closes: apache#37712 Lead-authored-by: Matt Topol <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Matt Topol <[email protected]>
…its (apache#37785) ### Rationale for this change The linked issue discovered a problem with the ARM64 implementation of the extractBits functionality which was being hidden because until `go1.21`, it looks like `golang.org/x/sys/cpu` wasn't properly detecting the ASIMD bit flag on the ARM processors and so it was using the pure go implementation. So we fix the assembly and add a test which was able to reproduce the failure without the fix on an ARM64 machine. ### What changes are included in this PR? There was a problem with the assembly as it existed as the short circuit case where the selection bitmap is 0 wasn't actually setting the 0 value back onto the stack as a return. The assembly which initialized that register to 0 only occurred AFTER the `CBZ` instruction which would jump to the bottom and return. This means that we *actually* returned whatever happened to be on the stack at that location (i.e. garbage!). This only occurred if you were using the ARM64-specific assembly implementation, not the amd64 or pure go versions. We also include a test to ensure we get the correct values for a variety of bitmaps and selections, along with this ensuring there's enough data on the stack so that if this problem re-occured we would catch it in this test. ### Are these changes tested? Test is added. * Closes: apache#37712 Lead-authored-by: Matt Topol <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Matt Topol <[email protected]>
Rationale for this change
The linked issue discovered a problem with the ARM64 implementation of the extractBits functionality which was being hidden because until
go1.21, it looks likegolang.org/x/sys/cpuwasn't properly detecting the ASIMD bit flag on the ARM processors and so it was using the pure go implementation. So we fix the assembly and add a test which was able to reproduce the failure without the fix on an ARM64 machine.What changes are included in this PR?
There was a problem with the assembly as it existed as the short circuit case where the selection bitmap is 0 wasn't actually setting the 0 value back onto the stack as a return. The assembly which initialized that register to 0 only occurred AFTER the
CBZinstruction which would jump to the bottom and return. This means that we actually returned whatever happened to be on the stack at that location (i.e. garbage!). This only occurred if you were using the ARM64-specific assembly implementation, not the amd64 or pure go versions.We also include a test to ensure we get the correct values for a variety of bitmaps and selections, along with this ensuring there's enough data on the stack so that if this problem re-occured we would catch it in this test.
Are these changes tested?
Test is added.