GH-37712: [Go][Parquet] Fix ARM64 assembly for bitmap extract bits by zeroshade · Pull Request #37785 · apache/arrow

zeroshade · 2023-09-18T21:40:00Z

Rationale for this change

The linked issue discovered a problem with the ARM64 implementation of the extractBits functionality which was being hidden because until go1.21, it looks like golang.org/x/sys/cpu wasn't properly detecting the ASIMD bit flag on the ARM processors and so it was using the pure go implementation. So we fix the assembly and add a test which was able to reproduce the failure without the fix on an ARM64 machine.

What changes are included in this PR?

There was a problem with the assembly as it existed as the short circuit case where the selection bitmap is 0 wasn't actually setting the 0 value back onto the stack as a return. The assembly which initialized that register to 0 only occurred AFTER the CBZ instruction which would jump to the bottom and return. This means that we actually returned whatever happened to be on the stack at that location (i.e. garbage!). This only occurred if you were using the ARM64-specific assembly implementation, not the amd64 or pure go versions.

We also include a test to ensure we get the correct values for a variety of bitmaps and selections, along with this ensuring there's enough data on the stack so that if this problem re-occured we would catch it in this test.

Are these changes tested?

Test is added.

Closes: [Go][Parquet] Reading a parquet file with a nested field that is null for all records panics when using Go 1.21.1 #37712

github-actions · 2023-09-18T21:40:27Z

⚠️ GitHub issue #37712 has been automatically assigned in GitHub to PR creator.

zeroshade · 2023-09-18T21:53:38Z

Ugh, looks like ALL the neon assembly is borked, it just wasn't ACTUALLY being run because golang.org/x/sys/cpu wasn't registering the ASIMD bit, but github.com/klauspost/cpuid/v2 DOES correctly recognize it and now we're seeing it die.

I guess I'll take a stab at fixing and debugging that tomorrow.

zeroshade · 2023-09-18T22:00:09Z

Interestingly, it's only failing on linux/arm64 not macOS ARM64.... which is, well, odd. sigh Why can't things be easy. I'm guessing there's probably a feature flag that i should be checking for that I'm not

go/parquet/internal/bmi/bmi_arm64.go

pitrou · 2023-09-19T16:44:21Z

go/parquet/internal/bmi/bmi_arm64.go

Hmm, so if an invalid value is passed in ARM_ENABLE_EXT, all extensions are silently disabled? It would be more user-friendly to print a warning.

technically we aren't actually disabling the extensions on the CPU, we're just telling the library that anything that wants to check for those extensions should consider them disabled and act accordingly. For example if an invalid value is passed we behave as if those extensions weren't enabled regardless and default to the pure go implementation.

That said, I agree it would be more user-friendly to have an explicit "DISABLE" value, and then just print a warning for invalid values. I'll make that change.

technically we aren't actually disabling the extensions on the CPU

Yes, I figured that out :-)

updated so that if the value of the env var is "DISABLE" then we disable things, otherwise we just print any unrecognized options to stderr

pitrou · 2023-09-19T16:45:18Z

go/parquet/internal/bmi/bitmap_neon_arm64.s

Hmm, how did you come with this fix? I thought this assembly code wasn't written by hand?

The assembly wasn't generated by hand, but this fix was absolutely written by hand (me) by way of looking at the code, the generated assembly, and stepping through the instructions to figure out what was happening incorrectly.

In this case, the problem was that the instructions which zero-out the return value got placed after the loop condition (checking if the selection bitmap was zero), so in that scenario we ended up returning whatever happened to be the next 8 bytes after the memory location the arguments were placed. I had to look up a bunch of stuff about how Go plan9 assembly works and also ARM64 assembly, to work out that zero'ing out one area was able to be done by simply storing from ZR the zero-register.

Do you think it's a C/C++ compiler bug? That sounds a bit unlikely to me, given how simple the extract_bits function is.

Also, why are you compiling this code for Neon? I find it unlikely that Neon can be used to speed up your SW emulation of PEXT.

Also, we have a SW emulation in Parquet C++ that processes 5 bits at a time rather than one. You could perhaps try to compile that one (or rewrite it in Go) :-)

Without knowing how Yuqi originally generated the arm64 assembly, I'm not sure what the cause of this was. All i know is that this was never correct, but was hidden by the fact that we were unaware of a bug in the golang.org/x/sys/cpu package which was failing to detect ARM features correctly so the ARM builds were just using the pure go implementation.

Also, why are you compiling this code for Neon? I find it unlikely that Neon can be used to speed up your SW emulation of PEXT.

From what I can tell, ARM doesn't have a PEXT instruction, again this C impl was provided by a contributor but doing a bit of research it looks like the implementation there is essentially the standard way to implement PEXT for ARM in C when I do some stack overflow searching. (There's also a couple variations I came across, but they roughly even out).

Also, we have a SW emulation in Parquet C++ that processes 5 bits at a time rather than one. You could perhaps try to compile that one (or rewrite it in Go) :-)

The pure go impl already does it! if you look at bmi.go it's a direct port of the software emulation in level_conversion_inc.h from the C++ parquet impl. We only go to the assembly / non-pure-go impl if our CPU check says that it supports BMI2 so that we can call pext_u64 directly at runtime.

The compiling for NEON is to get the benefit of vectorization for the levels_to_bitmap function (i.e. the GreaterThanBitmap SIMD method).

Without knowing how Yuqi originally generated the arm64 assembly, I'm not sure what the cause of this was.

There should be a well-defined way to generate the assembly, so that other people can later participate in maintenance.
(also, is it possible that the original arm64 assembly was simply out of sync with the current C source?)

The pure go impl already does it!

Nice, so you just need to port it to C now :-)

The assembly wasn't generated by hand, but this fix was absolutely written by hand (me) by way of looking at the code, the generated assembly, and stepping through the instructions to figure out what was happening incorrectly.

This is cool!

Might be bug from asm2plan9s. Or errors from manually converting/formatting from normal arm assembly to plan9.
Not sure the best approach to convert normal C/assembly to plan9 assembly. asm2plan9s and c2goasm are archived. Looks we should avoid them.

For this pext like function, the go version (lookup table) should perform better than this one. Maybe we can just delete this assembly implementation?

@cyb70289 I'd be fine with removing the arm assembly implementation of the pext function. I haven't benchmarked the lookup table vs the ARM assembly version yet. but I agree the lookup table would likely be faster.

The other issue i'm having is the bit packing side of things, though I think i may have figured out a solution to it... and i've learned more about ARM64 assembly in the last few days than i ever intended on learning LOL as I've needed to do a lot of handcoding assembly to get this to work. Looks like the issue boils down to needing anything with a label to be explicit in the assembly rather than using the WORD/BYTE syntax to pass them to the processor stream so that the labels get translated to the correct PC locations.

There should be a well-defined way to generate the assembly, so that other people can later participate in maintenance.

I agree! I was a little dissappointed that @guyuqi didn't add explicit steps to how he generated the arm64 assembly but wanted to benefit from it, so that was my mistake to accept that PR so long ago. Currently I'm trying to explicitly define (and automate as much as possible) with the generation based on when i figure out what works and solves this issue.

Wrote a benchmark, the golang version is indeed faster than assembly. Tested on Neoverse-N1.

$ go test -bench=. goos: linux goarch: arm64 pkg: bmi BenchmarkGo-80 663 1759920 ns/op BenchmarkAsm-80 435 2720235 ns/op PASS ok bmi 2.838s

well that settles that :) i'll update this and remove the asm version for arm 😄 Thanks!

removed the ARM asm version so that it just uses the pure go lookup-table impl

zeroshade · 2023-09-21T22:31:23Z

@cyb70289 @pitrou holy crap i got it to work! 😁 now i just gotta document what i did before i forget so i never have to do this again lol

zeroshade · 2023-09-22T16:46:40Z

@cyb70289 @pitrou this is ready for another look over and review now! thanks. Given the complexity i'm going to punt creating actual automation for the ARM assembly generation to follow-up PR instead

cyb70289 · 2023-09-25T07:14:35Z

go/parquet/internal/bmi/bmi_arm64.go

cpu.ARM64.HasASIMD is false on some arm64 platform?
It does work correctly on my test machine (ubunt22.05, aarch64, golang1.21).

From what I could tell in my testing, it appears that on golang1.20 and below, it was ending up false on arm64 platforms when it shouldn't. Try testing it on golang1.20 instead of 1.21 and let me know, because that's what I was observing in my tests (and explains why we never caught this issue before now)

This is strange. I tested earlier golang versions. Also tried x/sys/cpu v0.5.0 (used by arrow) and latest 0.12.0. All tests are okay.
Curious what might be the reason. If possible, can your share info about the test env? E.g.. tested in arm64 baremetal or virtual machine, linux kernel version, output of "lscpu". Thanks.

It was a virtual machine launched via EC2 on AWS as a Debian 12 arm64 machine. That said, the original issue that launched this investigation was a mac M1 which appeared to show the same situation that using go1.20 acted as per x/sys/cpu returning false for HasASIMD while using go1.21 got true and then hit this issue and failed horribly.

I'll launch a new one when i get a chance and paste the output of lscpu here later on

@cyb70289 Here is the output of lscpu on the AWS EC2 instance that I've been using for testing:

admin@ip-172-31-78-97:~/arrow/go/parquet/internal/utils$ lscpu Architecture: aarch64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 1 On-line CPU(s) list: 0 Vendor ID: ARM Model name: Neoverse-N1 Model: 1 Thread(s) per core: 1 Core(s) per socket: 1 Socket(s): 1 Stepping: r3p1 BogoMIPS: 243.75 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs Caches (sum of all): L1d: 64 KiB (1 instance) L1i: 64 KiB (1 instance) L2: 1 MiB (1 instance) L3: 32 MiB (1 instance) NUMA: NUMA node(s): 1 NUMA node0 CPU(s): 0 Vulnerabilities: Gather data sampling: Not affected Itlb multihit: Not affected L1tf: Not affected Mds: Not affected Meltdown: Not affected Mmio stale data: Not affected Retbleed: Not affected Spec rstack overflow: Not affected Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Spectre v1: Mitigation; __user pointer sanitization Spectre v2: Mitigation; CSV2, BHB Srbds: Not affected Tsx async abort: Not affected

pitrou · 2023-09-25T08:55:00Z

go/parquet/internal/utils/clib.go

I'm curious, what is this and what is it good for? Doesn't Go provide memcpy/memset replacements?

this exists purely to test the assembly versions I provided.

The issue is that the C implementation I'm using to generate the assembly makes calls to memset and memcpy, but C's calling conventions are different than Go's. So in the assembly when it calls memset/memcpy I replaced those CALL instructions to call these replacement versions (because I can't call out to libc from inside of Go like that, no guarantee that the lib is loaded). Which are essentially just ported versions of the C memcpy/memset to ensure that the Calling conventions and argument handling are the same.

The underscore prefix there ensures that the methods are not exported from the package, so I'm only using these for clib_test.go in order to sanity check the assembly methods.

Yuck, is there a way to avoid the calls to memset and memcpy?

Even the original C++ implementations (bpacking_simd128_generated.h) use memset and memcpy. There might be ways to avoid those calls but they'd likely be much less efficient. We also already use amd64 versions of these anyways, so it's good to have arm64 implementations.

By the looks of it, the calls to memcpy/memset could easily be replaced by the equivalent for loop. Given the size is small and constant, I don't think calling a generic version of memcpy/memset would be a significant win.

I'll give it a try

Latest update has removed the memset/memcpy and replaced them with simple fixed size loops. My local testing showed all the tests passing, so assuming the CI is happy with things then we should be good here if you're happy with the results @pitrou. Let me know please when you get a chance.

pitrou · 2023-09-25T08:56:31Z

Regardless of automation, there should be READMEs explaining the rationale and motivations, and also how to generate those files anew.

zeroshade · 2023-09-26T15:50:40Z

@pitrou fair point, I'll update / write the README stuff for this today.

Co-authored-by: Antoine Pitrou <[email protected]>

zeroshade · 2023-10-02T23:06:18Z

@tschaub Can you please try testing with the latest version of this branch/PR and confirm for me that everything is still working on your end and there's no performance degradation vs the previous fix? I agree with @pitrou's statement that it's unlikely the generic memset/memcpy were giving us much of a benefit over a simple fixed length loop, but it'd be nice to confirm that there's no negative effects

tschaub · 2023-10-03T05:15:24Z

@zeroshade - I only tested a couple different cases, but I don't see any negative effect with the latest changes.

Here are some benchmarks transforming a small.parquet file (2400 rows, 20 physical columns, uncompressed):

Benchmark 1: ./gpq-noasm convert small.parquet small-noasm.parquet
  Time (mean ± σ):      72.2 ms ±   2.2 ms    [User: 103.9 ms, System: 25.2 ms]
  Range (min … max):    68.0 ms …  77.4 ms    41 runs
 
Benchmark 2: ./gpq-before convert small.parquet small-before.parquet
  Time (mean ± σ):      68.6 ms ±   2.2 ms    [User: 99.9 ms, System: 25.1 ms]
  Range (min … max):    63.1 ms …  74.9 ms    42 runs
 
Benchmark 3: ./gpq-after convert small.parquet small-after.parquet
  Time (mean ± σ):      68.5 ms ±   2.6 ms    [User: 99.0 ms, System: 24.5 ms]
  Range (min … max):    63.5 ms …  73.5 ms    40 runs
 
Summary
  ./gpq-after convert small.parquet small-after.parquet ran
    1.00 ± 0.05 times faster than ./gpq-before convert small.parquet small-before.parquet
    1.05 ± 0.05 times faster than ./gpq-noasm convert small.parquet small-noasm.parquet

And here are some numbers from a medium.parquet file (162709 rows, 20 physical columns, uncompressed):

Benchmark 1: ./gpq-noasm convert medium.parquet medium-noasm.parquet
  Time (mean ± σ):     986.3 ms ±  12.3 ms    [User: 1456.6 ms, System: 281.5 ms]
  Range (min … max):   965.2 ms … 1007.7 ms    10 runs
 
Benchmark 2: ./gpq-before convert medium.parquet medium-before.parquet
  Time (mean ± σ):     878.9 ms ±   7.2 ms    [User: 1343.1 ms, System: 269.5 ms]
  Range (min … max):   865.4 ms … 888.1 ms    10 runs
 
Benchmark 3: ./gpq-after convert medium.parquet medium-after.parquet
  Time (mean ± σ):     879.1 ms ±   9.7 ms    [User: 1341.6 ms, System: 268.6 ms]
  Range (min … max):   858.7 ms … 891.8 ms    10 runs
 
Summary
  ./gpq-before convert medium.parquet medium-before.parquet ran
    1.00 ± 0.01 times faster than ./gpq-after convert medium.parquet medium-after.parquet
    1.12 ± 0.02 times faster than ./gpq-noasm convert medium.parquet medium-noasm.parquet

So no significant difference in those between the before (c9693c5) and after (9d4c29b) cases. The other case is a build with -tags noasm.

pitrou

Mostly LGTM.

pitrou · 2023-10-03T08:31:52Z

go/parquet/internal/utils/clib_amd64.s

 	RET

+// func _ClibMemcpy(dst, src unsafe.Pointer, n uint) unsafe.Pointer
+TEXT ·_ClibMemcpy(SB), NOSPLIT|NOFRAME, $16-24


Hmm, so is this file still useful?

I didn't make any changes to the amd64 implementation which still uses memcpy. Since there's no issues or problems with the amd64 implementation, I didn't want to change or remove anything with it. I just left the addition the Go function definition there for easier testing in the future if we need it.

Ah, I see. Too bad, perhaps you can create a separate GH issue for it?

zeroshade · 2023-10-03T15:20:10Z

Thanks for checking @tschaub! looks great

pitrou

LGTM, but it would be nice to have @cyb70289 's approval as wel

zeroshade · 2023-10-04T16:37:28Z

Thanks @pitrou I'll hold off on merging till the end of the week to give @cyb70289 a chance to approve

cyb70289

LGTM, thanks @zeroshade !

conbench-apache-arrow · 2023-10-06T03:39:26Z

After merging your PR, Conbench analyzed the 5 benchmarking runs that have been run so far on merge-commit f7530de.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 2 possible false positives for unstable benchmarks that are known to sometimes produce them.

…its (apache#37785) ### Rationale for this change The linked issue discovered a problem with the ARM64 implementation of the extractBits functionality which was being hidden because until `go1.21`, it looks like `golang.org/x/sys/cpu` wasn't properly detecting the ASIMD bit flag on the ARM processors and so it was using the pure go implementation. So we fix the assembly and add a test which was able to reproduce the failure without the fix on an ARM64 machine. ### What changes are included in this PR? There was a problem with the assembly as it existed as the short circuit case where the selection bitmap is 0 wasn't actually setting the 0 value back onto the stack as a return. The assembly which initialized that register to 0 only occurred AFTER the `CBZ` instruction which would jump to the bottom and return. This means that we *actually* returned whatever happened to be on the stack at that location (i.e. garbage!). This only occurred if you were using the ARM64-specific assembly implementation, not the amd64 or pure go versions. We also include a test to ensure we get the correct values for a variety of bitmaps and selections, along with this ensuring there's enough data on the stack so that if this problem re-occured we would catch it in this test. ### Are these changes tested? Test is added. * Closes: apache#37712 Lead-authored-by: Matt Topol <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Matt Topol <[email protected]>

zeroshade requested review from cyb70289 and lidavidm September 18, 2023 21:40

github-actions bot added Component: Go awaiting committer review Awaiting committer review labels Sep 18, 2023

pitrou reviewed Sep 19, 2023

View reviewed changes

go/parquet/internal/bmi/bmi_arm64.go Outdated Show resolved Hide resolved

pitrou reviewed Sep 19, 2023

View reviewed changes

github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting committer review Awaiting committer review awaiting changes Awaiting changes awaiting change review Awaiting change review labels Sep 20, 2023

github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting change review Awaiting change review awaiting changes Awaiting changes labels Sep 22, 2023

cyb70289 reviewed Sep 25, 2023

View reviewed changes

pitrou reviewed Sep 25, 2023

View reviewed changes

tschaub mentioned this pull request Oct 2, 2023

Use patch for ARM64 assembly bitmap extraction planetlabs/gpq#81

Merged

zeroshade and others added 2 commits October 2, 2023 12:41

Update go/parquet/internal/utils/_lib/README.md

9323b56

Co-authored-by: Antoine Pitrou <[email protected]>

Update go/parquet/internal/utils/_lib/README.md

07d8364

Co-authored-by: Antoine Pitrou <[email protected]>

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Oct 2, 2023

remove memcpy/memset

9291e08

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Oct 2, 2023

add license comments sed file

9d4c29b

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Oct 2, 2023

pitrou reviewed Oct 3, 2023

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Oct 3, 2023

zeroshade requested review from cyb70289 and pitrou October 3, 2023 17:37

pitrou approved these changes Oct 4, 2023

View reviewed changes

cyb70289 approved these changes Oct 5, 2023

View reviewed changes

zeroshade merged commit f7530de into apache:main Oct 5, 2023

zeroshade removed the awaiting changes Awaiting changes label Oct 5, 2023

zeroshade deleted the arm64-assembly-funtimes branch October 5, 2023 14:26

tschaub mentioned this pull request Oct 28, 2023

Updated dependencies planetlabs/gpq#106

Merged

Conversation

zeroshade commented Sep 18, 2023 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Uh oh!

github-actions bot commented Sep 18, 2023

Uh oh!

zeroshade commented Sep 18, 2023

Uh oh!

zeroshade commented Sep 18, 2023

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zeroshade Sep 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zeroshade commented Sep 21, 2023

Uh oh!

zeroshade commented Sep 22, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zeroshade Sep 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pitrou commented Sep 25, 2023

Uh oh!

zeroshade commented Sep 26, 2023

Uh oh!

zeroshade commented Oct 2, 2023

Uh oh!

zeroshade commented Sep 18, 2023 •

edited by github-actions bot

Loading

zeroshade Sep 21, 2023 •

edited

Loading

zeroshade Sep 28, 2023 •

edited

Loading