-
Notifications
You must be signed in to change notification settings - Fork 4k
Closed
Description
Describe the bug, including details regarding any error messages, version, and platform.
Originally posted here
I'm doing a simple group_by/aggregate on multiple keys, out of which one has null-values. This sometimes results in multiple result rows having the same values for the group_by keys, which I don't expect. Tested on pyarrow-16.1.0
Repro case:
import pyarrow as pa
def try_repro(size):
repro = pa.table({"a": [0] * size,
"g": [None]*size},
schema=pa.schema([pa.field("a", "uint8"),
pa.field("g", "date32")]))\
.group_by(["a", "g"]).aggregate([([], "count_all")])
if len(repro) != 1:
print(f"{size} => {len(repro)}")
return repro
for i in range(1,50):
r = try_repro(i)
print()
print(r)
Output without AVX2 (expected):
$ ARROW_USER_SIMD_LEVEL=AVX python repro.py
pyarrow.Table
a: uint8
g: date32[day]
count_all: int64
----
a: [[0]]
g: [[null]]
count_all: [[49]]
Output with AVX2 (not expected):
$ ARROW_USER_SIMD_LEVEL=AVX2 python repro.py
33 => 2
...
40 => 2
41 => 3
...
48 => 3
49 => 4
pyarrow.Table
a: uint8
g: date32[day]
count_all: int64
----
a: [[0,0,0,0]]
g: [[null,null,null,null]]
count_all: [[32,8,8,1]]
Some observations:
- Grouping on only
gdoesn't have the problem - Swapping the order
aandgin the group_by also removes the issue. - Looks like this starts happening as soon as the size of the tables hits 33, and then we get an extra group for every 8 rows we add (so at 33, 41, 49)
- Having
gbe anintdoes not exhibit the problem, afloatdoes. - Non-null values don't have the issue
- Macbook Pro M2 is also fine
Component(s)
Python
Reactions are currently unavailable