Skip to content

[C++][Python] pyarrow table group_by/aggregate results in multiple rows with the same group_by key #42231

@FreekPaans

Description

@FreekPaans

Describe the bug, including details regarding any error messages, version, and platform.

Originally posted here

I'm doing a simple group_by/aggregate on multiple keys, out of which one has null-values. This sometimes results in multiple result rows having the same values for the group_by keys, which I don't expect. Tested on pyarrow-16.1.0

Repro case:

import pyarrow as pa
def try_repro(size):
    repro = pa.table({"a": [0] * size,
                      "g": [None]*size},
                     schema=pa.schema([pa.field("a", "uint8"),
                                       pa.field("g", "date32")]))\
              .group_by(["a", "g"]).aggregate([([], "count_all")])

    if len(repro) != 1:
        print(f"{size} => {len(repro)}")
    return repro

for i in range(1,50):
    r = try_repro(i)

print()
print(r)

Output without AVX2 (expected):

$ ARROW_USER_SIMD_LEVEL=AVX python repro.py

pyarrow.Table
a: uint8
g: date32[day]
count_all: int64
----
a: [[0]]
g: [[null]]
count_all: [[49]]

Output with AVX2 (not expected):

$ ARROW_USER_SIMD_LEVEL=AVX2 python repro.py
33 => 2
...
40 => 2
41 => 3
...
48 => 3
49 => 4

pyarrow.Table
a: uint8
g: date32[day]
count_all: int64
----
a: [[0,0,0,0]]
g: [[null,null,null,null]]
count_all: [[32,8,8,1]]

Some observations:

  • Grouping on only g doesn't have the problem
  • Swapping the order a and g in the group_by also removes the issue.
  • Looks like this starts happening as soon as the size of the tables hits 33, and then we get an extra group for every 8 rows we add (so at 33, 41, 49)
  • Having g be an int does not exhibit the problem, a float does.
  • Non-null values don't have the issue
  • Macbook Pro M2 is also fine

Component(s)

Python

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions