Skip to content

TopK operator (i.e. CometTakeOrderedAndProjectExec) may return incorrect result #1030

@viirya

Description

@viirya

Describe the bug

We found an interesting bug recently. For some cases, Dataset.show and Dataset.collectAsList return different results.
We investigated the bug and found that it is due to the implementation oftake_bytes.

In the cases, Comet reads a dictionary array of string. It unpacks dictionary array to string array. In a query where TopK operator is used, the operator will store input arrays into internal store and emit after all inputs are consumed. In Comet, the output arrays from scan reuse same buffers across batches. For operators that cache input arrays, Comet will do deep copy on these arrays.

However, when unpacking dictionary array to string array by calling take_bytes , if the indices array has no null, take_bytes kernel simply takes a full slice of the null buffer of indices (i.e., reusing it) as the null buffer of output array. So in the next batch, once the null buffer is updated (as Comet reuses underlying buffer), the stored array in TopK operator is also changed. It makes the query result indeterministic.

Consider the semantics of take kernel, its output array should not reuse input array. The current behavior looks incorrect.

We are going to fix it at the arrow-rs: apache/arrow-rs#6617

Steps to reproduce

No response

Expected behavior

No response

Additional context

No response

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions