Skip to content

The output array of take_bytes will be modified if indices null buffer is modified #6617

@viirya

Description

@viirya

Describe the bug

In Comet, we found an interesting bug recently. For some cases, Dataset.show and Dataset.collectAsList return different results.
We investigated the bug and found that it is due to the implementation oftake_bytes.

In the cases, Comet reads a dictionary array of string. It unpacks dictionary array to string array. In a query where TopK operator is used, the operator will store input arrays into internal store and emit after all inputs are consumed. In Comet, the output arrays from scan reuse same buffers across batches. For operators that cache input arrays, Comet will do deep copy on these arrays.

However, when unpacking dictionary array to string array by calling take_bytes , if the indices array has no null, take_bytes kernel simply takes a full slice of the null buffer of indices (i.e., reusing it) as the null buffer of output array. So in the next batch, once the null buffer is updated (as Comet reuses underlying buffer), the stored array in TopK operator is also changed. It makes the query result indeterministic.

Consider the semantics of take kernel, its output array should not reuse input array. The current behavior looks incorrect.

To Reproduce

Expected behavior

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions