-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Describe the bug
In Comet, we found an interesting bug recently. For some cases, Dataset.show and Dataset.collectAsList return different results.
We investigated the bug and found that it is due to the implementation oftake_bytes.
In the cases, Comet reads a dictionary array of string. It unpacks dictionary array to string array. In a query where TopK operator is used, the operator will store input arrays into internal store and emit after all inputs are consumed. In Comet, the output arrays from scan reuse same buffers across batches. For operators that cache input arrays, Comet will do deep copy on these arrays.
However, when unpacking dictionary array to string array by calling take_bytes , if the indices array has no null, take_bytes kernel simply takes a full slice of the null buffer of indices (i.e., reusing it) as the null buffer of output array. So in the next batch, once the null buffer is updated (as Comet reuses underlying buffer), the stored array in TopK operator is also changed. It makes the query result indeterministic.
Consider the semantics of take kernel, its output array should not reuse input array. The current behavior looks incorrect.
To Reproduce
Expected behavior
Additional context