-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
In DataFusion we are working on a Merge operator that needs to compare multiple sorted streams of record batches, merge them together, and produce a single sorted output stream (see more context in apache/datafusion#722)
The current interface for comparing arrays in Arrow is getting a DynComparator
Which is a function that you give two indexes and get the result. You effectively have to do
let cmp = build_compare(array1, array2);
...
if cmp(index1, index2) == OrderingEqual:: {
}This has two problems for the merge operator:
- it has to keep track of different comparators for each pairs of arrays that are used
- (unobviously) by memoizing the comparators it will prevent the underlying array memory from being freed (as each comparator has a reference to the underlying array data)
Describe the solution you'd like
I would ideally like a comparator that does not have the array bound and instead was passed array indexes. Something like
let cmp = build_compare(array1.data_type(), array2.data_type());
...
if cmp(&array1, index1, &array2, index2) == OrderingEqual:: {
}We can probably also keep the existing interface
Describe alternatives you've considered
None
Additional context
See discussion with @Dandandan and @e-dard here: apache/datafusion#722 (comment)