-
Notifications
You must be signed in to change notification settings - Fork 2k
Description
Is your feature request related to a problem or challenge?
After #7721 a SortExec with a limit will use a special TopK . We have basic unit tests, but I think the coverage could be improved, specifically with Fuzz testing
Describe the solution you'd like
What I would like is a new fuzz test to be added to the the existing fuzz cases: https://github.com/apache/arrow-datafusion/tree/main/datafusion/core/tests/fuzz_cases
The structure of SortTest in https://github.com/apache/arrow-datafusion/blob/e95a24b5a260e0e2f603d52682d36cce192676f8/datafusion/core/tests/fuzz_cases/sort_fuzz.rs#L111 might be a good one to follow
The basic outline would be:
- Create an input with several columns (integers, strings, floats)
- Reorder the input randomly
- Divide the input up multiple batches using
make_staggered_batches - Run a query like
SELECT * FROM t ORDER BY <col(s)> LIMIT <N>and collect the output - Compute the expected result programmatically (e.g. by sort the data, prior to creating RecordBatches)
- Ensure the output matches the expected result
Input size: 1000 rows
Parameters to vary
- sort cols: (int), (string), (float), (int, string), (string, int), etc.
- N: 1, 10, 100, 300 (aka how many are kept)
Bonus points
make it easy to add new columns / types (e.g. like string dictionary)
Describe alternatives you've considered
No response
Additional context
No response