-
Notifications
You must be signed in to change notification settings - Fork 2k
Closed
Closed
Copy link
Labels
enhancementNew feature or requestNew feature or requestgood first issueGood for newcomersGood for newcomers
Description
Is your feature request related to a problem or challenge?
This is a follow on to #7036
As @bmmeijers says in #7036, datafusion can make much better plans if you tell it about the sort order of files.
It is possible now to specify the order of a parquet file
$ datafusion-cli
DataFusion CLI v29.0.0
❯ create external table cpu(time timestamp) stored as parquet location 'cpu.parquet' with order (time desc);
0 rows in set. Query took 0.001 seconds.
❯ select * from cpu;
+---------------------+
| time |
+---------------------+
| 2022-09-30T12:55:00 |
+---------------------+
1 row in set. Query took 0.003 seconds.
❯ explain select * from cpu order by time desc;
+---------------+-----------------------------------------------------------------------------------------------------------------------------+
| plan_type | plan |
+---------------+-----------------------------------------------------------------------------------------------------------------------------+
| logical_plan | Sort: cpu.time DESC NULLS FIRST |
| | TableScan: cpu projection=[time] |
| physical_plan | ParquetExec: file_groups={1 group: [[Users/alamb/Downloads/cpu.parquet]]}, projection=[time], output_ordering=[time@0 DESC] |
| | |
+---------------+-----------------------------------------------------------------------------------------------------------------------------+
2 rows in set. Query took 0.001 seconds.However, it is not possible to specify the time without also specifying all of the schema, which is redundant given the schema is stored in the parquet files:
❯ create external table cpu stored as parquet location 'cpu.parquet' with order (time desc);
Error during planning: Provide a schema before specifying the order while creating a table.Even though DataFusion can infer the schema automatically
❯ create external table cpu stored as parquet location 'cpu.parquet';
0 rows in set. Query took 0.002 seconds.
❯ select * from cpu;
+-----+---------------------+
| v | time |
+-----+---------------------+
| 1.0 | 2023-03-01T00:00:00 |
| 2.0 | 2023-03-02T00:00:00 |
+-----+---------------------+
2 rows in set. Query took 0.002 seconds.Describe the solution you'd like
I would like to be able to specify the sort order for parquet files without also specifying the schema
Given this parquet file: cpu.zip
I would like this to work and produce a table both columns v and time ordered by time:
❯ create external table cpu stored as parquet location 'cpu.parquet' with order (time);
Error during planning: Provide a schema before specifying the order while creating a table.Describe alternatives you've considered
No response
Additional context
No response
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or requestgood first issueGood for newcomersGood for newcomers