Skip to content

Allowing setting sort order of parquet files without specifying the schema #7317

@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

This is a follow on to #7036

As @bmmeijers says in #7036, datafusion can make much better plans if you tell it about the sort order of files.

It is possible now to specify the order of a parquet file

$ datafusion-cli
DataFusion CLI v29.0.0
❯ create external table cpu(time timestamp) stored as parquet location 'cpu.parquet' with order (time desc);
0 rows in set. Query took 0.001 seconds.

❯ select * from cpu;
+---------------------+
| time                |
+---------------------+
| 2022-09-30T12:55:00 |
+---------------------+
1 row in set. Query took 0.003 seconds.

❯ explain select * from cpu order by time desc;
+---------------+-----------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                        |
+---------------+-----------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | Sort: cpu.time DESC NULLS FIRST                                                                                             |
|               |   TableScan: cpu projection=[time]                                                                                          |
| physical_plan | ParquetExec: file_groups={1 group: [[Users/alamb/Downloads/cpu.parquet]]}, projection=[time], output_ordering=[time@0 DESC] |
|               |                                                                                                                             |
+---------------+-----------------------------------------------------------------------------------------------------------------------------+
2 rows in set. Query took 0.001 seconds.

However, it is not possible to specify the time without also specifying all of the schema, which is redundant given the schema is stored in the parquet files:

❯ create external table cpu stored as parquet location 'cpu.parquet' with order (time desc);
Error during planning: Provide a schema before specifying the order while creating a table.

Even though DataFusion can infer the schema automatically

❯ create external table cpu stored as parquet location 'cpu.parquet';
0 rows in set. Query took 0.002 seconds.

❯ select * from cpu;
+-----+---------------------+
| v   | time                |
+-----+---------------------+
| 1.0 | 2023-03-01T00:00:00 |
| 2.0 | 2023-03-02T00:00:00 |
+-----+---------------------+
2 rows in set. Query took 0.002 seconds.

Describe the solution you'd like

I would like to be able to specify the sort order for parquet files without also specifying the schema

Given this parquet file: cpu.zip

I would like this to work and produce a table both columns v and time ordered by time:

❯ create external table cpu stored as parquet location 'cpu.parquet' with order (time);
Error during planning: Provide a schema before specifying the order while creating a table.

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions