Support Configuring Parquet Column Specific Options via SQL Statement Options#7466
Conversation
alamb
left a comment
There was a problem hiding this comment.
This looks great to me -- thank you @devinjdangelo 🙏
The only other thing I think is needed for this PR is to an end to end test showing this syntax being used (perhaps in copy.slt)
Also, we should add documentation for this syntax to dml.md, but I think that can be done in a follow on PR (I will file a ticket to track it).
Also again sorry for the delay in reviews
Np, hope you had a good vacation! 🌴 I just pushed up an expanded test in copy.slt that uses all available options. COPY source_table
TO 'test_files/scratch/copy/table_with_options'
(format parquet,
single_file_output false,
compression snappy,
'compression::col1' 'zstd(5)',
'compression::col2' snappy,
max_row_group_size 12345,
data_pagesize_limit 1234,
write_batch_size 1234,
writer_version 2.0,
dictionary_page_size_limit 123,
created_by 'DF copy.slt',
column_index_truncate_length 123,
data_page_row_count_limit 1234,
bloom_filter_enabled true,
'bloom_filter_enabled::col1' false,
'bloom_filter_fpp::col2' 0.456,
'bloom_filter_ndv::col2' 456,
encoding plain,
'encoding::col1' DELTA_BINARY_PACKED,
'dictionary_enabled::col2' true,
dictionary_enabled false,
statistics_enabled page,
'statistics_enabled::col2' none,
max_statistics_size 123,
bloom_filter_fpp 0.001,
bloom_filter_ndv 100
)The only thing that bothers me a bit with the syntax is that you have to put column level options inside single quotes (i.e. 'compression::col1') otherwise the :: generates a syntax error I think coming from sqlparser-rs. External error: query failed: DataFusion error: SQL error: ParserError("Expected string or numeric value, found: ::")Would there be an easy way to relax the syntax requirements to allow :: without a string literal quote? |
I am not sure -- we may have to look into updating sqlparser |
|
Filed #7499 to track documentation of this feature |
Which issue does this PR close?
Closes #7442
Closes #7463 (this PR implements a different syntax than 7463 which turned out much easier to implement. Leaving 7463 draft open to show as possible alternative)
Rationale for this change
Extends syntax and allowed options for SQL statement options to configure parquet column level options (e.g. different compression for each possibly nested column).
What changes are included in this PR?
Implements new parsing utils and options for parquet column specific options. Example:
The example defaults all columns to snappy compression and sets col1 to zstd level 5 and the nested column (col2.nested) to zstd level 10.
Using "::" to separate the setting from the column path is obviously rust syntax inspired. We could use any other separator here if there is a more natural one for SQL. Dots (.) would have been the natural choice, but those are being used for separating nested column paths here.
Are these changes tested?
Yes, added a new unit test to verify settings are actually set as expected.
Are there any user-facing changes?
New syntax for specifying column level options, but this PR is backward compatible/no breaking changes unlike 7463