Allow adding user defined metadata to ParquetSink#10224
Conversation
alamb
left a comment
There was a problem hiding this comment.
Thank you @wiedld -- this looks great to me. I had some small documentation and test tweak suggestions, but I also think this PR could be merged as is and we could do them as a follow on PR
Just let me know what you prefer
cc @devinjdangelo and @metesynnada who I think have worked on this code recently
devinjdangelo
left a comment
There was a problem hiding this comment.
Thank you @wiedld the code looks good to me. The expected syntax of the metadata value isn't 100% clear to me, but this could just be down to the fact I've not worked with parquet kv metadata before.
| TO 'test_files/scratch/copy/table_with_metadata/' | ||
| STORED AS PARQUET | ||
| OPTIONS ( | ||
| 'format.metadata' 'key1:value1 key2:value2' |
There was a problem hiding this comment.
this is quite and I missed it the first time around -- nice eyes @devinjdangelo . @wiedld could you please add documentation to the TableParquetOptions that documents this behavior?
Specifically, I would be interested to know "what if you want to store metadata values that have spaces in them" (key1:my value with spaces)?
There was a problem hiding this comment.
BTW it would be fine if the answer is "you will get an error / is not supported yet" -- it might just be good to document that behavior
I could see wanting to support things like key1:"my awesome value" key2:"my other awesome value" (again not in this PR, but we should at least document how it works I think)
There was a problem hiding this comment.
Perhaps we could leverage a syntax like:
(
'format.metadata.key1' 'val1',
'format.metadata.key2' 'val2 with space',
...
)
to support values with spaces
There was a problem hiding this comment.
😆 You beat me to it.
I started with the internal double quotes approach, also considered escaped spaces; then I realized that these are introducing lexical rules which did not feel SQL appropriate. Landed on the approach suggested by @devinjdangelo , commit will be up shortly.
6d6e5ba to
b7808f1
Compare
… key parsing be a part of the format.metadata::key
b7808f1 to
5b06bf3
Compare
| # accepts multiple entries with the same key (will overwrite) | ||
| statement ok | ||
| COPY source_table | ||
| TO 'test_files/scratch/copy/table_with_metadata/' | ||
| STORED AS PARQUET | ||
| OPTIONS ( | ||
| 'format.metadata::key1' 'value', | ||
| 'format.metadata::key1' 'value' | ||
| ) |
There was a problem hiding this comment.
This overwriting is a common feature to all of the config OPTIONS(...). If we want to change, then I recommend a followup ticket.
There was a problem hiding this comment.
I agree there is no need to change it in this PR
alamb
left a comment
There was a problem hiding this comment.
Very nice -- thank you for changes @wiedld and thanks for the comments and suggestions @devinjdangelo . I think this PR looks very nice
| /// Multiple entries are permitted | ||
| /// ```sql | ||
| /// OPTIONS ( | ||
| /// 'format.metadata::key1' '', |
| .set("format.metadata::key3", "value with spaces ") | ||
| .unwrap(); | ||
| table_config | ||
| .set("format.metadata::key4", "value with special chars :: :") |
| # accepts multiple entries with the same key (will overwrite) | ||
| statement ok | ||
| COPY source_table | ||
| TO 'test_files/scratch/copy/table_with_metadata/' | ||
| STORED AS PARQUET | ||
| OPTIONS ( | ||
| 'format.metadata::key1' 'value', | ||
| 'format.metadata::key1' 'value' | ||
| ) |
There was a problem hiding this comment.
I agree there is no need to change it in this PR
ParquetSink
* chore: make explicit what ParquetWriterOptions are created from a subset of TableParquetOptions * refactor: restore the ability to add kv metadata into the generated file sink * test: demomnstrate API contract for metadata TableParquetOptions * chore: update code docs * fix: parse on proper delimiter, and improve tests * fix: enable any character in the metadata string value, by having any key parsing be a part of the format.metadata::key
* chore: make explicit what ParquetWriterOptions are created from a subset of TableParquetOptions * refactor: restore the ability to add kv metadata into the generated file sink * test: demomnstrate API contract for metadata TableParquetOptions * chore: update code docs * fix: parse on proper delimiter, and improve tests * fix: enable any character in the metadata string value, by having any key parsing be a part of the format.metadata::key
* chore: make explicit what ParquetWriterOptions are created from a subset of TableParquetOptions * refactor: restore the ability to add kv metadata into the generated file sink * test: demomnstrate API contract for metadata TableParquetOptions * chore: update code docs * fix: parse on proper delimiter, and improve tests * fix: enable any character in the metadata string value, by having any key parsing be a part of the format.metadata::key
* chore: make explicit what ParquetWriterOptions are created from a subset of TableParquetOptions * refactor: restore the ability to add kv metadata into the generated file sink * test: demomnstrate API contract for metadata TableParquetOptions * chore: update code docs * fix: parse on proper delimiter, and improve tests * fix: enable any character in the metadata string value, by having any key parsing be a part of the format.metadata::key
Which issue does this PR close?
Closes #10223 .
Related to #9493
Rationale for this change
Restore the ability to set user-inserted metadata into the writer properties of ParquetSink, and to enable this as a SQL level API configurable.
What changes are included in this PR?
Broken up per commit:
New feature added UPDATED syntax:
Are these changes tested?
Yes. We have both integration tests for the TableParquetOptions API, as well as sqllogictests for the SQL level API.
Are there any user-facing changes?
Yes. The TableParquetOptions API is extended, as well as new SQL level API for these options.