Extract Parquet statistics from Interval column#10801
Conversation
alamb
left a comment
There was a problem hiding this comment.
Thank you @marvinlanhenke -- this looks very nice 👌 -- I left some comments but I think it is quite close.
Note that there are some conflicts, likely with #10711 which merged a few hours ago
I will file a ticket upstream in arrow-rs to support writing statistics for intervals (thank you for the code pointers)
datafusion/core/src/datasource/physical_plan/parquet/statistics.rs
Outdated
Show resolved
Hide resolved
I filed apache/arrow-rs#5847. I didn't quite understand your suggestion about two tickets |
I was thinking about two tickets:
I think those are two distinct issues? |
|
@alamb thanks for the review. I have adressed your comments PTAL. |
Yes I think you are right -- any chance you can file a ticket in arrow-rs to track writing IntervalMonthDayNano? |
Sure, I can do that. I filed: apache/arrow-rs#5849 |
alamb
left a comment
There was a problem hiding this comment.
Looks good to me -- thank you very much @marvinlanhenke
comphead
left a comment
There was a problem hiding this comment.
lgtm @marvinlanhenke kudos for the test description
|
As per the format specification, it is incorrect to read statistics for interval columns, as there is no defined sort order - https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#interval |
|
@alamb |
I think removing it would be fine @marvinlanhenke -- thank you |
* feat: add make_batch + basic test
Which issue does this PR close?
Closes #10752.
Rationale for this change
Since parquet does not support statistics for
Intervalcolumns, this PR only prepares test cases and some necessary setup.Once parquet supports statistics, a follow-up PR will be needed to finish the implementation.
What changes are included in this PR?
get_statisticmatch arm (stub)should_panicAre these changes tested?
Are there any user-facing changes?
No.