Enable truncation of binary statistics columns#5076
Conversation
tustvold
left a comment
There was a problem hiding this comment.
Thank you for this, I left some comments that I think might reduce the diff resulting from this
parquet/src/file/statistics.rs
Outdated
| is_max_value_exact: bool, | ||
| is_min_value_exact: bool, |
There was a problem hiding this comment.
Changing this function signature will result in some non-trivial code churn, what do you think of keeping this function as-is, defaulting the values to true and then adding two methods like
pub fn with_is_max_value_exact(self, exact: bool) -> Self {
...
}
pub fn with_is_min_value_exact(self, exact: bool) -> Self {
...
}
parquet/src/file/statistics.rs
Outdated
| }; | ||
|
|
||
| // Whether or not the min/max values are exact. Due to pre-existing truncation | ||
| // in other libraries such as parquet-mr, we can't assume that any given parquet file |
There was a problem hiding this comment.
Does parquet-mr truncate non-binary columns?
There was a problem hiding this comment.
No, parquet-mr only applies this to binary statistics: https://github.com/apache/parquet-mr/pull/696/files#diff-1afc9f89a782ddd4e7cd17546ca048954091627d7a31597ab88892eb2a7a76abR618
Pertaining to the conversation above as well - I could reduce churn by only allowing the setting of min/max exactness on the constructors for binary-like stats, by splitting the statistics_new_func macro into a statistics_new_func_always_exact and statistics_new_func_inexact that generates a binary_with_inexact method? Given that there's only one place in the code in column/mod.rs where we set these to something other than true, would reduce the churn significantly.
There was a problem hiding this comment.
No, parquet-mr only applies this to binary statistics
In which case perhaps we can have slightly less pessimistic defaulting behaviour here?
by splitting the statistics_new_func macro into a statistics_new_func_always_exact and statistics_new_func_inexact
I think I would prefer to avoid this being a breaking change at all, the approach in #5076 (comment) would be consistent with other structures in this codebase and would be my preference unless there is a reason it is insufficient.
There was a problem hiding this comment.
OK, will update to that.
There was a problem hiding this comment.
@tustvold should be OK to re-review? I've also fixed the lints that were causing issues with the checks.
tustvold
left a comment
There was a problem hiding this comment.
This looks good to me thank you
Which issue does this PR close?
Closes #5037.
Rationale for this change
Similar to parquet-mr (apache/parquet-java#696), we allow truncation of statistics for binary and fix-length binary columns.
What changes are included in this PR?
7b37fd4introduces the min/max exactness parameters and parses them for various statistics, and ensures round-tripping.e57634dcreates a new writer property, and implements the truncation. It's tested for both strings and for decimals, and in the decimal case we ensure that re-constructed min and max decimals of the correct byte length will properly bound the true value.Are there any user-facing changes?
Introduction of new functionality to set the truncation length, but no breaking changes.