Fix several bugs in parquet writer statistics generation, add EnabledStatistics to control level of statistics generated#2022
Conversation
parquet/src/file/properties.rs
Outdated
There was a problem hiding this comment.
I'm not a massive fan of this, but it was the only way I could think of to preserve the existing behaviour whilst also fixing the related bugs
There was a problem hiding this comment.
I actually like this API -- as for some use cases page level statistics are likely to have a higher overhead in storage and writing than benefit (e.g. large string columns for small row groups)
There was a problem hiding this comment.
It would be good to clarify in here if Page also computes Chunk level statistics.
If Page does not also include Chunk that seems strange to me (as I would expect lower level stats to also have coarser grained statistics).
There was a problem hiding this comment.
Glad one of us likes it 😆
FWIW at least in the case of the column index, parquet allows using truncated values in statistics. e.g. instead of a min of "lorem ipsum ..." you could just store "lorem"
034be68 to
06cfa35
Compare
| if let Some(distinct) = distinct_count { | ||
| self.column_distinct_count = | ||
| Some(self.column_distinct_count.unwrap_or(0) + distinct); | ||
| // We can only set the distinct count if there are no other writes |
| Some(self.make_page_statistics()) | ||
| } else { | ||
| None | ||
| let has_min_max = self.min_page_value.is_some() && self.max_page_value.is_some(); |
There was a problem hiding this comment.
This logic is moved from flush_data_pages
Codecov Report
@@ Coverage Diff @@
## master #2022 +/- ##
==========================================
- Coverage 83.59% 83.43% -0.17%
==========================================
Files 222 222
Lines 57529 57911 +382
==========================================
+ Hits 48091 48317 +226
- Misses 9438 9594 +156
Continue to review full report at Codecov.
|
EnabledStatistics to control level of statistics generated
alamb
left a comment
There was a problem hiding this comment.
I had some comments on the API design and deprecation, but the actual code looks good and like a significant improvement to me. 👍
| }; | ||
|
|
||
| let mut meta = props.key_value_metadata.clone().unwrap_or_default(); | ||
| let meta = props |
There was a problem hiding this comment.
Does this change have any effect (other than to make the code more readable)?
There was a problem hiding this comment.
There is no user-facing change, but it does avoid an unnecessary clone
parquet/src/file/properties.rs
Outdated
There was a problem hiding this comment.
I actually like this API -- as for some use cases page level statistics are likely to have a higher overhead in storage and writing than benefit (e.g. large string columns for small row groups)
| .map_or(true, |v| self.compare_greater(v, min)) | ||
| { | ||
| self.min_column_value = Some(min.clone()); | ||
| if self.statistics_enabled == EnabledStatistics::Chunk { |
There was a problem hiding this comment.
I would personally find match self.statistics_enabled ... to be easier to validate that other variants of EnabledStatistics::Chunk did not apply.
Or perhaps a function EnabledStatistics::compute_chunk_statistics().
Basically I am thinking about "what if someone adds a new variant" -- it would be nice if the compiler told them where to update
There was a problem hiding this comment.
In fact, no that I re-read this, shouldn't this also be set if we are computing page level statistics?
There was a problem hiding this comment.
I'll add a comment explaining why we only care if it is exactly this variant, any other variants would need to be computed at a different level
| (None, None) => {} | ||
| (None, None) => { | ||
| for val in values { | ||
| Self::update_min(&self.descr, val, &mut self.min_column_value); |
parquet/src/column/writer.rs
Outdated
| /// Writer may optionally provide pre-calculated statistics for this batch | ||
| /// | ||
| /// Note: Unless disabled using by using [`WriterProperties`] to set | ||
| /// enabled statistics to [`EnabledStatistics::Chunk`], this will still compute |
There was a problem hiding this comment.
This comment almost reads the opposite of what I think the code is doing (using the passed in pre-calculated statistics or computing them if passed in)
parquet/src/file/properties.rs
Outdated
There was a problem hiding this comment.
It would be good to clarify in here if Page also computes Chunk level statistics.
If Page does not also include Chunk that seems strange to me (as I would expect lower level stats to also have coarser grained statistics).
| self.statistics_enabled = Some(enabled); | ||
| } | ||
|
|
||
| /// Sets max size for statistics for this column. |
There was a problem hiding this comment.
I don't see any mention of max_statistics_size in this PR description or the linked tickets.
I think it is a reasonable feature (to limit bloating parquet files). Is the issue that it was ignored? If so perhaps we can leave the API in with a warn or deprecate that it is ignored and file a ticket to properly support it?
There was a problem hiding this comment.
I will back out this change
| /// Perform a conditional update of `cur`, skipping any NaN values | ||
| /// | ||
| /// If `cur` is `None`, sets `cur` to `Some(val)`, otherwise calls `should_update` with | ||
| /// the value of `cur`, and updates `cur` to `Some(val)` if it returns `true` |
There was a problem hiding this comment.
cc @crepererum as I think you added the initial NaN handling a while ago
There was a problem hiding this comment.
If I read this correctly the NaN handling here is to avoid "poisoning" which is OK.
IMHO the stats should include a flag if there were any NaN values, but this is far outside the scope of this PR since this would affect the parquet format as a whole.
| assert_eq!(pages[1].page_type(), PageType::DATA_PAGE); | ||
|
|
||
| let page_statistics = pages[1].statistics().unwrap(); | ||
| assert_eq!(page_statistics.min_bytes(), 1_i32.to_le_bytes()); |
Which issue does this PR close?
Closes #2015
Closes #2016
Rationale for this change
Fixes a number of bugs
What changes are included in this PR?
Removes max_statistics_size as it isn't used, and changes statistics to be enabled based on an enumeration. This allows consistently only producing chunk level statistics, potentially precomputed, or not.
Are there any user-facing changes?
No