-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
There are currently 3 places statistics can be written in Parquet files:
- In the metadata for each ColumnChunk (source)
- in the data page header (source)
- In the "ColumnIndex" (source) which is part of the so called "Page Index"
The level of statistics is controlled by the EnabledStatistics structure:
EnabledStatistics::None: No statisticsEnabledStatistics::Chunk: Stores the statistics for each ColumnChunk (1 above)EnabledStatistics::Page: Stores the statistics for each ColumnChunk, AND the data page headers AND the ColumnIndex
Problem
EnabledStatistics::Page is wasteful because:
- It stores the page level statistics twice resulting in larger parquet files (see
max_statistics_truncate_lengthis ignored when writing statistics to data page headers #7579 for an example) - The copy in the data page header can not even be accessed via the Rust parquet reader and I don't think it is widely used (it was effectively replaced by the PageIndex)
In fact, as @etseidl points out here: #7490 (comment)
Makes me wonder if we should rethink EnabledStatistics. The Parquet spec actually recommends not writing page level statistics if the page indexes are written. Perhaps we could add something like EnabledStatistics::ChunkAndIndex to write chunk level and offset/column indexes but no statistics in the page header.
Specifically the documentation on PageIndex says:
Readers that support ColumnIndex should not also use page statistics. The only reason to write page-level statistics when writing ColumnIndex structs is to support older readers (not recommended).
Describe the solution you'd like
I would like a way to avoid writing data page header statistics (as they are likely to not be useful to other systems and thus wasteful)
Describe alternatives you've considered
Option 1: Redefine EnabledStatistics::Page
I personally suggest the following change which would requires no changes for users who have set EnabledStatistics::Page and make their parquet files smaller.
- Redefine
EnabledStatistics::Page: to mean store statistics for ColumnChunk and PageIndex (not data page headers) - Add a new option
WriterProperties::write_data_page_statisticsthat would explicitly also write the data page headers as well. We would add a note saying the option is not recommended for the reasons listed above
Option 2: EnabledStatistics::ChunkAndIndex
@etseidl suggests adding another variant:
Perhaps we could add something like EnabledStatistics::ChunkAndIndex to write chunk level and offset/column indexes but no statistics in the page header.
One challenge with this is that it would require all existing users to know to update their code to stop writing data page headers
Option 3: EnabledStatistics more specific
Another alternative is to make EnabledStatistics more specific, something like
* `EnabledStatistics::None`: No statistics
* `EnabledStatistics::Chunk`: Stores the statistics for each ColumnChunk (1 above)
* `EnabledStatistics::ColumnIndex`: Stores the statistics in the ColumnChunk and ColumnIndex
* `EnabledStatistics::ColumnIndexAndPage`: Stores the statistics in the data page headers **AND** the ColumnChunk and the ColumnIndexThis would be a breaking API change that would be somewhat annoying to downstream users as they would have to change their code.
Additional context