Skip to content

Some way to avoid writing redundant statistics into data page headers #7580

@alamb

Description

@alamb

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

There are currently 3 places statistics can be written in Parquet files:

  1. In the metadata for each ColumnChunk (source)
  2. in the data page header (source)
  3. In the "ColumnIndex" (source) which is part of the so called "Page Index"

The level of statistics is controlled by the EnabledStatistics structure:

  • EnabledStatistics::None: No statistics
  • EnabledStatistics::Chunk: Stores the statistics for each ColumnChunk (1 above)
  • EnabledStatistics::Page: Stores the statistics for each ColumnChunk, AND the data page headers AND the ColumnIndex

Problem

EnabledStatistics::Page is wasteful because:

  1. It stores the page level statistics twice resulting in larger parquet files (see max_statistics_truncate_length is ignored when writing statistics to data page headers #7579 for an example)
  2. The copy in the data page header can not even be accessed via the Rust parquet reader and I don't think it is widely used (it was effectively replaced by the PageIndex)

In fact, as @etseidl points out here: #7490 (comment)

Makes me wonder if we should rethink EnabledStatistics. The Parquet spec actually recommends not writing page level statistics if the page indexes are written. Perhaps we could add something like EnabledStatistics::ChunkAndIndex to write chunk level and offset/column indexes but no statistics in the page header.

Specifically the documentation on PageIndex says:

Readers that support ColumnIndex should not also use page statistics. The only reason to write page-level statistics when writing ColumnIndex structs is to support older readers (not recommended).

Describe the solution you'd like
I would like a way to avoid writing data page header statistics (as they are likely to not be useful to other systems and thus wasteful)

Describe alternatives you've considered

Option 1: Redefine EnabledStatistics::Page

I personally suggest the following change which would requires no changes for users who have set EnabledStatistics::Page and make their parquet files smaller.

  1. Redefine EnabledStatistics::Page: to mean store statistics for ColumnChunk and PageIndex (not data page headers)
  2. Add a new option WriterProperties::write_data_page_statistics that would explicitly also write the data page headers as well. We would add a note saying the option is not recommended for the reasons listed above

Option 2: EnabledStatistics::ChunkAndIndex

@etseidl suggests adding another variant:

Perhaps we could add something like EnabledStatistics::ChunkAndIndex to write chunk level and offset/column indexes but no statistics in the page header.

One challenge with this is that it would require all existing users to know to update their code to stop writing data page headers

Option 3: EnabledStatistics more specific

Another alternative is to make EnabledStatistics more specific, something like

* `EnabledStatistics::None`: No statistics
* `EnabledStatistics::Chunk`: Stores the statistics for each ColumnChunk (1 above)
* `EnabledStatistics::ColumnIndex`: Stores the statistics in the ColumnChunk and ColumnIndex
* `EnabledStatistics::ColumnIndexAndPage`: Stores the statistics in the data page headers **AND** the ColumnChunk and the ColumnIndex

This would be a breaking API change that would be somewhat annoying to downstream users as they would have to change their code.

Additional context

Metadata

Metadata

Assignees

Labels

enhancementAny new improvement worthy of a entry in the changelog

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions