Skip to content

[Parquet] Prototype: PARQUET-2249: Introduce IEEE 754 total order & NaN-counts #514 #8156

@alamb

Description

@alamb

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

@JFinis has been working on a proposal to better store statistics for floating point values in Parquet. The most recent proposal is here

In order to change the format, there needs to be at least 2 open source implementations of a proposal

There is also some question (see this link from @tustvold ) about how complex this would be to implement / get right.

Describe the solution you'd like

I would like to implement a draft of the specification in apache/parquet-format#514 in arrow-rs to show it is possible and keep the Rust implementation on the leading edge of implementation.

Describe alternatives you've considered

We would also need to implement the nan_count field along with filtering out nans when writing statistics for floats.

Some good tests would be to

  1. Write floating point data (specified below) to a parquet file

  2. Read the metadata back and verify min/max values and nan_count for the following cases

  3. A column with no Nan values,

  4. A column with a single +Nan value (should not appear in stats)

  5. A column with a single -Nan value (should not appear in stats)

  6. A column of Only Nan values

  7. A column with Inf and some +/- Nans

  8. A column with -Inf and some +/- Nans

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementAny new improvement worthy of a entry in the changelogparquetChanges to the parquet crate

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions