-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
- Related to Parquet spec (parquet.thrift) is inconsistent w.r.t. ColumnIndex + NaNs parquet-format#406
@JFinis has been working on a proposal to better store statistics for floating point values in Parquet. The most recent proposal is here
In order to change the format, there needs to be at least 2 open source implementations of a proposal
There is also some question (see this link from @tustvold ) about how complex this would be to implement / get right.
Describe the solution you'd like
I would like to implement a draft of the specification in apache/parquet-format#514 in arrow-rs to show it is possible and keep the Rust implementation on the leading edge of implementation.
Describe alternatives you've considered
- @etseidl has implemented the IEEE 754 total order in a draft PR here: [Not for Merge] PoC implementation of PARQUET-2249: Introduce IEEE 754 total order #7408
We would also need to implement the nan_count field along with filtering out nans when writing statistics for floats.
Some good tests would be to
-
Write floating point data (specified below) to a parquet file
-
Read the metadata back and verify min/max values and
nan_countfor the following cases -
A column with no Nan values,
-
A column with a single +Nan value (should not appear in stats)
-
A column with a single -Nan value (should not appear in stats)
-
A column of Only Nan values
-
A column with Inf and some +/- Nans
-
A column with -Inf and some +/- Nans
Additional context