Skip to content

feat: Add NaN-counts support for parquet metadata#8158

Draft
Xuanwo wants to merge 5 commits intoapache:mainfrom
Xuanwo:IEEE-754
Draft

feat: Add NaN-counts support for parquet metadata#8158
Xuanwo wants to merge 5 commits intoapache:mainfrom
Xuanwo:IEEE-754

Conversation

@Xuanwo
Copy link
Member

@Xuanwo Xuanwo commented Aug 15, 2025

Which issue does this PR close?

Rationale for this change

Add NaN-count parquet for parquet

Are these changes tested?

Yes, well tested. All cases mentioned in #8156 should have been tested. Let me know if there are something missed.

Are there any user-facing changes?

nan-count is added in a compatible way without any API breaking changes.

One exception is Statistics::new in the format.rs, but I think it's not part of the API surface.


This PR was primarily authored with Claude Code using Opus 4.1 and then hand-reviewed by me. I aimed to keep it aligned with our goals, though I may have missed minor issues. I am responsible for every change made in this PR. Please flag anything that feels off, I’ll fix it quickly.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Aug 15, 2025
@etseidl
Copy link
Contributor

etseidl commented Aug 16, 2025

Thanks for taking this on @Xuanwo, I sure didn't feel like doing this again 😅.

Given that this doesn't yet address the changes to the page indexes nor yet add the new ColumnOrder should we mark it draft?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for helping this move along @Xuanwo

I think the code looks pretty good but is missing a few things as @etseidl describes

  1. The new IEEE ordering
  2. Testing statistics in the page index

For testing, I recomend a "end to end" style test in https://github.com/apache/arrow-rs/tree/main/parquet/tests somewhere

  1. Write floating point data to a real parquet file
  2. REad the metadata back from the file and assert the statistics

That would ensure we hooked everything up right and the end user API was working as expected

);
assert_eq!(stats.max_opt().unwrap(), &ByteArray::from(f16::INFINITY));
assert_eq!(stats.nan_count_opt(), Some(2));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the spec also talks about the case for all nans, so that would also be a good case to check

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the review, I will polish the tests side.

@alamb alamb changed the title feat: Add NaN-counts support for parquet feat: Add NaN-counts support for parquet metadata Aug 18, 2025
@alamb
Copy link
Contributor

alamb commented Aug 18, 2025

Marking as draft as I think this PR is no longer waiting on feedback and I am trying to make it easier to find PRs in need of review. Please mark it as ready for review when it is ready for another look

Xuanwo added 2 commits August 19, 2025 02:26
Signed-off-by: Xuanwo <[email protected]>
Signed-off-by: Xuanwo <[email protected]>
@Xuanwo Xuanwo marked this pull request as draft August 18, 2025 18:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Parquet] Prototype: PARQUET-2249: Introduce IEEE 754 total order & NaN-counts #514

3 participants