feat: Add NaN-counts support for parquet metadata#8158
feat: Add NaN-counts support for parquet metadata#8158Xuanwo wants to merge 5 commits intoapache:mainfrom
Conversation
Signed-off-by: Xuanwo <[email protected]>
Signed-off-by: Xuanwo <[email protected]>
|
Thanks for taking this on @Xuanwo, I sure didn't feel like doing this again 😅. Given that this doesn't yet address the changes to the page indexes nor yet add the new |
alamb
left a comment
There was a problem hiding this comment.
Thank you so much for helping this move along @Xuanwo
I think the code looks pretty good but is missing a few things as @etseidl describes
- The new IEEE ordering
- Testing statistics in the page index
For testing, I recomend a "end to end" style test in https://github.com/apache/arrow-rs/tree/main/parquet/tests somewhere
- Write floating point data to a real parquet file
- REad the metadata back from the file and assert the statistics
That would ensure we hooked everything up right and the end user API was working as expected
| ); | ||
| assert_eq!(stats.max_opt().unwrap(), &ByteArray::from(f16::INFINITY)); | ||
| assert_eq!(stats.nan_count_opt(), Some(2)); | ||
| } |
There was a problem hiding this comment.
I think the spec also talks about the case for all nans, so that would also be a good case to check
There was a problem hiding this comment.
Thank you for the review, I will polish the tests side.
|
Marking as draft as I think this PR is no longer waiting on feedback and I am trying to make it easier to find PRs in need of review. Please mark it as ready for review when it is ready for another look |
Signed-off-by: Xuanwo <[email protected]>
Signed-off-by: Xuanwo <[email protected]>
Which issue does this PR close?
Rationale for this change
Add NaN-count parquet for parquet
Are these changes tested?
Yes, well tested. All cases mentioned in #8156 should have been tested. Let me know if there are something missed.
Are there any user-facing changes?
nan-count is added in a compatible way without any API breaking changes.
One exception is
Statistics::newin theformat.rs, but I think it's not part of the API surface.This PR was primarily authored with Claude Code using Opus 4.1 and then hand-reviewed by me. I aimed to keep it aligned with our goals, though I may have missed minor issues. I am responsible for every change made in this PR. Please flag anything that feels off, I’ll fix it quickly.