Add options to skip decoding Statistics and SizeStatistics in Parquet metadata#9008
Add options to skip decoding Statistics and SizeStatistics in Parquet metadata#9008alamb merged 9 commits intoapache:mainfrom
Statistics and SizeStatistics in Parquet metadata#9008Conversation
|
Sorry I didn't see this one before. I'll try and review it shortly |
|
run benchmark encoding metadata |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖: Benchmark completed Details
|
|
I am working to prepare the arrow 57.2.0 release -- do we want to merge this one, or shall we wait for arrow 58? |
I'll defer to your judgement @alamb. I don't think there's any pressing need at the moment, but it also doesn't need any further changes AFAIK, and it's not a breaking change. |
|
Thanks @etseidl -- let's wait then -- we already have quite a lot of good stuff queued up |
|
This PR appears to have picked up some conflicts -- likely with |
|
main is now open for the next release so I think once this the conflicts are resolved we can merge this one in too |
Nice! That was me too this week (and there was quite a bit of backlog to get through!) I spent some non trivial amount of time over the break studying and profiling the parquet reader so I expect to be doing a bunch of micro optimizations there (aka reducing allocations / reallocations). Should be a fun January! |
…quet metadata (apache#9008) # Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. --> - Part of apache#5853. # Rationale for this change Add ability to skip the decoding of more types of statistics contained in the Parquet column metadata. While this currently doesn't have a huge impact on decode time, it can reduce the amount of memory used by the `ParquetMetaData`. # What changes are included in this PR? Adds more options and tests for those options. Also adds size statistics to the metadata bench. # Are these changes tested? Yes # Are there any user-facing changes? Only adds new options, no breaking changes.
Which issue does this PR close?
Rationale for this change
Add ability to skip the decoding of more types of statistics contained in the Parquet column metadata. While this currently doesn't have a huge impact on decode time, it can reduce the amount of memory used by the
ParquetMetaData.What changes are included in this PR?
Adds more options and tests for those options. Also adds size statistics to the metadata bench.
Are these changes tested?
Yes
Are there any user-facing changes?
Only adds new options, no breaking changes.