Feature Request / Improvement
Hi Everyone,
I have some exciting news to share! Hive now supports writing column statistics to puffin files.
The statistics calculated by Hive include histograms, NDV (Number of Distinct Values), Min and Max values, the number of nulls, the number of true values, column name, and column type. You can find the full list of supported stats here: Link to GitHub.
These statistics are stored as a Hive columnStatistics object, which is serialized and saved as a blob in puffin. You can refer to the code here for more information: Link to GitHub.
Currently, this object is supported by Hive and partially by Impala as well: Link to GitHub. We also plan to incorporate the KLL datasketch for histograms.
As a result, we are looking to add columnStatistics object and KLL datasketch as standard blob types for the puffin file. Link to GitHub
Any feedback would be greatly appreciated.
Thanks!
Query engine
Hive