Skip to content

Request to add KLL Datasketch and hive ColumnStatisticsObj and as standard blob types to puffin file. #8198

@simhadri-g

Description

@simhadri-g

Feature Request / Improvement

Hi Everyone,

I have some exciting news to share! Hive now supports writing column statistics to puffin files.

The statistics calculated by Hive include histograms, NDV (Number of Distinct Values), Min and Max values, the number of nulls, the number of true values, column name, and column type. You can find the full list of supported stats here: Link to GitHub.

These statistics are stored as a Hive columnStatistics object, which is serialized and saved as a blob in puffin. You can refer to the code here for more information: Link to GitHub.

Currently, this object is supported by Hive and partially by Impala as well: Link to GitHub. We also plan to incorporate the KLL datasketch for histograms.

As a result, we are looking to add columnStatistics object and KLL datasketch as standard blob types for the puffin file. Link to GitHub

Any feedback would be greatly appreciated.

Thanks!

Query engine

Hive

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions