Skip to content

Extends Iceberg table stats API to allow publish data and stats atomically #6442

@findepi

Description

@findepi

Feature Request / Improvement

Currently UpdateStatistics (org.apache.iceberg.Transaction#updateStatistics) allows adding statistics for an existing snapshot.
As a result, it is currently not possible publish a snapshot with statistics already collected.

Collecting statistics for an existing data is definitely an important use-case (like Trino's ANALYZE),
but some query engines (like Trino) can collect stats on the fly, when writing to a table (INSERT, CREATE TABLE AS ...).

It's not difficult to

  • publish data change snapshot (adding new files)
  • take a note of new snapshot ID
  • add statistics for that snapshot

however this has some drawbacks

  • new data is published without stats, so other queries can be planned sub-optimally, leading to eg improper use of cluster resources, or even unexpected query failures (if data changed significantly)
  • someone may run ANALYZE on the new snapshot (unknowingly or intentionally), and this will end up with two different threads wanting to add stats to it -- wasted work

We should make it possible to publish data change together with new stats.
This may will require API changes
It may also require spec changes, if we want to use "inherit snapshot ID" model.
(Maybe we don't have to, since stats are in metadata?)

Query engine

None

Metadata

Metadata

Assignees

No one assigned

    Labels

    APISpecificationIssues that may introduce spec changes.coreimprovementPR that improves existing functionalitystale

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions