-
Notifications
You must be signed in to change notification settings - Fork 3k
Closed as not planned
Closed as not planned
Copy link
Labels
APISpecificationIssues that may introduce spec changes.Issues that may introduce spec changes.coreimprovementPR that improves existing functionalityPR that improves existing functionalitystale
Description
Feature Request / Improvement
Currently UpdateStatistics (org.apache.iceberg.Transaction#updateStatistics) allows adding statistics for an existing snapshot.
As a result, it is currently not possible publish a snapshot with statistics already collected.
Collecting statistics for an existing data is definitely an important use-case (like Trino's ANALYZE),
but some query engines (like Trino) can collect stats on the fly, when writing to a table (INSERT, CREATE TABLE AS ...).
It's not difficult to
- publish data change snapshot (adding new files)
- take a note of new snapshot ID
- add statistics for that snapshot
however this has some drawbacks
- new data is published without stats, so other queries can be planned sub-optimally, leading to eg improper use of cluster resources, or even unexpected query failures (if data changed significantly)
- someone may run ANALYZE on the new snapshot (unknowingly or intentionally), and this will end up with two different threads wanting to add stats to it -- wasted work
We should make it possible to publish data change together with new stats.
This may will require API changes
It may also require spec changes, if we want to use "inherit snapshot ID" model.
(Maybe we don't have to, since stats are in metadata?)
Query engine
None
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
APISpecificationIssues that may introduce spec changes.Issues that may introduce spec changes.coreimprovementPR that improves existing functionalityPR that improves existing functionalitystale