Support retrieving the latest Iceberg table on table scan#1297
Support retrieving the latest Iceberg table on table scan#1297phillipleblanc wants to merge 14 commits intoapache:mainfrom
Conversation
* Allow resolving the current snapshot ID to use on a scan from a callback function * Use table_fn * Fix * Just pass a reference to the catalog * make public * Just take table ident * lint
jonathanc-n
left a comment
There was a problem hiding this comment.
Looks good to me overall, just some small things. Thanks @phillipleblanc!
| /// A reference-counted arrow `Schema`. | ||
| schema: ArrowSchemaRef, | ||
| /// A reference to the catalog that this table provider belongs to. | ||
| catalog: Option<Arc<dyn Catalog>>, |
There was a problem hiding this comment.
This is nice, we need this for future transactions.
There was a problem hiding this comment.
Yeah, it might make sense. I think the use-cases for a static table are fairly limited, and that could be a separate TableProvider. But it would be a breaking change for all users
There was a problem hiding this comment.
I think breaking change should still be fine at this early stage
There was a problem hiding this comment.
@xxchan apologies for the delay, I've made this breaking change to require a catalog and pushed up the changes.
There was a problem hiding this comment.
Just to add some context for the discussion: keeping catalog as Optional still provides some convenience for certain use cases: #650
jonathanc-n
left a comment
There was a problem hiding this comment.
LGTM! cc @liurenjie1024 @Fokko @sdd
|
Could I please get a re-review on this PR, thanks! cc @liurenjie1024 @Fokko @sdd @Xuanwo |
Original PR: #11 Upstream PR: apache#1297
Original PR: #11 Upstream PR: apache#1297
|
This is superseded by the work in #1877 |
Which issue does this PR close?
What changes are included in this PR?
Makes the IcebergTableProvider::try_new method public that takes an
Arc<dyn Catalog>and a TableIdent. It uses that to get the current table metadata when the DataFusion TableProvider is created - but it also stores a reference to the Arc. When the DataFusion TableProvider is asked to scan the table, it uses the catalog to fetch the latest table metadata.This allows the TableProvider to get the latest changes to the Iceberg table, as opposed to being stuck on the snapshot when the table was created. This aligns closer to the expectation of using DataFusion TableProviders, where the scan is expected to scan the latest data.
Are these changes tested?
Covered by the existing integration tests at
crates/integrations/datafusion/tests/integration_datafusion_test.rs- but I have also added a dedicated test for this.