feat(reader): position-based column projection for Parquet files without field IDs (migrated tables)#1777
Conversation
|
CI failure looks like an environment issue, should be fine on a rerun. |
liurenjie1024
left a comment
There was a problem hiding this comment.
Thanks @mbutrovich for this fix. While in general this pr is correct, it handles the special case in several places. I have concern with extensibility of the approach in this pr, for example, what if we are going to handle name mapping? Is it possible to refactoring with following method:
- If parquet field id not exists, we could build a new parquet schema with assigned field id, as java did: https://github.com/apache/iceberg/blob/c07f2aabc0a1d02f068ecf1514d2479c0fbdd3b0/parquet/src/main/java/org/apache/iceberg/parquet/ParquetSchemaUtil.java#L96
- Create arrow schema with generated parqeut schema.
With this approach, we could keep other parts not changed, WDYT?
|
Thanks for the comments @liurenjie1024! Let me take a look today to address your feedback. |
|
Thanks again @liurenjie1024! Please let me know if these changes reflect what you had in mind. I'm hoping this design where we pass a custom schema to |
# Conflicts: # crates/iceberg/src/arrow/reader.rs
liurenjie1024
left a comment
There was a problem hiding this comment.
Thanks @mbutrovich for this fix!
What issue does this PR close?
Partially address #1749.
Rationale for this change
Background: This issue was discovered when running Iceberg Java's test suite against our experimental DataFusion Comet branch that uses iceberg-rust. Many failures occurred in
TestMigrateTableAction.java, which tests reading Parquet files from migrated tables (e.g., from Hive or Spark) that lack embedded field ID metadata.Problem: The Rust ArrowReader was unable to read these files, while Iceberg Java handles them using a position-based fallback where top-level field ID N maps to top-level Parquet column position N-1, and entire columns (including nested content) are projected.
What changes are included in this PR?
This PR implements position-based column projection for Parquet files without field IDs, enabling iceberg-rust to read migrated tables.
Solution: Implemented fallback projection in
ArrowReader::get_arrow_projection_mask_fallback()that matches Java'sParquetSchemaUtil.pruneColumnsFallback()behavior:ProjectionMask::roots()to project entire columns including nested content (structs, lists, maps)RecordBatchTransformerRecordBatchTransformer)This implementation now matches Iceberg Java's behavior for reading migrated tables, enabling interoperability with Java-based tooling and workflows.
Are these changes tested?
Yes, comprehensive unit tests were added to verify the fallback path works correctly:
test_read_parquet_file_without_field_ids- Basic projection with primitive columns using position-based mappingtest_read_parquet_without_field_ids_partial_projection- Project subset of columnstest_read_parquet_without_field_ids_schema_evolution- Handle missing columns with NULL valuestest_read_parquet_without_field_ids_multiple_row_groups- Verify behavior across row group boundariestest_read_parquet_without_field_ids_with_struct- Project structs with nested fields (entire top-level column)test_read_parquet_without_field_ids_filter_eliminates_all_rows- Comet saw a panic when all row groups were filtered out, this reproduces that scenariotest_read_parquet_without_field_ids_schema_evolution_add_column_in_middle- Schema evolution with a column in the middle caused a panic at one pointAll tests verify that behavior matches Iceberg Java's
pruneColumnsFallback()implementation in/parquet/src/main/java/org/apache/iceberg/parquet/ParquetSchemaUtil.java.