feat(arrow): Allow ArrowArrayAccessor to use field.name to match fields#1561
feat(arrow): Allow ArrowArrayAccessor to use field.name to match fields#1561liurenjie1024 merged 13 commits intoapache:mainfrom
Conversation
|
Let's discuss in this issue: #1560 (comment) |
crates/iceberg/src/arrow/value.rs
Outdated
| .map(|id| id == field.id) | ||
| .unwrap_or(false) | ||
| }) | ||
| .position(|arrow_field| self.arrow_field_matches_id(arrow_field, field.id)) |
There was a problem hiding this comment.
I think we still have misunderstanding here. As I have said #1560 (comment) , for the nan_val_cnt case, we should match field to arrow array using the field's position(e.g. the rank of this field in struct) rather than by id or name. It's not easy to accomplish with currentl interface, so we should change the interface of PartnerAccessor, see comments below.
|
@CTTY Thanks for raising this PR, as suggested in #1560 I think we need to rely on name-mapping here. Iceberg is designed to do operations lazy, meaning that we typically don't rewrite data unless it is neccessary. For example, if you rename a field, it will still find the field with the original name using the field-IDs. This is the case when you have a new table that has the field-IDs correctly set. However, in the age of big-data, and to make it easier for users to migrate to Iceberg, we also support importing existing Parquet files where the field-IDs are missing. In this case, we'll use |
Co-authored-by: Florian Valeye <[email protected]>
Co-authored-by: Renjie Liu <[email protected]>
This reverts commit 8729155.
liurenjie1024
left a comment
There was a problem hiding this comment.
Thanks @CTTY for this pr, LGTM!
| } | ||
|
|
||
| /// Create a new `ParquetWriterBuilder` with custom match mode | ||
| pub fn new_with_match_mode( |
There was a problem hiding this comment.
| pub fn new_with_match_mode( | |
| pub fn with_field_match_mode( |
…ds (apache#1561) ## Which issue does this PR close? - Closes apache#1560 ## What changes are included in this PR? - Added `FieldMatchMode` to `ArrowArrayAccessor` - Added `match_mode` to `ParquetWriterBuilder`, `NanValueCountVisitor` so user have control over this behavior on the writer level - Hardcoded `IcebergWriteExec` to use `FieldMatchMode::Name` ## Are these changes tested? Yes, added uts --------- Co-authored-by: Florian Valeye <[email protected]> Co-authored-by: Renjie Liu <[email protected]>
…ds (apache#1561) ## Which issue does this PR close? - Closes apache#1560 ## What changes are included in this PR? - Added `FieldMatchMode` to `ArrowArrayAccessor` - Added `match_mode` to `ParquetWriterBuilder`, `NanValueCountVisitor` so user have control over this behavior on the writer level - Hardcoded `IcebergWriteExec` to use `FieldMatchMode::Name` ## Are these changes tested? Yes, added uts --------- Co-authored-by: Florian Valeye <[email protected]> Co-authored-by: Renjie Liu <[email protected]>
Which issue does this PR close?
What changes are included in this PR?
FieldMatchModetoArrowArrayAccessormatch_modetoParquetWriterBuilder,NanValueCountVisitorso user have control over this behavior on the writer levelIcebergWriteExecto useFieldMatchMode::NameAre these changes tested?
Yes, added uts