Skip to content

Add external schema mappings for files written with name-based schemas #40#207

Merged
rdblue merged 15 commits intoapache:masterfrom
rdsr:ext_map
Oct 22, 2019
Merged

Add external schema mappings for files written with name-based schemas #40#207
rdblue merged 15 commits intoapache:masterfrom
rdsr:ext_map

Conversation

@rdsr
Copy link
Contributor

@rdsr rdsr commented Jun 6, 2019

Here's a first pass at implementing this.

External mapping is maintained as a field name -> field id map. So each alias will be an different entry in the map with the same id.

AssignFieldIds assign field ids to fields. This visitor updates the schema instead of building a new schema with field ids. I felt this was OK. Let me know if this is not the case.

TODOs

  1. Support schema insensitivity
  2. Add tests for aliasing
  3. Maybe have a better testing strategy? . (The way I build the mapping seems a hack IMO)

In code, Open questions I've added as TODOs. I'd appreciate to if we can discuss those in the comments

@rdsr
Copy link
Contributor Author

rdsr commented Jun 12, 2019

Thanks @rdblue for the comments. I wanted to get a sense from you for the overall approach here, specifically around the mapping , where the fieldname to id map contains non canonical field names and testing strategy. Do you think the current approach is sound enough?

@rdsr
Copy link
Contributor Author

rdsr commented Jun 23, 2019

Hi @rdblue , please have a look when u get the time. I've addressed your remaining comments. I need your inputs specifically on some of the TODOs I've added to the code.

@rdblue
Copy link
Contributor

rdblue commented Jul 7, 2019

Sorry for not getting back to this very quickly. I'm planning to review it this week.

@rdblue
Copy link
Contributor

rdblue commented Jul 14, 2019

@rdsr, thanks for your patience. This looks like a great start.

I made a few specific comments. In addition, I think we may want to apply the schema mapping differently. Instead of building an Avro schema with IDs, why not add an optional mapping to the conversion from Avro to Iceberg schema? Then that conversion could drop fields that don't have IDs, either embedded in the schema or from a mapping. I think that would fit a bit more cleanly because it would not be creating an intermediate file schema. It would be producing the Iceberg equivalent of the Avro schema instead. What do you think?

@rdsr
Copy link
Contributor Author

rdsr commented Jul 18, 2019

I think the passing an optional mapping to the conversion from Avro to Iceberg schema makes sense. Let me investigate more on this.

Thanks for the comments!

@rdblue
Copy link
Contributor

rdblue commented Jul 24, 2019

@rdsr, I've been working on #41, and I think we may have some restructuring to the mapping, instead of a simple name to ID map. Supporting aliases with a simple map requires enumerating all of the possibilities, and it is difficult to work with. I think a multi-level approach, where each struct has a mapping is going to be better to work with. I'll clean up what I have and post a PR for you to look at.

@rdsr
Copy link
Contributor Author

rdsr commented Jul 24, 2019

Thanks @rdblue !

@rdsr
Copy link
Contributor Author

rdsr commented Sep 9, 2019

@rdblue , picking this up again and going through your changes in #41 .

I think the idea passing an optional mapping to avro to iceberg schema is workable, and I think u mean this api AvroSchemaUtil#buildAvroProjection.

I'm working on this this week and will try to post a patch soon

@rdsr
Copy link
Contributor Author

rdsr commented Sep 10, 2019

@rdblue I've updated the approach here.

I now pass an optional NameMapping to BuildAvroProjection
field ids are now determined from first the field-id property and then from NameMapping .
The logic of how a field is selected in each of the methods (field(), map(), array() etc) is also slightly changed.

The general logic now is

  1. If the field does not have the field id (either as a property or in NameMapping) it is not selected
  2. if it resulting child node is null it is not selected
  3. if its resulting child node is different, a new field is created.

@rdblue
Copy link
Contributor

rdblue commented Sep 14, 2019

Thanks for the update, @rdsr! I'll have a look next week when I'm back.

@rdsr rdsr force-pushed the ext_map branch 6 times, most recently from 189dc0a to fb69c79 Compare October 19, 2019 13:35
…Mapping,

* Change the read schema to rename a field uniquely if that field is not projected by NameMapping
@rdsr
Copy link
Contributor Author

rdsr commented Oct 19, 2019

I've updated the BuildAvroProjection to rename missing optional fields.
Also updated tests to have a complete round trip to data file and back.

Requesting a review @rdblue !

@rdblue
Copy link
Contributor

rdblue commented Oct 22, 2019

Merging this. Nice work, @rdsr! Thanks for taking care of this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants