Core: Refactor internal Avro reader to resolve schemas directly by rdblue · Pull Request #9366 · apache/iceberg

rdblue · 2023-12-22T22:21:07Z

This refactors the Avro generic reader so that it resolves schemas directly (like PyIceberg) rather than creating an Avro schema to trick Avro's ResolvingDecoder into projecting columns correctly.

This makes the read path easier to maintain because there is no need to hijack and rewrite schemas in ProjectionDatumReader using BuildAvroProjection. This should make it much easier to add default value support.

rdblue · 2023-12-22T22:24:07Z

core/src/main/java/org/apache/iceberg/avro/SupportsCustomRecords.java

+/**
+ * An interface for Avro DatumReaders to support custom record classes.
+ */
+interface SupportsCustomRecords {


This is internal for now. I'm not sure that we want to expose it more broadly yet.

core/src/main/java/org/apache/iceberg/avro/ValueReader.java

rdblue · 2023-12-22T22:27:31Z

core/src/main/java/org/apache/iceberg/avro/ValueReaders.java

  }

+  public abstract static class PlannedStructReader<S>
+      implements ValueReader<S>, SupportsRowPosition {


This is the new base class that implements a read plan, which is a list of positions and readers. The read plan is produced by the new logic in GenericAvroReader that handles structs.

rdblue · 2023-12-22T22:29:53Z

core/src/main/java/org/apache/iceberg/avro/ValueReaders.java

-    PositionReader(long rowPosition) {
-      this.currentPosition = rowPosition - 1;
-    }
+  static class PositionReader implements ValueReader<Long>, SupportsRowPosition {


This refactor includes changes to make PositionReader easier to use. Rather than needing to instantiate it inside of setRowPositionSupplier, it now just implements SupportsRowPosition. The logic for creating a position reader is now part of the object model rather than part of StructReader.

rdblue · 2023-12-22T22:31:07Z

core/src/test/java/org/apache/iceberg/avro/TestAvroNameMapping.java

-        .as("Field missing from table mapping is renamed")
-        .isNotNull();
-    Assertions.assertThat(projected.get("location_r5"))
+    Assertions.assertThat(projected.get("location"))


These test updates fix the odd behavior caused by rewriting the read schema. Name mapping previously needed to produce fields with incorrect names in order to project fields but not have them read (by name) from the data file.

Fokko

Left a few questions, but looks good! Excited to see the skipping in there as well. When we write the sizes for arrays/maps then this should also speed up reading quite a bit.

Fokko · 2023-12-26T10:01:06Z

core/src/main/java/org/apache/iceberg/avro/Avro.java

+        this.nameMapping = MappingUtil.create(schema);
+      }
+
+      DatumReader<D> reader;


I always like to make these final so you're sure that it doesn't skip through a branch.

Suggested change

DatumReader<D> reader;

final DatumReader<D> reader;

I don't think it is necessary in this case. The compiler will catch if it is unset because no default was provided.

Fokko · 2023-12-26T10:03:23Z

core/src/main/java/org/apache/iceberg/avro/AvroIterable.java

        ((SupportsRowPosition) reader)
-            .setRowPositionSupplier(() -> AvroIO.findStartingRowPos(file::newStream, start));
+            .setRowPositionSupplier(
+                Suppliers.memoize(() -> AvroIO.findStartingRowPos(file::newStream, start)));


Why the memoize? Are we reading the same file multiple times?

Previously, this was being done in the StructValueReader. If the struct reader inserted a PositionReader, it would also rewrite the position supplier.

That was a lot of complication for the value reader and didn't work in all cases (for example, if two structs had position columns) so I moved the memoization here. It's simpler that way and enabled us to add position readers that are constructed in the same place as the other readers, instead of needing to keep track of the position index in a struct and inject when setRowPositionSupplier is called.

core/src/main/java/org/apache/iceberg/avro/ValueReader.java

Fokko · 2023-12-27T11:46:41Z

core/src/main/java/org/apache/iceberg/avro/ValueReaders.java

+    }
+  }
+
+  private static class RequiredOptionReader implements ValueReader<Object> {


Why do we need this next to the UnionReader?

This is actually unused so we could remove it. The purpose is to be able to replace the union reader with one that checks that the value is non-null for cases where the file has an optional field but the expected schema requires it.

With Iceberg's schema evolution rules, we should never have that case, which is why I didn't end up using this (it was complicated and of little value). But I included the class just in case we want it in the future.

It would be good to hear what you think. Should we keep or remove it?

Since you've approved this and I don't see any other required changes, I'm going to remove this to unblock getting this commit in.

Thanks for the context. I would remove it, the PR looks good, so feel free to merge it once the tests are green 👍

Fokko · 2023-12-27T11:48:49Z

core/src/main/java/org/apache/iceberg/avro/GenericAvroReader.java

+        Types.NestedField field = expected.field(fieldId);
+        if (constant != null) {
+          readPlan.add(Pair.of(pos, ValueReaders.constant(constant)));
+        } else if (fieldId == MetadataColumns.IS_DELETED.fieldId()) {


Do we need to codify these cases? They should just follow the Iceberg spec like any other Avro file.

Yes, I think this is better than how we did it before.

Previously, we would inject these fields in the StructReader, but there improvements that we can make to that approach:

It isn't the struct reader's responsibility to skip or change the readers that are passed to it. Read "planning" should be done here, where the read and file schemas are both present.

It wasn't clear what readers should be passed in or produced here, given that readers might be replaced

This unifies how new optional fields are handled with how row position and metadata fields are handled. It also sets up future default value handling using a constant reader, which is one of the reasons for making these changes.

rdblue · 2024-01-01T19:03:49Z

Thanks for reviewing, @Fokko!

…he#9366)

wmoustafa · 2024-01-17T01:16:24Z

core/src/main/java/org/apache/iceberg/avro/GenericAvroReader.java

+        Object constant = idToConstant.get(fieldId);
+        Types.NestedField field = expected.field(fieldId);
+        if (constant != null) {
+          readPlan.add(Pair.of(pos, ValueReaders.constant(constant)));


I think here we need something along the lines of GenericAvroConstantReader that returns the constant in the Avro GenericData.Record format. Right now the ConstantReader class returns the constant object as is. Most of the time this constant is an Iceberg data constant, but what we need here is an Avro GenericData.Record. We can extend ConstantReader<T> here, but is is a private class. Can we promote it to public?

…he#9366)

Core: Refactor internal Avro reader to resolve schemas directly.

cea25f2

github-actions bot added the core label Dec 22, 2023

rdblue commented Dec 22, 2023

View reviewed changes

core/src/main/java/org/apache/iceberg/avro/ValueReader.java Show resolved Hide resolved

Apply spotless.

ea79026

rdblue commented Dec 22, 2023

View reviewed changes

rdblue added 4 commits December 22, 2023 15:01

Use only the new struct reader.

bfd99eb

Implement skip.

0446504

Apply spotless.

0d0780c

Rename to avoid name conflict.

9962d62

rdblue mentioned this pull request Dec 22, 2023

Data: Support reading default values from generic Avro readers #6004

Closed

Fokko approved these changes Dec 27, 2023

View reviewed changes

rdblue added 2 commits January 1, 2024 10:04

Apply spotless.

16e734b

Remove RequiredOptionReader.

03a81cf

rdblue merged commit 604422b into apache:main Jan 1, 2024

lisirrx pushed a commit to lisirrx/iceberg that referenced this pull request Jan 4, 2024

Core: Refactor internal Avro reader to resolve schemas directly (apac…

db0c666

…he#9366)

wmoustafa reviewed Jan 17, 2024

View reviewed changes

wmoustafa mentioned this pull request Jan 17, 2024

API, Core: Add default value APIs and Avro implementation #9502

Merged

geruh pushed a commit to geruh/iceberg that referenced this pull request Jan 26, 2024

Core: Refactor internal Avro reader to resolve schemas directly (apac…

8be52a6

…he#9366)

devangjhabakh pushed a commit to cdouglas/iceberg that referenced this pull request Apr 22, 2024

Core: Refactor internal Avro reader to resolve schemas directly (apac…

d9e9714

…he#9366)

rdblue mentioned this pull request Oct 10, 2024

Spark 3.5: Update Spark to use planned Avro reads #11299

Merged

Conversation

rdblue commented Dec 22, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Fokko left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue Jan 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Fokko Jan 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue commented Jan 1, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rdblue Jan 1, 2024 •

edited

Loading

Fokko Jan 1, 2024 •

edited

Loading