Add support for run-end encoded (REE) arrays in arrow-avro #8584

jecsand838 · 2025-10-10T06:55:13Z

Which issue does this PR close?

Part of Add Avro Support #4886

Rationale for this change

Arrow has a first‑class Run‑End Encoded (REE) data type that efficiently represents consecutive repeated values by storing run ends (indices) alongside the values array. Adding REE support to arrow-avro lets users read/write Arrow REE arrays to Avro without inflating them, preserving size and performance characteristics across serialization boundaries.

What changes are included in this PR?

New Avro codec for REE: Introduces Codec::RunEndEncoded(Arc<AvroDataType>, u8) and maps it to Arrow’s DataType::RunEndEncoded. The run‑end index bit width must be one of 16/32/64, and the generated Arrow fields use the standard child names run_ends (non‑nullable) and values.
Schema parsing & validation: Recognizes the Avro logical type arrow.run-end-encoded and requires the attribute arrow.runEndIndexBits (one of 16, 32, 64). Missing or invalid values yield clear parse errors.
Nullability propagation: When REE appears inside nullable unions, nullability is “bubbled” into the values branch so Avro JSON generation models nullability correctly.
Union integration: From<&Codec> for UnionFieldKind is updated so REE defers to the inner value codec, ensuring unions of REE types resolve as expected.
Feature wiring / dependency: Enables REE handling behind the existing avro_custom_types feature and adds an optional dependency on arrow-select (feature now includes "arrow-select").
Reader/Writer updates: Enhances the encoder/reader paths to round‑trip REE arrays end‑to‑end.

Are these changes tested?

Yes. This commit adds end‑to‑end tests that round‑trip REE arrays with run‑end index types Int16, Int32, and Int64 through the Avro reader/writer to validate schema, encoding, and decoding.

Are there any user-facing changes?

N/A since arrow-avro is not public yet.

jecsand838 · 2025-10-10T07:06:31Z

@alamb @mbrobbel I figured I'd try to get one more in before v57.0.0. This one should help with the DataFusion Avro datasource.

arrow-avro/src/reader/record.rs

arrow-avro/src/codec.rs

arrow-avro/Cargo.toml

Co-authored-by: Matthijs Brobbel <[email protected]>

mbrobbel · 2025-10-13T09:50:14Z

Thanks @jecsand838

…arrow-avro`, (#8595) # Which issue does this PR close? - Closes #4886 - Stacked on #8584 # Rationale for this change This PR brings Arrow-Avro round‑trip coverage up to date with modern Arrow types and the latest Avro logical types. In particular, Avro 1.12 adds `timestamp-nanos` and `local-timestamp-nanos`. Enabling these logical types and filling in missing Avro writer encoders for Arrow’s newer *view* and list families allows lossless read/write and simpler pipelines. It also hardens timestamp/time scaling in the writer to avoid silent overflow when converting seconds to milliseconds, surfacing a clear error instead. # What changes are included in this PR? * **Nanosecond timestamps**: Introduces a `TimestampNanos(bool)` codec in `arrow-avro` that maps Avro `timestamp-nanos` / `local-timestamp-nanos` to Arrow `Timestamp(Nanosecond, tz)`. The reader/decoder, union field kinds, and Arrow `DataType` mapping are all extended accordingly. Logical type detection is wired through both `logicalType` and the `arrowTimeUnit="nanosecond"` attribute. * **UUID logical type round‑trip fix**: When reading Avro `logicalType="uuid"` fields, preserve that logical type in Arrow field metadata so writers can round‑trip it back to Avro. * **Avro writer encoders**: Add the missing array encoders and coverage for Arrow’s `ListView`, `LargeListView`, and `FixedSizeList`, and extend array encoder support to `BinaryView` and `Utf8View`. (See large additions in `writer/encoder.rs`.) * **Safer time/timestamp scaling**: Guard second to millisecond conversions in `Time32`/`Timestamp` encoders to prevent overflow; encoding now returns a clear `InvalidArgument` error in those cases. * **Schema utilities**: Add `AvroSchemaOptions` with `null_order` and `strip_metadata` flags so Avro JSON can be built while optionally omitting internal Arrow keys during round‑trip schema generation. * **Tests & round‑trip coverage**: Add unit tests for nanosecond timestamp decoding (UTC, local, and with nulls) and additional end‑to‑end/round‑trip tests for the updated writer paths. # Are these changes tested? Yes. * New decoder tests validate `Timestamp(Nanosecond, tz)` behavior for UTC and local timestamps and for nullable unions. * Writer tests validate the nanosecond encoder and exercise an overflow path for second→millisecond conversion that now returns an error. * Additional round‑trip tests were added alongside the new encoders. # Are there any user-facing changes? N/A since `arrow-avro` is not public yet.

github-actions bot added arrow Changes to the arrow crate arrow-avro arrow-avro crate labels Oct 10, 2025

Add support for run-end encoded (REE) arrays in arrow-avro

e371251

jecsand838 force-pushed the avro-ree-support branch from 12c3757 to e371251 Compare October 10, 2025 07:30

mbrobbel added this to the 57.0.0 milestone Oct 10, 2025

jecsand838 added 3 commits October 10, 2025 11:01

Merge branch 'main' into avro-ree-support

856fb89

Merge branch 'main' into avro-ree-support

a820dd2

Merge branch 'main' into avro-ree-support

cd4a6c2

mbrobbel approved these changes Oct 13, 2025

View reviewed changes

arrow-avro/src/reader/record.rs Outdated Show resolved Hide resolved

arrow-avro/src/reader/record.rs Outdated Show resolved Hide resolved

arrow-avro/src/codec.rs Outdated Show resolved Hide resolved

arrow-avro/Cargo.toml Outdated Show resolved Hide resolved

jecsand838 mentioned this pull request Oct 13, 2025

Add ArrowError::AvroError, remaining types and roundtrip tests to arrow-avro, #8595

Merged

jecsand838 and others added 4 commits October 13, 2025 03:05

Update arrow-avro/src/reader/record.rs

57329cc

Co-authored-by: Matthijs Brobbel <[email protected]>

Update arrow-avro/src/reader/record.rs

ec4eb36

Co-authored-by: Matthijs Brobbel <[email protected]>

Update arrow-avro/Cargo.toml

343813f

Co-authored-by: Matthijs Brobbel <[email protected]>

Address PR Comments

ade92b8

mbrobbel merged commit 74c0386 into apache:main Oct 13, 2025
24 checks passed

jecsand838 deleted the avro-ree-support branch October 24, 2025 02:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add support for run-end encoded (REE) arrays in arrow-avro #8584

Add support for run-end encoded (REE) arrays in arrow-avro #8584

Uh oh!

jecsand838 commented Oct 10, 2025

Uh oh!

jecsand838 commented Oct 10, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mbrobbel commented Oct 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add support for run-end encoded (REE) arrays in arrow-avro #8584

Add support for run-end encoded (REE) arrays in arrow-avro #8584

Uh oh!

Conversation

jecsand838 commented Oct 10, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

jecsand838 commented Oct 10, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mbrobbel commented Oct 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants