Skip to content

Conversation

@jecsand838
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

Arrow has a first‑class Run‑End Encoded (REE) data type that efficiently represents consecutive repeated values by storing run ends (indices) alongside the values array. Adding REE support to arrow-avro lets users read/write Arrow REE arrays to Avro without inflating them, preserving size and performance characteristics across serialization boundaries.

What changes are included in this PR?

  • New Avro codec for REE: Introduces Codec::RunEndEncoded(Arc<AvroDataType>, u8) and maps it to Arrow’s DataType::RunEndEncoded. The run‑end index bit width must be one of 16/32/64, and the generated Arrow fields use the standard child names run_ends (non‑nullable) and values.
  • Schema parsing & validation: Recognizes the Avro logical type arrow.run-end-encoded and requires the attribute arrow.runEndIndexBits (one of 16, 32, 64). Missing or invalid values yield clear parse errors.
  • Nullability propagation: When REE appears inside nullable unions, nullability is “bubbled” into the values branch so Avro JSON generation models nullability correctly.
  • Union integration: From<&Codec> for UnionFieldKind is updated so REE defers to the inner value codec, ensuring unions of REE types resolve as expected.
  • Feature wiring / dependency: Enables REE handling behind the existing avro_custom_types feature and adds an optional dependency on arrow-select (feature now includes "arrow-select").
  • Reader/Writer updates: Enhances the encoder/reader paths to round‑trip REE arrays end‑to‑end.

Are these changes tested?

Yes. This commit adds end‑to‑end tests that round‑trip REE arrays with run‑end index types Int16, Int32, and Int64 through the Avro reader/writer to validate schema, encoding, and decoding.

Are there any user-facing changes?

N/A since arrow-avro is not public yet.

@github-actions github-actions bot added arrow Changes to the arrow crate arrow-avro arrow-avro crate labels Oct 10, 2025
@jecsand838
Copy link
Contributor Author

@alamb @mbrobbel I figured I'd try to get one more in before v57.0.0. This one should help with the DataFusion Avro datasource.

@mbrobbel mbrobbel added this to the 57.0.0 milestone Oct 10, 2025
@mbrobbel mbrobbel merged commit 74c0386 into apache:main Oct 13, 2025
24 checks passed
@mbrobbel
Copy link
Member

Thanks @jecsand838

mbrobbel pushed a commit that referenced this pull request Oct 15, 2025
…arrow-avro`, (#8595)

# Which issue does this PR close?

- Closes #4886 
- Stacked on #8584 

# Rationale for this change

This PR brings Arrow-Avro round‑trip coverage up to date with modern
Arrow types and the latest Avro logical types. In particular, Avro 1.12
adds `timestamp-nanos` and `local-timestamp-nanos`. Enabling these
logical types and filling in missing Avro writer encoders for Arrow’s
newer *view* and list families allows lossless read/write and simpler
pipelines.

It also hardens timestamp/time scaling in the writer to avoid silent
overflow when converting seconds to milliseconds, surfacing a clear
error instead.

# What changes are included in this PR?

* **Nanosecond timestamps**: Introduces a `TimestampNanos(bool)` codec
in `arrow-avro` that maps Avro `timestamp-nanos` /
`local-timestamp-nanos` to Arrow `Timestamp(Nanosecond, tz)`. The
reader/decoder, union field kinds, and Arrow `DataType` mapping are all
extended accordingly. Logical type detection is wired through both
`logicalType` and the `arrowTimeUnit="nanosecond"` attribute.
* **UUID logical type round‑trip fix**: When reading Avro
`logicalType="uuid"` fields, preserve that logical type in Arrow field
metadata so writers can round‑trip it back to Avro.
* **Avro writer encoders**: Add the missing array encoders and coverage
for Arrow’s `ListView`, `LargeListView`, and `FixedSizeList`, and extend
array encoder support to `BinaryView` and `Utf8View`. (See large
additions in `writer/encoder.rs`.)
* **Safer time/timestamp scaling**: Guard second to millisecond
conversions in `Time32`/`Timestamp` encoders to prevent overflow;
encoding now returns a clear `InvalidArgument` error in those cases.
* **Schema utilities**: Add `AvroSchemaOptions` with `null_order` and
`strip_metadata` flags so Avro JSON can be built while optionally
omitting internal Arrow keys during round‑trip schema generation.
* **Tests & round‑trip coverage**: Add unit tests for nanosecond
timestamp decoding (UTC, local, and with nulls) and additional
end‑to‑end/round‑trip tests for the updated writer paths.

# Are these changes tested?

Yes.

* New decoder tests validate `Timestamp(Nanosecond, tz)` behavior for
UTC and local timestamps and for nullable unions.
* Writer tests validate the nanosecond encoder and exercise an overflow
path for second→millisecond conversion that now returns an error.
* Additional round‑trip tests were added alongside the new encoders. 

# Are there any user-facing changes?

N/A since `arrow-avro` is not public yet.
@jecsand838 jecsand838 deleted the avro-ree-support branch October 24, 2025 02:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate arrow-avro arrow-avro crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants