-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Add support for run-end encoded (REE) arrays in arrow-avro #8584
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Contributor
Author
12c3757 to
e371251
Compare
mbrobbel
approved these changes
Oct 13, 2025
Co-authored-by: Matthijs Brobbel <[email protected]>
Co-authored-by: Matthijs Brobbel <[email protected]>
Co-authored-by: Matthijs Brobbel <[email protected]>
Member
|
Thanks @jecsand838 |
mbrobbel
pushed a commit
that referenced
this pull request
Oct 15, 2025
…arrow-avro`, (#8595) # Which issue does this PR close? - Closes #4886 - Stacked on #8584 # Rationale for this change This PR brings Arrow-Avro round‑trip coverage up to date with modern Arrow types and the latest Avro logical types. In particular, Avro 1.12 adds `timestamp-nanos` and `local-timestamp-nanos`. Enabling these logical types and filling in missing Avro writer encoders for Arrow’s newer *view* and list families allows lossless read/write and simpler pipelines. It also hardens timestamp/time scaling in the writer to avoid silent overflow when converting seconds to milliseconds, surfacing a clear error instead. # What changes are included in this PR? * **Nanosecond timestamps**: Introduces a `TimestampNanos(bool)` codec in `arrow-avro` that maps Avro `timestamp-nanos` / `local-timestamp-nanos` to Arrow `Timestamp(Nanosecond, tz)`. The reader/decoder, union field kinds, and Arrow `DataType` mapping are all extended accordingly. Logical type detection is wired through both `logicalType` and the `arrowTimeUnit="nanosecond"` attribute. * **UUID logical type round‑trip fix**: When reading Avro `logicalType="uuid"` fields, preserve that logical type in Arrow field metadata so writers can round‑trip it back to Avro. * **Avro writer encoders**: Add the missing array encoders and coverage for Arrow’s `ListView`, `LargeListView`, and `FixedSizeList`, and extend array encoder support to `BinaryView` and `Utf8View`. (See large additions in `writer/encoder.rs`.) * **Safer time/timestamp scaling**: Guard second to millisecond conversions in `Time32`/`Timestamp` encoders to prevent overflow; encoding now returns a clear `InvalidArgument` error in those cases. * **Schema utilities**: Add `AvroSchemaOptions` with `null_order` and `strip_metadata` flags so Avro JSON can be built while optionally omitting internal Arrow keys during round‑trip schema generation. * **Tests & round‑trip coverage**: Add unit tests for nanosecond timestamp decoding (UTC, local, and with nulls) and additional end‑to‑end/round‑trip tests for the updated writer paths. # Are these changes tested? Yes. * New decoder tests validate `Timestamp(Nanosecond, tz)` behavior for UTC and local timestamps and for nullable unions. * Writer tests validate the nanosecond encoder and exercise an overflow path for second→millisecond conversion that now returns an error. * Additional round‑trip tests were added alongside the new encoders. # Are there any user-facing changes? N/A since `arrow-avro` is not public yet.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Rationale for this change
Arrow has a first‑class Run‑End Encoded (REE) data type that efficiently represents consecutive repeated values by storing run ends (indices) alongside the values array. Adding REE support to
arrow-avrolets users read/write Arrow REE arrays to Avro without inflating them, preserving size and performance characteristics across serialization boundaries.What changes are included in this PR?
Codec::RunEndEncoded(Arc<AvroDataType>, u8)and maps it to Arrow’sDataType::RunEndEncoded. The run‑end index bit width must be one of 16/32/64, and the generated Arrow fields use the standard child namesrun_ends(non‑nullable) andvalues.arrow.run-end-encodedand requires the attributearrow.runEndIndexBits(one of 16, 32, 64). Missing or invalid values yield clear parse errors.valuesbranch so Avro JSON generation models nullability correctly.From<&Codec> for UnionFieldKindis updated so REE defers to the inner value codec, ensuring unions of REE types resolve as expected.avro_custom_typesfeature and adds an optional dependency onarrow-select(feature now includes"arrow-select").Are these changes tested?
Yes. This commit adds end‑to‑end tests that round‑trip REE arrays with run‑end index types Int16, Int32, and Int64 through the Avro reader/writer to validate schema, encoding, and decoding.
Are there any user-facing changes?
N/A since
arrow-avrois not public yet.