-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
arrow-avro currently cannot encode/decode a number of Arrow DataTypes, and some types have schema/encoding mismatches that can lead to incorrect data (even when encoding succeeds).
The goal is:
- No more
ArrowError::NotYetImplemented(or similar) when writing/reading an ArrowRecordBatchcontaining supported Arrow types, excluding Sparse Unions (will be handled separately). - When compiled with
feature = "avro_custom_types": Arrow to Avro to Arrow should round-trip the ArrowDataType(including width/signedness/time units and relevant metadata using Arrow-specific custom logical types following the establishedarrow.*pattern. - When compiled without
avro_custom_types: Arrow types should be encoded to the closest standard Avro primitive / logical type, with any necessary lossy conversions documented and consistently applied.
Arrow DataType vs arrow-avro code paths
1) Types mapped in schema.rs but not actually writable (Writer errors)
arrow-avro/src/schema.rs already generates Avro schema for several Arrow types, but arrow-avro/src/writer/encoder.rs doesn’t implement the corresponding encoders, so writing fails.
Examples:
DataType::Int8,Int16DataType::UInt8,UInt16,UInt32,UInt64DataType::Float16DataType::Date64(note:schema.rsmaps this to Avrolongwith the standard logical typelocal-timestamp-millis)DataType::Time64(TimeUnit::Nanosecond)
Concrete failures observed in code:
encoder.rsfalls back to:Err(ArrowError::NotYetImplemented(format!("Avro scalar type not yet supported: {other:?}")))
for many of the above (e.g. Int8/Int16/UInt*/Float16).
Date64is explicitly rejected in the writer with:NotYetImplemented("Avro logical type 'date' is days since epoch (int). Arrow Date64 (ms) has no direct Avro logical type; cast to Date32 or to a Timestamp.")
This is inconsistent withschema.rs, which currently emitslocal-timestamp-millisforDate64.
Time64(Nanosecond)is explicitly rejected with:NotYetImplemented("Avro writer does not support time-nanos; cast to Time64(Microsecond).")
This shows up immediately in common scenarios (e.g. encoding a RecordBatch with an Int16 column).
2) Schema/encoding mismatches (Writer produces Avro that doesn’t match its own schema)
There are also cases where the schema generated doesn’t match the encoding behavior:
-
Interval(YearMonth) / Interval(DayTime)
schema.rsmaps these tolongwitharrowIntervalUnit=...encoder.rsencodes all ArrowIntervalvariants using an Avrodurationfixed(12) writer (DurationEncoder)- This is not just a metadata mismatch: Avro
longvalues are zig-zag varint encoded, while Avrofixed(12)is always 12 raw bytes. Writingfixed(12)bytes under alongschema can produce invalid / unreadable Avro and corrupt the stream.
-
Time32(Second) and Timestamp(Second)
schema.rswrites these asint/longwitharrowTimeUnit="second"(no standard Avro logicalType)encoder.rsconverts seconds to milliseconds on write (Time32SecondsToMillisEncoder,TimestampSecondsToMillisEncoder)- That yields data in millis while the schema/metadata indicate seconds
These mismatches are bigger than “no round-trip” — they can lead to incorrect interpretation by readers (including arrow-avro).
3) Reader does not currently round-trip most Arrow-specific types (even when metadata exists)
The reader currently only maps a small set of Arrow-specific types via avro_custom_types (notably Duration* and RunEndEncoded). Many other Arrow-specific or Arrow-width-specific types decode to the nearest Avro primitive:
- Avro
intalways becomes ArrowInt32(no path back toInt8/Int16/UInt8/UInt16) - Avro
longalways becomes ArrowInt64or one of the timestamp/time logical types (no path back toUInt32/UInt64) - There is no Float16 decode path
Additionally, codec.rs contains a special-case that treats Int64 + arrowTimeUnit == "nanosecond" as TimestampNanos(false). This would conflict with schema.rs’s current representation for Time64(Nanosecond) (which also uses arrowTimeUnit="nanosecond"), i.e. even if writer support were added, the reader-side mapping would likely misinterpret it as a timestamp unless this is corrected.
Describe the solution you'd like
Implement missing Arrow to Avro support in a way that is consistent across:
- schema generation (
arrow-avro/src/schema.rs) - encoding (
arrow-avro/src/writer/encoder.rs) - schema to codec mapping (
arrow-avro/src/codec.rs) - decoding/builders (
arrow-avro/src/reader/record.rs)
...and that is explicitly aligned with avro_custom_types.
A. Add Writer/Reader support for remaining Arrow primitive width/signedness types
At minimum, remove “type-level” NotYetImplemented errors by adding Writer encoders and Reader support for:
Int8,Int16UInt8,UInt16,UInt32,UInt64Float16
Proposed mapping behavior:
Arrow DataType |
avro_custom_types OFF (interop-first) |
avro_custom_types ON (round-trip) |
|---|---|---|
| Int8 / Int16 | encode as Avro int (widen) |
encode as Avro int with logicalType: "arrow.int8" / "arrow.int16" |
| UInt8 / UInt16 | encode as Avro int (as i32) |
Avro int with logicalType: "arrow.uint8" / "arrow.uint16" |
| UInt32 | encode as Avro long (as i64) |
Avro long with logicalType: "arrow.uint32" |
| UInt64 | encode as Avro long when value fits i64, otherwise error (or documented fallback) |
encode with a round-trippable representation (e.g. Avro fixed(8) or bytes) + logicalType: "arrow.uint64" |
| Float16 | encode as Avro float (f32) |
round-trip via logicalType: "arrow.float16" + representation (e.g. store IEEE-754 f16 bits in fixed(2) or int) |
B. Complete date/time/timestamp support (and align schema + encoding)
Implement:
Date64(currentlyschema.rsemitslong+local-timestamp-millis, but the writer errors)Time64(Nanosecond)(schema exists but writer errors; reader must not confuse with timestamps)- Align
Time32(Second)/Timestamp(Second)schema + encoding
Proposed behavior:
Arrow DataType |
avro_custom_types OFF |
avro_custom_types ON |
|---|---|---|
| Date64 | encode as long + logicalType: "local-timestamp-millis" (this is what schema.rs already emits today; implement writer accordingly) |
long + logicalType: "arrow.date64" (ms since epoch, round-trip to Date64) |
| Time32(Second) | encode as time-millis (int) by scaling seconds to millis |
int + logicalType: "arrow.time32-second" and store raw seconds |
| Time64(Nanosecond) | encode as time-micros (long) with scaling nanos to micros (document/truncate or require divisible by 1000) |
long + logicalType: "arrow.time64-nanosecond" and store raw nanos |
| Timestamp(Second, tz?) | encode as timestamp-millis / local-timestamp-millis with scaling seconds to millis |
long + logicalType: "arrow.timestamp-second" (and preserve tz/local-ness as needed) |
Notes:
- Arrow’s
Date64spec treats values as “days in milliseconds” (evenly divisible by86_400_000).arrow-rsdoes not enforce this at runtime, so the writer should either document behavior for non-conforming values or optionally validate when writing.
C. Implement Interval handling and define custom logical types for round-trip
Current Interval behavior is not consistent between schema and encoding for YearMonth/DayTime, and Avro’s standard duration can’t represent Arrow’s full interval semantics (negative values; MonthDayNano nanos).
Proposed approach:
-
With
avro_custom_typesON: implement custom logical types for each Arrow interval unit that can round-trip (including negative and nanos):arrow.interval-year-montharrow.interval-day-timearrow.interval-month-day-nano(should preserve i32 months, i32 days, i64 nanos)
-
With
avro_custom_typesOFF: encode to the closest standard Avro representation:- Prefer Avro
durationwhere feasible (and document constraints: unsigned months/days/millis; millis precision) - For values not representable (negative, sub-millis nanos), return a clear error suggesting enabling
avro_custom_typesor casting
- Prefer Avro
E. Add coverage tests (type-level and round-trip)
Add tests ensuring:
- Writer can encode
RecordBatchfor each Arrow type above (noNotYetImplemented) - With
avro_custom_typesON: Arrow to Avro to Arrow schema equality for these types (excluding Sparse union) - With
avro_custom_typesOFF: Arrow to Avro does not error; Arrow type coercions match documented “closest Avro type/logical type” rules - Ensure schema generation and encoding are consistent (especially for Interval and second-based time/timestamp)
Describe alternatives you've considered
- Require callers to manually cast unsupported Arrow types (e.g.
Int16 to Int32,Float16 to Float32,Time64(nanos) to Time64(micros)) prior to writing Avro. - Always enable
avro_custom_typesand rely on Arrow-specific annotations everywhere (simpler for Arrow-only pipelines, less interoperable). - For unsigned types, encode as strings or decimals to avoid range issues (more interoperable but heavier, and not a strict round-trip without custom annotations).
Additional context
Relevant code locations (current behavior):
arrow-avro/src/schema.rs(already maps many of the missing types; also stores Arrow metadata likearrowTimeUnit,arrowIntervalUnit,arrowUnionTypeIds)arrow-avro/src/writer/encoder.rs(missing encoders for Int8/Int16/UInt*/Float16/Date64/Time64(nanos); has conversion encoders for seconds to millis that currently don’t align with schema; interval encoding currently can violate the declared schema for YearMonth/DayTime)