Skip to content

[arrow-avro] Add missing Arrow DataType support with avro_custom_types round-trip + non-custom fallbacks #9290

@jecsand838

Description

@jecsand838

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

arrow-avro currently cannot encode/decode a number of Arrow DataTypes, and some types have schema/encoding mismatches that can lead to incorrect data (even when encoding succeeds).

The goal is:

  • No more ArrowError::NotYetImplemented (or similar) when writing/reading an Arrow RecordBatch containing supported Arrow types, excluding Sparse Unions (will be handled separately).
  • When compiled with feature = "avro_custom_types": Arrow to Avro to Arrow should round-trip the Arrow DataType (including width/signedness/time units and relevant metadata using Arrow-specific custom logical types following the established arrow.* pattern.
  • When compiled without avro_custom_types: Arrow types should be encoded to the closest standard Avro primitive / logical type, with any necessary lossy conversions documented and consistently applied.

Arrow DataType vs arrow-avro code paths

1) Types mapped in schema.rs but not actually writable (Writer errors)

arrow-avro/src/schema.rs already generates Avro schema for several Arrow types, but arrow-avro/src/writer/encoder.rs doesn’t implement the corresponding encoders, so writing fails.

Examples:

  • DataType::Int8, Int16
  • DataType::UInt8, UInt16, UInt32, UInt64
  • DataType::Float16
  • DataType::Date64 (note: schema.rs maps this to Avro long with the standard logical type local-timestamp-millis)
  • DataType::Time64(TimeUnit::Nanosecond)

Concrete failures observed in code:

  • encoder.rs falls back to:
    • Err(ArrowError::NotYetImplemented(format!("Avro scalar type not yet supported: {other:?}")))
      for many of the above (e.g. Int8/Int16/UInt*/Float16).
  • Date64 is explicitly rejected in the writer with:
    • NotYetImplemented("Avro logical type 'date' is days since epoch (int). Arrow Date64 (ms) has no direct Avro logical type; cast to Date32 or to a Timestamp.")
      This is inconsistent with schema.rs, which currently emits local-timestamp-millis for Date64.
  • Time64(Nanosecond) is explicitly rejected with:
    • NotYetImplemented("Avro writer does not support time-nanos; cast to Time64(Microsecond).")

This shows up immediately in common scenarios (e.g. encoding a RecordBatch with an Int16 column).

2) Schema/encoding mismatches (Writer produces Avro that doesn’t match its own schema)

There are also cases where the schema generated doesn’t match the encoding behavior:

  • Interval(YearMonth) / Interval(DayTime)

    • schema.rs maps these to long with arrowIntervalUnit=...
    • encoder.rs encodes all Arrow Interval variants using an Avro duration fixed(12) writer (DurationEncoder)
    • This is not just a metadata mismatch: Avro long values are zig-zag varint encoded, while Avro fixed(12) is always 12 raw bytes. Writing fixed(12) bytes under a long schema can produce invalid / unreadable Avro and corrupt the stream.
  • Time32(Second) and Timestamp(Second)

    • schema.rs writes these as int/long with arrowTimeUnit="second" (no standard Avro logicalType)
    • encoder.rs converts seconds to milliseconds on write (Time32SecondsToMillisEncoder, TimestampSecondsToMillisEncoder)
    • That yields data in millis while the schema/metadata indicate seconds

These mismatches are bigger than “no round-trip” — they can lead to incorrect interpretation by readers (including arrow-avro).

3) Reader does not currently round-trip most Arrow-specific types (even when metadata exists)

The reader currently only maps a small set of Arrow-specific types via avro_custom_types (notably Duration* and RunEndEncoded). Many other Arrow-specific or Arrow-width-specific types decode to the nearest Avro primitive:

  • Avro int always becomes Arrow Int32 (no path back to Int8/Int16/UInt8/UInt16)
  • Avro long always becomes Arrow Int64 or one of the timestamp/time logical types (no path back to UInt32/UInt64)
  • There is no Float16 decode path

Additionally, codec.rs contains a special-case that treats Int64 + arrowTimeUnit == "nanosecond" as TimestampNanos(false). This would conflict with schema.rs’s current representation for Time64(Nanosecond) (which also uses arrowTimeUnit="nanosecond"), i.e. even if writer support were added, the reader-side mapping would likely misinterpret it as a timestamp unless this is corrected.

Describe the solution you'd like

Implement missing Arrow to Avro support in a way that is consistent across:

  • schema generation (arrow-avro/src/schema.rs)
  • encoding (arrow-avro/src/writer/encoder.rs)
  • schema to codec mapping (arrow-avro/src/codec.rs)
  • decoding/builders (arrow-avro/src/reader/record.rs)

...and that is explicitly aligned with avro_custom_types.

A. Add Writer/Reader support for remaining Arrow primitive width/signedness types

At minimum, remove “type-level” NotYetImplemented errors by adding Writer encoders and Reader support for:

  • Int8, Int16
  • UInt8, UInt16, UInt32, UInt64
  • Float16

Proposed mapping behavior:

Arrow DataType avro_custom_types OFF (interop-first) avro_custom_types ON (round-trip)
Int8 / Int16 encode as Avro int (widen) encode as Avro int with logicalType: "arrow.int8" / "arrow.int16"
UInt8 / UInt16 encode as Avro int (as i32) Avro int with logicalType: "arrow.uint8" / "arrow.uint16"
UInt32 encode as Avro long (as i64) Avro long with logicalType: "arrow.uint32"
UInt64 encode as Avro long when value fits i64, otherwise error (or documented fallback) encode with a round-trippable representation (e.g. Avro fixed(8) or bytes) + logicalType: "arrow.uint64"
Float16 encode as Avro float (f32) round-trip via logicalType: "arrow.float16" + representation (e.g. store IEEE-754 f16 bits in fixed(2) or int)

B. Complete date/time/timestamp support (and align schema + encoding)

Implement:

  • Date64 (currently schema.rs emits long + local-timestamp-millis, but the writer errors)
  • Time64(Nanosecond) (schema exists but writer errors; reader must not confuse with timestamps)
  • Align Time32(Second) / Timestamp(Second) schema + encoding

Proposed behavior:

Arrow DataType avro_custom_types OFF avro_custom_types ON
Date64 encode as long + logicalType: "local-timestamp-millis" (this is what schema.rs already emits today; implement writer accordingly) long + logicalType: "arrow.date64" (ms since epoch, round-trip to Date64)
Time32(Second) encode as time-millis (int) by scaling seconds to millis int + logicalType: "arrow.time32-second" and store raw seconds
Time64(Nanosecond) encode as time-micros (long) with scaling nanos to micros (document/truncate or require divisible by 1000) long + logicalType: "arrow.time64-nanosecond" and store raw nanos
Timestamp(Second, tz?) encode as timestamp-millis / local-timestamp-millis with scaling seconds to millis long + logicalType: "arrow.timestamp-second" (and preserve tz/local-ness as needed)

Notes:

  • Arrow’s Date64 spec treats values as “days in milliseconds” (evenly divisible by 86_400_000). arrow-rs does not enforce this at runtime, so the writer should either document behavior for non-conforming values or optionally validate when writing.

C. Implement Interval handling and define custom logical types for round-trip

Current Interval behavior is not consistent between schema and encoding for YearMonth/DayTime, and Avro’s standard duration can’t represent Arrow’s full interval semantics (negative values; MonthDayNano nanos).

Proposed approach:

  • With avro_custom_types ON: implement custom logical types for each Arrow interval unit that can round-trip (including negative and nanos):

    • arrow.interval-year-month
    • arrow.interval-day-time
    • arrow.interval-month-day-nano (should preserve i32 months, i32 days, i64 nanos)
  • With avro_custom_types OFF: encode to the closest standard Avro representation:

    • Prefer Avro duration where feasible (and document constraints: unsigned months/days/millis; millis precision)
    • For values not representable (negative, sub-millis nanos), return a clear error suggesting enabling avro_custom_types or casting

E. Add coverage tests (type-level and round-trip)
Add tests ensuring:

  1. Writer can encode RecordBatch for each Arrow type above (no NotYetImplemented)
  2. With avro_custom_types ON: Arrow to Avro to Arrow schema equality for these types (excluding Sparse union)
  3. With avro_custom_types OFF: Arrow to Avro does not error; Arrow type coercions match documented “closest Avro type/logical type” rules
  4. Ensure schema generation and encoding are consistent (especially for Interval and second-based time/timestamp)

Describe alternatives you've considered

  • Require callers to manually cast unsupported Arrow types (e.g. Int16 to Int32, Float16 to Float32, Time64(nanos) to Time64(micros)) prior to writing Avro.
  • Always enable avro_custom_types and rely on Arrow-specific annotations everywhere (simpler for Arrow-only pipelines, less interoperable).
  • For unsigned types, encode as strings or decimals to avoid range issues (more interoperable but heavier, and not a strict round-trip without custom annotations).

Additional context

Relevant code locations (current behavior):

  • arrow-avro/src/schema.rs (already maps many of the missing types; also stores Arrow metadata like arrowTimeUnit, arrowIntervalUnit, arrowUnionTypeIds)
  • arrow-avro/src/writer/encoder.rs (missing encoders for Int8/Int16/UInt*/Float16/Date64/Time64(nanos); has conversion encoders for seconds to millis that currently don’t align with schema; interval encoding currently can violate the declared schema for YearMonth/DayTime)

Metadata

Metadata

Assignees

Labels

enhancementAny new improvement worthy of a entry in the changelog

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions