[arrow-avro] Add missing Arrow DataType support with `avro_custom_types` round-trip + non-custom fallbacks

**Is your feature request related to a problem or challenge? Please describe what you are trying to do.**

`arrow-avro` currently cannot encode/decode a number of Arrow `DataType`s, and some types have schema/encoding mismatches that can lead to incorrect data (even when encoding succeeds).

The goal is:

* **No more `ArrowError::NotYetImplemented` (or similar) when writing/reading an Arrow `RecordBatch` containing supported Arrow types**, excluding **Sparse Unions** (will be handled separately).
* **When compiled with `feature = "avro_custom_types"`:** Arrow to Avro to Arrow should **round-trip the Arrow `DataType`** (including width/signedness/time units and relevant metadata using **Arrow-specific custom logical types** following the established `arrow.*` pattern.
* **When compiled without `avro_custom_types`:** Arrow types should be encoded to the **closest standard Avro primitive / logical type**, with any necessary lossy conversions documented and consistently applied.

**Arrow `DataType` vs `arrow-avro` code paths**

#### 1) Types mapped in `schema.rs` but not actually writable (Writer errors)
`arrow-avro/src/schema.rs` already generates Avro schema for several Arrow types, but `arrow-avro/src/writer/encoder.rs` doesn’t implement the corresponding encoders, so writing fails.

Examples:
* `DataType::Int8`, `Int16`
* `DataType::UInt8`, `UInt16`, `UInt32`, `UInt64`
* `DataType::Float16`
* `DataType::Date64` (**note:** `schema.rs` maps this to Avro `long` with the standard logical type `local-timestamp-millis`)
* `DataType::Time64(TimeUnit::Nanosecond)`

Concrete failures observed in code:
* `encoder.rs` falls back to:
  * `Err(ArrowError::NotYetImplemented(format!("Avro scalar type not yet supported: {other:?}")))`
  for many of the above (e.g. Int8/Int16/UInt*/Float16).
* `Date64` is explicitly rejected in the writer with:
  * `NotYetImplemented("Avro logical type 'date' is days since epoch (int). Arrow Date64 (ms) has no direct Avro logical type; cast to Date32 or to a Timestamp.")`
  This is inconsistent with `schema.rs`, which currently emits `local-timestamp-millis` for `Date64`.
* `Time64(Nanosecond)` is explicitly rejected with:
  * `NotYetImplemented("Avro writer does not support time-nanos; cast to Time64(Microsecond).")`

This shows up immediately in common scenarios (e.g. encoding a `RecordBatch` with an `Int16` column).

#### 2) Schema/encoding mismatches (Writer produces Avro that doesn’t match its own schema)
There are also cases where **the schema generated doesn’t match the encoding behavior**:

* **Interval(YearMonth) / Interval(DayTime)**  
  * `schema.rs` maps these to `long` with `arrowIntervalUnit=...`
  * `encoder.rs` encodes **all** Arrow `Interval` variants using an **Avro `duration` fixed(12)** writer (`DurationEncoder`)
  * This is **not just a metadata mismatch**: Avro `long` values are zig-zag varint encoded, while Avro `fixed(12)` is always 12 raw bytes. Writing `fixed(12)` bytes under a `long` schema can produce **invalid / unreadable Avro** and corrupt the stream.

* **Time32(Second)** and **Timestamp(Second)**  
  * `schema.rs` writes these as `int`/`long` with `arrowTimeUnit="second"` (no standard Avro logicalType)
  * `encoder.rs` converts seconds to **milliseconds** on write (`Time32SecondsToMillisEncoder`, `TimestampSecondsToMillisEncoder`)
  * That yields data in millis while the schema/metadata indicate seconds

These mismatches are bigger than “no round-trip” — they can lead to incorrect interpretation by readers (including `arrow-avro`).

#### 3) Reader does not currently round-trip most Arrow-specific types (even when metadata exists)
The reader currently only maps a small set of Arrow-specific types via `avro_custom_types` (notably Duration* and RunEndEncoded). Many other Arrow-specific or Arrow-width-specific types decode to the nearest Avro primitive:

* Avro `int` always becomes Arrow `Int32` (no path back to `Int8`/`Int16`/`UInt8`/`UInt16`)
* Avro `long` always becomes Arrow `Int64` or one of the timestamp/time logical types (no path back to `UInt32`/`UInt64`)
* There is no Float16 decode path

Additionally, `codec.rs` contains a special-case that treats `Int64` + `arrowTimeUnit == "nanosecond"` as `TimestampNanos(false)`. This would conflict with `schema.rs`’s current representation for `Time64(Nanosecond)` (which also uses `arrowTimeUnit="nanosecond"`), i.e. even if writer support were added, the reader-side mapping would likely misinterpret it as a timestamp unless this is corrected.

**Describe the solution you'd like**

Implement missing Arrow to Avro support in a way that is consistent across:

* schema generation (`arrow-avro/src/schema.rs`)
* encoding (`arrow-avro/src/writer/encoder.rs`)
* schema to codec mapping (`arrow-avro/src/codec.rs`)
* decoding/builders (`arrow-avro/src/reader/record.rs`)

...and that is explicitly aligned with `avro_custom_types`.

**A. Add Writer/Reader support for remaining Arrow primitive width/signedness types**

At minimum, remove “type-level” NotYetImplemented errors by adding Writer encoders and Reader support for:

* `Int8`, `Int16`
* `UInt8`, `UInt16`, `UInt32`, `UInt64`
* `Float16`

Proposed mapping behavior:

| Arrow `DataType` | `avro_custom_types` **OFF** (interop-first) | `avro_custom_types` **ON** (round-trip) |
|---|---|---|
| Int8 / Int16 | encode as Avro `int` (widen) | encode as Avro `int` with `logicalType: "arrow.int8"` / `"arrow.int16"` |
| UInt8 / UInt16 | encode as Avro `int` (as i32) | Avro `int` with `logicalType: "arrow.uint8"` / `"arrow.uint16"` |
| UInt32 | encode as Avro `long` (as i64) | Avro `long` with `logicalType: "arrow.uint32"` |
| UInt64 | encode as Avro `long` *when value fits i64*, otherwise error (or documented fallback) | encode with a round-trippable representation (e.g. Avro `fixed(8)` or `bytes`) + `logicalType: "arrow.uint64"` |
| Float16 | encode as Avro `float` (f32) | round-trip via `logicalType: "arrow.float16"` + representation (e.g. store IEEE-754 f16 bits in `fixed(2)` or `int`) |

**B. Complete date/time/timestamp support (and align schema + encoding)**

Implement:

* `Date64` (currently `schema.rs` emits `long` + `local-timestamp-millis`, but the writer errors)
* `Time64(Nanosecond)` (schema exists but writer errors; reader must not confuse with timestamps)
* Align `Time32(Second)` / `Timestamp(Second)` schema + encoding

Proposed behavior:

| Arrow `DataType` | `avro_custom_types` **OFF** | `avro_custom_types` **ON** |
|---|---|---|
| Date64 | encode as `long` + **`logicalType: "local-timestamp-millis"`** (this is what `schema.rs` already emits today; implement writer accordingly) | `long` + `logicalType: "arrow.date64"` (ms since epoch, round-trip to Date64) |
| Time32(Second) | encode as `time-millis` (int) by scaling seconds to millis | `int` + `logicalType: "arrow.time32-second"` and store raw seconds |
| Time64(Nanosecond) | encode as `time-micros` (long) with scaling nanos to micros (document/truncate or require divisible by 1000) | `long` + `logicalType: "arrow.time64-nanosecond"` and store raw nanos |
| Timestamp(Second, tz?) | encode as `timestamp-millis` / `local-timestamp-millis` with scaling seconds to millis | `long` + `logicalType: "arrow.timestamp-second"` (and preserve tz/local-ness as needed) |

Notes:
* Arrow’s `Date64` spec treats values as “days in milliseconds” (evenly divisible by `86_400_000`). `arrow-rs` does not enforce this at runtime, so the writer should either document behavior for non-conforming values or optionally validate when writing.

**C. Implement Interval handling and define custom logical types for round-trip**

Current `Interval` behavior is not consistent between schema and encoding for YearMonth/DayTime, and Avro’s standard `duration` can’t represent Arrow’s full interval semantics (negative values; MonthDayNano nanos).

Proposed approach:

* With `avro_custom_types` **ON**: implement custom logical types for each Arrow interval unit that can round-trip (including negative and nanos):
  * `arrow.interval-year-month`
  * `arrow.interval-day-time`
  * `arrow.interval-month-day-nano` (should preserve i32 months, i32 days, i64 nanos)

* With `avro_custom_types` **OFF**: encode to the closest standard Avro representation:
  * Prefer Avro `duration` where feasible (and document constraints: unsigned months/days/millis; millis precision)
  * For values not representable (negative, sub-millis nanos), return a clear error suggesting enabling `avro_custom_types` or casting

**E. Add coverage tests (type-level and round-trip)**
Add tests ensuring:

1) **Writer** can encode `RecordBatch` for each Arrow type above (no `NotYetImplemented`)  
2) With `avro_custom_types` **ON**: Arrow to Avro to Arrow schema equality for these types (excluding Sparse union)  
3) With `avro_custom_types` **OFF**: Arrow to Avro does not error; Arrow type coercions match documented “closest Avro type/logical type” rules  
4) Ensure schema generation and encoding are consistent (especially for Interval and second-based time/timestamp)

**Describe alternatives you've considered**

* Require callers to manually cast unsupported Arrow types (e.g. `Int16 to Int32`, `Float16 to Float32`, `Time64(nanos) to Time64(micros)`) prior to writing Avro.
* Always enable `avro_custom_types` and rely on Arrow-specific annotations everywhere (simpler for Arrow-only pipelines, less interoperable).
* For unsigned types, encode as strings or decimals to avoid range issues (more interoperable but heavier, and not a strict round-trip without custom annotations).

**Additional context**

Relevant code locations (current behavior):
* `arrow-avro/src/schema.rs` (already maps many of the missing types; also stores Arrow metadata like `arrowTimeUnit`, `arrowIntervalUnit`, `arrowUnionTypeIds`)
* `arrow-avro/src/writer/encoder.rs` (missing encoders for Int8/Int16/UInt*/Float16/Date64/Time64(nanos); has conversion encoders for seconds to millis that currently don’t align with schema; interval encoding currently can violate the declared schema for YearMonth/DayTime)


Arrow `DataType`	`avro_custom_types` OFF (interop-first)	`avro_custom_types` ON (round-trip)
Int8 / Int16	encode as Avro `int` (widen)	encode as Avro `int` with `logicalType: "arrow.int8"` / `"arrow.int16"`
UInt8 / UInt16	encode as Avro `int` (as i32)	Avro `int` with `logicalType: "arrow.uint8"` / `"arrow.uint16"`
UInt32	encode as Avro `long` (as i64)	Avro `long` with `logicalType: "arrow.uint32"`
UInt64	encode as Avro `long` when value fits i64, otherwise error (or documented fallback)	encode with a round-trippable representation (e.g. Avro `fixed(8)` or `bytes`) + `logicalType: "arrow.uint64"`
Float16	encode as Avro `float` (f32)	round-trip via `logicalType: "arrow.float16"` + representation (e.g. store IEEE-754 f16 bits in `fixed(2)` or `int`)

Arrow `DataType`	`avro_custom_types` OFF	`avro_custom_types` ON
Date64	encode as `long` + `logicalType: "local-timestamp-millis"` (this is what `schema.rs` already emits today; implement writer accordingly)	`long` + `logicalType: "arrow.date64"` (ms since epoch, round-trip to Date64)
Time32(Second)	encode as `time-millis` (int) by scaling seconds to millis	`int` + `logicalType: "arrow.time32-second"` and store raw seconds
Time64(Nanosecond)	encode as `time-micros` (long) with scaling nanos to micros (document/truncate or require divisible by 1000)	`long` + `logicalType: "arrow.time64-nanosecond"` and store raw nanos
Timestamp(Second, tz?)	encode as `timestamp-millis` / `local-timestamp-millis` with scaling seconds to millis	`long` + `logicalType: "arrow.timestamp-second"` (and preserve tz/local-ness as needed)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[arrow-avro] Add missing Arrow DataType support with `avro_custom_types` round-trip + non-custom fallbacks #9290

1) Types mapped in `schema.rs` but not actually writable (Writer errors)

2) Schema/encoding mismatches (Writer produces Avro that doesn’t match its own schema)

3) Reader does not currently round-trip most Arrow-specific types (even when metadata exists)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[arrow-avro] Add missing Arrow DataType support with avro_custom_types round-trip + non-custom fallbacks #9290

Description

1) Types mapped in schema.rs but not actually writable (Writer errors)

2) Schema/encoding mismatches (Writer produces Avro that doesn’t match its own schema)

3) Reader does not currently round-trip most Arrow-specific types (even when metadata exists)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[arrow-avro] Add missing Arrow DataType support with `avro_custom_types` round-trip + non-custom fallbacks #9290

1) Types mapped in `schema.rs` but not actually writable (Writer errors)