Strongly Typed ArrayData

**TLDR**

Make ArrayData layout explicit so that we can eventually push offsets down into the underlying buffers/bitmaps, instead of tracking them as a top-level concept which has proven to be rather error prone.

This is also the enabling feature that will support easy and zero cost interoperability between arrow-rs and arrow2 -- see https://github.com/jorgecarleitao/arrow2/issues/1429

**Is your feature request related to a problem or challenge? Please describe what you are trying to do.**

Currently `ArrayData` is defined as follows.

```
pub struct ArrayData {
    /// The data type for this array data
    data_type: DataType,

    /// The number of elements in this array data
    len: usize,

    /// The number of null elements in this array data
    null_count: usize,

    /// The offset into this array data, in number of items
    offset: usize,

    /// The buffers for this array data. Note that depending on the array types, this
    /// could hold different kinds of buffers (e.g., value buffer, value offset buffer)
    /// at different positions.
    buffers: Vec<Buffer>,

    /// The child(ren) of this array. Only non-empty for nested types, currently
    /// `ListArray` and `StructArray`.
    child_data: Vec<ArrayData>,

    /// The null bitmap. A `None` value for this indicates all values are non-null in
    /// this array.
    null_bitmap: Option<Bitmap>,
}
```

This is simple, but has a couple of caveats:

* It isn't clear what is present for specific layout types
* There is no clear path to storing `BooleanArray` as `BitMap` vs `Buffer`, which would allow removing `offset`
* Vec allocations for one or two elements (the C++ implementation inlines these)
* There is potential for accidentally interpreting a buffer incorrectly

**Describe the solution you'd like**

Introduce a new `ArrayDataLayout` enumeration:

```
pub enum ArrayDataLayout {
  Boolean { values: Bitmap },
  Primitive{ values: Buffer },
  Offsets { offsets: Buffer, values: Buffer },
  Dictionary { keys: Buffer, values: ArrayData },
  List { offsets: Buffer, elements: ArrayData },
  Struct { children: Vec<ArrayData> },
  Union { offsets: Option<Buffer>, types: Buffer, children: Vec<ArrayData> },
}
```

```
pub struct ArrayData {
    /// The data type for this array data
    data_type: DataType,

    /// The number of elements in this array data
    len: usize,

    /// The number of null elements in this array data
    null_count: usize,

    /// The offset into this array data, in number of items
    offset: usize,

    /// The null bitmap. A `None` value for this indicates all values are non-null in
    /// this array.
    null_bitmap: Option<Bitmap>,

    /// The array data layout
    layout: ArrayDataLayout
}
```

We could then progressively deprecate the methods that explicitly refer to buffers by index, etc...

**Describe alternatives you've considered**

We could not do this

**Additional context**

This could be seen as an evolution of @HaoYang670 's proposal in https://github.com/apache/arrow-rs/issues/1640

It also relates to @jhorstmann 's proposal on https://github.com/apache/arrow-rs/pull/1499#issuecomment-1096878229

It could also be seen as an interpretation of the arrow2 physical vs logical type separation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strongly Typed ArrayData #1799

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Strongly Typed ArrayData #1799

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions