Skip to content

Strongly Typed ArrayData #1799

@tustvold

Description

@tustvold

TLDR

Make ArrayData layout explicit so that we can eventually push offsets down into the underlying buffers/bitmaps, instead of tracking them as a top-level concept which has proven to be rather error prone.

This is also the enabling feature that will support easy and zero cost interoperability between arrow-rs and arrow2 -- see jorgecarleitao/arrow2#1429

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Currently ArrayData is defined as follows.

pub struct ArrayData {
    /// The data type for this array data
    data_type: DataType,

    /// The number of elements in this array data
    len: usize,

    /// The number of null elements in this array data
    null_count: usize,

    /// The offset into this array data, in number of items
    offset: usize,

    /// The buffers for this array data. Note that depending on the array types, this
    /// could hold different kinds of buffers (e.g., value buffer, value offset buffer)
    /// at different positions.
    buffers: Vec<Buffer>,

    /// The child(ren) of this array. Only non-empty for nested types, currently
    /// `ListArray` and `StructArray`.
    child_data: Vec<ArrayData>,

    /// The null bitmap. A `None` value for this indicates all values are non-null in
    /// this array.
    null_bitmap: Option<Bitmap>,
}

This is simple, but has a couple of caveats:

  • It isn't clear what is present for specific layout types
  • There is no clear path to storing BooleanArray as BitMap vs Buffer, which would allow removing offset
  • Vec allocations for one or two elements (the C++ implementation inlines these)
  • There is potential for accidentally interpreting a buffer incorrectly

Describe the solution you'd like

Introduce a new ArrayDataLayout enumeration:

pub enum ArrayDataLayout {
  Boolean { values: Bitmap },
  Primitive{ values: Buffer },
  Offsets { offsets: Buffer, values: Buffer },
  Dictionary { keys: Buffer, values: ArrayData },
  List { offsets: Buffer, elements: ArrayData },
  Struct { children: Vec<ArrayData> },
  Union { offsets: Option<Buffer>, types: Buffer, children: Vec<ArrayData> },
}
pub struct ArrayData {
    /// The data type for this array data
    data_type: DataType,

    /// The number of elements in this array data
    len: usize,

    /// The number of null elements in this array data
    null_count: usize,

    /// The offset into this array data, in number of items
    offset: usize,

    /// The null bitmap. A `None` value for this indicates all values are non-null in
    /// this array.
    null_bitmap: Option<Bitmap>,

    /// The array data layout
    layout: ArrayDataLayout
}

We could then progressively deprecate the methods that explicitly refer to buffers by index, etc...

Describe alternatives you've considered

We could not do this

Additional context

This could be seen as an evolution of @HaoYang670 's proposal in #1640

It also relates to @jhorstmann 's proposal on #1499 (comment)

It could also be seen as an interpretation of the arrow2 physical vs logical type separation.

Metadata

Metadata

Assignees

Labels

api-changeChanges to the arrow APIenhancementAny new improvement worthy of a entry in the changelogquestionFurther information is requested

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions