-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
TLDR
Make ArrayData layout explicit so that we can eventually push offsets down into the underlying buffers/bitmaps, instead of tracking them as a top-level concept which has proven to be rather error prone.
This is also the enabling feature that will support easy and zero cost interoperability between arrow-rs and arrow2 -- see jorgecarleitao/arrow2#1429
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Currently ArrayData is defined as follows.
pub struct ArrayData {
/// The data type for this array data
data_type: DataType,
/// The number of elements in this array data
len: usize,
/// The number of null elements in this array data
null_count: usize,
/// The offset into this array data, in number of items
offset: usize,
/// The buffers for this array data. Note that depending on the array types, this
/// could hold different kinds of buffers (e.g., value buffer, value offset buffer)
/// at different positions.
buffers: Vec<Buffer>,
/// The child(ren) of this array. Only non-empty for nested types, currently
/// `ListArray` and `StructArray`.
child_data: Vec<ArrayData>,
/// The null bitmap. A `None` value for this indicates all values are non-null in
/// this array.
null_bitmap: Option<Bitmap>,
}
This is simple, but has a couple of caveats:
- It isn't clear what is present for specific layout types
- There is no clear path to storing
BooleanArrayasBitMapvsBuffer, which would allow removingoffset - Vec allocations for one or two elements (the C++ implementation inlines these)
- There is potential for accidentally interpreting a buffer incorrectly
Describe the solution you'd like
Introduce a new ArrayDataLayout enumeration:
pub enum ArrayDataLayout {
Boolean { values: Bitmap },
Primitive{ values: Buffer },
Offsets { offsets: Buffer, values: Buffer },
Dictionary { keys: Buffer, values: ArrayData },
List { offsets: Buffer, elements: ArrayData },
Struct { children: Vec<ArrayData> },
Union { offsets: Option<Buffer>, types: Buffer, children: Vec<ArrayData> },
}
pub struct ArrayData {
/// The data type for this array data
data_type: DataType,
/// The number of elements in this array data
len: usize,
/// The number of null elements in this array data
null_count: usize,
/// The offset into this array data, in number of items
offset: usize,
/// The null bitmap. A `None` value for this indicates all values are non-null in
/// this array.
null_bitmap: Option<Bitmap>,
/// The array data layout
layout: ArrayDataLayout
}
We could then progressively deprecate the methods that explicitly refer to buffers by index, etc...
Describe alternatives you've considered
We could not do this
Additional context
This could be seen as an evolution of @HaoYang670 's proposal in #1640
It also relates to @jhorstmann 's proposal on #1499 (comment)
It could also be seen as an interpretation of the arrow2 physical vs logical type separation.