Skip to content

Add methods pub fn offsets_buffer, pub fn types_ids_bufferand pub fn data_buffer for ArrayDataBuilder #1640

@HaoYang670

Description

@HaoYang670

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Currently, when users what to create an array by using ArrayDataBuilder, they always write code like:

let data_buffer = ...;
let offsets_buffer = ...;
let validity_buffer = ...;
let array_data = ArrayData::builder(datatype)
    .len(num_of_element)
    .null_bit_buffer(validity_buffer)
    .add_buffer(offsets_buffer)
    .add_buffer(data_buffer)
    .offset(number)
    .build()
    .unwrap();

The code is not user-friendly and easy to cause some runtime errors. Because:

  1. Users have to remember the order and type of each buffer: https://arrow.apache.org/docs/format/Columnar.html#buffer-listing-for-each-layout
  2. The array can have at most 3 buffers, but by using the method add_buffer or buffers, users can create arrayData with > 3 buffers, which is harmful.
  3. Users may get a wrong answer if they confuse offset buffer and data buffer. For example:
// build a list array [[0, 1], [2], [4]]
let offsets_buffer = [0, 2, 3, 4];
let data_buffer = [0, 1, 2, 4];

// currect way
let array_data = ArrayData::builder(ListArray)
    .len(3)
    .buffers(vec![offsets_buffer, data_buffer])
    ...
    .build();

// what if we reverse the order of buffers?
// wrong way
let array_data = ArrayData::builder(ListArray)
    .len(3)
    .buffers(vec![data_buffer, offsets_buffer])
    ...
    .build();

// This will get an array: [[0], [2], [3, 4]]

ArrayData::try_new can not detect such errors. So users have to debug to find it.

Describe the solution you'd like

  1. delete the methods add_buffer and buffers
  2. add methods offsets_buffer, type_ids_buffer and data_buffer
  3. rename null_bit_buffer and validity_buffer (because it is the name in Apache Arrow Format)
  4. Update the struct of ArrayDataBuilder like:
pub struct ArrayDataBuilder {
    data_type: DataType,
    len: usize,
    null_count: Option<usize>,
    validity_buffer: Option<Buffer>,
    offsets_buffer: Option<Buffer>,
    type_ids_buffer: Option<Buffer>,
    data_buffer: Option<Buffer>,
    offset: usize,
    child_data: Vec<ArrayData>,
}

we can add some cheap checking in the build method. For example

match (data_type, validity_buffer, type_ids_buffer, offsets_buffer, data_buffer) {
    (Binary, Some(buf0), None, Some(buf1), Some(buf2)) => ArrayData::try_new(...),
    (DenseUnion, None, Some(buf0), Some(buf1), None) => ArrayData::try_new(...),
    ...,
    _ => Err("wrong layout"),
}

This will introduce some public API changes. So I want more code maintainers to review this issue. And I will implement this if we could reach consensus.

Metadata

Metadata

Assignees

No one assigned

    Labels

    arrowChanges to the arrow crateenhancementAny new improvement worthy of a entry in the changelog

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions