-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Closed
Labels
arrowChanges to the arrow crateChanges to the arrow crateenhancementAny new improvement worthy of a entry in the changelogAny new improvement worthy of a entry in the changelog
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Currently, when users what to create an array by using ArrayDataBuilder, they always write code like:
let data_buffer = ...;
let offsets_buffer = ...;
let validity_buffer = ...;
let array_data = ArrayData::builder(datatype)
.len(num_of_element)
.null_bit_buffer(validity_buffer)
.add_buffer(offsets_buffer)
.add_buffer(data_buffer)
.offset(number)
.build()
.unwrap();The code is not user-friendly and easy to cause some runtime errors. Because:
- Users have to remember the order and type of each buffer: https://arrow.apache.org/docs/format/Columnar.html#buffer-listing-for-each-layout
- The array can have at most 3 buffers, but by using the method
add_bufferorbuffers, users can create arrayData with > 3 buffers, which is harmful. - Users may get a wrong answer if they confuse offset buffer and data buffer. For example:
// build a list array [[0, 1], [2], [4]]
let offsets_buffer = [0, 2, 3, 4];
let data_buffer = [0, 1, 2, 4];
// currect way
let array_data = ArrayData::builder(ListArray)
.len(3)
.buffers(vec![offsets_buffer, data_buffer])
...
.build();
// what if we reverse the order of buffers?
// wrong way
let array_data = ArrayData::builder(ListArray)
.len(3)
.buffers(vec![data_buffer, offsets_buffer])
...
.build();
// This will get an array: [[0], [2], [3, 4]]ArrayData::try_new can not detect such errors. So users have to debug to find it.
Describe the solution you'd like
- delete the methods
add_bufferandbuffers - add methods
offsets_buffer,type_ids_bufferanddata_buffer - rename
null_bit_bufferandvalidity_buffer(because it is the name in Apache Arrow Format) - Update the struct of
ArrayDataBuilderlike:
pub struct ArrayDataBuilder {
data_type: DataType,
len: usize,
null_count: Option<usize>,
validity_buffer: Option<Buffer>,
offsets_buffer: Option<Buffer>,
type_ids_buffer: Option<Buffer>,
data_buffer: Option<Buffer>,
offset: usize,
child_data: Vec<ArrayData>,
}we can add some cheap checking in the build method. For example
match (data_type, validity_buffer, type_ids_buffer, offsets_buffer, data_buffer) {
(Binary, Some(buf0), None, Some(buf1), Some(buf2)) => ArrayData::try_new(...),
(DenseUnion, None, Some(buf0), Some(buf1), None) => ArrayData::try_new(...),
...,
_ => Err("wrong layout"),
}This will introduce some public API changes. So I want more code maintainers to review this issue. And I will implement this if we could reach consensus.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
arrowChanges to the arrow crateChanges to the arrow crateenhancementAny new improvement worthy of a entry in the changelogAny new improvement worthy of a entry in the changelog