-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
It is farily common to have data with no null values (all valid). However, the current builders such as StringBuilder always use a BooleanBufferBuilder (source) which allocate and manage a BooleanBuffer even if there are no actual nulls
This is an important enough special case that I think @JasonLi-cn added a special StringArrayBuilder in DataFusion which has a special case for no nulls
This is bad both due to the maintenance overhead of a second near identical copy with a bunch of unsafe (leading to potential security issues, such as this and this)
The only difference is that it has an optimization for building a boolean buffer
Describe the solution you'd like
I would like to avoid the code duplication in DataFusion by avoiding the null buffer creation for data without nulls.
I think doing this optimization in the arrow-rs level would benefit many usecases without nulls
Describe alternatives you've considered
@domodwyer implemented a strategy for this idea in InfluxDB IOx in a clever and elegant way in InfluxDB IOx. The high level idea is to always update length but only allocate/resize buffers when a non null (valid) value pushed
In the internal implementation, this looks like:
pub fn push(&mut self, bit: Value) {
// Only NULL values require modifying the bitmap.
if bit == Value::Null {
self.append_null();
}
// Advance the bitmap length.
self.len += 1;
}
/// Consume this [`NullMaskBuilder`] and return the constructed null mask
/// bitmap.
///
/// NOTE: this may have length of 0 if only non-null values were pushed.
pub(super) fn finish(self) -> Vec<u8> {
self.bitmap
}So if only nulls are pushed, the output vector is entirely empty (and thus no allocations occured)
While arrow uses a validation mask (not a null mask) so we would have to invert the logic, I think the same basic idea would apply:
Basically we could update the append (and related functions) to only advance the buffer when a null was appended and avoid the potential resize here if all the data was only null:
https://docs.rs/arrow-buffer/54.0.0/src/arrow_buffer/builder/boolean.rs.html#88
We would then have to change how append and other functions worked
Additional context