Skip to content

Optimize BooleanBufferBuilder for non nullable columns #6973

@alamb

Description

@alamb

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

It is farily common to have data with no null values (all valid). However, the current builders such as StringBuilder always use a BooleanBufferBuilder (source) which allocate and manage a BooleanBuffer even if there are no actual nulls

This is an important enough special case that I think @JasonLi-cn added a special StringArrayBuilder in DataFusion which has a special case for no nulls

https://github.com/apache/datafusion/blob/63b94c8f9e128b938e81b7e867ce6256a94d67e6/datafusion/functions/src/strings.rs#L62-L68

This is bad both due to the maintenance overhead of a second near identical copy with a bunch of unsafe (leading to potential security issues, such as this and this)

The only difference is that it has an optimization for building a boolean buffer

Describe the solution you'd like
I would like to avoid the code duplication in DataFusion by avoiding the null buffer creation for data without nulls.

I think doing this optimization in the arrow-rs level would benefit many usecases without nulls

Describe alternatives you've considered

@domodwyer implemented a strategy for this idea in InfluxDB IOx in a clever and elegant way in InfluxDB IOx. The high level idea is to always update length but only allocate/resize buffers when a non null (valid) value pushed

In the internal implementation, this looks like:

    pub fn push(&mut self, bit: Value) {
        // Only NULL values require modifying the bitmap.
        if bit == Value::Null {
            self.append_null();
        }

        // Advance the bitmap length.
        self.len += 1;
    }

    /// Consume this [`NullMaskBuilder`] and return the constructed null mask
    /// bitmap.
    ///
    /// NOTE: this may have length of 0 if only non-null values were pushed.
    pub(super) fn finish(self) -> Vec<u8> {
        self.bitmap
    }

So if only nulls are pushed, the output vector is entirely empty (and thus no allocations occured)

While arrow uses a validation mask (not a null mask) so we would have to invert the logic, I think the same basic idea would apply:

Basically we could update the append (and related functions) to only advance the buffer when a null was appended and avoid the potential resize here if all the data was only null:
https://docs.rs/arrow-buffer/54.0.0/src/arrow_buffer/builder/boolean.rs.html#88

We would then have to change how append and other functions worked

https://docs.rs/arrow-buffer/54.0.0/src/arrow_buffer/builder/boolean.rs.html#138

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    arrowChanges to the arrow crateenhancementAny new improvement worthy of a entry in the changelog

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions