Optimize `BooleanBufferBuilder` for non nullable columns

**Is your feature request related to a problem or challenge? Please describe what you are trying to do.**

It is farily common to have data with no null values (all valid). However, the current builders such as [`StringBuilder`](https://docs.rs/arrow/latest/arrow/array/type.StringBuilder.html) always use a [BooleanBufferBuilder](https://docs.rs/arrow/latest/arrow/array/struct.BooleanBufferBuilder.html) ([source](https://docs.rs/arrow-array/54.0.0/src/arrow_array/builder/generic_bytes_builder.rs.html#34)) which allocate and manage a `BooleanBuffer` even if there are no actual nulls

This is an important enough special case that I think @JasonLi-cn  added a special `StringArrayBuilder` in DataFusion which has a special case for no nulls

https://github.com/apache/datafusion/blob/63b94c8f9e128b938e81b7e867ce6256a94d67e6/datafusion/functions/src/strings.rs#L62-L68

This is bad both due to the maintenance overhead of a second near identical copy with a bunch of  `unsafe` (leading to potential security issues, such as [this](https://github.com/apache/datafusion/pull/13802) and [this](https://github.com/apache/datafusion/pull/13758))

The only difference is that it has an optimization for building a boolean buffer


**Describe the solution you'd like**
I would like to avoid the code duplication in DataFusion by avoiding the null buffer creation for data without nulls.

I think doing this optimization in the arrow-rs level would benefit many usecases without nulls

**Describe alternatives you've considered**

@domodwyer implemented a strategy for this idea in InfluxDB IOx in a clever and elegant way in InfluxDB IOx.  The high level idea is to always update length but only allocate/resize buffers when a non null (valid) value pushed

In the internal implementation, this looks like:


```rust
    pub fn push(&mut self, bit: Value) {
        // Only NULL values require modifying the bitmap.
        if bit == Value::Null {
            self.append_null();
        }

        // Advance the bitmap length.
        self.len += 1;
    }

    /// Consume this [`NullMaskBuilder`] and return the constructed null mask
    /// bitmap.
    ///
    /// NOTE: this may have length of 0 if only non-null values were pushed.
    pub(super) fn finish(self) -> Vec<u8> {
        self.bitmap
    }
```

So if only nulls are pushed,  the output vector is entirely empty (and thus no allocations occured)

While arrow uses a validation mask (not a null mask) so we would have to invert the logic, I think the same basic idea would apply:

Basically we could update the append (and related functions) to only advance the buffer when a null was appended and avoid the potential resize  here if all the data was only null:
https://docs.rs/arrow-buffer/54.0.0/src/arrow_buffer/builder/boolean.rs.html#88

We would then have to change how append and other functions worked

https://docs.rs/arrow-buffer/54.0.0/src/arrow_buffer/builder/boolean.rs.html#138


**Additional context**

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize `BooleanBufferBuilder` for non nullable columns #6973

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Optimize BooleanBufferBuilder for non nullable columns #6973

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Optimize `BooleanBufferBuilder` for non nullable columns #6973