Improve arrow-row --> StringView/BinaryView memory usage 

**Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
Part of https://github.com/apache/arrow-rs/issues/5374

@XiangpengHao implemented optimized row format --> ByteView (StringView / BinaryView) encoding/decoding in https://github.com/apache/arrow-rs/issues/5945 / https://github.com/apache/arrow-rs/pull/6044

It also adds benchmarks so we can test🎉 

However, as mentioned in https://github.com/apache/arrow-rs/pull/6044/files#r1676803119 the output array in https://github.com/apache/arrow-rs/pull/6044 will have both short and long strings  even though only the long strings are used in the view definition (the short strings are included to do fast utf8 validation)

This results in more memory used for the output array than neccessary

**Describe the solution you'd like**

reduce memory required by output array


**Describe alternatives you've considered**
One idea is to use a separate utf8 validation buffer for short strings, similarly to

https://github.com/apache/arrow-rs/blob/0002b4ded7cfffbf46c85e2fac0b4f9a545d0f55/parquet/src/arrow/array_reader/byte_view_array.rs#L623-L668

**Additional context**

	let read = if !self.validate_utf8 {
	self.decoder.read(len, \|bytes\| {
	let offset = array_buffer.len();
	let view = make_view(bytes, buffer_id, offset as u32);
	if bytes.len() > 12 {
	// only copy the data to buffer if the string can not be inlined.
	array_buffer.extend_from_slice(bytes);
	}

	// # Safety
	// The buffer_id is the last buffer in the output buffers
	// The offset is calculated from the buffer, so it is valid
	unsafe {
	output.append_raw_view_unchecked(&view);
	}
	Ok(())
	})?
	} else {
	// utf8 validation buffer has only short strings. These short
	// strings are inlined into the views but we copy them into a
	// contiguous buffer to accelerate validation.®
	let mut utf8_validation_buffer = Vec::with_capacity(4096);

	let v = self.decoder.read(len, \|bytes\| {
	let offset = array_buffer.len();
	let view = make_view(bytes, buffer_id, offset as u32);
	if bytes.len() > 12 {
	// only copy the data to buffer if the string can not be inlined.
	array_buffer.extend_from_slice(bytes);
	} else {
	utf8_validation_buffer.extend_from_slice(bytes);
	}

	// # Safety
	// The buffer_id is the last buffer in the output buffers
	// The offset is calculated from the buffer, so it is valid
	// Utf-8 validation is done later
	unsafe {
	output.append_raw_view_unchecked(&view);
	}
	Ok(())
	})?;
	check_valid_utf8(&array_buffer)?;
	check_valid_utf8(&utf8_validation_buffer)?;
	v
	};

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve arrow-row --> StringView/BinaryView memory usage #6057

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improve arrow-row --> StringView/BinaryView memory usage #6057

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions