Allow constructing ByteViewArray from existing blocks#5796
Allow constructing ByteViewArray from existing blocks#5796tustvold merged 4 commits intoapache:masterfrom
Conversation
| /// Try to append a view of the given `block`, `offset` and `length` | ||
| /// | ||
| /// See [`Self::append_block`] | ||
| pub fn try_append_view(&mut self, block: u32, offset: u32, len: u32) -> Result<(), ArrowError> { |
There was a problem hiding this comment.
I don't know what performance impact the validation logic here will have, but we can always add an unchecked version down the line should it become a problem.
There was a problem hiding this comment.
It seems like we have a filter benchmark but not a raw array creation speed benchmark
arrow-rs/arrow/src/util/bench_util.rs
Line 141 in 9828bf0
I agree let's start like this and then add benchmarks (like reading from parquet) and if they show slow downs we can add unchecked versions
alamb
left a comment
There was a problem hiding this comment.
I think this looks like a good API to me
cc @ariesdevil
| /// | ||
| /// # Append Values | ||
| /// | ||
| /// To avoid bump allocating this builder allocates data in fixed size blocks, configurable |
There was a problem hiding this comment.
| /// To avoid bump allocating this builder allocates data in fixed size blocks, configurable | |
| /// To avoid bump allocating, this builder allocates data in fixed size blocks, configurable |
| let mut v = StringViewBuilder::new(); | ||
| assert_eq!(v.append_block(b1), 0); | ||
|
|
||
| v.append_value("This is a very long string that exceeds the inline length"); |
There was a problem hiding this comment.
These values are appended to the current block (0) right?
| ] | ||
| ); | ||
|
|
||
| let err = v.try_append_view(0, u32::MAX, 1).unwrap_err(); |
There was a problem hiding this comment.
Can you please also add an error test for an invalid block ID? (aka "No block found with index {block}")
Which issue does this PR close?
Relates to #5736
Relates to #5530
Relates to #5735
Rationale for this change
Whilst working on #5736 I struggled to devise a coherent interface for constructing byte views, because views can't really be constructed independently of the data buffers. In particular small strings need to be inlined in the view, but longer strings need to be added to a data buffer. As a result any interface that exposes the view abstraction is naturally leaky, and quite fiddly to use correctly.
Fortunately we already have a builder that abstracts away the view shenanigans, and with some minor tweaks we can extend it to allow using existing buffers, I think this provides for a much more coherent abstraction for constructing byte view arrays.
I think we should still proceed with making the buffer views typed, i.e. #5736, but this simplifies this to be a read-focused abstraction.
What changes are included in this PR?
Are there any user-facing changes?