Add StringViewArray and BinaryViewArray (#4253) by tustvold · Pull Request #4585 · apache/arrow-rs

tustvold · 2023-07-29T21:07:32Z

Draft as not yet standardised and needs a LOT more testing

Which issue does this PR close?

Closes #.

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

alamb · 2023-07-30T10:06:24Z

Related mailing list discussion: https://lists.apache.org/thread/w88tpz76ox8h3rxkjl4so6rg3f1rv7wt

alamb · 2023-07-30T10:08:46Z

arrow-data/src/view.rs

The "small string optimization" in rust !!! -- I had wondered what this would look like

bkietz

This looks good to me, I have just a few suggestions

arrow-array/src/array/byte_view_array.rs

bkietz · 2023-09-06T17:45:15Z

arrow-data/src/transform/mod.rs

+    /// View data buffers
+    /// This is not stored in `MutableArrayData` because these values constant and only needed
+    /// at the end, when freezing [_MutableArrayData].
+    view_buffers: Vec<Buffer>,


I'd recommend renaming this to avoid the suggestion that these buffers hold views

Suggested change

/// View data buffers

/// This is not stored in `MutableArrayData` because these values constant and only needed

/// at the end, when freezing [_MutableArrayData].

view_buffers: Vec<Buffer>,

/// Variadic data buffers referenced by views

/// This is not stored in `MutableArrayData` because these values constant and only needed

/// at the end, when freezing [_MutableArrayData].

variadic_character_buffers: Vec<Buffer>,

bkietz · 2023-09-06T18:07:21Z

arrow-data/src/equal/binary_view.rs

+    let lhs_b = lhs.buffers();
+    let rhs_views = &rhs.buffer::<u128>(0)[rhs_start..rhs_start + len];
+    let rhs_b = rhs.buffers();
+
+    for (idx, (l, r)) in lhs_views.iter().zip(rhs_views).enumerate() {
+        // Only checking one null mask here because by the time the control flow reaches
+        // this point, the equality of the two masks would have already been verified.
+        if lhs.is_null(idx) {
+            continue;
+        }
+
+        let l_len = *l as u32;
+        let r_len = *r as u32;
+        if l_len != r_len {
+            return false;
+        } else if l_len <= 12 {
+            // Inline storage
+            if l != r {
+                return false;
+            }
+        } else {
+            let l_view = View::from(*l);
+            let r_view = View::from(*r);
+            let l_b = &lhs_b[(l_view.buffer_index as usize) + 1];
+            let r_b = &rhs_b[(r_view.buffer_index as usize) + 1];
+
+            let l_o = l_view.offset as usize;
+            let r_o = r_view.offset as usize;
+            let len = l_len as usize;
+            if l_b[l_o..l_o + len] != r_b[r_o..r_o + len] {
+                return false;
+            }
+        }
+    }
+    true


For equality comparison, we can compare the length and prefix of the views simultaneously:

Suggested change

let lhs_b = lhs.buffers();

let rhs_views = &rhs.buffer::<u128>(0)[rhs_start..rhs_start + len];

let rhs_b = rhs.buffers();

for (idx, (l, r)) in lhs_views.iter().zip(rhs_views).enumerate() {

// Only checking one null mask here because by the time the control flow reaches

// this point, the equality of the two masks would have already been verified.

if lhs.is_null(idx) {

continue;

}

let l_len = *l as u32;

let r_len = *r as u32;

if l_len != r_len {

return false;

} else if l_len <= 12 {

// Inline storage

if l != r {

return false;

}

} else {

let l_view = View::from(*l);

let r_view = View::from(*r);

let l_b = &lhs_b[(l_view.buffer_index as usize) + 1];

let r_b = &rhs_b[(r_view.buffer_index as usize) + 1];

let l_o = l_view.offset as usize;

let r_o = r_view.offset as usize;

let len = l_len as usize;

if l_b[l_o..l_o + len] != r_b[r_o..r_o + len] {

return false;

}

}

}

true

let lhs_b = lhs.buffers()[1..];

let rhs_views = &rhs.buffer::<u128>(0)[rhs_start..rhs_start + len];

let rhs_b = rhs.buffers()[1..];

for (idx, (l, r)) in lhs_views.iter().zip(rhs_views).enumerate() {

// Only checking one null mask here because by the time the control flow reaches

// this point, the equality of the two masks would have already been verified.

if lhs.is_null(idx) {

continue;

}

let l_len_prefix = *l as u64;

let r_len_prefix = *r as u64;

if l_len_prefix != r_len_prefix {

return false;

}

let len = *l as u32;

if len <= 12 {

if (*l >> 64) as u64 != (*r >> 64) as u64 {

return false;

}

continue;

}

let l_view = View::from(*l);

let r_view = View::from(*r);

let l_b = &lhs_b[l_view.buffer_index as usize];

let r_b = &rhs_b[r_view.buffer_index as usize];

// prefixes are already known to be equal; skip checking them

let len = len as usize - 4;

let l_o = l_view.offset as usize + 4;

let r_o = r_view.offset as usize + 4;

if l_b[l_o..l_o + len] != r_b[r_o..r_o + len] {

return false;

}

}

true

I'd have hoped LLVM is smart enough to work this out but I can double-check

alamb · 2024-02-12T21:26:04Z

See #5374 for implementation discussions

doki23 · 2024-02-18T01:20:38Z

arrow-array/src/array/byte_view_array.rs

+
+/// An array of variable length byte view arrays
+pub struct GenericByteViewArray<T: ByteViewType> {
+    data_type: DataType,


Why we need this? Can we just use T::DATA_TYPE?

doki23 · 2024-02-18T02:15:28Z

arrow-array/src/builder/byte_view_builder.rs

+        let v: &[u8] = value.as_ref().as_ref();
+        let length: u32 = v.len().try_into().unwrap();
+        if length <= 12 {
+            let mut offset = [0; 16];


I'd recommend renaming offset to view_buffer.

doki23 · 2024-02-18T02:16:00Z

arrow-array/src/array/byte_view_array.rs

+            self.len()
+        );
+
+        assert!(i < self.views.len());


This assertion is duplicate with the previous one.

sundy-li · 2024-02-22T03:28:52Z

arrow-data/src/equal/binary_view.rs

+            }
+        } else {
+            let l_view = View::from(*l);
+            let r_view = View::from(*r);


We can compare the prefix to short-circuit.

As described in paper:

In case of long strings, the remaining four bytes of the header are used to store the first four characters of the string, allowing Umbra to short-circuit some comparisons

Oh, I just note it's commend already. ignore this.

sundy-li · 2024-02-22T03:31:17Z

arrow-data/src/view.rs

+/// The element layout of a view buffer
+///
+/// See [`DataType::Utf8View`](arrow_schema::DataType)
+pub struct View {


do we need to use #[repr(C)] ?

sundy-li · 2024-02-22T03:34:18Z

arrow-data/src/view.rs

+    buffers: &[Buffer],
+) -> Result<(), ArrowError> {
+    validate_view_impl(views, buffers, |idx, b| {
+        std::str::from_utf8(b).map_err(|e| {


simdutf8 crate may be better than this.

alamb · 2024-02-26T18:36:39Z

Thanks for the comments @sundy-li . Is there any chance you or someone at DataBend are interested in / planning on working on this feature?

sundy-li · 2024-02-27T06:53:00Z

Thanks for the comments @sundy-li . Is there any chance you or someone at DataBend are interested in / planning on working on this feature?

Yes, we are working on it at databendlabs/databend#14662

In databend, our column memory model is still based on arrow2. But reading/writing is using arrow-rs, we are planing to work on it after databendlabs/databend#14662 is finished.

@ariesdevil

ariesdevil · 2024-02-27T06:57:30Z

Hi @alamb, I'm willing to work on this feature after databendlabs/databend#14662 is finished.

alamb · 2024-02-28T11:22:45Z

Hi @alamb, I'm willing to work on this feature after datafuselabs/databend#14662 is finished.

Thank you @sundy-li and @ariesdevil

I left a comment on #5374 #5374 (comment) about how we can potentially work on this together

tustvold · 2024-03-06T03:26:57Z

I'm going to close this PR as I believe others are picking this up

alamb · 2024-03-06T07:56:58Z

Add StringViewArray implementation and layout and basic construction + tests apache/arrow#5469

Indeed -- we are tracking the work in #5374

github-actions bot added the arrow Changes to the arrow crate label Jul 29, 2023

tustvold mentioned this pull request Jul 30, 2023

GH-35627: [C++][Format][Integration] Add string view to the arrow format apache/arrow#35628

Closed

alamb reviewed Jul 30, 2023

View reviewed changes

arrow-data/src/view.rs Outdated

Copy link

Contributor

alamb Jul 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "small string optimization" in rust !!! -- I had wondered what this would look like

tustvold force-pushed the array-view branch from f90dfb0 to 51da8c2 Compare August 30, 2023 14:32

github-actions bot added the parquet Changes to the parquet crate label Aug 30, 2023

Add StringViewArray and BinaryViewArray (apache#4253)

0411e3e

tustvold force-pushed the array-view branch from 51da8c2 to 0411e3e Compare August 30, 2023 14:37

tustvold added 2 commits August 30, 2023 16:15

Add DataTypeLayout::variadic

29a91fb

Stricter view verification

a73ef7c

bkietz requested changes Sep 6, 2023

View reviewed changes

alamb assigned tustvold Sep 11, 2023

bkietz mentioned this pull request Sep 12, 2023

GH-35627: [Format][Integration] Add string-view to arrow format apache/arrow#37526

Merged

This was referenced Feb 12, 2024

[EPIC] Implement StringViewArray and BinaryViewArray #5374

Closed

Prototype ArrayView Types #4253

Closed

doki23 reviewed Feb 18, 2024

View reviewed changes

sundy-li reviewed Feb 22, 2024

View reviewed changes

This was referenced Mar 4, 2024

Add DataType::Utf8View and DataType::BinaryView #5468

Closed

Add StringViewArray implementation and layout and basic construction + tests #5469

Closed

tustvold closed this Mar 6, 2024

ariesdevil mentioned this pull request Mar 7, 2024

feat: initial support string_view and binary_view, supports layout and basic construction + tests #5481

Merged

alamb mentioned this pull request Apr 9, 2024

Encapsulate View logic for GenericByteViewArray #5619

Closed

Comments

Conversation

tustvold commented Jul 29, 2023

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Uh oh!

alamb commented Jul 30, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bkietz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Feb 12, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Feb 26, 2024

Uh oh!

sundy-li commented Feb 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ariesdevil commented Feb 27, 2024

Uh oh!

alamb commented Feb 28, 2024

Uh oh!

tustvold commented Mar 6, 2024

Uh oh!

alamb commented Mar 6, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

sundy-li commented Feb 27, 2024 •

edited

Loading