Add substring support for FixedSizeBinaryArray#1633
Conversation
Signed-off-by: remzi <[email protected]>
Signed-off-by: remzi <[email protected]>
Signed-off-by: remzi <[email protected]>
Signed-off-by: remzi <[email protected]>
Signed-off-by: remzi <[email protected]>
Signed-off-by: remzi <[email protected]>
Signed-off-by: remzi <[email protected]>
| let new_len = match length { | ||
| Some(len) => len.min(old_len - new_start), | ||
| None => old_len - new_start, | ||
| }; |
There was a problem hiding this comment.
Is it possible that if length is negative? DataType::FixedSizeBinary(negative) seems invalid.
There was a problem hiding this comment.
I think the answer is no for several reasons:
- the definition of the public function makes sure that
lengthis always >=0
pub fn substring(array: &dyn Array, start: i64, length: Option<u64>) -> Result<ArrayRef> {- the definition of the data type of
FixedSizeBinaryseems to allow negative value, although it is somewhat weird:
/// Opaque binary data of fixed size.
/// Enum parameter specifies the number of bytes per value.
FixedSizeBinary(i32),Although it may be more reasonable to use unsigned int to type the length, in Apache Arrow specification, the length must be i32. https://arrow.apache.org/docs/format/Columnar.html#fixed-size-list-layout (For nested arrays, we always use signed integer to represent the length or offsets.)
A fixed size list type is specified like FixedSizeList<T>[N], where T is any type (primitive or nested) and N is a 32-bit signed integer representing the length of the lists.
What would happen if we give a negative length or negative offsets buffer?
This is a fun game!
There was a problem hiding this comment.
Is there any discussion or explanation of why we use signed integer to represent length?
There was a problem hiding this comment.
Ah, that's right, I didn't notice that length is u64.
There was a problem hiding this comment.
Since unsigned integers can be more difficult to work with in some cases (e.g. in the JVM), we recommend preferring signed integers over unsigned integers for representing dictionary indices. Additionally, we recommend avoiding using 64-bit unsigned integer indices unless they are required by an application.
This is some wordings I read in the spec. It is not talking about array length, but I think it might be somehow related to the question.
| // all-nulls array is always identical | ||
| (vec![None, None, None], -1, Some(1), vec![None, None, None]), |
| array.data_ref().null_buffer().cloned(), | ||
| 0, |
There was a problem hiding this comment.
0 offset works for the new array value new_values. But null_buffer is cloned, so maybe it has different offset?
There was a problem hiding this comment.
Thank you very much for finding the bug. I have fixed it by getting the bit_slice of null_buffer. Also a corresponding test is added.
There was a problem hiding this comment.
As for other arrays' implementations we have the same bug, I will file a follow-up issue to fix it.
Signed-off-by: remzi <[email protected]>
Signed-off-by: remzi <[email protected]>
Codecov Report
@@ Coverage Diff @@
## master #1633 +/- ##
==========================================
+ Coverage 83.02% 83.08% +0.05%
==========================================
Files 193 193
Lines 55577 55772 +195
==========================================
+ Hits 46145 46338 +193
- Misses 9432 9434 +2
Continue to review full report at Codecov.
|
alamb
left a comment
There was a problem hiding this comment.
I don't have any more comments other than thank you @HaoYang670 and @viirya 👍
|
Thank you @HaoYang670 @alamb ! 😄 |
Which issue does this PR close?
Closes #1618.
What changes are included in this PR?
substringkernel supportFixedSizeBinaryArrayFixedSizeBinaryArrayAre there any user-facing changes?
Rename some benchmarks.
Bench Result