Add dictionary array support for substring function#1665
Add dictionary array support for substring function#1665viirya merged 4 commits intoapache:masterfrom
Conversation
| /// let error = substring(&array, 0, Some(5)).unwrap_err().to_string(); | ||
| /// assert!(error.contains("invalid utf-8 boundary")); | ||
| /// ``` | ||
| pub fn substring(array: &dyn Array, start: i64, length: Option<u64>) -> Result<ArrayRef> { |
There was a problem hiding this comment.
I moved this func to the beginning of the file, before all other non-public ones, for better readability.
| DataType::Dictionary(kt, _) => { | ||
| substring_dict!( | ||
| kt, | ||
| Int8: Int8Type, |
There was a problem hiding this comment.
We may make this shorter via concat_idents (e.g., concat_idents($t, Type)) but it's only available in nightly.
| } | ||
|
|
||
| #[test] | ||
| fn dictionary() -> Result<()> { |
There was a problem hiding this comment.
| fn dictionary() -> Result<()> { | |
| fn test_substring_dictionary() -> Result<()> { |
There was a problem hiding this comment.
I think it's not necessary to add test_ prefix for Rust tests since they are already under the tests module. The substring here also seem redundant since the full test name compute::kernels::substring::tests::dictionary already contain it.
viirya
left a comment
There was a problem hiding this comment.
Looks good to me. A few minor comments.
Codecov Report
@@ Coverage Diff @@
## master #1665 +/- ##
==========================================
+ Coverage 83.10% 83.16% +0.05%
==========================================
Files 193 193
Lines 55864 56039 +175
==========================================
+ Hits 46425 46603 +178
+ Misses 9439 9436 -3
Continue to review full report at Codecov.
|
| /// let error = substring(&array, 0, Some(5)).unwrap_err().to_string(); | ||
| /// assert!(error.contains("invalid utf-8 boundary")); | ||
| /// ``` | ||
| pub fn substring(array: &dyn Array, start: i64, length: Option<u64>) -> Result<ArrayRef> { |
There was a problem hiding this comment.
Just a nit: Maybe we could let length be Option<u32>. Because the longest length will not exceed 1<<31 - 1 (for LargeBinaryArray and LargeStringArray)
There was a problem hiding this comment.
Hmm I think this is not quite related to this PR. I can open another one for the change.
| /// ``` | ||
| /// | ||
| /// # Error | ||
| /// - The function errors when the passed array is not a \[Large\]String array or \[Large\]Binary array. |
There was a problem hiding this comment.
We may also update that Dictionary arrays with [large]string/[large]binary values are also accepted here.
|
Thank you @sunchao ❤️ |
|
Merged, thanks @sunchao @HaoYang670 @alamb |
Which issue does this PR close?
Closes #1656.
Rationale for this change
Currently the
substringkernel only support "plain" arrays but not dictionary encoded ones. With dictionary array, the compute could be much more efficient since it only needs to be done on the dictionary values.What changes are included in this PR?
This PR adds the support of dictionary array for
substringkernel.Are there any user-facing changes?
No