[arrow-string] Implement string view support for regexp_match#6849
[arrow-string] Implement string view support for regexp_match#6849alamb merged 4 commits intoapache:mainfrom
regexp_match#6849Conversation
Signed-off-by: Tai Le Manh <[email protected]>
wiedld
left a comment
There was a problem hiding this comment.
I really liked the use of macros to make the test cases very clean. Although I'm unclear of the reason to use macros elsewhere.
| let builder = StringViewBuilder::with_capacity(0); | ||
| let mut list_builder = ListBuilder::new(builder); | ||
|
|
||
| process_regexp_array_match!(array, regex_array, flags_array, list_builder); |
There was a problem hiding this comment.
Confirmed this expands to the original logic, for both StringView (here) and StringArrays (above), to match on an array of regex. Why the use of macros?
| fn regexp_scalar_match_utf8view( | ||
| array: &StringViewArray, | ||
| regex: &Regex, | ||
| ) -> Result<ArrayRef, ArrowError> { | ||
| let builder = StringViewBuilder::with_capacity(0); | ||
| let mut list_builder = ListBuilder::new(builder); | ||
|
|
||
| process_regexp_match!(array, regex, list_builder); |
There was a problem hiding this comment.
Same pattern, and same question as before.
It's the same logic (for a single regex) into a macro to use for both StringView and StringArray. Why the macro?
There was a problem hiding this comment.
@wiedld Thanks so much for reviewing ❤️
Confirmed this expands to the original logic, for both StringView (here) and StringArrays (above), to match on an array of regex. Why the use of macros?
It's the same logic (for a single regex) into a macro to use for both StringView and StringArray. Why the macro?
I just want to reduce duplicate code here. I remember that I have struggled when I implement generic functions instead of macro. But if it makes code hard to read/debug/maintain I can try to remove the macro and using another solution.
There was a problem hiding this comment.
I played around a bit, and It looks like some of the common array builder methods are not traits on ArrayBuilder. Specifically, I got to here and was blocked by the builder.append_value():
fn process_regexp_match<'a, T: ?Sized, A: ArrayAccessor<Item = &'a str>, O: OffsetSizeTrait, B: ArrayBuilder>(
array: ArrayIter<A>,
regex: &Regex,
list_builder: &mut GenericListBuilder<O, B>,
) -> Result<(), ArrowError> {
...
}
It looks like this method is only implemented on 9 of the 15 current ArrayBuilder implementations. Not sure on whether the method should be pulled into the interface or whether we want additional builder traits/interface definitions (maybe in a separate PR). @alamb ?
Otherwise, yes the macro is the alternative for DRY. But it feels like we want abstraction (a.k.a. a trait).
There was a problem hiding this comment.
A trait sounds like a good idea to me, but unless we have a specific idea/proposal of one, I think macros are the best way to proceed (we can always clean the code up later)
Or perhaps @wiedld if you have time you can try and work out what a trait based solution would look like so we can compare
There was a problem hiding this comment.
I can do that in a follow up PR. Thank you for the input @alamb .
@tlm365 -- if the additional test gets added, then I can approve.
There was a problem hiding this comment.
@alamb @tlm365 -- Here is a proposed trait ValuesBuilder: ArrayBuilder for common interfaces in building a list of values.
Note that there were several other common sets of builder functionality (such as building lists of lists), but I didn't address that due to (a) keep the PR small, and (b) I knew only of this immediate use case (altho there are likely others).
| expected_builder.values().append_value("005"); | ||
| expected_builder.append(true); | ||
| expected_builder.values().append_value("7"); | ||
| expected_builder.append(true); | ||
| expected_builder.append(false); | ||
| expected_builder.append(false); | ||
| expected_builder.append(false); | ||
| expected_builder.values().append_value(""); | ||
| expected_builder.append(true); |
There was a problem hiding this comment.
Much nicer not having this anymore.
| test_match_scalar_pattern!( | ||
| match_scalar_pattern_string_no_flags, | ||
| vec![Some("abc-005-def"), Some("x-7-5"), Some("X545"), None], | ||
| r"x.*-(\d*)-.*", | ||
| None::<&str>, | ||
| StringArray, | ||
| GenericStringBuilder<i32>, | ||
| [None, Some("7"), None, None] | ||
| ); |
There was a problem hiding this comment.
There is no difference in the match results with, or without, flags. Might be nice to have test cases demonstrating this difference. 🙏🏼
Signed-off-by: Tai Le Manh <[email protected]>
wiedld
left a comment
There was a problem hiding this comment.
Thanks for the added test cases.
👍 🚀 |
…he#6849) * [arrow-string] Implement string view suport for regexp match Signed-off-by: Tai Le Manh <[email protected]> * update unit tests * fix clippy warnings * Add test cases Signed-off-by: Tai Le Manh <[email protected]> --------- Signed-off-by: Tai Le Manh <[email protected]>
…he#6849) * [arrow-string] Implement string view suport for regexp match Signed-off-by: Tai Le Manh <[email protected]> * update unit tests * fix clippy warnings * Add test cases Signed-off-by: Tai Le Manh <[email protected]> --------- Signed-off-by: Tai Le Manh <[email protected]>
Which issue does this PR close?
Closes #6717.
Rationale for this change
What changes are included in this PR?
Add support string view (
Utf8View) forregexp_match, unit tests.Are there any user-facing changes?
More type support for
regexp_match.