-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
In apache/datafusion#870, @b41sh added support for filtering all values that do/do not match a particular regular expression. However, it uses the (only available at time of writing) regexp_match kernel which returns any actual matches (as a ListArray) rather than just a "true/false" (BooleanArray) if the row matched or not. This is unoptimal because:
- It is more work to construct a
ListArraythan aBooleanArray - There is extra work to then turn the
ListArrayback into aBooleanArray
Describe the solution you'd like
Add an arrow compute kernel (perhaps in the comparison module) that looks like
A better name TBD -- regexp_matches_utf8 is similar to like_utf8 but also perhaps too similar to regexp_match
pub fn regexp_matches_utf8<OffsetSize: StringOffsetSizeTrait>(
array: &GenericStringArray<OffsetSize>,
regex_array: &GenericStringArray<OffsetSize>,
flags_array: Option<&GenericStringArray<OffsetSize>>
) -> Result<BooleanArray>Where the resulting BooleanArray is
- true if there was 1 or more matches of the regex array/flags
- false if there were 0 matches of the regex array/flags
- NULL if the input or regexp array was null (make them the same null semantics as
regex_matchandlike_utf8)
Describe alternatives you've considered
None yet
Additional context
See use in apache/datafusion#870