Skip to content

Add optimized filter kernel for regular expression matching #697

@alamb

Description

@alamb

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
In apache/datafusion#870, @b41sh added support for filtering all values that do/do not match a particular regular expression. However, it uses the (only available at time of writing) regexp_match kernel which returns any actual matches (as a ListArray) rather than just a "true/false" (BooleanArray) if the row matched or not. This is unoptimal because:

  1. It is more work to construct a ListArray than a BooleanArray
  2. There is extra work to then turn the ListArray back into a BooleanArray

Describe the solution you'd like
Add an arrow compute kernel (perhaps in the comparison module) that looks like

A better name TBD -- regexp_matches_utf8 is similar to like_utf8 but also perhaps too similar to regexp_match

pub fn regexp_matches_utf8<OffsetSize: StringOffsetSizeTrait>(
    array: &GenericStringArray<OffsetSize>, 
    regex_array: &GenericStringArray<OffsetSize>, 
    flags_array: Option<&GenericStringArray<OffsetSize>>
) -> Result<BooleanArray>

Where the resulting BooleanArray is

  • true if there was 1 or more matches of the regex array/flags
  • false if there were 0 matches of the regex array/flags
  • NULL if the input or regexp array was null (make them the same null semantics as regex_match and like_utf8)

Describe alternatives you've considered
None yet

Additional context
See use in apache/datafusion#870

Metadata

Metadata

Assignees

Labels

enhancementAny new improvement worthy of a entry in the changelog

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions