Implement `regexp_matches_utf8` by b41sh · Pull Request #706 · apache/arrow-rs

b41sh · 2021-08-22T02:08:22Z

Which issue does this PR close?

Closes #697

Rationale for this change

What changes are included in this PR?

Implement arrow compute kernel comparison functions regexp_matches_utf8, regexp_matches_utf8_scalar, regexp_not_matches_utf8, regexp_not_matches_utf8_scalar and test cases.

Are there any user-facing changes?

codecov-commenter · 2021-08-22T02:22:52Z

Codecov Report

Merging #706 (b56b6ef) into master (8308615) will increase coverage by 0.02%.
The diff coverage is 91.74%.

@@            Coverage Diff             @@
##           master     #706      +/-   ##
==========================================
+ Coverage   82.46%   82.48%   +0.02%     
==========================================
  Files         168      168              
  Lines       47419    47528     +109     
==========================================
+ Hits        39104    39204     +100     
- Misses       8315     8324       +9

Impacted Files	Coverage Δ
arrow/src/compute/kernels/comparison.rs	`95.08% <91.74%> (-0.76%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8308615...b56b6ef. Read the comment docs.

alamb · 2021-08-23T21:04:19Z

Thanks @b41sh -- I plan to review this tomorrow morning

alamb

Thank you @b41sh -- I think this PR is looking quite good -- the code and changes are easy to understand and well tested.

I do wonder about the name of these kernels as well as the need to have explicit "not matches" kernels. I am hoping for others to give feedback as well

alamb · 2021-08-24T12:09:50Z

arrow/src/compute/kernels/comparison.rs

 }

+/// Perform SQL `array ~ regex_array` operation on [`StringArray`] / [`LargeStringArray`].
+pub fn regexp_matches_utf8<OffsetSize: StringOffsetSizeTrait>(


As mentioned in #697 (comment), the name regexp_matches_utf8 is pretty similar to regexp_match which I think might cause confusion

It looks like the rust regular expression library uses is_match so I wonder if following that model might be clearer?

So instead of regexp_matches_utf8 maybe calling this regexp_is_match_utf8 (and similarly below).

@seddonm1 / @nevi-me / @jorgecarleitao do you have any thoughts regarding the naming of a function that checks if values match regular expressions?

I agree with you, regexp_is_match_utf8 is a little bit better.

This looks like the https://www.postgresql.org/docs/current/functions-matching.html#FUNCTIONS-POSIX-TABLE functionality which uses an Operator ~ to instead of a function. I think the kernel names are fine but it would be nice if we could retain the postgres semantics in order to achieve as much compatibility as possible.

imo achieving full compatibility with postgres is beyond the scope of the arrow-rs crate. IMO the least surprising option here is an implementation of the regex semantics declared in the regex crate. Users interested in the postgres semantics are a subset of all users - postgres semantics is something relatively specific to databases.

Also, afaik achieving postgres semantics is relatively difficult as it requires a regex parser to act as the postgres one (but I may be mistaken here).

A possible approach is to move both of these kernels to datafusion and keep a non-postgres, regex-crate specific one here.

Agreed jorge, sorry I had missed this was being added at the Arrow crate level. I had previously implemented the Postgres regex behavior in the Datafusion layer and agree that due to differences in regex semantics maybe this does not belong in the arrow-rs crate.

My main point would be from the user experience point of view it would be good to implement the 'abcd' ~ 'bc' style (even if this requires sqlparser updates) rather than implement new non-standard SQL functions.

arrow/src/compute/kernels/comparison.rs

alamb · 2021-08-24T12:18:12Z

arrow/src/compute/kernels/comparison.rs

+}
+
+/// Perform SQL `array !~ regex_array` operation on [`StringArray`] / [`LargeStringArray`].
+pub fn regexp_not_matches_utf8<OffsetSize: StringOffsetSizeTrait>(


I wonder if we need the negative variants (e.g. do we need regexp_not_matches_utf8) kernel? Perhaps we could simply use not(regexp_matches_utf8(..))

arrow/src/compute/kernels/comparison.rs

b41sh · 2021-08-24T16:03:49Z

Thank you @b41sh -- I think this PR is looking quite good -- the code and changes are easy to understand and well tested.

I do wonder about the name of these kernels as well as the need to have explicit "not matches" kernels. I am hoping for others to give feedback as well

@alamb I have added some comment and remove function regexp_not_matches_utf8. PTAL

alamb

I think it is looking great @b41sh -- thank you. I agree with @jorgecarleitao and @seddonm1 that this kernel should follow the model of the rust regex crate (which I think it does)

I'll plan to merge this tomorrow (and include it as part of arrow-rs 5.3.0) unless anyone has additional comments. Just let me know

alamb · 2021-08-25T10:22:12Z

arrow/src/compute/kernels/comparison.rs

+///
+/// `flags_array` are optional [`StringArray`] / [`LargeStringArray`] flag, which allow
+/// special search modes, such as case insensitive and multi-line mode.
+/// See the documentation [here](https://docs.rs/regex/1.5.4/regex/#grouping-and-flags)


* impl regexp_matches_utf8 * fix clippy * add bench * optimize

* impl regexp_matches_utf8 * fix clippy * add bench * optimize Co-authored-by: baishen <[email protected]>

impl regexp_matches_utf8

8227498

github-actions bot added the arrow Changes to the arrow crate label Aug 22, 2021

fix clippy

a726656

alamb reviewed Aug 24, 2021

View reviewed changes

b41sh added 2 commits August 24, 2021 20:24

add bench

2d0c465

optimize

b56b6ef

alamb approved these changes Aug 25, 2021

View reviewed changes

alamb merged commit 48c7529 into apache:master Aug 26, 2021

alamb pushed a commit that referenced this pull request Aug 26, 2021

Implement regexp_matches_utf8 (#706)

483ca60

* impl regexp_matches_utf8 * fix clippy * add bench * optimize

alamb added the cherry-picked label Aug 26, 2021

alamb mentioned this pull request Aug 26, 2021

Cherry pick Implement regexp_matches_utf8 to active_release #717

Merged

alamb added a commit that referenced this pull request Aug 26, 2021

Implement regexp_matches_utf8 (#706) (#717)

446a4b7

* impl regexp_matches_utf8 * fix clippy * add bench * optimize Co-authored-by: baishen <[email protected]>

jhorstmann mentioned this pull request Oct 5, 2021

Investigate usages of ArrayData::new (WIP) #813

Closed

4 tasks

alamb mentioned this pull request Nov 3, 2021

Disable test under MIRI which appears to exceed memory limits intermittently on github CI #910

Closed

Conversation

b41sh commented Aug 22, 2021

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Uh oh!

codecov-commenter commented Aug 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

alamb commented Aug 23, 2021

Uh oh!

alamb left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb Aug 24, 2021

Choose a reason for hiding this comment

Uh oh!

b41sh Aug 24, 2021

Choose a reason for hiding this comment

Uh oh!

seddonm1 Aug 24, 2021

Choose a reason for hiding this comment

Uh oh!

jorgecarleitao Aug 24, 2021

Choose a reason for hiding this comment

Uh oh!

seddonm1 Aug 24, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

alamb Aug 24, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

b41sh commented Aug 24, 2021

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Aug 25, 2021

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

codecov-commenter commented Aug 22, 2021 •

edited

Loading

alamb left a comment •

edited

Loading