Conversation
Codecov Report
@@ Coverage Diff @@
## master #706 +/- ##
==========================================
+ Coverage 82.46% 82.48% +0.02%
==========================================
Files 168 168
Lines 47419 47528 +109
==========================================
+ Hits 39104 39204 +100
- Misses 8315 8324 +9
Continue to review full report at Codecov.
|
|
Thanks @b41sh -- I plan to review this tomorrow morning |
There was a problem hiding this comment.
Thank you @b41sh -- I think this PR is looking quite good -- the code and changes are easy to understand and well tested.
I do wonder about the name of these kernels as well as the need to have explicit "not matches" kernels. I am hoping for others to give feedback as well
| } | ||
|
|
||
| /// Perform SQL `array ~ regex_array` operation on [`StringArray`] / [`LargeStringArray`]. | ||
| pub fn regexp_matches_utf8<OffsetSize: StringOffsetSizeTrait>( |
There was a problem hiding this comment.
As mentioned in #697 (comment), the name regexp_matches_utf8 is pretty similar to regexp_match which I think might cause confusion
It looks like the rust regular expression library uses is_match so I wonder if following that model might be clearer?
So instead of regexp_matches_utf8 maybe calling this regexp_is_match_utf8 (and similarly below).
@seddonm1 / @nevi-me / @jorgecarleitao do you have any thoughts regarding the naming of a function that checks if values match regular expressions?
There was a problem hiding this comment.
I agree with you, regexp_is_match_utf8 is a little bit better.
There was a problem hiding this comment.
This looks like the https://www.postgresql.org/docs/current/functions-matching.html#FUNCTIONS-POSIX-TABLE functionality which uses an Operator ~ to instead of a function. I think the kernel names are fine but it would be nice if we could retain the postgres semantics in order to achieve as much compatibility as possible.
There was a problem hiding this comment.
imo achieving full compatibility with postgres is beyond the scope of the arrow-rs crate. IMO the least surprising option here is an implementation of the regex semantics declared in the regex crate. Users interested in the postgres semantics are a subset of all users - postgres semantics is something relatively specific to databases.
Also, afaik achieving postgres semantics is relatively difficult as it requires a regex parser to act as the postgres one (but I may be mistaken here).
A possible approach is to move both of these kernels to datafusion and keep a non-postgres, regex-crate specific one here.
There was a problem hiding this comment.
Agreed jorge, sorry I had missed this was being added at the Arrow crate level. I had previously implemented the Postgres regex behavior in the Datafusion layer and agree that due to differences in regex semantics maybe this does not belong in the arrow-rs crate.
My main point would be from the user experience point of view it would be good to implement the 'abcd' ~ 'bc' style (even if this requires sqlparser updates) rather than implement new non-standard SQL functions.
| } | ||
|
|
||
| /// Perform SQL `array !~ regex_array` operation on [`StringArray`] / [`LargeStringArray`]. | ||
| pub fn regexp_not_matches_utf8<OffsetSize: StringOffsetSizeTrait>( |
There was a problem hiding this comment.
I wonder if we need the negative variants (e.g. do we need regexp_not_matches_utf8) kernel? Perhaps we could simply use not(regexp_matches_utf8(..))
@alamb I have added some comment and remove function |
alamb
left a comment
There was a problem hiding this comment.
I think it is looking great @b41sh -- thank you. I agree with @jorgecarleitao and @seddonm1 that this kernel should follow the model of the rust regex crate (which I think it does)
I'll plan to merge this tomorrow (and include it as part of arrow-rs 5.3.0) unless anyone has additional comments. Just let me know
| /// | ||
| /// `flags_array` are optional [`StringArray`] / [`LargeStringArray`] flag, which allow | ||
| /// special search modes, such as case insensitive and multi-line mode. | ||
| /// See the documentation [here](https://docs.rs/regex/1.5.4/regex/#grouping-and-flags) |
* impl regexp_matches_utf8 * fix clippy * add bench * optimize Co-authored-by: baishen <[email protected]>
Which issue does this PR close?
Closes #697
Rationale for this change
What changes are included in this PR?
Implement arrow compute kernel comparison functions
regexp_matches_utf8,regexp_matches_utf8_scalar,regexp_not_matches_utf8,regexp_not_matches_utf8_scalarand test cases.Are there any user-facing changes?