ARROW-12835: [C++][Python][R] Implement case-insensitive match using RE2#10369
ARROW-12835: [C++][Python][R] Implement case-insensitive match using RE2#10369lidavidm wants to merge 1 commit intoapache:masterfrom
Conversation
|
Bikeshedding here, but maybe we should rename the C++ and Python argument to |
|
ignore_case is also what Python's regex engine calls it at least (though not RE2), so I've renamed it. |
pitrou
left a comment
There was a problem hiding this comment.
Could you give an example where re2 doesn't follow proper Unicode semantics?
There was a problem hiding this comment.
I don't understand why two MatchSubstring and MatchSubstringImpl classes. It seems one should be sufficient?
There was a problem hiding this comment.
There's only one of each. I moved them around in this PR, but it's the same as before.
lidavidm
left a comment
There was a problem hiding this comment.
An example of RE2's Unicode handling is here: google/re2#262
That said I don't think it's too big a deal for us.
There was a problem hiding this comment.
There's only one of each. I moved them around in this PR, but it's the same as before.
It depends what you mean. The fact that |
It is a bit of a bummer but I think it's also not 'unexpected' in that other systems (languages, etc) probably make the same tradeoff, hence why I thought it was worth a note in the docs, but wasn't worth implementing a full solution using utf8proc. |
|
Rebased to check for CI. |
This uses RE2 to implement a case-insensitive substring search.
Originally, I implemented this using utf8proc, but then found it was about an order of magnitude slower than RE2. (This isn't an apples-to-apples comparison; utf8proc does it more 'properly' and handles more Unicode corners.) So I switched to just doing it with RE2 instead, especially since the utf8proc approach was complicated. (You can still see it in the original commit here if you're curious.)