-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Optimize like to regexp conversion to do not include unnecessary ^.* and .*$ #8893
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize like to regexp conversion to do not include unnecessary ^.* and .*$ #8893
Conversation
Codecov Report
@@ Coverage Diff @@
## master #8893 +/- ##
=============================================
- Coverage 70.09% 35.44% -34.66%
+ Complexity 4965 184 -4781
=============================================
Files 1831 1831
Lines 96270 96335 +65
Branches 14390 14403 +13
=============================================
- Hits 67483 34148 -33335
- Misses 24135 59422 +35287
+ Partials 4652 2765 -1887
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
| */ | ||
| public static String likeToRegexpLike(String likePattern) { | ||
| return "^" + escapeMetaCharacters(likePattern).replace('_', '.').replace("%", ".*") + "$"; | ||
| String converted = "^" + escapeMetaCharacters(likePattern).replace('_', '.').replace("%", ".*") + "$"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest using a StringBuilder for the intermediate operations and checks here and then turn it into a string at the end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. Ideally we should construct the string once to reduce garbage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about that. This code is not in the hotpath. It just called once when FilterContext is created and that's all. Therefore the amount of memory generated should be negligible.
Anyway, the change is easy to apply, so I'm going to do so.
Jackie-Jiang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Nice simple optimization!
| */ | ||
| public static String likeToRegexpLike(String likePattern) { | ||
| return "^" + escapeMetaCharacters(likePattern).replace('_', '.').replace("%", ".*") + "$"; | ||
| String converted = "^" + escapeMetaCharacters(likePattern).replace('_', '.').replace("%", ".*") + "$"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. Ideally we should construct the string once to reduce garbage
| return "^$"; | ||
| case 1: | ||
| if (likePattern.charAt(0) == '%') { | ||
| return "^.*$"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be ".*" or just ""?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All three are correct. I guess the later is the most efficient
| } | ||
| break; | ||
| default: | ||
| if (likePattern.charAt(0) == '%') { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"%%" becomes "", which should be fine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we plan to optimize something similar to
LIKE '%%%%%%%%%%%%%zz'
REGEXP_LIKE(col, '((((((.*)*)*)*)*)*)*zz')
listed in this blog
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we plan to optimize something similar to
I don't think this is the place to do that because we don't want to just optimize LIKE '%%%%%%%%%%%%%zz', we also want to optimize REGEXP_LIKE(col, '((((((.*)*)*)*)*)*)*zz').
I mean:
- we transform LIKE expressions into REGEXP_LIKE
- we let users to write their own REGEXP_LIKE expressions
- we know some regex in REGEXP_LIKE are dangerous
We should not focus on making 1. safe, we should focus in making 3. safe. Otherwise an attacker may not be able to use LIKE to create an attack but they could use REGEXP_LIKE.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"%%" becomes "", which should be fine?
% means matches any string with zero or more characters so LIKE('%%') will match with any text and therefore it should be equivalent to REGEXP_LIKE('')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the explanation. it just seems ok to recursively prune out leading/trailing % e.g. instead of
start = 1; we can do start equal to first non % char.
683df63 to
4ed2bee
Compare
… in LIKE expressions
…t. Add a unit test
…and .*$ (apache#8893) * Optimize like to regexp conversion to do not include useless ^.* and .*$ * Optimize likeToRegexpLike to reduce allocations * LIKE '%' will be transformed into REGEXP_LIKE('') * Add an optimization in case of leading/trailing consecutive wildcards in LIKE expressions * Fix an error in some specific expressions detected by integration test. Add a unit test
RegexpPatternConverterUtils.likeToRegexpLikeis used to translateLIKEexpressions toregexp_match. The transformation is trivial, but there are some very easy optimizations that can be done.For example: As
regexp_matchhas the Java Matcher.find semantics, it doesn't try to match the expression with the whole string but looks for a substring that matches the expression. Therefore, the prefix^.*and the suffix.*$are useless. This PR removes them.The PR doesn't include benchmarks that proves that the optimization is useful, but that was discovered in #8818. The commit of this PR has been cherry picked from there and the benchmarks included in 8818 indicata 2x performance increase. Actual numbers can be seen here #8818 (comment)