Skip to content

Conversation

@gortiz
Copy link
Contributor

@gortiz gortiz commented Jun 15, 2022

RegexpPatternConverterUtils.likeToRegexpLike is used to translate LIKE expressions to regexp_match. The transformation is trivial, but there are some very easy optimizations that can be done.

For example: As regexp_match has the Java Matcher.find semantics, it doesn't try to match the expression with the whole string but looks for a substring that matches the expression. Therefore, the prefix ^.* and the suffix .*$ are useless. This PR removes them.

The PR doesn't include benchmarks that proves that the optimization is useful, but that was discovered in #8818. The commit of this PR has been cherry picked from there and the benchmarks included in 8818 indicata 2x performance increase. Actual numbers can be seen here #8818 (comment)

@gortiz gortiz changed the title Optimize like to regexp conversion to do not include useless ^.* and .*$ Optimize like to regexp conversion to do not include unnecessay ^.* and .*$ Jun 15, 2022
@gortiz gortiz changed the title Optimize like to regexp conversion to do not include unnecessay ^.* and .*$ Optimize like to regexp conversion to do not include unnecessary ^.* and .*$ Jun 15, 2022
@codecov-commenter
Copy link

codecov-commenter commented Jun 15, 2022

Codecov Report

Merging #8893 (74fd8d3) into master (8e7ca65) will decrease coverage by 34.65%.
The diff coverage is 48.48%.

❗ Current head 74fd8d3 differs from pull request most recent head 88f2564. Consider uploading reports for the commit 88f2564 to get more accurate results

@@              Coverage Diff              @@
##             master    #8893       +/-   ##
=============================================
- Coverage     70.09%   35.44%   -34.66%     
+ Complexity     4965      184     -4781     
=============================================
  Files          1831     1831               
  Lines         96270    96335       +65     
  Branches      14390    14403       +13     
=============================================
- Hits          67483    34148    -33335     
- Misses        24135    59422    +35287     
+ Partials       4652     2765     -1887     
Flag Coverage Δ
integration1 26.55% <34.84%> (-0.01%) ⬇️
integration2 ?
unittests1 ?
unittests2 15.37% <13.63%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...rg/apache/pinot/common/lineage/SegmentLineage.java 77.77% <0.00%> (-17.88%) ⬇️
...spi/utils/builder/ControllerRequestURLBuilder.java 0.00% <0.00%> (ø)
...inot/common/utils/RegexpPatternConverterUtils.java 44.82% <53.48%> (-42.68%) ⬇️
...ler/api/resources/PinotSegmentRestletResource.java 24.90% <61.53%> (-0.10%) ⬇️
...ntroller/helix/core/PinotHelixResourceManager.java 65.90% <100.00%> (-1.22%) ⬇️
.../java/org/apache/pinot/spi/utils/BooleanUtils.java 0.00% <0.00%> (-100.00%) ⬇️
...java/org/apache/pinot/spi/trace/BaseRecording.java 0.00% <0.00%> (-100.00%) ⬇️
...java/org/apache/pinot/spi/trace/NoOpRecording.java 0.00% <0.00%> (-100.00%) ⬇️
...ava/org/apache/pinot/spi/config/table/FSTType.java 0.00% <0.00%> (-100.00%) ⬇️
...ava/org/apache/pinot/spi/config/user/RoleType.java 0.00% <0.00%> (-100.00%) ⬇️
... and 1074 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8e7ca65...88f2564. Read the comment docs.

*/
public static String likeToRegexpLike(String likePattern) {
return "^" + escapeMetaCharacters(likePattern).replace('_', '.').replace("%", ".*") + "$";
String converted = "^" + escapeMetaCharacters(likePattern).replace('_', '.').replace("%", ".*") + "$";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest using a StringBuilder for the intermediate operations and checks here and then turn it into a string at the end

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Ideally we should construct the string once to reduce garbage

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about that. This code is not in the hotpath. It just called once when FilterContext is created and that's all. Therefore the amount of memory generated should be negligible.

Anyway, the change is easy to apply, so I'm going to do so.

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Nice simple optimization!

*/
public static String likeToRegexpLike(String likePattern) {
return "^" + escapeMetaCharacters(likePattern).replace('_', '.').replace("%", ".*") + "$";
String converted = "^" + escapeMetaCharacters(likePattern).replace('_', '.').replace("%", ".*") + "$";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Ideally we should construct the string once to reduce garbage

return "^$";
case 1:
if (likePattern.charAt(0) == '%') {
return "^.*$";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be ".*" or just ""?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All three are correct. I guess the later is the most efficient

}
break;
default:
if (likePattern.charAt(0) == '%') {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"%%" becomes "", which should be fine?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we plan to optimize something similar to

LIKE '%%%%%%%%%%%%%zz'
REGEXP_LIKE(col, '((((((.*)*)*)*)*)*)*zz')

listed in this blog

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we plan to optimize something similar to

I don't think this is the place to do that because we don't want to just optimize LIKE '%%%%%%%%%%%%%zz', we also want to optimize REGEXP_LIKE(col, '((((((.*)*)*)*)*)*)*zz').

I mean:

  1. we transform LIKE expressions into REGEXP_LIKE
  2. we let users to write their own REGEXP_LIKE expressions
  3. we know some regex in REGEXP_LIKE are dangerous

We should not focus on making 1. safe, we should focus in making 3. safe. Otherwise an attacker may not be able to use LIKE to create an attack but they could use REGEXP_LIKE.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"%%" becomes "", which should be fine?

% means matches any string with zero or more characters so LIKE('%%') will match with any text and therefore it should be equivalent to REGEXP_LIKE('')

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the explanation. it just seems ok to recursively prune out leading/trailing % e.g. instead of
start = 1; we can do start equal to first non % char.

@gortiz gortiz force-pushed the remove-like-to-regex-pre-and-suffix branch from 683df63 to 4ed2bee Compare July 5, 2022 08:09
@gortiz gortiz requested review from Jackie-Jiang and walterddr July 14, 2022 11:02
@walterddr walterddr merged commit 99e7948 into apache:master Jul 14, 2022
yuanbenson pushed a commit to yuanbenson/pinot that referenced this pull request Jul 20, 2022
…and .*$ (apache#8893)

* Optimize like to regexp conversion to do not include useless ^.* and .*$

* Optimize likeToRegexpLike to reduce allocations

* LIKE '%' will be transformed into REGEXP_LIKE('')

* Add an optimization in case of leading/trailing consecutive wildcards in LIKE expressions

* Fix an error in some specific expressions detected by integration test. Add a unit test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants