Optimize like to regexp conversion to do not include unnecessary ^.* and .*$ #8893

gortiz · 2022-06-15T08:45:20Z

RegexpPatternConverterUtils.likeToRegexpLike is used to translate LIKE expressions to regexp_match. The transformation is trivial, but there are some very easy optimizations that can be done.

For example: As regexp_match has the Java Matcher.find semantics, it doesn't try to match the expression with the whole string but looks for a substring that matches the expression. Therefore, the prefix ^.* and the suffix .*$ are useless. This PR removes them.

The PR doesn't include benchmarks that proves that the optimization is useful, but that was discovered in #8818. The commit of this PR has been cherry picked from there and the benchmarks included in 8818 indicata 2x performance increase. Actual numbers can be seen here #8818 (comment)

codecov-commenter · 2022-06-15T09:40:27Z

Codecov Report

Merging #8893 (74fd8d3) into master (8e7ca65) will decrease coverage by 34.65%.
The diff coverage is 48.48%.

❗ Current head 74fd8d3 differs from pull request most recent head 88f2564. Consider uploading reports for the commit 88f2564 to get more accurate results

@@              Coverage Diff              @@
##             master    #8893       +/-   ##
=============================================
- Coverage     70.09%   35.44%   -34.66%     
+ Complexity     4965      184     -4781     
=============================================
  Files          1831     1831               
  Lines         96270    96335       +65     
  Branches      14390    14403       +13     
=============================================
- Hits          67483    34148    -33335     
- Misses        24135    59422    +35287     
+ Partials       4652     2765     -1887

Flag	Coverage Δ
integration1	`26.55% <34.84%> (-0.01%)`	⬇️
integration2	`?`
unittests1	`?`
unittests2	`15.37% <13.63%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...rg/apache/pinot/common/lineage/SegmentLineage.java	`77.77% <0.00%> (-17.88%)`	⬇️
...spi/utils/builder/ControllerRequestURLBuilder.java	`0.00% <0.00%> (ø)`
...inot/common/utils/RegexpPatternConverterUtils.java	`44.82% <53.48%> (-42.68%)`	⬇️
...ler/api/resources/PinotSegmentRestletResource.java	`24.90% <61.53%> (-0.10%)`	⬇️
...ntroller/helix/core/PinotHelixResourceManager.java	`65.90% <100.00%> (-1.22%)`	⬇️
.../java/org/apache/pinot/spi/utils/BooleanUtils.java	`0.00% <0.00%> (-100.00%)`	⬇️
...java/org/apache/pinot/spi/trace/BaseRecording.java	`0.00% <0.00%> (-100.00%)`	⬇️
...java/org/apache/pinot/spi/trace/NoOpRecording.java	`0.00% <0.00%> (-100.00%)`	⬇️
...ava/org/apache/pinot/spi/config/table/FSTType.java	`0.00% <0.00%> (-100.00%)`	⬇️
...ava/org/apache/pinot/spi/config/user/RoleType.java	`0.00% <0.00%> (-100.00%)`	⬇️
... and 1074 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8e7ca65...88f2564. Read the comment docs.

richardstartin · 2022-06-15T09:57:58Z

pinot-common/src/main/java/org/apache/pinot/common/utils/RegexpPatternConverterUtils.java

   */
  public static String likeToRegexpLike(String likePattern) {
-    return "^" + escapeMetaCharacters(likePattern).replace('_', '.').replace("%", ".*") + "$";
+    String converted = "^" + escapeMetaCharacters(likePattern).replace('_', '.').replace("%", ".*") + "$";


I suggest using a StringBuilder for the intermediate operations and checks here and then turn it into a string at the end

+1. Ideally we should construct the string once to reduce garbage

I'm not sure about that. This code is not in the hotpath. It just called once when FilterContext is created and that's all. Therefore the amount of memory generated should be negligible.

Anyway, the change is easy to apply, so I'm going to do so.

Jackie-Jiang

LGTM. Nice simple optimization!

Jackie-Jiang · 2022-06-15T17:00:00Z

pinot-common/src/main/java/org/apache/pinot/common/utils/RegexpPatternConverterUtils.java

   */
  public static String likeToRegexpLike(String likePattern) {
-    return "^" + escapeMetaCharacters(likePattern).replace('_', '.').replace("%", ".*") + "$";
+    String converted = "^" + escapeMetaCharacters(likePattern).replace('_', '.').replace("%", ".*") + "$";


+1. Ideally we should construct the string once to reduce garbage

…nd-suffix

Jackie-Jiang · 2022-06-16T18:51:29Z

pinot-common/src/main/java/org/apache/pinot/common/utils/RegexpPatternConverterUtils.java

+        return "^$";
+      case 1:
+        if (likePattern.charAt(0) == '%') {
+          return "^.*$";


Should this be ".*" or just ""?

All three are correct. I guess the later is the most efficient

Jackie-Jiang · 2022-06-16T18:59:10Z

pinot-common/src/main/java/org/apache/pinot/common/utils/RegexpPatternConverterUtils.java

+        }
+        break;
+      default:
+        if (likePattern.charAt(0) == '%') {


"%%" becomes "", which should be fine?

do we plan to optimize something similar to

LIKE '%%%%%%%%%%%%%zz' REGEXP_LIKE(col, '((((((.*)*)*)*)*)*)*zz')

listed in this blog

do we plan to optimize something similar to

I don't think this is the place to do that because we don't want to just optimize LIKE '%%%%%%%%%%%%%zz', we also want to optimize REGEXP_LIKE(col, '((((((.*)*)*)*)*)*)*zz').

I mean:

we transform LIKE expressions into REGEXP_LIKE

we let users to write their own REGEXP_LIKE expressions

we know some regex in REGEXP_LIKE are dangerous

We should not focus on making 1. safe, we should focus in making 3. safe. Otherwise an attacker may not be able to use LIKE to create an attack but they could use REGEXP_LIKE.

"%%" becomes "", which should be fine?

% means matches any string with zero or more characters so LIKE('%%') will match with any text and therefore it should be equivalent to REGEXP_LIKE('')

thanks for the explanation. it just seems ok to recursively prune out leading/trailing % e.g. instead of
start = 1; we can do start equal to first non % char.

… in LIKE expressions

…t. Add a unit test

…x-pre-and-suffix

…and .*$ (apache#8893) * Optimize like to regexp conversion to do not include useless ^.* and .*$ * Optimize likeToRegexpLike to reduce allocations * LIKE '%' will be transformed into REGEXP_LIKE('') * Add an optimization in case of leading/trailing consecutive wildcards in LIKE expressions * Fix an error in some specific expressions detected by integration test. Add a unit test

Optimize like to regexp conversion to do not include useless ^.* and .*$

74739dd

gortiz changed the title ~~Optimize like to regexp conversion to do not include useless ^.* and .*$~~ Optimize like to regexp conversion to do not include unnecessay ^.* and .*$ Jun 15, 2022

gortiz changed the title ~~Optimize like to regexp conversion to do not include unnecessay ^.* and .*$~~ Optimize like to regexp conversion to do not include unnecessary ^.* and .*$ Jun 15, 2022

richardstartin reviewed Jun 15, 2022

View reviewed changes

Jackie-Jiang approved these changes Jun 15, 2022

View reviewed changes

gortiz added 2 commits June 16, 2022 12:25

Optimize likeToRegexpLike to reduce allocations

cd7222b

Merge remote-tracking branch 'origin' into remove-like-to-regex-pre-a…

f517411

…nd-suffix

Jackie-Jiang reviewed Jun 16, 2022

View reviewed changes

gortiz added 2 commits July 5, 2022 08:54

Merge branch 'master' into remove-like-to-regex-pre-and-suffix

784c526

LIKE '%' will be transformed into REGEXP_LIKE('')

4ed2bee

gortiz force-pushed the remove-like-to-regex-pre-and-suffix branch from 683df63 to 4ed2bee Compare July 5, 2022 08:09

gortiz added 3 commits July 8, 2022 12:07

Add an optimization in case of leading/trailing consecutive wildcards…

2ed99e8

… in LIKE expressions

Fix an error in some specific expressions detected by integration tes…

bfe2315

…t. Add a unit test

Merge remote-tracking branch 'origin/master' into remove-like-to-rege…

88f2564

…x-pre-and-suffix

gortiz requested review from Jackie-Jiang and walterddr July 14, 2022 11:02

walterddr approved these changes Jul 14, 2022

View reviewed changes

walterddr merged commit 99e7948 into apache:master Jul 14, 2022

Optimize like to regexp conversion to do not include unnecessary ^.* and .*$ #8893

Optimize like to regexp conversion to do not include unnecessary ^.* and .*$ #8893

Uh oh!

Conversation

gortiz commented Jun 15, 2022

Uh oh!

codecov-commenter commented Jun 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Jackie-Jiang left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

codecov-commenter commented Jun 15, 2022 •

edited

Loading