Enabling LogicalProject pushdown optimizations to eliminate exchange of unused columns #14198

shauryachats · 2024-10-09T22:09:10Z

The current logical planner has a shortcoming for NOT IN subqueries where it projects unused columns before the exchange stage, resulting in increased data during shuffles and significantly increasing latency.

An example would be:

EXPLAIN PLAN FOR 
SELECT count(*) FROM a 
WHERE a.col1 = 'foo' and a.col2 = 'bar' AND a.col4 = 12 AND a.col5 = false 
AND a.col3 NOT IN (SELECT b.col3 FROM b where b.col3 = 'baz') 
AND a.col3 NOT IN (SELECT b.col3 FROM b where b.col3 = 'qux'))

The execution plan is:

Execution Plan
PinotLogicalAggregate(group=[{}], agg#0=[COUNT($0)])
  PinotLogicalExchange(distribution=[hash])
    PinotLogicalAggregate(group=[{}], agg#0=[COUNT()])
      LogicalFilter(condition=[IS NOT TRUE($8)])
        LogicalJoin(condition=[=($6, $7)], joinType=[left])
          PinotLogicalExchange(distribution=[hash[6]])
            LogicalProject(col1=[$0], col2=[$1], col4=[$3], col5=[$4], col30=[$5], $f1=[$7], col32=[$2])
              LogicalFilter(condition=[IS NOT TRUE($7)])
                LogicalJoin(condition=[=($5, $6)], joinType=[left])
                  PinotLogicalExchange(distribution=[hash[5]])
                    LogicalProject(col1=[$0], col2=[$1], col3=[$2], col4=[$3], col5=[$4], col30=[$2])
                      LogicalFilter(condition=[AND(=($0, _UTF-8'foo'), =($1, _UTF-8'bar'), =($3, 12), NOT($4))])
                        LogicalTableScan(table=[[default, a]])
                  PinotLogicalExchange(distribution=[hash[0]])
                    PinotLogicalAggregate(group=[{0}], agg#0=[MIN($1)])
                      PinotLogicalExchange(distribution=[hash[0]])
                        PinotLogicalAggregate(group=[{0}], agg#0=[MIN($1)])
                          LogicalProject(col3=[$2], $f1=[true])
                            LogicalFilter(condition=[=($2, CAST(_UTF-8'baz'):INTEGER NOT NULL)])
                              LogicalTableScan(table=[[default, b]])
          PinotLogicalExchange(distribution=[hash[0]])
            PinotLogicalAggregate(group=[{0}], agg#0=[MIN($1)])
              PinotLogicalExchange(distribution=[hash[0]])
                PinotLogicalAggregate(group=[{0}], agg#0=[MIN($1)])
                  LogicalProject(col3=[$2], $f1=[true])
                    LogicalFilter(condition=[=($2, CAST(_UTF-8'qux'):INTEGER NOT NULL)])
                      LogicalTableScan(table=[[default, b]])

It can be observed from the execution plan

                 PinotLogicalExchange(distribution=[hash[5]])
                    LogicalProject(col1=[$0], col2=[$1], col3=[$2], col4=[$3], col5=[$4], col30=[$2])

that while the join is only on a.col3 and the final result required is a count(*), other columns (col1, col2, col4, col5) are present in the LogicalProject step right before PinotLogicalExchange which should ideally have been pruned.

The cause of the issue is the LogicalFilter which does not allow projects to be transposed, since the current rules do not have PROJECT_FILTER_TRANSPOSE included. Even if PROJECT_FILTER_TRANSPOSE is included in BASIC_RULES, it would not work for multiple joins since BASIC_RULES optimizations are run only once.

This PR aims to fix this issue by running Project pushdown (transpose and merge operations) as a RuleCollection after the first iteration of Filter pushdown rules are run to ensure Project can be pushed down as much as possible.
After the Project pushdown is run, the Filter pushdown is run again to push down filters after the project pushdown for two reasons:

In a generic query, filter pushdowns are generally more selective than project pushdowns and should be given more priority.
Major changes to the current logical plans should be avoided, and not running a Filter pushdown after Project pushdown results in a Project pushdown after each logical plan node which increases verbosity.

…of unused columns

codecov-commenter · 2024-10-09T22:45:50Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 63.87%. Comparing base (59551e4) to head (c4b435c).
Report is 2306 commits behind head on master.

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #14198      +/-   ##
============================================
+ Coverage     61.75%   63.87%   +2.12%     
- Complexity      207     1535    +1328     
============================================
  Files          2436     2622     +186     
  Lines        133233   144399   +11166     
  Branches      20636    22103    +1467     
============================================
+ Hits          82274    92235    +9961     
- Misses        44911    45365     +454     
- Partials       6048     6799     +751

Flag	Coverage Δ
custom-integration1	`100.00% <ø> (+99.99%)`	⬆️
integration	`100.00% <ø> (+99.99%)`	⬆️
integration1	`100.00% <ø> (+99.99%)`	⬆️
integration2	`0.00% <ø> (ø)`
java-11	`63.85% <100.00%> (+2.14%)`	⬆️
java-21	`63.70% <100.00%> (+2.07%)`	⬆️
skip-bytebuffers-false	`63.86% <100.00%> (+2.11%)`	⬆️
skip-bytebuffers-true	`63.67% <100.00%> (+35.94%)`	⬆️
temurin	`63.87% <100.00%> (+2.12%)`	⬆️
unittests	`63.87% <100.00%> (+2.12%)`	⬆️
unittests1	`55.50% <100.00%> (+8.61%)`	⬆️
unittests2	`34.40% <100.00%> (+6.67%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Jackie-Jiang · 2024-10-11T23:34:06Z

These rules are repeated in multiple places, and rule sets are executed multiple times, introducing extra overhead and currently very hard to manage. Do you see a better way to re-organize them so that they can cover more general scenarios?

cc @gortiz @yashmayya to also take a look

shauryachats · 2024-10-15T22:01:19Z

@Jackie-Jiang I tried various permutations of reorganizing these rules which could minimize the projection of unused columns and reduce the modification of the number of existing logical plans which this change should not impact.

One alternate way is to remove the last filter pushdown to reduce overhead but it causes many extraneous LogicalProject nodes between each node in the plan, which breaks a lot of existing logical plans (around 110 out of the 400 plans in the tests).

We have a separate workstream to improve query optimization and we can attempt to tackle it subsequently: #14243

cc: @ankitsultana

Jackie-Jiang · 2024-10-17T23:08:23Z

Basically filter pushdown rules can impact project pushdown rules and vise versa. Should we consider putting them together as one rule collection? Will it potentially cause infinite loop?

shauryachats · 2024-10-17T23:24:33Z

Yes, I tested it out and it does cause an infinite loop since PROJECT_FILTER_TRANSPOSE and FILTER_PROJECT_TRANSPOSE in the same HepInstruction results in the Project and Filter being transposed one after another infinitely.

ankitsultana · 2024-10-22T22:15:22Z

pinot-query-planner/src/test/resources/queries/JoinPlans.json

-          "\n                      LogicalProject(col3=[$2], $f1=[true])",
-          "\n                        LogicalFilter(condition=[=($0, _UTF-8'bar')])",
-          "\n                          LogicalTableScan(table=[[default, b]])",
+          "\n                LogicalProject(col3=[$0], $f1=[$1])",


Interesting, I think this project is redundant? Something to note regardless as we work on a proper long term solution. Same in the other parts of this plan.

Agreed, we can address this subsequently.

ankitsultana · 2024-10-22T22:17:36Z

pinot-query-planner/src/test/resources/queries/WindowFunctionPlans.json

        "output": [
          "Execution Plan",
-          "\nLogicalProject(col1=[$0], $1=[$3])",
+          "\nLogicalProject(col1=[$0], w0$o0=[$3])",


@shauryachats : can we also verify that this window query works after this change? (as in it runs successfully)

Yes, verified that window queries work in Quickstart.

Enabling LogicalProject pushdown optimizations to eliminate exchange …

c4b435c

…of unused columns

shauryachats changed the title ~~Enabling LogicalProject pushdown optimizations to eliminate exchange …~~ Enabling LogicalProject pushdown optimizations to eliminate exchange of unused columns Oct 9, 2024

ankitsultana requested a review from Jackie-Jiang October 10, 2024 19:10

Jackie-Jiang added enhancement multi-stage Related to the multi-stage query engine labels Oct 11, 2024

shauryachats mentioned this pull request Oct 15, 2024

Define Concrete Rule Order in Multistage Query Optimization #14243

Open

ankitsultana reviewed Oct 22, 2024

View reviewed changes

ankitsultana approved these changes Oct 22, 2024

View reviewed changes

Jackie-Jiang merged commit 281478e into apache:master Oct 22, 2024

shauryachats deleted the project_plan branch October 23, 2024 01:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enabling LogicalProject pushdown optimizations to eliminate exchange of unused columns #14198

Enabling LogicalProject pushdown optimizations to eliminate exchange of unused columns #14198

Uh oh!

shauryachats commented Oct 9, 2024 •

edited

Loading

Uh oh!

codecov-commenter commented Oct 9, 2024 •

edited

Loading

Uh oh!

Jackie-Jiang commented Oct 11, 2024

Uh oh!

shauryachats commented Oct 15, 2024 •

edited

Loading

Uh oh!

Jackie-Jiang commented Oct 17, 2024

Uh oh!

shauryachats commented Oct 17, 2024

Uh oh!

ankitsultana Oct 22, 2024

Uh oh!

shauryachats Oct 22, 2024

Uh oh!

ankitsultana Oct 22, 2024

Uh oh!

shauryachats Oct 22, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Enabling LogicalProject pushdown optimizations to eliminate exchange of unused columns #14198

Enabling LogicalProject pushdown optimizations to eliminate exchange of unused columns #14198

Uh oh!

Conversation

shauryachats commented Oct 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented Oct 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Jackie-Jiang commented Oct 11, 2024

Uh oh!

shauryachats commented Oct 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Jackie-Jiang commented Oct 17, 2024

Uh oh!

shauryachats commented Oct 17, 2024

Uh oh!

ankitsultana Oct 22, 2024

Choose a reason for hiding this comment

Uh oh!

shauryachats Oct 22, 2024

Choose a reason for hiding this comment

Uh oh!

ankitsultana Oct 22, 2024

Choose a reason for hiding this comment

Uh oh!

shauryachats Oct 22, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

shauryachats commented Oct 9, 2024 •

edited

Loading

codecov-commenter commented Oct 9, 2024 •

edited

Loading

shauryachats commented Oct 15, 2024 •

edited

Loading