Support local replicated join and local exchange parallelism #14893

Jackie-Jiang · 2025-01-22T23:15:08Z

Related to #14518

Added a new table hint:

is_replicated (boolean)

Support local replicated join by configuring both side as local distribution, and also hint right table as replicated:

SELECT /*+ joinOptions(left_distribution_type = 'local', right_distribution_type = 'local') */ a.col1, b.col2 FROM a JOIN b /*+ tableOptions(is_replicated='true') */ ON a.col1 = b.col1

Also support parallelism for local exchange to increase the parallelism for intermediate stage with table hint partition_parallelism.

codecov-commenter · 2025-01-22T23:58:21Z

Codecov Report

Attention: Patch coverage is 79.91632% with 48 lines in your changes missing coverage. Please review.

Project coverage is 63.70%. Comparing base (59551e4) to head (57c1b80).
Report is 1640 commits behind head on master.

Files with missing lines	Patch %	Lines
.../org/apache/pinot/query/routing/WorkerManager.java	75.52%	23 Missing and 12 partials ⚠️
...che/pinot/broker/routing/BrokerRoutingManager.java	0.00%	9 Missing ⚠️
...ery/planner/physical/MailboxAssignmentVisitor.java	93.22%	0 Missing and 4 partials ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #14893      +/-   ##
============================================
+ Coverage     61.75%   63.70%   +1.95%     
- Complexity      207     1470    +1263     
============================================
  Files          2436     2713     +277     
  Lines        133233   151804   +18571     
  Branches      20636    23440    +2804     
============================================
+ Hits          82274    96712   +14438     
- Misses        44911    47826    +2915     
- Partials       6048     7266    +1218

Flag	Coverage Δ
custom-integration1	`100.00% <ø> (+99.99%)`	⬆️
integration	`100.00% <ø> (+99.99%)`	⬆️
integration1	`100.00% <ø> (+99.99%)`	⬆️
integration2	`0.00% <ø> (ø)`
java-11	`63.69% <79.91%> (+1.98%)`	⬆️
java-21	`63.60% <79.91%> (+1.97%)`	⬆️
skip-bytebuffers-false	`63.70% <79.91%> (+1.95%)`	⬆️
skip-bytebuffers-true	`63.58% <79.91%> (+35.85%)`	⬆️
temurin	`63.70% <79.91%> (+1.95%)`	⬆️
unittests	`63.70% <79.91%> (+1.95%)`	⬆️
unittests1	`56.27% <83.04%> (+9.38%)`	⬆️
unittests2	`34.01% <0.00%> (+6.28%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

gortiz · 2025-01-23T09:33:23Z

pinot-broker/src/main/java/org/apache/pinot/broker/routing/BrokerRoutingManager.java

+
+    List<String> getSegments(BrokerRequest brokerRequest) {
+      Set<String> selectedSegments = _segmentSelector.select(brokerRequest);
+      if (!selectedSegments.isEmpty()) {


nit: isn't this if a bit redundant?

We want to short circuit it. This is the same as calculateRouting()

gortiz · 2025-01-23T09:41:10Z

pinot-core/src/main/java/org/apache/pinot/core/routing/RoutingManager.java

  RoutingTable getRoutingTable(BrokerRequest brokerRequest, long requestId);

+  /**
+   * Returns the segments that are relevant for the given broker request.


Let's specify here what null means.

gortiz

I would need more time to review the code and, ideally, some explanation of the decisions you made here. The changes look to me more complex than I would have expected. We are deviating from the standard Calcite semantics here (i.e., with singleton + parallelism), and I'm not sure why we need to do that.

What I would expect in this situation is that the join node uses the broadcast distribution for its right side (meaning that each incarnation of the join will see all the data). The main difference with the regular broadcast is that instead of picking one server per segment and broadcasting from them, we pick all servers that will execute the left side and read from them, sending the information to its own node.

gortiz · 2025-01-23T09:56:47Z

...t-query-planner/src/main/java/org/apache/pinot/calcite/rel/logical/PinotLogicalExchange.java


  private PinotLogicalExchange(RelOptCluster cluster, RelTraitSet traitSet, RelNode input, RelDistribution distribution,
-      PinotRelExchangeType exchangeType) {
+      PinotRelExchangeType exchangeType, List<Integer> keys) {


In which cases the keys will be different than distribution.getKeys? What is in fact the meaning of having keys = {X, Y, Z} and a distribution like random that doesn't support keys? Wouldn't be better to change the distribution value depending on the keys? If we want to use a distribution + keys that is not permitted by Calcite we can create our own implementation of RelDistribution

Let me put more comments explaining this. We use SINGLETON to represent local exchange, but we also want to support parallelism for local exchange where keys are needed. We can revisit this as we add more custom distribution types

gortiz · 2025-01-23T10:10:38Z

...lanner/src/main/java/org/apache/pinot/calcite/rel/rules/PinotJoinExchangeNodeInsertRule.java

+        // NOTE: We use SINGLETON to represent local distribution. Add keys to the exchange because we might want to
+        //       switch it to HASH distribution to increase parallelism. See MailboxAssignmentVisitor for details.


Reading RelDistribution.Type, shouldn't this be broadcast? The definition of broadcast is:

There are multiple instances of the stream, and all records appear in each instance

While the definition of singleton is:

There is only one instance of the stream. It sees all records.

BTW, I don't get why we set DistributionType as SINGLETON in cases where we want to use HASH.

This is not broadcast because we don't want to send data to other servers. This is not strictly SINGLETON if we want to add parallelism to local exchange (split one block into multiple and spread them into multiple operators). If there is no extra parallelism (1-to-1 distribution), then it is SINGLETON.

gortiz · 2025-01-23T10:12:55Z

...-planner/src/main/java/org/apache/pinot/query/planner/physical/MailboxAssignmentVisitor.java

    if (node instanceof MailboxSendNode) {
      MailboxSendNode sendNode = (MailboxSendNode) node;
-      int senderStageId = sendNode.getStageId();
+      Integer senderStageId = sendNode.getStageId();


Why Integer? BaseNode.getStageId() is always a int, right?

Correct, but using Integer can avoid a lot of boxing. I changed this to Integer to align with receiverStageId

Jackie-Jiang · 2025-01-24T03:43:45Z

I would need more time to review the code and, ideally, some explanation of the decisions you made here. The changes look to me more complex than I would have expected. We are deviating from the standard Calcite semantics here (i.e., with singleton + parallelism), and I'm not sure why we need to do that.

What I would expect in this situation is that the join node uses the broadcast distribution for its right side (meaning that each incarnation of the join will see all the data). The main difference with the regular broadcast is that instead of picking one server per segment and broadcasting from them, we pick all servers that will execute the left side and read from them, sending the information to its own node.

Broadcast is supported in #14797, but there could still be data shuffling.
With this PR, we can completely eliminate data shuffling, and right table is always served from the same server.
Regarding singleton + parallelism, this is needed to increase parallelism for intermediate stage. If we do singleton (1-to-1 exchange), there will be same number of intermediate operators as leaf operators, which is not good enough in a lot of cases. We usually run only one leaf operator per server, but we want to run more intermediate operators to fully utilize CPU.

Jackie-Jiang · 2025-01-29T01:49:33Z

I'll merge it for now since it contains some important performance optimizations. We can revisit since there is no backward compatible issue

…14893)

Jackie-Jiang added feature documentation release-notes Referenced by PRs that need attention when compiling the next release notes multi-stage Related to the multi-stage query engine labels Jan 22, 2025

Jackie-Jiang force-pushed the local_replicated branch from 845dfe8 to 4a26c4b Compare January 22, 2025 23:19

Jackie-Jiang force-pushed the local_replicated branch from 4a26c4b to ac02e60 Compare January 23, 2025 00:21

Jackie-Jiang mentioned this pull request Jan 23, 2025

Add more JOIN strategies #14518

Open

gortiz reviewed Jan 23, 2025

View reviewed changes

Jackie-Jiang force-pushed the local_replicated branch from ac02e60 to 96ec407 Compare January 23, 2025 23:27

Jackie-Jiang mentioned this pull request Jan 24, 2025

Allow using hint to force enable/disable colocated join #14912

Merged

Jackie-Jiang force-pushed the local_replicated branch from 96ec407 to f5095e3 Compare January 24, 2025 03:37

Jackie-Jiang requested a review from xiangfu0 January 24, 2025 03:44

Support local replicated join and local exchange parallelism

57c1b80

Jackie-Jiang force-pushed the local_replicated branch from f5095e3 to 57c1b80 Compare January 28, 2025 21:27

xiangfu0 approved these changes Jan 28, 2025

View reviewed changes

Jackie-Jiang added the performance label Jan 29, 2025

Jackie-Jiang merged commit 8a703e6 into apache:master Jan 29, 2025
21 checks passed

Jackie-Jiang deleted the local_replicated branch January 29, 2025 01:49

zeronerdzerogeekzerocool pushed a commit to zeronerdzerogeekzerocool/pinot that referenced this pull request Feb 20, 2025

Support local replicated join and local exchange parallelism (apache#…

e85b921

…14893)

Jackie-Jiang mentioned this pull request Mar 7, 2025

Fix LOOKUP join #15223

Merged

		// NOTE: We use SINGLETON to represent local distribution. Add keys to the exchange because we might want to
		// switch it to HASH distribution to increase parallelism. See MailboxAssignmentVisitor for details.

Support local replicated join and local exchange parallelism #14893

Support local replicated join and local exchange parallelism #14893

Uh oh!

Conversation

Jackie-Jiang commented Jan 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented Jan 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gortiz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Jackie-Jiang commented Jan 24, 2025

Uh oh!

Jackie-Jiang commented Jan 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Jackie-Jiang commented Jan 22, 2025 •

edited

Loading

codecov-commenter commented Jan 22, 2025 •

edited

Loading