[multistage] Handle Excluded New Segments in MSE Physical Optimizer #15780

ankitsultana · 2025-05-13T20:15:58Z

Summary

The Leaf stage worker assignment rule was iterating on PartitionInfo._segments list. But that list is not complete, and SegmentPartitionMetadataManager was skipping the new segments from TablePartitionInfo.

This PR fixes that by:

Renaming the existing TablePartitionInfo to TablePartitionReplicatedServersInfo (TPRSI), so it's clear that the TPRSI POJO also tracks segment replication (i.e. segment/instance assignment)
Removing the usage of TablePartitionReplicatedServersInfo from LeafStageWorkerAssignmentRule and only relying on the new TablePartitionInfo (TPI).

Big Picture

The bigger picture here is that the v2 query optimizer relies on TablePartitionInfo solely for quickly accessing segments belonging to a given partition. Prior to this PR, TablePartitionInfo used by the leaf stage worker assignment was also automatically skipping new segments from being tracked in the TPI.

This was bad design, because the v2 query optimizer delegates the responsibility of segment selection and assignment to the Routing Manager.

Test Plan

Added Unit Tests. We also have existing unit-tests and E2E plan tests that verify plan generation.

ankitsultana · 2025-05-13T20:19:46Z

...java/org/apache/pinot/query/planner/physical/v2/opt/rules/LeafStageWorkerAssignmentRule.java

      List<String> selectedSegments = new ArrayList<>();
-      if (info != null) {
+      List<String> segmentsForPartition = tablePartitionInfo.getSegmentsInPartition(partitionNum);
+      if (!segmentsForPartition.isEmpty()) {


note: Though this check is not required, I think it makes the code slightly easier to read

wirybeaver · 2025-05-13T22:20:09Z

pinot-core/src/main/java/org/apache/pinot/core/routing/TablePartitionInfo.java

  public TablePartitionInfo(String tableNameWithType, String partitionColumn, String partitionFunctionName,
-      int numPartitions, PartitionInfo[] partitionInfoMap, List<String> segmentsWithInvalidPartition) {
+      int numPartitions, PartitionInfo[] partitionInfoMap, List<String> segmentsWithInvalidPartition,
+      Map<Integer, List<String>> excludedNewSegments) {


nit: mark the existing constructor with @VisibleForTesting and create a new constructor to pass in the excludedNewSegments. The modification of existing unit test of TablePartitionInfo don't need to be modified in this way.

imo having multiple ctors is more error prone in the long run. It's easy to call ctors with lower number of args and developers then don't think about everything that's required to create an accurate version of an object.

wirybeaver · 2025-05-13T22:23:22Z

LGTM. My hunch is that the broker routing manager can insert the partition number into segment when the helix thread invoke the update method. It will simplify the code of workerAssignment a lot and reduce the overhead in the query path.

Specifically, the core part of workerAssignment is to ensure the partition is not assigned to multiple servers. As we have Map<ServerInstance, List<SegmentWrapper>> where the SegmentWrapper containing the partition number, we only need to walk through the routing table once to validate there's not intersection of partitions between servers.

wirybeaver

there's a global unit test failure on the pinot-core module

shauryachats

LGTM

Jackie-Jiang

Seems like you need to compute the partition info without maintaining fully replicated servers? If so, I'd suggest adding a new type of TablePartitionInfo and directly calculate the info needed for the new physical planner, e.g. all segments within the partition, segments with invalid partition etc.
This way, we don't need to pay overhead to maintain info not needed for both approaches.

ankitsultana · 2025-05-15T23:30:05Z

Seems like you need to compute the partition info without maintaining fully replicated servers? If so, I'd suggest adding a new type of TablePartitionInfo and directly calculate the info needed for the new physical planner, e.g. all segments within the partition, segments with invalid partition etc. This way, we don't need to pay overhead to maintain info not needed for both approaches.

I don't follow, which overhead is concerning here? Note that this change is simply tracking the excludedNewSegments in TPI. These were created temporarily anyways, but all I am doing is passing them to TablePartitionInfo, since otherwise SegmentPartitionMetadataManager was silently skipping some segments from TablePartitionInfo.

From a semantics point of view I think TablePartitionInfo should track all segments that were processed by the SegmentPartitionMetadataManager.

Jackie-Jiang · 2025-05-20T00:18:42Z

What I meant is that the new LeafStageWorkerAssignmentRule and WorkerManager requires different info from partitioning:

LeafStageWorkerAssignmentRule needs to find all segments for each partition, and track segments without any partition
WorkerManager needs to find segments for each partition with fully replicated servers

Because they need different info, I'd suggest adding a new class similar to TablePartitionInfo (you may also rename the existing one to something like TablePartitionInfoWithReplicatedServers), then make a new method to retrieve that info. They are for different purpose, so better separate them.

ankitsultana · 2025-05-20T01:01:30Z

Isn't that an overkill?

Requirements of LeafStageWorkerAssignmentRule and WorkerManager are different, but from an abstraction point of view we can say that SegmentPartitionMetadataManager manages TablePartitionInfo which:

Provides an easy to use API to work with segment partitions.
Is a complete view, in that all processed segments will be part of the TablePartitionInfo POJO.

Maybe I am missing something, is there a deeper reason to create another POJO here?

Maintaining two separate POJOs with mostly the same information will be an anti-pattern in my view.

itschrispeck

Looks good from my side

Jackie-Jiang · 2025-05-20T22:16:15Z

Isn't that an overkill?

Requirements of LeafStageWorkerAssignmentRule and WorkerManager are different, but from an abstraction point of view we can say that SegmentPartitionMetadataManager manages TablePartitionInfo which:

Provides an easy to use API to work with segment partitions.

Is a complete view, in that all processed segments will be part of the TablePartitionInfo POJO.

Maybe I am missing something, is there a deeper reason to create another POJO here?

Maintaining two separate POJOs with mostly the same information will be an anti-pattern in my view.

The main difference between them is as following:

WorkerManager is trying to find segments of the same partition served by the same server
LeafStageWorkerAssignmentRule only needs to find segments of the same partition

The reason why WorkerManager is treating new added segments differently is because they might not be available on the same server and we don't want to fail the query, thus it excludes them. What you are doing within this PR is to add the excluded ones back.

Because the info needed for them are actually different, even though they share common properties (e.g. which partition a segment belongs to), it will be easier to maintain if keep the logic separate. With current approach, there are overhead for both use cases:

When called from WorkerManager, storing the excluded segments is overhead
When called from SegmentPartitionMetadataManager, maintaining the fully replicated servers is overhead

ankitsultana · 2025-05-29T04:03:48Z

On second thought, the current changes in this PR are not semantically correct. Like Jackie mentioned, I am only interested in getting the segments corresponding to a given partition from Segment Partition Metadata Manager. The segment selection, and exclusion of new segments, is delegated to Routing Manager, as it ideally should be.

Given that, me adding excludedNewSegments to the existing POJO isn't ideal since that will allow the exclusion logic to be owned by both SegmentPartitionMetadataManager and Routing Manager.

I'll update this soon.

ankitsultana · 2025-05-29T19:51:02Z

pinot-core/src/main/java/org/apache/pinot/core/routing/TablePartitionInfo.java



+/**
+ * Tracks segments by partition for a table. Also tracks the invalid partition segments.


note: moved existing TablePartitionInfo to TablePartitionReplicatedServersInfo. This one only tracks segments by partition and the invalid partition segments.

ankitsultana · 2025-05-29T21:52:25Z

...n/java/org/apache/pinot/broker/routing/segmentpartition/SegmentPartitionMetadataManager.java


-  private void computeTablePartitionInfo() {
+  @Override
+  public synchronized void onAssignmentChange(IdealState idealState, ExternalView externalView,


note: GH is not great at showing this but I have simply moved the compute table partition info methods towards the end. however reviewers can double check this claim.

codecov-commenter · 2025-05-29T22:39:20Z

Codecov Report

Attention: Patch coverage is 72.11538% with 29 lines in your changes missing coverage. Please review.

Project coverage is 63.33%. Comparing base (1a476de) to head (baf8f62).
Report is 161 commits behind head on master.

Files with missing lines	Patch %	Lines
...mentpartition/SegmentPartitionMetadataManager.java	74.13%	11 Missing and 4 partials ⚠️
.../org/apache/pinot/query/routing/WorkerManager.java	52.94%	7 Missing and 1 partial ⚠️
...che/pinot/broker/routing/BrokerRoutingManager.java	0.00%	5 Missing ⚠️
...e/routing/TablePartitionReplicatedServersInfo.java	94.44%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #15780      +/-   ##
============================================
+ Coverage     62.90%   63.33%   +0.43%     
+ Complexity     1386     1354      -32     
============================================
  Files          2867     2898      +31     
  Lines        163354   166410    +3056     
  Branches      24952    25453     +501     
============================================
+ Hits         102755   105398    +2643     
- Misses        52847    53042     +195     
- Partials       7752     7970     +218

Flag	Coverage Δ
custom-integration1	`100.00% <ø> (ø)`
integration	`100.00% <ø> (ø)`
integration1	`100.00% <ø> (ø)`
integration2	`0.00% <ø> (ø)`
java-11	`63.30% <72.11%> (+0.43%)`	⬆️
java-21	`63.29% <72.11%> (+0.47%)`	⬆️
skip-bytebuffers-false	`?`
skip-bytebuffers-true	`?`
temurin	`63.33% <72.11%> (+0.43%)`	⬆️
unittests	`63.33% <72.11%> (+0.43%)`	⬆️
unittests1	`56.45% <78.04%> (+0.63%)`	⬆️
unittests2	`33.34% <56.73%> (-0.23%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Jackie-Jiang · 2025-05-29T22:38:55Z

...n/java/org/apache/pinot/broker/routing/segmentpartition/SegmentPartitionMetadataManager.java

      _segmentInfoMap.put(segment, segmentInfo);
    }
-    computeTablePartitionInfo();
+    computeAllTablePartitionInfo();


Consider adding config to turn each of them on/off to reduce overhead. Can be done as a follow up

…pache#15780)

ankitsultana added multi-stage Related to the multi-stage query engine mse-physical-optimizer labels May 13, 2025

ankitsultana requested review from Jackie-Jiang and itschrispeck May 13, 2025 20:16

ankitsultana force-pushed the mse-physical-p6 branch from 4b6c2a9 to 1ef191a Compare May 13, 2025 20:19

ankitsultana commented May 13, 2025

View reviewed changes

wirybeaver reviewed May 13, 2025

View reviewed changes

wirybeaver approved these changes May 13, 2025

View reviewed changes

ankitsultana closed this May 14, 2025

ankitsultana reopened this May 14, 2025

shauryachats approved these changes May 14, 2025

View reviewed changes

Jackie-Jiang reviewed May 15, 2025

View reviewed changes

itschrispeck approved these changes May 20, 2025

View reviewed changes

[multistage] Handle Excluded New Segments in MSE Physical Optimizer

e0578bc

rebase and refactor

4990bbb

ankitsultana force-pushed the mse-physical-p6 branch from 1ef191a to 4990bbb Compare May 29, 2025 19:23

ankitsultana added 2 commits May 29, 2025 19:43

self-review #1

86f50a5

more cleanup

ba0c145

ankitsultana commented May 29, 2025

View reviewed changes

fix checkstyle

f56d149

ankitsultana requested review from Jackie-Jiang and shauryachats May 29, 2025 20:06

fix test

9e86fb6

ankitsultana commented May 29, 2025

View reviewed changes

fix checkstyle again

baf8f62

Jackie-Jiang approved these changes May 29, 2025

View reviewed changes

ankitsultana merged commit de04b5d into apache:master May 29, 2025
18 checks passed

songwdfu pushed a commit to songwdfu/pinot that referenced this pull request Jun 3, 2025

[multistage] Handle Excluded New Segments in MSE Physical Optimizer (a…

ce57435

…pache#15780)



		/**
		* Tracks segments by partition for a table. Also tracks the invalid partition segments.

[multistage] Handle Excluded New Segments in MSE Physical Optimizer #15780

[multistage] Handle Excluded New Segments in MSE Physical Optimizer #15780

Uh oh!

Conversation

ankitsultana commented May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Big Picture

Test Plan

Uh oh!

ankitsultana May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wirybeaver May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ankitsultana May 13, 2025

Choose a reason for hiding this comment

Uh oh!

wirybeaver commented May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wirybeaver left a comment

Choose a reason for hiding this comment

Uh oh!

shauryachats left a comment

Choose a reason for hiding this comment

Uh oh!

Jackie-Jiang left a comment

Choose a reason for hiding this comment

Uh oh!

ankitsultana commented May 15, 2025

Uh oh!

Jackie-Jiang commented May 20, 2025

Uh oh!

ankitsultana commented May 20, 2025

Uh oh!

itschrispeck left a comment

Choose a reason for hiding this comment

Uh oh!

Jackie-Jiang commented May 20, 2025

Uh oh!

ankitsultana commented May 29, 2025

Uh oh!

ankitsultana May 29, 2025

Choose a reason for hiding this comment

Uh oh!

ankitsultana May 29, 2025

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Jackie-Jiang May 29, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

ankitsultana commented May 13, 2025 •

edited

Loading

ankitsultana May 13, 2025 •

edited

Loading

wirybeaver May 13, 2025 •

edited

Loading

wirybeaver commented May 13, 2025 •

edited

Loading

codecov-commenter commented May 29, 2025 •

edited

Loading