Skip to content

Conversation

@kmozaid
Copy link
Contributor

@kmozaid kmozaid commented Feb 28, 2022

Description

This PR adds support for multi column partitioning for minion tasks and broker segment pruner.

Upgrade Notes

Does this PR prevent a zero down-time upgrade? (Assume upgrade order: Controller, Broker, Server, Minion)

  • No

Does this PR fix a zero-downtime upgrade introduced earlier?

  • No

Does this PR otherwise need attention when creating release notes? Things to consider:

  • New configuration options
  • Deprecation of configurations
  • Signature changes to public methods/interfaces
  • New plugins added or old plugins removed
  • No (Please label this PR as release-notes and complete the section on Release Notes)

Release Notes

Segment can be partitioned on multiple columns when they are moved/created by minion tasks and when data is ingested via batch ingestion. Broker can also prune segments on multiple columns partition metadata.

@kmozaid kmozaid force-pushed the feature/multi-column-partitioning-offline-table branch 3 times, most recently from 6e34564 to 15e04b3 Compare February 28, 2022 07:23
@codecov-commenter
Copy link

codecov-commenter commented Feb 28, 2022

Codecov Report

Merging #8255 (2c4e35e) into master (d999aec) will increase coverage by 0.01%.
The diff coverage is 77.31%.

@@             Coverage Diff              @@
##             master    #8255      +/-   ##
============================================
+ Coverage     70.78%   70.79%   +0.01%     
- Complexity     4243     4317      +74     
============================================
  Files          1631     1690      +59     
  Lines         85281    88583    +3302     
  Branches      12845    13429     +584     
============================================
+ Hits          60364    62714    +2350     
- Misses        20746    21478     +732     
- Partials       4171     4391     +220     
Flag Coverage Δ
integration1 27.25% <14.94%> (-1.54%) ⬇️
integration2 25.72% <1.03%> (-1.75%) ⬇️
unittests1 67.10% <100.00%> (+0.11%) ⬆️
unittests2 14.21% <74.74%> (+0.09%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...on/tasks/mergerollup/MergeRollupTaskGenerator.java 84.85% <64.70%> (-1.04%) ⬇️
...mentpruner/MultiPartitionColumnsSegmentPruner.java 74.14% <74.14%> (ø)
...er/routing/segmentpruner/SegmentPrunerFactory.java 92.75% <100.00%> (+0.21%) ⬆️
...mentpruner/SinglePartitionColumnSegmentPruner.java 68.21% <100.00%> (ø)
...ache/pinot/plugin/minion/tasks/MergeTaskUtils.java 98.18% <100.00%> (+2.10%) ⬆️
...ocal/indexsegment/mutable/IntermediateSegment.java 76.51% <100.00%> (+6.88%) ⬆️
...data/manager/realtime/DefaultSegmentCommitter.java 0.00% <0.00%> (-80.00%) ⬇️
...ctionaryBasedSingleColumnDistinctOnlyExecutor.java 0.00% <0.00%> (-80.00%) ⬇️
...src/main/java/org/apache/pinot/client/Request.java 0.00% <0.00%> (-60.00%) ⬇️
...ntroller/helix/core/minion/CronJobScheduleJob.java 0.00% <0.00%> (-59.10%) ⬇️
... and 429 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d999aec...2c4e35e. Read the comment docs.

@kmozaid kmozaid force-pushed the feature/multi-column-partitioning-offline-table branch 2 times, most recently from 36ff76b to c1b5a0f Compare February 28, 2022 09:45
@kmozaid kmozaid force-pushed the feature/multi-column-partitioning-offline-table branch from c1b5a0f to e6c6adc Compare February 28, 2022 10:57
Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we add a queries test (search for *QueriesTest for examples) or integration test to ensure it works end-to-end?

@kmozaid
Copy link
Contributor Author

kmozaid commented Mar 15, 2022

Shall we add a queries test (search for *QueriesTest for examples) or integration test to ensure it works end-to-end?

I have added a test which prune segments based on query -
https://github.com/apache/pinot/pull/8255/files#diff-9e8214b130b1a4a54157230e84674df2328018e7e7044a90a1f2a114ad56674bR353

and Sure let me see if I can add an integration test.

@kmozaid kmozaid requested a review from Jackie-Jiang March 31, 2022 16:41
@kmozaid
Copy link
Contributor Author

kmozaid commented Mar 31, 2022

Hi @Jackie-Jiang Could you please check this PR if it can be merge now?

@Jackie-Jiang
Copy link
Contributor

@kmozaid This is a great feature, and I want to ensure it can work end-to-end. With the SegmentPrunerTest, it ensures the pruner itself works, but multiple partitioning could fail in other places (I remember some classes explicitly check if there is only one partition column). I'd suggest adding an integration test to ensure it works end-to-end. You may refer to NullHandlingIntegrationTest on how to add an integration test with custom data.

@kmozaid
Copy link
Contributor Author

kmozaid commented Apr 1, 2022

Thanks @Jackie-Jiang , Sure, Let me add integration test.

@kmozaid
Copy link
Contributor Author

kmozaid commented Apr 3, 2022

@kmozaid This is a great feature, and I want to ensure it can work end-to-end. With the SegmentPrunerTest, it ensures the pruner itself works, but multiple partitioning could fail in other places (I remember some classes explicitly check if there is only one partition column). I'd suggest adding an integration test to ensure it works end-to-end. You may refer to NullHandlingIntegrationTest on how to add an integration test with custom data.

Hi @Jackie-Jiang I have updated existing merge rollup minion task and realtime-to-offline minion task integration test which involves partitioning.Now these tests covered multi column partitioning flow. Please check if it is still required to write a separate integration tests.

@kmozaid
Copy link
Contributor Author

kmozaid commented Apr 3, 2022

There are three other places where multicolumn partitioning config is not allowed.

  1. PinotLLCRealtimeSegmentManager.getPartitionMetadataFromTableConfig:710
  2. LLRealtimeSegmentDataManager.setPartitionParameters:1425
  3. PinotHadoopJobLauncher:HadoopSegmentPreprocessingJob.fetchPartitioningConfig:153

This feature is NOT support by REALTIME tables so first 2 checks are still valid. The last one is specific to Hadoop pre processing job, I think we can skip for this feature as we don't really support BoundedColumnValue partitioner for hadoop;s job (https://github.com/apache/pinot/blob/master/pinot-plugins/pinot-batch-ingestion/v0_deprecated/pinot-hadoop/src/main/java/org/apache/pinot/hadoop/job/partitioners/PartitionFunctionFactory.java#L78). It is deprecated too.

@kmozaid
Copy link
Contributor Author

kmozaid commented Apr 4, 2022

Test Cases Covered -

  1. Integration tests for Realtime to Offline Task with multi column partitioning
  2. Integration tests for Merge Rollup task with multi column partitioning.
  3. Unit tests for direct offline segment creation from file with multi column partitioning (merge rollup task integration tests setup does this).

@kmozaid
Copy link
Contributor Author

kmozaid commented Apr 8, 2022

Hi @Jackie-Jiang , Could you please provide review feedback? please also trigger the workflow again.

@kmozaid kmozaid force-pushed the feature/multi-column-partitioning-offline-table branch from 29af1e3 to d10f61b Compare April 19, 2022 10:32
Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with some minor comments. Thanks for adding the tests!

_partitions = partitions;
}
}
public interface PartitionSegmentPruner extends SegmentPruner {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove this file since this extra interface does not provide extra value (it should show as a rename from PartitionSegmentPruner to SinglePartitionColumnSegmentPruner)

@kmozaid kmozaid force-pushed the feature/multi-column-partitioning-offline-table branch 2 times, most recently from 713a12e to 5e2afae Compare April 21, 2022 04:46
@kmozaid kmozaid force-pushed the feature/multi-column-partitioning-offline-table branch from 5e2afae to 2c4e35e Compare April 21, 2022 04:58
@Jackie-Jiang Jackie-Jiang added feature release-notes Referenced by PRs that need attention when compiling the next release notes labels Apr 21, 2022
@Jackie-Jiang Jackie-Jiang merged commit f200ec5 into apache:master Apr 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature release-notes Referenced by PRs that need attention when compiling the next release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants