Skip to content

Conversation

@kmozaid
Copy link
Contributor

@kmozaid kmozaid commented Feb 18, 2022

Description

This PR adds a new implementation of PartitionFunction which is used to partition segments. The new partition function named BoundedColumnValue can be used to partition segments on column values and still generating partitionId of integer type.

Example Usage -

  "segmentPartitionConfig": {
    "columnPartitionMap": {
      "subject": {
        "functionName": "BoundedColumnValue",
        "functionConfig": {
          "columnValues": "Maths|English|Chemistry",
          "columnValuesDelimiter": "|"
        },
        "numPartitions": 4
      }
    }

PartitionId is generated based on position in columnValues. PartitionId would 1 for Maths, 2 for English and so on.
PartitionId 0 is reserved for any other subject which are not present in given columnValues. The different column values can be specified using pipe (|) separation. This additional partition functionConfig is persisted in metadata.properties and segment metadata in zookeeper. Broker can also use this function to prune segments.

Upgrade Notes

Does this PR prevent a zero down-time upgrade? (Assume upgrade order: Controller, Broker, Server, Minion)

  • No

Does this PR fix a zero-downtime upgrade introduced earlier?

  • No

Does this PR otherwise need attention when creating release notes? Things to consider:

  • New configuration options
  • Deprecation of configurations
  • Signature changes to public methods/interfaces
  • New plugins added or old plugins removed
  • Yes, Adds new configuration option and changes to public interfaces.

Release Notes

Added new partition function called BoundedColumnValue to be able to partition segments based on column value.

Documentation

Yes, will create another PR in pinot-docs repo.

@codecov-commenter
Copy link

codecov-commenter commented Feb 18, 2022

Codecov Report

Merging #8224 (f2f131f) into master (fa0db64) will increase coverage by 5.54%.
The diff coverage is 80.51%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #8224      +/-   ##
============================================
+ Coverage     63.98%   69.53%   +5.54%     
- Complexity     4239     4241       +2     
============================================
  Files          1584     1631      +47     
  Lines         83250    85276    +2026     
  Branches      12608    12844     +236     
============================================
+ Hits          53268    59293    +6025     
+ Misses        26144    21847    -4297     
- Partials       3838     4136     +298     
Flag Coverage Δ
integration2 27.57% <3.89%> (?)
unittests1 66.97% <79.16%> (+0.03%) ⬆️
unittests2 14.10% <3.89%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...ocal/indexsegment/mutable/IntermediateSegment.java 69.62% <0.00%> (-0.45%) ⬇️
...ltime/converter/stats/MutableColumnStatistics.java 53.22% <0.00%> (-1.78%) ⬇️
...verter/stats/MutableNoDictionaryColStatistics.java 40.90% <0.00%> (-4.10%) ⬇️
.../loader/defaultcolumn/DefaultColumnStatistics.java 73.07% <0.00%> (-2.93%) ⬇️
...java/org/apache/pinot/segment/spi/V1Constants.java 14.28% <ø> (ø)
...he/pinot/segment/spi/creator/ColumnStatistics.java 100.00% <ø> (ø)
.../core/realtime/PinotLLCRealtimeSegmentManager.java 75.25% <50.00%> (+9.15%) ⬆️
...partition/BoundedColumnValuePartitionFunction.java 70.58% <70.58%> (ø)
...pi/partition/metadata/ColumnPartitionMetadata.java 88.00% <83.33%> (+1.95%) ⬆️
.../routing/segmentpruner/PartitionSegmentPruner.java 61.24% <100.00%> (+0.30%) ⬆️
... and 361 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fa0db64...f2f131f. Read the comment docs.

@kmozaid kmozaid force-pushed the feature/bounded-column-value-partition-function branch from bec3162 to ef0384e Compare February 18, 2022 09:08
@kmozaid kmozaid force-pushed the feature/bounded-column-value-partition-function branch from ef0384e to c840ebf Compare February 18, 2022 09:26
@richardstartin
Copy link
Member

Hi @kmozaid thanks for taking the time to make this contribution. Can you explain what problem this solves? Is it because you already have a partitioning and you want to maintain locality within partitions?

@kmozaid
Copy link
Contributor Author

kmozaid commented Feb 18, 2022

Hi @kmozaid thanks for taking the time to make this contribution. Can you explain what problem this solves? Is it because you already have a partitioning and you want to maintain locality within partitions?

Hi @richardstartin , We have a table where data is being ingested from multiple sources. (these multiple sources pushes data to same kafka topic). Data is kept for 5 days in realtime table and then moved offline table by minion task. We want to keep data from these sources in separate segments for offline table. There is a column which identifies the source. BoundedColumnValue partition function provides capability to keep data from different sources in respective partitioned segments. Later if we want to backfill the data of just one source, then we will be able to do so because we would know what are the segments for given source and replace them by backfill. The main use case is to be able to backfill data of particular source. This is also discussed in slack thread - https://apache-pinot.slack.com/archives/CDRCA57FC/p1643286670255700

image

Just a note regarding the image - The partitionId mentioned in image are source1, source2 and source 3 although they would be integer value based on the their position in configured columnValues attributes.

* }
* }
* }
* With this partition config on column "subject", partitionId would be 1 for Maths, 2 for English and so on.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider putting other values as the last partition in case all values are already configured, and no value goes into partition 0?

Copy link
Contributor Author

@kmozaid kmozaid Feb 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to keep a fixed value of partitionId for unknown column values. When putting other/unknown values as last partition, the partition id would different every time partition config is changed to add/remove another column value.

When backfilling data for a particular source/tenant (in our example, a subject), we can easily ignore segments which has partitionId set to 0 in segment metadata because we know for sure that it could have data for many different subjects.

@kmozaid kmozaid requested a review from Jackie-Jiang February 22, 2022 11:43
Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM otherwise. Please reformat the changes with latest Pinot Style

*/
public ColumnPartitionMetadata(String functionName, int numPartitions, Set<Integer> partitions) {
public ColumnPartitionMetadata(String functionName, int numPartitions, Set<Integer> partitions,
Map<String, String> functionConfig) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(code format)

Suggested change
Map<String, String> functionConfig) {
@Nullable Map<String, String> functionConfig) {

@kmozaid kmozaid force-pushed the feature/bounded-column-value-partition-function branch from 90b2317 to 5bc1ee8 Compare February 25, 2022 02:40
@kmozaid kmozaid force-pushed the feature/bounded-column-value-partition-function branch from 5bc1ee8 to f2f131f Compare February 25, 2022 02:44
@kmozaid
Copy link
Contributor Author

kmozaid commented Feb 25, 2022

Thanks @Jackie-Jiang and @richardstartin for your prompt reviews, appreciate it!

@Jackie-Jiang Jackie-Jiang added the release-notes Referenced by PRs that need attention when compiling the next release notes label Feb 25, 2022
@Jackie-Jiang Jackie-Jiang merged commit 3f98ce3 into apache:master Feb 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-notes Referenced by PRs that need attention when compiling the next release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants