Allow disabling dict generation for High cardinality columns #8398

KKcorps · 2022-03-24T13:04:55Z

For high cardinality metric columns (especialy doubles/floats), dictionary may be an overhead not only for storage space but also it adds one additional hop for read.
During segment generation phase, we can check if a column would require more storage for dictionary vs raw values and can choose to not create dictionary in that phase.

The feature is off by default and can be enabled by setting optimizeDictionaryEnabled to true.

Release Notes

Add a new configuration to disable dictionaries for single-valued metric columns.

Documentation

https://docs.pinot.apache.org/configuration-reference/table#table-index-config

richardstartin · 2022-03-24T13:22:42Z

I'm not sure this is a good idea for INT or LONG columns, which might be grouped-by. The way group by works right now relies on having a dictionary, and if we disable the dictionary then one will just be built on the fly in one of the slower group by implementations. Instead, automatic recommendation based on common queries will guide the user down the path of converting the column to a metric column if it's never used in a group by query. I'm ambivalent to the change for FLOAT and DOUBLE: recommendation should make this a solved problem, but the change makes sense for these types.

codecov-commenter · 2022-03-24T14:00:34Z

Codecov Report

Merging #8398 (817a6f0) into master (b05a541) will increase coverage by 0.09%.
The diff coverage is 67.30%.

❗ Current head 817a6f0 differs from pull request most recent head 401488b. Consider uploading reports for the commit 401488b to get more accurate results

@@             Coverage Diff              @@
##             master    #8398      +/-   ##
============================================
+ Coverage     64.10%   64.20%   +0.09%     
- Complexity     4267     4284      +17     
============================================
  Files          1594     1619      +25     
  Lines         84040    85461    +1421     
  Branches      12719    13026     +307     
============================================
+ Hits          53870    54866     +996     
- Misses        26291    26608     +317     
- Partials       3879     3987     +108

Flag	Coverage Δ
unittests1	`67.05% <68.43%> (+0.04%)`	⬆️
unittests2	`14.14% <7.70%> (-0.09%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...e/pinot/broker/api/resources/PinotBrokerDebug.java	`0.00% <ø> (ø)`
...pinot/broker/api/resources/PinotBrokerRouting.java	`0.00% <ø> (ø)`
.../BrokerResourceOnlineOfflineStateModelFactory.java	`41.86% <ø> (ø)`
.../helix/BrokerUserDefinedMessageHandlerFactory.java	`33.96% <ø> (ø)`
...roker/requesthandler/GrpcBrokerRequestHandler.java	`0.00% <ø> (ø)`
...thandler/SingleConnectionBrokerRequestHandler.java	`13.20% <0.00%> (+0.24%)`	⬆️
...inot/client/JsonAsyncHttpPinotClientTransport.java	`14.08% <0.00%> (+0.75%)`	⬆️
...src/main/java/org/apache/pinot/client/Request.java	`0.00% <ø> (-60.00%)`	⬇️
.../pinot/common/function/DateTimePatternHandler.java	`80.00% <0.00%> (-20.00%)`	⬇️
...common/function/scalar/TrigonometricFunctions.java	`0.00% <0.00%> (ø)`
... and 258 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b05a541...401488b. Read the comment docs.

KKcorps · 2022-03-24T14:21:42Z

@richardstartin what if we put in a check for only metric columns and not dimension columns? People are not expected to group by on metrics.

richardstartin · 2022-03-24T14:37:36Z

@richardstartin what if we put in a check for only metric columns and not dimension columns? People are not expected to group by on metrics.

That makes sense to me.

I think there's more to this though. Total size isn't the only factor, but also density of the column for value scans. Let's say you have 10M unique doubles, which require 76MB in storage without a dictionary. The dictionary would require at least 76MB to store the values, but the column values would require ceil(log_2(10M)) = 24 bits per value, or ~29MB for the entire column. Let's say you are scanning for values in a range, you do two dictionary lookups to map the range bounds into the dictionary's domain, then scan 29MB of data instead of 76MB. Dictionary decoding is scalar but can be explicitly vectorized. You may or may not need to retrieve the values, so may even avoid decoding the values in the range, and the cost of decoding the dictionary ids depends on how many values are retrieved, and may or may not exceed the cost of scanning more data. So this is a decision which requires more inputs than just the relative sizes.

richardstartin · 2022-03-24T16:29:13Z

I thought about this some more. The case I mentioned above is likely rare. I think:

Metric columns should not have a dictionary by default. This will generally make aggregations faster.
If a user decides to add a dictionary for a metric column, they should get one no matter what, without needing to understand any other configurations

KKcorps · 2022-03-24T21:51:36Z

@richardstartin I agree with the second part. For the first part though, are you suggesting that, instead of all these checks, we should just return noDict for metric columns as default behavior?

Jackie-Jiang · 2022-03-24T22:18:40Z

I would suggest adding this logic into the table config auto-tuner, which can generate the optimized table config. The problem of adding this directly into the table config is that there is no way for user to force the encoding type for a metric to be dictionary.
Adding @walterddr to the discussion

siddharthteotia · 2022-03-25T04:31:31Z

So we had implemented a config recommendation rule in the RecommendationEngine (used by LinkedIn)

https://github.com/apache/pinot/blob/master/pinot-controller/src/main/java/org/apache/pinot/controller/recommender/rules/impl/NoDictionaryOnHeapDictionaryJointRule.java

It needs to be improved based on some of the things we have observed after using it for quite some time

First is that for pure aggregation only queries, they get slowed down significantly (3ms v/s 300ms) if dictionary is not created on the column -- because MIN, MAX aggregations can be answered from dictionary as opposed to scanning table
Columns that are in SELECT list benefit without dictionary because during projection, noDictionary avoids the extra hop from forward index to dictionary. In some cases, we saw 20% performance improvement for such scenarios by not having dictionary
Lastly, as also mentioned in this PR -- for low cardinality storage savings can be significant but regardless of cardinality, and especially for STRING columns predicate evaluation / native arithmetic is faster on dictionary codes than varchar /string comparison

We find ourselves recommending noDictionary too aggressively and need to balance the above requirements in the rule in our recommendation engine.

richardstartin · 2022-03-25T09:16:02Z

First is that for pure aggregation only queries, they get slowed down significantly (3ms v/s 300ms) if dictionary is not created on the column -- because MIN, MAX aggregations can be answered from dictionary as opposed to scanning table

min and max should just use metadata, like count does. It would be an easy change to make.

richardstartin · 2022-03-25T10:20:16Z

#8408 should make point 1 a non-issue

mayankshriv · 2022-03-26T17:18:03Z

I'd like to request reaching consensus here:

Do we agree that using raw values for high cardinality metrics makes sense? If so, can we enable no-dict for those either by default or under a flag?
Do we agree that Combine Metadata and Dictionary based plans into single operator #8408 takes care of issue 1 in @siddharthteotia 's comment? And issues 2 and 3 from the same comment don't apply to this PR.
We can independently make the heuristic between recommendation engine / auto-tuner and this PR use the same algorithm, but that should be outside the scope of this PR.

If there are more open items, lets reach consensus on this, and move froward with this PR.

cc: @siddharthteotia @richardstartin @Jackie-Jiang

KKcorps · 2022-03-26T19:56:11Z

IMO, #8408 does solve the point 1.
Point 2 and 3 can be addressed by putting in the criteria for doing this only for numeric metric columns.

One thing I can add in this PR from the recommender though is thresholdMinPercentDictionaryStorageSave config.

siddharthteotia · 2022-03-29T04:22:26Z

I'd like to request reaching consensus here:

Do we agree that using raw values for high cardinality metrics makes sense? If so, can we enable no-dict for those either by default or under a flag?

Do we agree that Combine Metadata and Dictionary based plans into single operator #8408 takes care of issue 1 in @siddharthteotia 's comment? And issues 2 and 3 from the same comment don't apply to this PR.

We can independently make the heuristic between recommendation engine / auto-tuner and this PR use the same algorithm, but that should be outside the scope of this PR.

If there are more open items, lets reach consensus on this, and move froward with this PR.

cc: @siddharthteotia @richardstartin @Jackie-Jiang

@mayankshriv my comment was more of a FYI to share our learnings from recommender rule engine. I don't mean to increase scope of this PR.

I agree that for high cardinality metrics, no dictionary is a better option.

I think outright if we want to create metric columns with no dictionary, regardless of cardinality, then we should probably do that under a flag to keep the default behavior same as today ?

I will review @richardstartin PR

mayankshriv · 2022-04-03T02:02:38Z

What's the consensus here? Are there any open concerns, or can this PR be merged?

pinot-spi/src/main/java/org/apache/pinot/spi/config/table/IndexingConfig.java

siddharthteotia

LGTM.

...in/java/org/apache/pinot/segment/local/segment/creator/impl/SegmentColumnarIndexCreator.java

kishoreg · 2022-04-05T07:34:41Z

Sorry for late comment.. can we model this as segment optimization strategies and design for a list of strategies that can added overtime and each strategy has its own set of properties?

kishoreg · 2022-04-05T07:35:40Z

I am only referring to the design of tableconfig structure not the code design

KKcorps · 2022-04-05T13:11:59Z

@kishoreg I can do that but a better way will be to implement something like we have for Tuners. I can take that over in next PR.

siddharthteotia · 2022-04-13T04:52:56Z

Shall we merge this ?

Jackie-Jiang · 2022-04-13T23:46:35Z

@KKcorps Merged the PR and labelled the PR as release-notes. Could you please help add a release notes section in the PR description for the new added configs and update the pinot documentation?

KKcorps added 2 commits March 23, 2022 02:35

add basic check for no dictionary

09e19a9

Add support to remove dictionaries at the time of creating indcies

de5f707

KKcorps requested review from Jackie-Jiang, mayankshriv and richardstartin March 24, 2022 13:07

remove unnecessary formatting

66633b4

KKcorps marked this pull request as ready for review March 24, 2022 15:36

KKcorps changed the title ~~Disable dict generation for High cardinality columns~~ Allow disabling dict generation for High cardinality columns Mar 24, 2022

Only optimise for metric columns and add config for min threshold check

dc856aa

richardstartin approved these changes Apr 4, 2022

View reviewed changes

KKcorps added Configuration Config changes (addition/deletion/change in behavior) performance labels Apr 4, 2022

Jackie-Jiang reviewed Apr 4, 2022

View reviewed changes

siddharthteotia approved these changes Apr 5, 2022

View reviewed changes

siddharthteotia reviewed Apr 5, 2022

View reviewed changes

...in/java/org/apache/pinot/segment/local/segment/creator/impl/SegmentColumnarIndexCreator.java Outdated Show resolved Hide resolved

Refactor: rename variables

ab55ddb

KKcorps added 3 commits April 5, 2022 19:05

Add javadoc for new configs

8c74937

Use size ratio instead of storage saved

0dd2402

Refactor: rename threshold variable

401488b

Jackie-Jiang added the release-notes Referenced by PRs that need attention when compiling the next release notes label Apr 13, 2022

Jackie-Jiang merged commit 476679d into apache:master Apr 13, 2022

swaminathanmanish mentioned this pull request Mar 27, 2023

Adding fields in TableConfigBuilder to enable/disable dictionary optimization for metric columns #10484

Merged

swaminathanmanish mentioned this pull request Apr 5, 2023

Enabling dictionary override optimization on the segment reload path #10557

Merged

Allow disabling dict generation for High cardinality columns #8398

Allow disabling dict generation for High cardinality columns #8398

Uh oh!

Conversation

KKcorps commented Mar 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Release Notes

Documentation

Uh oh!

richardstartin commented Mar 24, 2022

Uh oh!

codecov-commenter commented Mar 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

KKcorps commented Mar 24, 2022

Uh oh!

richardstartin commented Mar 24, 2022

Uh oh!

richardstartin commented Mar 24, 2022

Uh oh!

KKcorps commented Mar 24, 2022

Uh oh!

Jackie-Jiang commented Mar 24, 2022

Uh oh!

siddharthteotia commented Mar 25, 2022

Uh oh!

richardstartin commented Mar 25, 2022

Uh oh!

richardstartin commented Mar 25, 2022

Uh oh!

mayankshriv commented Mar 26, 2022

Uh oh!

KKcorps commented Mar 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

siddharthteotia commented Mar 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mayankshriv commented Apr 3, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

siddharthteotia left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kishoreg commented Apr 5, 2022

Uh oh!

kishoreg commented Apr 5, 2022

Uh oh!

KKcorps commented Apr 5, 2022

Uh oh!

siddharthteotia commented Apr 13, 2022

Uh oh!

Jackie-Jiang commented Apr 13, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

KKcorps commented Mar 24, 2022 •

edited

Loading

codecov-commenter commented Mar 24, 2022 •

edited

Loading

KKcorps commented Mar 26, 2022 •

edited

Loading

siddharthteotia commented Mar 29, 2022 •

edited

Loading