-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Allow disabling dict generation for High cardinality columns #8398
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I'm not sure this is a good idea for |
Codecov Report
@@ Coverage Diff @@
## master #8398 +/- ##
============================================
+ Coverage 64.10% 64.20% +0.09%
- Complexity 4267 4284 +17
============================================
Files 1594 1619 +25
Lines 84040 85461 +1421
Branches 12719 13026 +307
============================================
+ Hits 53870 54866 +996
- Misses 26291 26608 +317
- Partials 3879 3987 +108
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
|
@richardstartin what if we put in a check for only metric columns and not dimension columns? People are not expected to group by on metrics. |
That makes sense to me. I think there's more to this though. Total size isn't the only factor, but also density of the column for value scans. Let's say you have 10M unique doubles, which require 76MB in storage without a dictionary. The dictionary would require at least 76MB to store the values, but the column values would require ceil(log_2(10M)) = 24 bits per value, or ~29MB for the entire column. Let's say you are scanning for values in a range, you do two dictionary lookups to map the range bounds into the dictionary's domain, then scan 29MB of data instead of 76MB. Dictionary decoding is scalar but can be explicitly vectorized. You may or may not need to retrieve the values, so may even avoid decoding the values in the range, and the cost of decoding the dictionary ids depends on how many values are retrieved, and may or may not exceed the cost of scanning more data. So this is a decision which requires more inputs than just the relative sizes. |
|
I thought about this some more. The case I mentioned above is likely rare. I think:
|
|
@richardstartin I agree with the second part. For the first part though, are you suggesting that, instead of all these checks, we should just return noDict for metric columns as default behavior? |
|
I would suggest adding this logic into the table config auto-tuner, which can generate the optimized table config. The problem of adding this directly into the table config is that there is no way for user to force the encoding type for a metric to be dictionary. |
|
So we had implemented a config recommendation rule in the RecommendationEngine (used by LinkedIn) It needs to be improved based on some of the things we have observed after using it for quite some time
We find ourselves recommending noDictionary too aggressively and need to balance the above requirements in the rule in our recommendation engine. |
min and max should just use metadata, like count does. It would be an easy change to make. |
|
#8408 should make point 1 a non-issue |
|
I'd like to request reaching consensus here:
If there are more open items, lets reach consensus on this, and move froward with this PR. |
|
IMO, #8408 does solve the point 1. One thing I can add in this PR from the recommender though is |
@mayankshriv my comment was more of a FYI to share our learnings from recommender rule engine. I don't mean to increase scope of this PR. I agree that for high cardinality metrics, no dictionary is a better option. I think outright if we want to create metric columns with no dictionary, regardless of cardinality, then we should probably do that under a flag to keep the default behavior same as today ? I will review @richardstartin PR |
|
What's the consensus here? Are there any open concerns, or can this PR be merged? |
pinot-spi/src/main/java/org/apache/pinot/spi/config/table/IndexingConfig.java
Outdated
Show resolved
Hide resolved
pinot-spi/src/main/java/org/apache/pinot/spi/config/table/IndexingConfig.java
Outdated
Show resolved
Hide resolved
pinot-spi/src/main/java/org/apache/pinot/spi/config/table/IndexingConfig.java
Outdated
Show resolved
Hide resolved
siddharthteotia
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
...in/java/org/apache/pinot/segment/local/segment/creator/impl/SegmentColumnarIndexCreator.java
Outdated
Show resolved
Hide resolved
|
Sorry for late comment.. can we model this as segment optimization strategies and design for a list of strategies that can added overtime and each strategy has its own set of properties? |
|
I am only referring to the design of tableconfig structure not the code design |
|
@kishoreg I can do that but a better way will be to implement something like we have for Tuners. I can take that over in next PR. |
|
Shall we merge this ? |
|
@KKcorps Merged the PR and labelled the PR as |
For high cardinality metric columns (especialy doubles/floats), dictionary may be an overhead not only for storage space but also it adds one additional hop for read.
During segment generation phase, we can check if a column would require more storage for dictionary vs raw values and can choose to not create dictionary in that phase.
The feature is off by default and can be enabled by setting
optimizeDictionaryEnabledtotrue.Release Notes
Documentation
https://docs.pinot.apache.org/configuration-reference/table#table-index-config