-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Do not create dictionary for high-cardinality columns #9527
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #9527 +/- ##
=============================================
- Coverage 70.00% 15.83% -54.17%
+ Complexity 4933 175 -4758
=============================================
Files 1946 1912 -34
Lines 104280 102715 -1565
Branches 15808 15624 -184
=============================================
- Hits 72998 16265 -56733
- Misses 26157 85253 +59096
+ Partials 5125 1197 -3928
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now that this config has been changed to work for both metrics and dimension fields, shouldn't isOptimizeDictionaryForMetrics() and the associated variable indicate that it works for both? maybe this should be a separate flag for dimension since it'll be hard to modify existing configs across all tables to change the name of this one.
In the IndexingConfig there is even a comment that'll need to be updated:
/**
* If `optimizeDictionaryForMetrics` enabled, dictionary is not created for the metric columns
* for which rawIndexSize / forwardIndexSize is less than the `noDictionarySizeRatioThreshold`.
*/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it makes sense. I am planning to introduce a new config altogether even for json/text.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On thinking about it, It will get pretty confusing for users. I have decided to finally change the config to optimizeDictionary and keep the old config as well for backward compatibility. Have added comments for deprecation though.
...in/java/org/apache/pinot/segment/local/segment/creator/impl/SegmentColumnarIndexCreator.java
Outdated
Show resolved
Hide resolved
...in/java/org/apache/pinot/segment/local/segment/creator/impl/SegmentColumnarIndexCreator.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that you've renamed the original config isOptimizeDictionaryForMetrics. Won't this cause backward compatibility issues with tables out there that already use isOptimizeDictionaryForMetrics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to keep multiple flags as it creates a lot of confusion for new users. Also afaik, this config is pretty new (introduced in 0.10) and is not used by a lot of folks. I understand your concern though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@KKcorps we did some validation on our end and no one is using this config yet, so nothing should break on our end with your change. I'm okay with changing the config name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I asked some users and seems like they are using the old config. For now, I have kept both but marked the older one as deprecated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Long term wise, we want to have a flag to automatically choose whether to use fixed-length dictionary, var-length dictionary or raw index. Not sure if we want to name it optimizeDictionary, but I don't have a good name either
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need config for that? Isn't it determined by data type of the column?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want a config to give more control to the user so that they can explicitly choose the index to create if desired. In certain scenarios, raw index might be able to save space, but not good for query performance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that is out of scope for this PR. That will anyways not be a boolean config so we can decide on that later. The config name for that would be more like dictionaryType
8d877d3 to
dbc5e48
Compare
dbc5e48 to
9702077
Compare
9702077 to
4c84cbb
Compare
After a second thought, need more discussion
Jackie-Jiang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see that for json and text column, we might not want to create dictionary, but for other dimensions, in most cases we still want to create dictionaries, or a lot of indexes cannot be applied.
With the current change, for existing users who have optimize dictionary set for metrics, this will automatically apply that to dimensions, which can cause serious regression (inverted index cannot be added).
How about adding a config to only apply this to json/text column?
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) Remove extra empty lines
Actually the reason for this change was to introduce this config for dimension columns (after complaints about space amplification and memory usage from users). users do have cases where they keep String columns as dimensions but don't really do any filtering on top of them. |
|
In that case, we can keep both |
Makes sense. Incorporated this now. |
Jackie-Jiang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Please add a release note section and documentation for the new added config
optimizeDictionaryconfig for dimension columns with fixed width as well.Objective is to reduce the segment size and server space/memory usage.
Release Notes
New config added
optimizeDictionaryto reduce memory usage due to dictionariesDocumentation
To enable this feature, add the following to your table config