Skip to content

Conversation

@KKcorps
Copy link
Contributor

@KKcorps KKcorps commented Oct 4, 2022

  • Disable dicts for JSON and TEXT indexing columns
  • Extend optimizeDictionary config for dimension columns with fixed width as well.

Objective is to reduce the segment size and server space/memory usage.

Release Notes

New config added optimizeDictionary to reduce memory usage due to dictionaries

Documentation

To enable this feature, add the following to your table config

"tableIndexConfig" : {
 "optimizeDictionary" : true
}

@KKcorps KKcorps requested a review from Jackie-Jiang October 4, 2022 11:02
@codecov-commenter
Copy link

codecov-commenter commented Oct 4, 2022

Codecov Report

Merging #9527 (a5026af) into master (f7c5511) will decrease coverage by 54.16%.
The diff coverage is 0.00%.

@@              Coverage Diff              @@
##             master    #9527       +/-   ##
=============================================
- Coverage     70.00%   15.83%   -54.17%     
+ Complexity     4933      175     -4758     
=============================================
  Files          1946     1912       -34     
  Lines        104280   102715     -1565     
  Branches      15808    15624      -184     
=============================================
- Hits          72998    16265    -56733     
- Misses        26157    85253    +59096     
+ Partials       5125     1197     -3928     
Flag Coverage Δ
integration1 ?
integration2 ?
unittests1 ?
unittests2 15.83% <0.00%> (+0.14%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...ment/creator/impl/SegmentColumnarIndexCreator.java 0.00% <0.00%> (-79.61%) ⬇️
...ot/segment/spi/creator/SegmentGeneratorConfig.java 0.00% <0.00%> (-82.00%) ⬇️
.../apache/pinot/spi/config/table/IndexingConfig.java 0.00% <0.00%> (-90.70%) ⬇️
...src/main/java/org/apache/pinot/sql/FilterKind.java 0.00% <0.00%> (-100.00%) ⬇️
...ain/java/org/apache/pinot/core/data/table/Key.java 0.00% <0.00%> (-100.00%) ⬇️
...in/java/org/apache/pinot/spi/utils/BytesUtils.java 0.00% <0.00%> (-100.00%) ⬇️
...n/java/org/apache/pinot/core/data/table/Table.java 0.00% <0.00%> (-100.00%) ⬇️
.../java/org/apache/pinot/core/data/table/Record.java 0.00% <0.00%> (-100.00%) ⬇️
.../java/org/apache/pinot/core/util/GroupByUtils.java 0.00% <0.00%> (-100.00%) ⬇️
...java/org/apache/pinot/spi/trace/BaseRecording.java 0.00% <0.00%> (-100.00%) ⬇️
... and 1515 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now that this config has been changed to work for both metrics and dimension fields, shouldn't isOptimizeDictionaryForMetrics() and the associated variable indicate that it works for both? maybe this should be a separate flag for dimension since it'll be hard to modify existing configs across all tables to change the name of this one.

In the IndexingConfig there is even a comment that'll need to be updated:

  /**
   * If `optimizeDictionaryForMetrics` enabled, dictionary is not created for the metric columns
   * for which rawIndexSize / forwardIndexSize is less than the `noDictionarySizeRatioThreshold`.
   */

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it makes sense. I am planning to introduce a new config altogether even for json/text.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On thinking about it, It will get pretty confusing for users. I have decided to finally change the config to optimizeDictionary and keep the old config as well for backward compatibility. Have added comments for deprecation though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that you've renamed the original config isOptimizeDictionaryForMetrics. Won't this cause backward compatibility issues with tables out there that already use isOptimizeDictionaryForMetrics?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to keep multiple flags as it creates a lot of confusion for new users. Also afaik, this config is pretty new (introduced in 0.10) and is not used by a lot of folks. I understand your concern though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KKcorps we did some validation on our end and no one is using this config yet, so nothing should break on our end with your change. I'm okay with changing the config name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I asked some users and seems like they are using the old config. For now, I have kept both but marked the older one as deprecated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Long term wise, we want to have a flag to automatically choose whether to use fixed-length dictionary, var-length dictionary or raw index. Not sure if we want to name it optimizeDictionary, but I don't have a good name either

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need config for that? Isn't it determined by data type of the column?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want a config to give more control to the user so that they can explicitly choose the index to create if desired. In certain scenarios, raw index might be able to save space, but not good for query performance

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that is out of scope for this PR. That will anyways not be a boolean config so we can decide on that later. The config name for that would be more like dictionaryType

Jackie-Jiang
Jackie-Jiang previously approved these changes Oct 31, 2022
@Jackie-Jiang Jackie-Jiang dismissed their stale review October 31, 2022 18:22

After a second thought, need more discussion

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see that for json and text column, we might not want to create dictionary, but for other dimensions, in most cases we still want to create dictionaries, or a lot of indexes cannot be applied.
With the current change, for existing users who have optimize dictionary set for metrics, this will automatically apply that to dimensions, which can cause serious regression (inverted index cannot be added).
How about adding a config to only apply this to json/text column?

Comment on lines 376 to 377


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) Remove extra empty lines

@KKcorps
Copy link
Contributor Author

KKcorps commented Nov 2, 2022

I can see that for json and text column, we might not want to create dictionary, but for other dimensions, in most cases we still want to create dictionaries, or a lot of indexes cannot be applied.
With the current change, for existing users who have optimize dictionary set for metrics, this will automatically apply that to dimensions, which can cause serious regression (inverted index cannot be added).
How about adding a config to only apply this to json/text column?

Actually the reason for this change was to introduce this config for dimension columns (after complaints about space amplification and memory usage from users).
Json and text index got introduced later in the scope.
IMO, what we can do though is then introduce a seperate metric optimizeDictionaryForDimensions but mention the risk with setting this config.

users do have cases where they keep String columns as dimensions but don't really do any filtering on top of them.

@Jackie-Jiang
Copy link
Contributor

In that case, we can keep both optimizeDictionary (apply to both dimensions and metrics) and optimizeDictionaryForMetrics (only apply to metrics) to avoid backward incompatible. I don't see a case where user only want to optimize dimensions but not metrics

@KKcorps
Copy link
Contributor Author

KKcorps commented Nov 21, 2022

In that case, we can keep both optimizeDictionary (apply to both dimensions and metrics) and optimizeDictionaryForMetrics (only apply to metrics) to avoid backward incompatible. I don't see a case where user only want to optimize dimensions but not metrics

Makes sense. Incorporated this now.

@KKcorps KKcorps requested review from Jackie-Jiang and removed request for somandal November 22, 2022 04:46
@Jackie-Jiang Jackie-Jiang added feature release-notes Referenced by PRs that need attention when compiling the next release notes Configuration Config changes (addition/deletion/change in behavior) labels Nov 22, 2022
Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Please add a release note section and documentation for the new added config

@KKcorps KKcorps merged commit 2140954 into apache:master Nov 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Configuration Config changes (addition/deletion/change in behavior) feature release-notes Referenced by PRs that need attention when compiling the next release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants