Skip to content

Conversation

@noon-stripe
Copy link
Contributor

@noon-stripe noon-stripe commented Apr 28, 2022

Description

This PR adds Ingestion Aggregation. The design doc can be found #8360.

This feature aggregates values at ingestion time, which reduces the number of rows stores (thus speeding up queries), using the same strategy as the 'aggregateMetrics' feature. This expands the feature to include, COUNT, MIN, and MAX, with the ability to expand further.

Upgrade Notes

Does this PR prevent a zero down-time upgrade? (Assume upgrade order: Controller, Broker, Server, Minion)

Yes (Please label as backward-incompat, and complete the section below on Release Notes)
Does this PR fix a zero-downtime upgrade introduced earlier?

Yes (Please label this as backward-incompat, and complete the section below on Release Notes)
Does this PR otherwise need attention when creating release notes? Things to consider:

New configuration options

Deprecation of configurations
Signature changes to public methods/interfaces
New plugins added or old plugins removed
Yes (Please label this PR as release-notes and complete the section on Release Notes)

Release Notes

Ingestion Pre-Aggregation is now supported for MIN, MAX, and COUNT, in addition to SUM. To enable the feature, add an aggregationConfig to the ingestionConfigs of a realtime table config. The format of the config is

"aggregationConfigs": [
{
"columnName": "destColumn",
"aggregationFunction": "MIN(srcColumn)"
}
],
The destColumn must be in the schema and the srcColumn must not be in the schema. Additionally, all destColumns must be noDictionaryColumns.

Documentation

After this PR I will add the documentation.

@codecov-commenter
Copy link

codecov-commenter commented Apr 28, 2022

Codecov Report

Merging #8611 (0d7bfd1) into master (2c3813b) will decrease coverage by 55.33%.
The diff coverage is 0.00%.

@@              Coverage Diff              @@
##             master    #8611       +/-   ##
=============================================
- Coverage     69.57%   14.24%   -55.34%     
+ Complexity     4572      168     -4404     
=============================================
  Files          1729     1684       -45     
  Lines         90179    88351     -1828     
  Branches      13413    13223      -190     
=============================================
- Hits          62744    12585    -50159     
- Misses        23083    74829    +51746     
+ Partials       4352      937     -3415     
Flag Coverage Δ
integration1 ?
integration2 ?
unittests1 ?
unittests2 14.24% <0.00%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...he/pinot/common/utils/config/TableConfigUtils.java 0.00% <0.00%> (-83.83%) ⬇️
...che/pinot/controller/util/FileIngestionHelper.java 91.02% <ø> (ø)
...manager/realtime/LLRealtimeSegmentDataManager.java 0.00% <0.00%> (-70.91%) ⬇️
...local/indexsegment/mutable/MutableSegmentImpl.java 0.00% <0.00%> (-57.47%) ⬇️
...ent/local/realtime/impl/RealtimeSegmentConfig.java 0.00% <0.00%> (-92.13%) ⬇️
...ache/pinot/segment/local/utils/IngestionUtils.java 0.00% <0.00%> (-25.16%) ⬇️
...he/pinot/segment/local/utils/TableConfigUtils.java 0.00% <0.00%> (-65.14%) ⬇️
.../spi/config/table/ingestion/AggregationConfig.java 0.00% <0.00%> (ø)
...ot/spi/config/table/ingestion/IngestionConfig.java 0.00% <0.00%> (-50.00%) ⬇️
...g/apache/pinot/spi/utils/IngestionConfigUtils.java 0.00% <0.00%> (-76.06%) ⬇️
... and 1360 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2c3813b...0d7bfd1. Read the comment docs.

@noon-stripe noon-stripe changed the title Merge Ingestion Aggregation Feature Ingestion Aggregation Feature Apr 29, 2022
Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic is in general good. We can make it more robust

Copy link
Contributor Author

@noon-stripe noon-stripe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed comments

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems we are just maintaining the map between column to its source column and the value aggregator in the IngestionAggregator. Suggest removing the IngestionAggregator and add the source column and value aggregator into the IndexContainer. This way we can avoid a lot of unnecessary map lookups

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly good. Well done!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this check after the null value check, or it will throw NPE

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which check? I believe these are in the correct order, right?

            FieldSpec fieldSpec = schema.getFieldSpecFor(columnName);
            Preconditions.checkState(fieldSpec != null, "The destination column '" + columnName
                + "' of the aggregation function must be present in the schema");
            Preconditions.checkState(fieldSpec.getFieldType() == FieldSpec.FieldType.METRIC,
                "The destination column '" + columnName + "' of the aggregation function must be a metric column");

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this block before the field spec check:

          String aggregationFunction = aggregationConfig.getAggregationFunction();
          if (columnName == null || aggregationFunction == null) {
            throw new IllegalStateException(
                "columnName/aggregationFunction cannot be null in AggregationConfig " + aggregationConfig);
          }

Without checking the columnName is not null, schema.getFieldSpecFor(columnName) could throw NPE

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider using SUM on metric itself as the default if not configured?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good suggestion, but I actually prefer to have the aggregation config be a requirement. Otherwise, you're altering the data without informing the user.

Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this block before the field spec check:

          String aggregationFunction = aggregationConfig.getAggregationFunction();
          if (columnName == null || aggregationFunction == null) {
            throw new IllegalStateException(
                "columnName/aggregationFunction cannot be null in AggregationConfig " + aggregationConfig);
          }

Without checking the columnName is not null, schema.getFieldSpecFor(columnName) could throw NPE

@Jackie-Jiang Jackie-Jiang added feature release-notes Referenced by PRs that need attention when compiling the next release notes labels May 18, 2022
@Jackie-Jiang Jackie-Jiang merged commit ae02ece into apache:master May 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature release-notes Referenced by PRs that need attention when compiling the next release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants