Skip to content

Conversation

@xiangfu0
Copy link
Contributor

@xiangfu0 xiangfu0 commented Mar 12, 2022

Description

Adding Timestamp index for Pinot data type TIMESTAMP.

Timestamp stores value as millisecond epoch long value. Typical workload for timestamp is filtering on a time range and group by with different time granularities(days/month/etc).

The current implementation requires the query executor to extract values, apply the function then do filter/groupBy, no leverage on dictionary or index.

This PR introduces the Timestamp Index. Users can configure the most useful granularities for a Timestamp data type column.

  1. Pinot will pre-generate one column per time granularity with forward index and range index. The naming convention is $${ts_column_name}$${ts_granularity}, e.g. Timestamp column ts with granularities DAY, MONTH will have two extra columns generated: $ts$DAY and $ts$MONTH.
  2. Query overwrite for predicate and selection/groupby:
    2.1 GROUPBY: functions like dateTrunc('DAY', ts) will be translated to use the underly column $ts$DAY to fetch data.
    2.2 PREDICATE: range index is auto-built for all granularity columns.

Example query usage:

select count(*), datetrunc('WEEK', ts) as tsWeek from airlineStats WHERE tsWeek > fromDateTime('2014-01-16', 'yyyy-MM-dd') group by tsWeek  limit 10

Some preliminary benchmark shows the query perf over 2.7 billion records improved from 45 secs to 4.2 secs

select dateTrunc('YEAR', event_time) as y, dateTrunc('MONTH', event_time) as m,  sum(pull_request_commits) from githubEvents group by y, m limit 1000  Option(timeoutMs=3000000)

image
vs.
image

Release Notes

Introduces the Timestamp Index. Users can configure the most useful granularities for a Timestamp data type column.

  1. Pinot will pre-generate one column per time granularity with forward index and range index. The naming convention is $${ts_column_name}$${ts_granularity}, e.g. Timestamp column ts with granularities DAY, MONTH will have two extra columns generated: $ts$DAY and $ts$MONTH.
  2. Query overwrite for predicate and selection/groupby:
    2.1 SELECTION/GROUPBY: functions like dateTrunc('DAY', ts) will be translated to use the underly column $ts$DAY to fetch data.
    2.2 PREDICATE: range index is auto-built for all granularity columns.

Documentation

@xiangfu0 xiangfu0 marked this pull request as draft March 12, 2022 00:38
@xiangfu0 xiangfu0 force-pushed the timestamp-type-index branch 6 times, most recently from cb8f7d1 to b197bfe Compare March 12, 2022 07:44
@codecov-commenter
Copy link

codecov-commenter commented Mar 12, 2022

Codecov Report

Merging #8343 (a613e80) into master (b2e2a7e) will increase coverage by 0.08%.
The diff coverage is 53.08%.

@@             Coverage Diff              @@
##             master    #8343      +/-   ##
============================================
+ Coverage     70.57%   70.66%   +0.08%     
- Complexity     4283     4285       +2     
============================================
  Files          1672     1674       +2     
  Lines         87483    87640     +157     
  Branches      13241    13273      +32     
============================================
+ Hits          61745    61931     +186     
+ Misses        21454    21398      -56     
- Partials       4284     4311      +27     
Flag Coverage Δ
integration1 27.12% <5.55%> (+0.10%) ⬆️
integration2 26.11% <3.70%> (+0.03%) ⬆️
unittests1 66.99% <66.37%> (-0.01%) ⬇️
unittests2 14.08% <0.00%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...inot/common/function/scalar/DateTimeFunctions.java 95.23% <ø> (ø)
...he/pinot/segment/local/utils/TableConfigUtils.java 65.13% <0.00%> (-0.29%) ⬇️
...local/segment/index/loader/IndexLoadingConfig.java 66.00% <12.50%> (-4.58%) ⬇️
...roker/requesthandler/BaseBrokerRequestHandler.java 50.63% <15.21%> (-1.43%) ⬇️
...egment/spi/index/metadata/SegmentMetadataImpl.java 73.59% <25.00%> (-0.70%) ⬇️
...local/recordtransformer/ExpressionTransformer.java 87.71% <30.00%> (-12.29%) ⬇️
...pache/pinot/common/config/provider/TableCache.java 76.07% <66.66%> (-0.12%) ⬇️
...ot/spi/config/table/TimestampIndexGranularity.java 81.25% <81.25%> (ø)
...ot/segment/spi/creator/SegmentGeneratorConfig.java 82.49% <87.09%> (+0.40%) ⬆️
...indexsegment/immutable/ImmutableSegmentLoader.java 85.18% <100.00%> (+0.42%) ⬆️
... and 34 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b2e2a7e...a613e80. Read the comment docs.

@xiangfu0 xiangfu0 force-pushed the timestamp-type-index branch from b197bfe to b96fdbd Compare March 12, 2022 09:55
@xiangfu0 xiangfu0 force-pushed the timestamp-type-index branch 2 times, most recently from b85926d to 80ecfe8 Compare March 12, 2022 11:44
@xiangfu0 xiangfu0 force-pushed the timestamp-type-index branch from 80ecfe8 to 6b76294 Compare March 12, 2022 20:29
@kishoreg
Copy link
Member

Please add description

@xiangfu0 xiangfu0 force-pushed the timestamp-type-index branch from 0caf992 to 760e539 Compare March 12, 2022 23:58
@xiangfu0
Copy link
Contributor Author

Please add description

Will do. This is still WIP, also need to add unit & integration tests.

@xiangfu0 xiangfu0 force-pushed the timestamp-type-index branch 11 times, most recently from 963231b to de009b2 Compare March 16, 2022 22:59
@xiangfu0 xiangfu0 force-pushed the timestamp-type-index branch 2 times, most recently from fc61d36 to 8e9f7ba Compare April 6, 2022 23:47
@xiangfu0 xiangfu0 requested a review from Jackie-Jiang April 6, 2022 23:49
@xiangfu0 xiangfu0 force-pushed the timestamp-type-index branch 4 times, most recently from f4032e9 to 7b8f7e0 Compare April 7, 2022 20:38
Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add some validation to the TableConfigUtils.validate()? The timestamp index can only be applied to columns with millis granularity (TIMESTAMP data type, or DATE_TIME field with EPOCH:MILLISECONDS, or it will give wrong result.

@xiangfu0 xiangfu0 force-pushed the timestamp-type-index branch 2 times, most recently from 86b29f1 to f58f1c6 Compare April 8, 2022 01:19
@xiangfu0
Copy link
Contributor Author

xiangfu0 commented Apr 8, 2022

Can we add some validation to the TableConfigUtils.validate()? The timestamp index can only be applied to columns with millis granularity (TIMESTAMP data type, or DATE_TIME field with EPOCH:MILLISECONDS, or it will give wrong result.

make sense.
Adding the validate logic that this index only supports for TIMESTAMP type for now.

@xiangfu0 xiangfu0 force-pushed the timestamp-type-index branch 4 times, most recently from 427a1d9 to 5885c17 Compare April 8, 2022 06:50
@xiangfu0 xiangfu0 requested a review from Jackie-Jiang April 9, 2022 06:29
@xiangfu0 xiangfu0 force-pushed the timestamp-type-index branch 2 times, most recently from adee833 to e3731b3 Compare April 9, 2022 08:01
Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@xiangfu0 xiangfu0 force-pushed the timestamp-type-index branch from e3731b3 to a613e80 Compare April 11, 2022 18:45
@xiangfu0 xiangfu0 added feature release-notes Referenced by PRs that need attention when compiling the next release notes labels Apr 11, 2022
@xiangfu0 xiangfu0 merged commit be163f6 into apache:master Apr 11, 2022
@xiangfu0 xiangfu0 deleted the timestamp-type-index branch April 11, 2022 20:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature release-notes Referenced by PRs that need attention when compiling the next release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants