Skip to content

Conversation

@richardstartin
Copy link
Member

@richardstartin richardstartin commented Feb 5, 2022

Aggregation function aggregate as double to avoid numeric overflow, but converting blocks to double[] too soon has two drawbacks:

  • double[] are larger than float[] and int[] and this may result in twice the footprint on heap when these arrays are retained by the DataBlockCache
  • If the values are read directly from a ForwardIndexReader, reading int/long/float values as double prevents autovectorization, resulting in a performance penalty.

Avoiding Premature Type Conversion

Type conversion slows down bulk reads in ForwardIndexReader - e.g. compare reading long contiguous values as long and as double (this benchmark is in the project and can be run by anyone):

Benchmark                                                     (_blockSize)  (_numBlocks)  Mode  Cnt      Score      Error  Units
BenchmarkFixedByteSVForwardIndexReader.readDoublesBatch              10000          1000  avgt    5  33931.164 ± 4052.520  us/op
BenchmarkFixedByteSVForwardIndexReader.readLongsBatch                10000          1000  avgt    5  14003.773 ± 1625.357  us/op

The root cause is that when values are read into an array from disk, the endianness needs to be swapped. Hotspot has an efficient implementation of this operation Copy::conjoint_swap which can't be used when type conversion also needs to be performed, so reading long values as longs is more efficient than reading longs as doubles, despite the same amount of data being copied. This can be see by profiling the benchmark:

....[Hottest Regions]...............................................................................
 62.36%           libjvm.so  Copy::conjoint_swap (21 bytes) 
 30.25%         c2, level 4  org.apache.pinot.perf.BenchmarkFixedByteSVForwardIndexReader::readLongsBatch, version 1447 (101 bytes) 
 
 ....[Hottest Regions]...............................................................................
 81.73%         c2, level 4  org.apache.pinot.perf.BenchmarkFixedByteSVForwardIndexReader::readDoublesBatch, version 1428 (78 bytes) 

This means that a simple sum over a raw column is more efficient when the type conversion is delayed. This can be seen (in a benchmark to be contributed in another PR) when computing a sum over a raw INT column:

master

Benchmark               (_intBaseValue)  (_numRows)                              (_query)  Mode  Cnt      Score     Error  Units
BenchmarkQueries.query                0     1500000  SELECT SUM(RAW_INT_COL) FROM MyTable  avgt    5  17719.453 ± 171.739  us/op

branch

Benchmark               (_intBaseValue)  (_numRows)                              (_query)  Mode  Cnt      Score     Error  Units
BenchmarkQueries.query                0     1500000  SELECT SUM(RAW_INT_COL) FROM MyTable  avgt    5  15475.992 ± 537.057  us/op

@codecov-commenter
Copy link

codecov-commenter commented Feb 5, 2022

Codecov Report

Merging #8139 (9d67cc1) into master (8bbf93a) will increase coverage by 0.06%.
The diff coverage is 91.78%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #8139      +/-   ##
============================================
+ Coverage     71.39%   71.46%   +0.06%     
+ Complexity     4303     4302       -1     
============================================
  Files          1624     1624              
  Lines         84198    84254      +56     
  Branches      12602    12612      +10     
============================================
+ Hits          60116    60208      +92     
+ Misses        19970    19935      -35     
+ Partials       4112     4111       -1     
Flag Coverage Δ
integration1 28.93% <91.78%> (+0.06%) ⬆️
integration2 27.67% <91.78%> (+<0.01%) ⬆️
unittests1 67.92% <43.83%> (-0.05%) ⬇️
unittests2 14.19% <0.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...y/aggregation/function/SumAggregationFunction.java 95.45% <89.47%> (-4.55%) ⬇️
...y/aggregation/function/MaxAggregationFunction.java 96.36% <92.59%> (-3.64%) ⬇️
...y/aggregation/function/MinAggregationFunction.java 96.36% <92.59%> (-3.64%) ⬇️
...ache/pinot/core/operator/docidsets/OrDocIdSet.java 86.36% <0.00%> (-11.37%) ⬇️
...inot/core/util/SegmentCompletionProtocolUtils.java 57.69% <0.00%> (-7.70%) ⬇️
...a/org/apache/pinot/common/utils/ServiceStatus.java 60.00% <0.00%> (-7.15%) ⬇️
.../pinot/core/query/scheduler/PriorityScheduler.java 80.82% <0.00%> (-2.74%) ⬇️
...not/broker/broker/helix/ClusterChangeMediator.java 77.65% <0.00%> (-2.13%) ⬇️
.../pinot/server/starter/helix/BaseServerStarter.java 57.98% <0.00%> (-1.97%) ⬇️
...controller/helix/core/minion/PinotTaskManager.java 67.51% <0.00%> (-1.83%) ⬇️
... and 26 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8bbf93a...9d67cc1. Read the comment docs.

double value = valueArray[i];
if (value > max) {
max = value;
BlockValSet blockValSet = blockValSetMap.get(_expression);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand how delaying conversion to double will help with heap usage. Couple of questions on performance:

  • Now every aggregate() call has to execute switch condition once. Will there be any perf penalty for this ?
  • Why does conversion to double[] prevent auto-vectorizatonn ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can see some numbers on #7920 (read longs as longs vs convert to double) but I’ll pull out some disassembly to demonstrate, as well as attach some numbers here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the switch statement, there is one per block, so it’s cost is amortised (just like all the virtual calls we do)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@siddharthteotia I've added some rationale and numbers to the PR description, please take a look.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sharing perf numbers, @richardstartin. I was also curious to see the disassembled code as you said auto-vectorization probability increases with continuing to treat as long. Please share if you have.

I am good with the PR. Please also check-in the benchmark

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The vectorized method is Copy::conjoint_swap, as mentioned above, which hsdis does not dissemble.

@richardstartin richardstartin force-pushed the aggregation-delay-conversion-to-double branch from 9ea5254 to 9d67cc1 Compare February 6, 2022 22:07
@siddharthteotia siddharthteotia merged commit 1684aee into apache:master Feb 7, 2022
Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great optimization. We should also add this to group-by and MV functions

break;
}
default:
throw new IllegalStateException("Cannot compute min for non-numeric type: " + blockValSet.getValueType());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix the exception message to reflect the correct aggregation

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will follow up

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants