Power of 2 fixed size chunks #7934

richardstartin · 2021-12-20T09:58:33Z

This introduces a simple evolution of the raw index format for fixed width data types which merely enforces that chunk sizes are a power of 2. For example, if a chunk size of 1000 is chosen, the writer will round up to 1024. This allows the reader to assume that the chunk size is a power of 2 and replace integer remainder calculations and divisions with masks and shifts respectively. The format is otherwise identical.

This has a good impact when the index is compressed and the accesses are non-contiguous but there are many accesses per chunk:

Benchmark                                                                    (_blockSize)  (_numBlocks)  Mode  Cnt   Score   Error  Units
BenchmarkFixedByteSVForwardIndexReader.readCompressedDoublesNonContiguousV3         10000          1000  avgt    5  39.976 ± 0.439  ms/op
BenchmarkFixedByteSVForwardIndexReader.readCompressedDoublesNonContiguousV4         10000          1000  avgt    5  33.110 ± 0.588  ms/op
BenchmarkFixedByteSVForwardIndexReader.readCompressedLongsNonContiguousV3           10000          1000  avgt    5  46.568 ± 0.440  ms/op
BenchmarkFixedByteSVForwardIndexReader.readCompressedLongsNonContiguousV4           10000          1000  avgt    5  31.989 ± 0.419  ms/op

codecov-commenter · 2021-12-20T10:32:59Z

Codecov Report

Merging #7934 (b8aa111) into master (71fefe2) will decrease coverage by 0.01%.
The diff coverage is 90.38%.

@@             Coverage Diff              @@
##             master    #7934      +/-   ##
============================================
- Coverage     71.07%   71.05%   -0.02%     
- Complexity     4112     4132      +20     
============================================
  Files          1593     1595       +2     
  Lines         82372    82410      +38     
  Branches      12270    12270              
============================================
+ Hits          58545    58560      +15     
- Misses        19872    19896      +24     
+ Partials       3955     3954       -1

Flag	Coverage Δ
integration1	`28.96% <0.00%> (-0.08%)`	⬇️
integration2	`27.63% <0.00%> (+0.04%)`	⬆️
unittests1	`68.08% <90.38%> (+0.01%)`	⬆️
unittests2	`14.33% <0.00%> (-0.03%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
.../io/writer/impl/BaseChunkSVForwardIndexWriter.java	`85.96% <0.00%> (ø)`
.../writer/impl/VarByteChunkSVForwardIndexWriter.java	`100.00% <ø> (ø)`
...ment/index/readers/DefaultIndexReaderProvider.java	`70.00% <50.00%> (-3.69%)`	⬇️
...riter/impl/FixedByteChunkSVForwardIndexWriter.java	`96.55% <80.00%> (-3.45%)`	⬇️
...ment/index/readers/forward/ChunkReaderContext.java	`90.90% <90.90%> (ø)`
...readers/forward/BaseChunkSVForwardIndexReader.java	`93.10% <100.00%> (+0.45%)`	⬆️
...ward/FixedBytePower2ChunkSVForwardIndexReader.java	`100.00% <100.00%> (ø)`
...ntroller/helix/core/minion/CronJobScheduleJob.java	`0.00% <0.00%> (-59.10%)`	⬇️
...he/pinot/segment/local/segment/store/IndexKey.java	`75.00% <0.00%> (-5.00%)`	⬇️
.../startree/v2/builder/OffHeapSingleTreeBuilder.java	`87.42% <0.00%> (-4.20%)`	⬇️
... and 18 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 71fefe2...b8aa111. Read the comment docs.

Jackie-Jiang

LGTM otherwise

...n/java/org/apache/pinot/segment/local/io/writer/impl/FixedByteChunkSVForwardIndexWriter.java

Jackie-Jiang · 2021-12-21T00:30:29Z

...c/main/java/org/apache/pinot/segment/local/io/writer/impl/BaseChunkSVForwardIndexWriter.java

   */
  protected BaseChunkSVForwardIndexWriter(File file, ChunkCompressionType compressionType, int totalDocs,
-      int numDocsPerChunk, int chunkSize, int sizeOfEntry, int version)
+      int numDocsPerChunk, int chunkSize, int sizeOfEntry, int version, boolean fixed)


I don't think we need this extra fixed here for the version validation. Var-length V4 won't use this writer, but I don't think we should validate that here.

I think it would be calamitous if this were used for variable length data by mistake, given that V4 for variable length data has a different layout. The only resolution would be to delete the data. So I felt it was important to validate.

richardstartin · 2021-12-21T06:39:44Z

Notes for merging: this will conflict with #7920 and has been kept separate because I tend to create a PR per change, but both need the benchmark. Once one of the two is merged I will need to rebase the benchmark in the other.

richardstartin · 2021-12-22T19:06:45Z

What are the blockers here?

richardstartin force-pushed the power-of-2-fixed-size-chunks branch 3 times, most recently from 2c1e8cf to 418e77c Compare December 20, 2021 11:40

power of 2 fixed-byte chunk reader

01ddc8b

richardstartin force-pushed the power-of-2-fixed-size-chunks branch from 418e77c to 01ddc8b Compare December 20, 2021 19:06

siddharthteotia mentioned this pull request Dec 20, 2021

don't use mmap for compression except for huge chunks in V4 raw index #7931

Merged

Jackie-Jiang approved these changes Dec 21, 2021

View reviewed changes

change method name

b8aa111

siddharthteotia approved these changes Dec 23, 2021

View reviewed changes

siddharthteotia merged commit bed2e30 into apache:master Dec 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Power of 2 fixed size chunks #7934

Power of 2 fixed size chunks #7934

Uh oh!

richardstartin commented Dec 20, 2021

Uh oh!

codecov-commenter commented Dec 20, 2021 •

edited

Loading

Uh oh!

Jackie-Jiang left a comment

Uh oh!

Uh oh!

Jackie-Jiang Dec 21, 2021

Uh oh!

richardstartin Dec 21, 2021

Uh oh!

richardstartin commented Dec 21, 2021

Uh oh!

richardstartin commented Dec 22, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Power of 2 fixed size chunks #7934

Power of 2 fixed size chunks #7934

Uh oh!

Conversation

richardstartin commented Dec 20, 2021

Uh oh!

codecov-commenter commented Dec 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Jackie-Jiang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Jackie-Jiang Dec 21, 2021

Choose a reason for hiding this comment

Uh oh!

richardstartin Dec 21, 2021

Choose a reason for hiding this comment

Uh oh!

richardstartin commented Dec 21, 2021

Uh oh!

richardstartin commented Dec 22, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov-commenter commented Dec 20, 2021 •

edited

Loading