Skip to content

Conversation

@richardstartin
Copy link
Member

This introduces a simple evolution of the raw index format for fixed width data types which merely enforces that chunk sizes are a power of 2. For example, if a chunk size of 1000 is chosen, the writer will round up to 1024. This allows the reader to assume that the chunk size is a power of 2 and replace integer remainder calculations and divisions with masks and shifts respectively. The format is otherwise identical.

This has a good impact when the index is compressed and the accesses are non-contiguous but there are many accesses per chunk:

Benchmark                                                                    (_blockSize)  (_numBlocks)  Mode  Cnt   Score   Error  Units
BenchmarkFixedByteSVForwardIndexReader.readCompressedDoublesNonContiguousV3         10000          1000  avgt    5  39.976 ± 0.439  ms/op
BenchmarkFixedByteSVForwardIndexReader.readCompressedDoublesNonContiguousV4         10000          1000  avgt    5  33.110 ± 0.588  ms/op
BenchmarkFixedByteSVForwardIndexReader.readCompressedLongsNonContiguousV3           10000          1000  avgt    5  46.568 ± 0.440  ms/op
BenchmarkFixedByteSVForwardIndexReader.readCompressedLongsNonContiguousV4           10000          1000  avgt    5  31.989 ± 0.419  ms/op

@codecov-commenter
Copy link

codecov-commenter commented Dec 20, 2021

Codecov Report

Merging #7934 (b8aa111) into master (71fefe2) will decrease coverage by 0.01%.
The diff coverage is 90.38%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #7934      +/-   ##
============================================
- Coverage     71.07%   71.05%   -0.02%     
- Complexity     4112     4132      +20     
============================================
  Files          1593     1595       +2     
  Lines         82372    82410      +38     
  Branches      12270    12270              
============================================
+ Hits          58545    58560      +15     
- Misses        19872    19896      +24     
+ Partials       3955     3954       -1     
Flag Coverage Δ
integration1 28.96% <0.00%> (-0.08%) ⬇️
integration2 27.63% <0.00%> (+0.04%) ⬆️
unittests1 68.08% <90.38%> (+0.01%) ⬆️
unittests2 14.33% <0.00%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
.../io/writer/impl/BaseChunkSVForwardIndexWriter.java 85.96% <0.00%> (ø)
.../writer/impl/VarByteChunkSVForwardIndexWriter.java 100.00% <ø> (ø)
...ment/index/readers/DefaultIndexReaderProvider.java 70.00% <50.00%> (-3.69%) ⬇️
...riter/impl/FixedByteChunkSVForwardIndexWriter.java 96.55% <80.00%> (-3.45%) ⬇️
...ment/index/readers/forward/ChunkReaderContext.java 90.90% <90.90%> (ø)
...readers/forward/BaseChunkSVForwardIndexReader.java 93.10% <100.00%> (+0.45%) ⬆️
...ward/FixedBytePower2ChunkSVForwardIndexReader.java 100.00% <100.00%> (ø)
...ntroller/helix/core/minion/CronJobScheduleJob.java 0.00% <0.00%> (-59.10%) ⬇️
...he/pinot/segment/local/segment/store/IndexKey.java 75.00% <0.00%> (-5.00%) ⬇️
.../startree/v2/builder/OffHeapSingleTreeBuilder.java 87.42% <0.00%> (-4.20%) ⬇️
... and 18 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 71fefe2...b8aa111. Read the comment docs.

@richardstartin richardstartin force-pushed the power-of-2-fixed-size-chunks branch 3 times, most recently from 2c1e8cf to 418e77c Compare December 20, 2021 11:40
Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM otherwise

*/
protected BaseChunkSVForwardIndexWriter(File file, ChunkCompressionType compressionType, int totalDocs,
int numDocsPerChunk, int chunkSize, int sizeOfEntry, int version)
int numDocsPerChunk, int chunkSize, int sizeOfEntry, int version, boolean fixed)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this extra fixed here for the version validation. Var-length V4 won't use this writer, but I don't think we should validate that here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be calamitous if this were used for variable length data by mistake, given that V4 for variable length data has a different layout. The only resolution would be to delete the data. So I felt it was important to validate.

@richardstartin
Copy link
Member Author

Notes for merging: this will conflict with #7920 and has been kept separate because I tend to create a PR per change, but both need the benchmark. Once one of the two is merged I will need to rebase the benchmark in the other.

@richardstartin
Copy link
Member Author

What are the blockers here?

@siddharthteotia siddharthteotia merged commit bed2e30 into apache:master Dec 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants