-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Power of 2 fixed size chunks #7934
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Power of 2 fixed size chunks #7934
Conversation
Codecov Report
@@ Coverage Diff @@
## master #7934 +/- ##
============================================
- Coverage 71.07% 71.05% -0.02%
- Complexity 4112 4132 +20
============================================
Files 1593 1595 +2
Lines 82372 82410 +38
Branches 12270 12270
============================================
+ Hits 58545 58560 +15
- Misses 19872 19896 +24
+ Partials 3955 3954 -1
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
2c1e8cf to
418e77c
Compare
418e77c to
01ddc8b
Compare
Jackie-Jiang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM otherwise
...n/java/org/apache/pinot/segment/local/io/writer/impl/FixedByteChunkSVForwardIndexWriter.java
Outdated
Show resolved
Hide resolved
| */ | ||
| protected BaseChunkSVForwardIndexWriter(File file, ChunkCompressionType compressionType, int totalDocs, | ||
| int numDocsPerChunk, int chunkSize, int sizeOfEntry, int version) | ||
| int numDocsPerChunk, int chunkSize, int sizeOfEntry, int version, boolean fixed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need this extra fixed here for the version validation. Var-length V4 won't use this writer, but I don't think we should validate that here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be calamitous if this were used for variable length data by mistake, given that V4 for variable length data has a different layout. The only resolution would be to delete the data. So I felt it was important to validate.
|
Notes for merging: this will conflict with #7920 and has been kept separate because I tend to create a PR per change, but both need the benchmark. Once one of the two is merged I will need to rebase the benchmark in the other. |
|
What are the blockers here? |
This introduces a simple evolution of the raw index format for fixed width data types which merely enforces that chunk sizes are a power of 2. For example, if a chunk size of 1000 is chosen, the writer will round up to 1024. This allows the reader to assume that the chunk size is a power of 2 and replace integer remainder calculations and divisions with masks and shifts respectively. The format is otherwise identical.
This has a good impact when the index is compressed and the accesses are non-contiguous but there are many accesses per chunk: