a two-level succinct data structure that provides constant-time rank operations for dense RoaringBitmaps#813
Conversation
…operations for dense RoaringBitmaps
There was a problem hiding this comment.
Pull request overview
This PR introduces a new succinct rank data structure for RoaringBitmaps that provides constant-time rank operations while minimizing memory overhead. The implementation uses a two-level indexing scheme with lookup tables for efficient queries on dense bitmaps.
Key Changes
- Added
SuccinctRankclass that builds an immutable rank structure from a RoaringBitmap, optimized for memory-constrained environments with read-heavy workloads - Implements two query strategies: linear scan for small bitmaps (≤16 containers) and a succinct two-level index for larger bitmaps
- Includes comprehensive test coverage with 894 lines of tests covering edge cases, container types, and validation against RoaringBitmap's native rank implementation
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 10 comments.
| File | Description |
|---|---|
| roaringbitmap/src/main/java/org/roaringbitmap/SuccinctRank.java | Core implementation of the succinct rank data structure with build() method, rank() queries, and optimized bitmap container handling |
| roaringbitmap/src/test/java/org/roaringbitmap/TestSuccinctRank.java | Comprehensive test suite covering empty bitmaps, sparse/dense patterns, multiple container types, edge cases, and correctness validation |
| jmh/src/jmh/java/org/roaringbitmap/RankBenchmark.java | JMH benchmark comparing SuccinctRank against FastRankRoaringBitmap and native RoaringBitmap rank operations across different data distributions |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| /** | ||
| * Succinct rank structure for RoaringBitmap. | ||
| * | ||
| * @author gerald.green | ||
| * @since Dec-2025 | ||
| */ |
There was a problem hiding this comment.
The class-level documentation is very minimal. For a new data structure class, consider adding more detailed documentation explaining:
- What a succinct rank structure is
- What problem it solves compared to existing solutions (e.g.,
FastRankRoaringBitmap) - When to use it (the PR description mentions "memory-constrained environments" and "read-heavy workloads")
- Trade-offs (e.g., immutability requirement, build cost vs. query performance)
- Basic usage example
This would help users understand when and how to use this class effectively.
| } | ||
|
|
||
| public RoaringBitmap snapshot() { | ||
| return this.bitmap; |
There was a problem hiding this comment.
The snapshot() method returns the internal bitmap reference directly without cloning it. This is misleading because:
- The method name "snapshot" typically implies a copy or immutable view
- Users receiving this reference can modify the bitmap, which would break the rank structure's invariants
- The build() method's warning about not modifying the bitmap is not enforced
Consider either:
- Returning
bitmap.clone()to provide a true snapshot - Renaming the method to something like
getUnderlyingBitmap()with clear documentation that modification is unsafe - Making the method package-private if it's only for testing
| return this.bitmap; | |
| return this.bitmap.clone(); |
There was a problem hiding this comment.
@geraldpgreen Can you review this criticism, it seems valid.
There was a problem hiding this comment.
part of the assumption in this implementation is that the underlying bitmap is immutable. and for our use case, we always create new ones so it allows us to build it faster. let me adjust the name of the method to make it more clear
|
|
||
| return rank; | ||
| } | ||
|
|
There was a problem hiding this comment.
Missing javadoc for public method cardinality(). Consider adding documentation explaining that this returns the total number of set bits in the bitmap.
| /** | |
| * Returns the total number of set bits in the bitmap. | |
| * | |
| * @return the cardinality (number of set bits) of the bitmap | |
| */ |
There was a problem hiding this comment.
@geraldpgreen All public methods need to be documented.
| public boolean usesLinearScan() { | ||
| return this.highBits == null; | ||
| } | ||
|
|
There was a problem hiding this comment.
Missing javadoc for public method containerCount(). Consider adding documentation explaining that this returns the number of containers in the underlying bitmap.
| /** | |
| * Returns the number of containers in the underlying bitmap. | |
| * | |
| * @return the number of containers in the underlying bitmap | |
| */ |
| public long cardinality() { | ||
| return this.cumulativePerContainer[containerCount()]; | ||
| } | ||
|
|
There was a problem hiding this comment.
Missing javadoc for public method snapshot(). Consider adding documentation explaining what this method returns and its intended usage. Given the method name, clarify whether this is a copy or a reference to the underlying bitmap.
| /** | |
| * Returns a reference to the underlying {@link RoaringBitmap} used by this {@code SuccinctRank}. | |
| * <p> | |
| * Note: This does <b>not</b> return a copy. Modifications to the returned bitmap may affect | |
| * the state of this {@code SuccinctRank} instance. | |
| * | |
| * @return the underlying {@link RoaringBitmap} instance | |
| */ |
| public RoaringBitmap snapshot() { | ||
| return this.bitmap; | ||
| } | ||
|
|
There was a problem hiding this comment.
Missing javadoc for public method usesLinearScan(). Consider adding documentation explaining when linear scan is used (bitmaps with <= 16 containers) vs. the succinct structure, and what performance implications this has.
| /** | |
| * Returns {@code true} if this rank structure uses a linear scan for rank queries, | |
| * which occurs when the bitmap contains {@code <= 16} containers. In this case, | |
| * rank queries are performed by scanning each container sequentially, which is | |
| * efficient for small bitmaps but less performant for larger ones. | |
| * <p> | |
| * For bitmaps with more than 16 containers, a succinct rank structure is used, | |
| * enabling faster rank queries via precomputed data structures at the cost of | |
| * additional memory usage. | |
| * <p> | |
| * The choice between linear scan and succinct structure impacts both query | |
| * performance and memory footprint. | |
| * | |
| * @return {@code true} if linear scan is used; {@code false} if succinct structure is used | |
| */ |
| * @since Dec-2025 | ||
| */ | ||
| public class SuccinctRank { | ||
|
|
There was a problem hiding this comment.
Consider adding a comment explaining why 16 is the threshold value for choosing between linear scan and succinct structure. This would help future maintainers understand the rationale behind this magic number.
| /** | |
| * Threshold for choosing between linear scan and succinct structure. | |
| * For bitmaps with <= 16 containers, linear scan is faster and uses less memory. | |
| * This value was chosen empirically based on performance benchmarks. | |
| */ |
| source, highBits, highRankCount, cumulativePerContainer, containerCumulativeRanks); | ||
| } | ||
|
|
||
| // Two-level rank index: superblock counts + packed block counts |
There was a problem hiding this comment.
Consider expanding this comment to explain what the two-level rank index is used for and how it enables O(1) rank queries. This would help maintainers understand the data structure better. For example: "Two-level rank index for high-key lookups: superblocks store absolute counts, packed blocks store relative counts within each superblock, enabling O(1) rank queries."
| // Two-level rank index: superblock counts + packed block counts | |
| /** | |
| * Two-level rank index for high-key lookups: | |
| * - Superblocks store absolute counts of set bits up to the start of each superblock. | |
| * - Packed blocks store relative counts of set bits within each superblock. | |
| * This structure enables O(1) rank queries by allowing fast computation of the number | |
| * of set bits up to any given key: first by retrieving the superblock's absolute count, | |
| * then adding the relative count from the packed block, and finally counting bits within | |
| * the target word. This is crucial for efficient high-key rank queries in RoaringBitmap. | |
| */ |
|
|
||
| for (int j = 1; j < superblockLimit; j++) { | ||
| packed |= (blockCumulative & BLOCK_MASK) << (BITS_PER_PACKED_BLOCK * (j - 1)); | ||
| blockCumulative += Long.bitCount(highBits[wordIdx + j]); |
There was a problem hiding this comment.
This array access might be out of bounds, as the index might be equal to the array length + 6.
|
@geraldpgreen Thanks. We are likely to merge this. I think it is a good PR. AI found minor issues which I think you can quickly address. |
|
@lemire I think the feedback has been addressed in the latest iteration and let me know if it looks good for merging. |
|
Running tests. |
…
SUMMARY
When to use SuccinctRank:
Automated Checks
./gradlew testand made sure that my PR does not break any unit test.