-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Move group-varint encoding/decoding logic to DataOutput/DataInput #12841
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
jpountz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't feel great about the API taking longs, but otherwise I like this change. @uschindler might have opinions on the MemorySegmentIndexInput implementation.
Should we optimize ByteBuffersDataInput similarly (and then delegate in ByteBuffersIndexInput) to yield speedups with older Java releases as well?
lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInput.java
Show resolved
Hide resolved
| final int chunkSizePower; | ||
| final Arena arena; | ||
| final MemorySegment[] segments; | ||
| private static final int[] MASKS = new int[] {0xFF, 0xFFFF, 0xFFFFFF, 0xFFFFFFFF}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe rename to GROUP_VINT_MASKS or something along these lines now that this logic moved to a class which is not only about group vint?
Also in general I prefer having constants before instance members in the class definition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, great suggestion!
| * internal state like file position). | ||
| */ | ||
| public abstract class DataOutput { | ||
| BytesRef groupVIntBytes; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BytesRefBuilder feels like a better fit for how you're using it (using length rather than offset to track the number of written bytes). Also let's make it private?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, Thanks for the suggestion!
|
|
||
| final GroupVIntWriter w = new GroupVIntWriter(); | ||
| byte[] encoded = new byte[(int) (Integer.BYTES * ForUtil.BLOCK_SIZE * 1.25)]; | ||
| Directory dir = FSDirectory.open(createTempDir()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use newFSDirectory to add coverage for all Directory implementations?
| Directory dir = FSDirectory.open(createTempDir()); | |
| Directory dir = newFSDirectory(createTempDir()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
| if (curSegment.byteSize() - curPosition < 17) { | ||
| super.readGroupVInt(docs, pos); | ||
| return; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we have a test that covers this case well at the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In TestGroupVInt#testEncodeDecode we use a range of [1-31] bpv and a ragne of [1-128] numValues, For instance if the bpv==2 and numValues==4 it will cover this case?
| * @param docs the array to read ints into. | ||
| * @param offset the offset in the array to start storing ints. | ||
| */ | ||
| public void readGroupVInt(long[] docs, int offset) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make this method private? This would force MemorySegmentIndexInput to copy the logic of readGroupVInts but this would also be better encapsulated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, This also reduces virtual function calls.
|
|
||
| // tail vints | ||
| for (; off < limit; off++) { | ||
| writeVInt((int) values[off]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we're moving this to DataOutput, we probably need to check these casts, e.g. with Math.toIntExact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, i like that!
|
And maybe |
easyice
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the fast review! I will try to do the similarly optimize use BitUtil.VH_LE_INT#get() in ByteBuffersDataInput and BufferedIndexInput
| * @param docs the array to read ints into. | ||
| * @param offset the offset in the array to start storing ints. | ||
| */ | ||
| public void readGroupVInt(long[] docs, int offset) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, This also reduces virtual function calls.
| * internal state like file position). | ||
| */ | ||
| public abstract class DataOutput { | ||
| BytesRef groupVIntBytes; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, Thanks for the suggestion!
|
|
||
| // tail vints | ||
| for (; off < limit; off++) { | ||
| writeVInt((int) values[off]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, i like that!
| final int chunkSizePower; | ||
| final Arena arena; | ||
| final MemorySegment[] segments; | ||
| private static final int[] MASKS = new int[] {0xFF, 0xFFFF, 0xFFFFFF, 0xFFFFFFFF}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, great suggestion!
| if (curSegment.byteSize() - curPosition < 17) { | ||
| super.readGroupVInt(docs, pos); | ||
| return; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In TestGroupVInt#testEncodeDecode we use a range of [1-31] bpv and a ragne of [1-128] numValues, For instance if the bpv==2 and numValues==4 it will cover this case?
lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInput.java
Show resolved
Hide resolved
|
|
||
| final GroupVIntWriter w = new GroupVIntWriter(); | ||
| byte[] encoded = new byte[(int) (Integer.BYTES * ForUtil.BLOCK_SIZE * 1.25)]; | ||
| Directory dir = FSDirectory.open(createTempDir()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
|
I did the similarly optimize in Here is the JMH output for newly optimized, |
| int curPosition = blockOffset; | ||
|
|
||
| byte[] bytes = blocks[blockIndex(pos)].array(); | ||
| docs[offset] = (int) BitUtil.VH_LE_INT.get(bytes, curPosition) & GROUP_VINT_MASKS[n1Minus1]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather work on the ByteBuffer (ie. do ByteBuffer#getInt rather than using var handles on the underlying byte[]), even if performance is a bit better by working directly on the byte[].
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jpountz Thanks for this suggestion, the performance is similar
| public abstract class DataInput implements Cloneable { | ||
| // the maximum length of a single group-varint is 4 integers + 1 byte flag. | ||
| static final int MAX_LENGTH_PER_GROUP = 17; | ||
| static final int[] GROUP_VINT_MASKS = new int[] {0xFF, 0xFFFF, 0xFFFFFF, 0xFFFFFFFF}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I would like it better if these constants and fallbackReadGroupVInt were moved to a helper class, e.g. org.apache.lucene.util.GroupVIntUtil.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
| } | ||
|
|
||
| /** | ||
| * Encode integers using group-varint. It uses VInt to encode tail values that are not enough for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a javadoc link to #writeVInt? Also mention that all longs are actually required to be integers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, Thanks, i like that
| import org.apache.lucene.util.ArrayUtil; | ||
| import org.apache.lucene.util.packed.PackedInts; | ||
|
|
||
| public class TestGroupVInt extends LuceneTestCase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we move these tests to BaseDirectoryTestCase now, to automatically get coverage across all directories?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, cool!
|
Hi, Some thoughts:
For both cases above make sure to exactly follow all try-catch blocks and calling to super as done in other bulk read methods. It may also be needed to add a separate test to TestMultiMMap. In general all tests should be in BaseDirecoryTestCase, except specific ones to check reads across buffer/segment borders Also make sure to run tests with Java 17 and Java 19, 20, 21. |
|
In general I am not sure if we really need specialization in Mmapdir. Without a benchmark showing improvement, I'd cancel this. Counting number of virtual calls is bullshit. This is all optimized away by Hotspot's inlining, so don't repeat code just because of a few method calls. CC @rmuir |
|
Hi, Please share assembly output of optimized code, otherwise -1 to apply any changes to MemorySegmentIndexInput and (Mapped)ByteBufferIndexInput. This normally inlines perfectly. |
|
Thank you very much for your suggestions, i had fixed the comments from @jpountz, but the related to Mmapdir will be later(such as java19, java20 support), because we need to confirm whether to change it. i agree that we should be cautious here. in my understanding the optimized code is faster than the main branch probably because the optimized code doesn't need a The JMH output for currently, it's similar to before we replaced java21 @uschindler Thank you for take a look at this, here is the assembly output for assembly code |
|
I ran the benchmark using For task
Here is the benchmark output with java 21, baseline(main), candidate(PR): round 1round 2round 2If there are other approach of benchmark suggestions, i can also do it :) |
jpountz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @easyice, I left some minor comments, it looks close!
| final int n3Minus1 = (flag >> 2) & 0x03; | ||
| final int n4Minus1 = flag & 0x03; | ||
|
|
||
| blockOffset = blockOffset(pos); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already computed blockOffset() above, can we avoid computing it twice, e.g. by replacing the readByte() call with block.get(blockOffset++)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, i did the similar change in BufferedIndexInput#readGroupVInt
| * internal state like file position). | ||
| */ | ||
| public abstract class DataOutput { | ||
| private BytesRefBuilder groupVIntBytes = new BytesRefBuilder(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can it be made final?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, I will pay more attention to this issue.
| * @param dst the array to read ints into. | ||
| * @param offset the offset in the array to start storing ints. | ||
| */ | ||
| public static void fallbackReadGroupVInt(DataInput in, long[] dst, int offset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't need fallback as a prefix anymore now that it's in its own class. Let's document that this is a default implementation and that it is recommended to use DataInput#readGroupVInts directly rather than this method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
| GroupVIntUtil.fallbackReadGroupVInt(this, dst, offset); | ||
| return; | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe copy curSegment to a local variable here, e.g. MemorySegment curSegment = this.curSegment so that the JVM doesn't have to worry about curSegment getting updated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, i wonder what the impact on JVM if the variable maybe updated? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the bigger problem here is that it is not in the try-catch block. when the indexinput is closed this fails with NPE or IllegalStateException.
Everything in this method must be in the try-catch. Look at othe rmethods, the NPE/ISE catch block always covers everything.
| } | ||
|
|
||
| try { | ||
| final int flag = readByte() & 0xFF; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we already did the job of checking that we have enough bytes left, we don't need readByte() to do it for us?
| final int flag = readByte() & 0xFF; | |
| final int flag = curSegment.get(LAYOUT_BYTE, curPosition++) & 0xFF; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
| // test fallbackReadGroupVInt | ||
| doTestGroupVInt(dir, 5, 1, 6, 8); | ||
|
|
||
| // test large data to covers multiple blocks in ByteBuffersDataInput |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adding it explicitly since this is probably the most important case to cover
| // test large data to covers multiple blocks in ByteBuffersDataInput | |
| // test large data to covers multiple blocks in ByteBuffersDataInput and MMapDirectory via TestMultiMMap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @jpountz , i added a separate test to BaseChunkedDirectoryTestCase to make sure use multi blocks, this will both cover MMapDirectory and ByteBuffersDataInput
| * @param values the values to write | ||
| * @param limit the number of values to write. | ||
| */ | ||
| public void writeGroupVInts(long[] values, int limit) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm tempted to mark it @lucene.experimental to allow us to change the signature to use ints in the future. And likewise for its counterpart on DataInput?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay!
you need to install hsdis to be able to see instructions and not just binary data from printassembly. see https://github.com/apache/lucene/blob/main/help/jmh.txt |
|
Thank you @rmuir, the new assembly output is here:) CC @uschindler |
|
Hi, It would be good to have 2 asm dumps: one with your patch and one where the benchmark just calls the inherited default impl on Mmapdir. |
|
@uschindler FYI this is what I'm getting: https://gist.github.com/jpountz/be81b1eb93c6118aac65c3679911f1d8. There are two files: baseline.txt for the default impl, and contender.txt for the optimized impl.
To be clear, this specialization is not about doing the same kind of optimization that the JVM would do by inlining, it's about taking advantage of the fact that many of our Directory implementations can do unaligned random reads under the hood to decode group-varints more efficiently, fewer conditionals specifically. |
|
Hi Adrien, Thanks for the more insight!
This looks much better than the files posted before. This shows that both (the baseline and contender) methods are fully inlined (almost...). So the difference is indeed the internal implementation.
This information wasn't available to me. With unaligned random reads you mean that you read with positional reads from the area where the buffer is saved? With default DataInput you can only go forward, so you need to read everything and also in order. This explains, but I am really not happy to put this very special code into theeI/O layer of Lucene. This makes it unmaintainable, sorry. I have not much time at the moment, but could we for now leave out MMapDirectory (all variants) form the optimization and start small? I have some ideas to rewrite that code to not duplicate the same code with small changes over and over. What you need is some way to read the the values unaligned from an indexed memory area (int for arrays and bytebuffers; long for memorysegments)? This is exactly what VarHandles do! We could write a generic implementation that uses VarHandles to access the underlying bytes. You can get VarHandles from byte arrays (see BitUtil class - you already use that) or from ByteBuffers (works the same way to get a VarHandle to access internals of ByteBuffers direct or on-heap! - P.S.: We could use that to get rid of the stupid Long/FloatBuffers in the legacy MMap-ByteBufferIndexInput - I played around with it already). And finally the accessors of MemorySegment are Varhandles behind the scenes, too (just with long coordinate instead of int). So my idea would be an implementation that uses VarHandles and this one can be used for byte[], ByteBuffer[] and MemorySegment. The impl is basically the same (takes an offset for start of block and the receiver and then decodes everything after that offset). All implementations would create an instance of that GroupVarInt decoder that knows the varhandles and the receiver object). I am a bit busy the next days, but I can maybe look into that at the weekend. Maybe we get a good API without copypasting this horrible complex code that I don't understand ten times into our repository. This is my main issue with this. I do much for performance, but I won't copypaste code everywhere just to get a few percents. Sorry. |
|
The best idea that I have instead of VarHandles: Create an implementation for ByteBuffer (using the methods available there). This implementation would work with:
I would give that a try: It is easy and reuseable as we just use a ByteBuffer as a view on top of the small area. P.S.: That escape analysis works you can see in our Panama Vector Util. Every vector multiplication creates around 20 objects.... |
|
@uschindler Thanks for review!
The optimized decode for an integer value using
do you mean create an implementation to get the current block as |
My idea was to create a utility class with a static method that does the decoding on top of a ByteBuffer. This method takes a ByteBuffer and the same remaining parameters. To allow reuse of bytebuffers also add a startoffset where encoding starts. But it could rely on position(). In the several indexinputs just wrap a slice and call the static method. In ByteBufferIndexInput directly call the method. The only downside of this is that the caller code must know beforehand how large the slice must be. |
|
I appied this for Lucene 10 only (main branch). This also fits changes.txt. Let's figure out if backporting breaks anything. If this is fine in Lucene 9.x, we can move the changes entry. But for now I'd like to more thoroughly test this and see if Mike's production benchmark shows any changes. |
|
I also merged the main barnch into the Java 22+ MMapDirectory implementation to make sure the MMAP tester finds any issues during its runs: https://jenkins.thetaphi.de/job/Lucene-MMAPv2-Linux/ and https://jenkins.thetaphi.de/job/Lucene-MMAPv2-Windows/ |
|
There seems to be a speedup on prefix queries in nightly benchmarks. For reference, here is the benchmark in branch_9x with this PR, the performance of java11: I ran with |
|
Hi, Thanks for the measurements, @easyice. So basically, we can backport the commit without any modifications. I can do this and move change entries afterwards. No PR is needed for that. It's easy. Uwe |
|
Hi @easyice, is my understanding correct? We can backport this PR as is? |
|
Thank you Uwe, in my understanding we shouldn't backport this PR, because it has performance regression on java11 for |
No it hasn't. We excluded MMapDirectory based on ByteBufferIndexInput in this PR. There is no change in ByteBufferIndexInput. |
|
This PR only changes the Java 19-21 MMap's MemorySegmentIndexInput, but not ByteBufferIndexInput as we had a regression in ByteBufferIndexInput in Java 17, too. So it was reverted. Just look at the code! With Java 17 (and therefore Java 11) there is no change in code. |
|
If you have backported this PR and see a difference, it is a relic. Just raise number of repetitions. |
|
As the naming of classes is causing issues all the time, I will open a PR and rename the badly named |
|
Yeah, i backport this PR to branch_9x on my local machine, and run the benchmark with java11, i got performance regression for java11 |
That's cool! |
Ohhh... let me try again.. |
|
You see a relic. There is no slowdown. The baseline code is exactly the same like the one on DataInput base class. The difference may come from different decisions by hotspot regarding inlining possibly because the benchmark runtime was too short. You won't the a difference in production. You are hunting ghosts. I see no reason to not backport this code. In general, I don't think the whole code here will make a big difference in production, as seen by Mike's benchmark. This is the reason why I am not happy in moving the code complexity to the indexinputs at all. So either backport it or revert it in total. These are only microbenchmarks that don't have a large impact at all. If it is twice as fast you will see a difference, but not for the code here. In production the inline decisions by hotspot will be totally different. The benchmark code only runs in isolation. |
|
Thanks for explaining! i agree that the micro benchmarks results are for reference only, so let we backport it :) |
|
Hi @easyice, |
|
Thank you for the backport and all great suggestions! |
|
FYI I pushed an annotation to nightly benchmarks, it should show up tomorrow. |
From issue: #12826
The JMH benchmark with this PR on my Mac (Intel chip) :
java 17
java 21