-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Concurrent HNSW Merge #12660
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Concurrent HNSW Merge #12660
Conversation
|
This looks like a great start. I worked up a very similar PR but I think I have some concurrency bugs around the update/add of new entry points in the upper graph levels. I might post mine anyway because it has some syntax sugar I think would be nice to bring over here. One idea I had was to create a NeighborArrayIterator that would handle read-locking in the OnHeapGraph case with another implementation that just encapsulates the iteration with no locking for use by other (off-heap) HnswGraph. With that and a little refactoring we can use the same HnswGraphSearcher for both cases? |
|
That sounds great, thanks Mike! Please post the PR and I will try to see
how to put it together. Meanwhile let's try to get the prerequisite PR
reviewed and pushed first? Such that this one can look cleaner.
…On Thu, Oct 12, 2023, 11:07 Michael Sokolov ***@***.***> wrote:
This looks like a great start. I worked up a very similar PR but I think I
have some concurrency bugs around the update/add of new entry points in the
upper graph levels. I might post mine anyway because it has some syntax
sugar I think would be nice to bring over here. One idea I had was to
create a NeighborArrayIterator that would handle read-locking in the
OnHeapGraph case with another implementation that just encapsulates the
iteration with no locking for use by other (off-heap) HnswGraph. With that
and a little refactoring we can use the same HnswGraphSearcher for both
cases?
—
Reply to this email directly, view it on GitHub
<#12660 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEFSB7C4JF6SQ7JQ2SJCR63X7AWVFANCNFSM6AAAAAA55CGFD4>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
|
Good Job |
|
OK, I posted #12683. I'm curious about the performance measurements. We want to know how much lock contention there is so ideally we'd compare total time to merge using single-threaded vs using this. Did your measurement cover the entire indexing process including indexing + flushes (not merges)? Oh wait I see you distinguished force merge vs the whole thing. It's great that we can go faster, but I feel we ought to strive to do better than 1/2 per thread (4x speedup for 8 threads). Maybe once we are happy with the code we can test 2 threads since there ought to be less contention? Although frankly 4x speedup is great! But I'd also wonder how much CPU we are using then? is it 8x CPU usage too, or more likely some of those threads spend time waiting? Also - how large was your index? I think we'd expect greater speedups for larger indexes? |
|
Thanks @msokolov, I'll take a look soon.
The most fair comparison here I posted might be the force merge time for 100k docs, which is 20012ms vs 97579ms. Since there should be no merging happening during normal indexing due to index size (around 300mb and per segment 50mb buffer), all the conditions are the same, but still we're counting the flush time and preparation time during the comparison. Because the lock are only used in several places, I'm wondering whether we could simply just accumulate the lock waitting time using |
12c40f0 to
120c322
Compare
|
@msokolov I incorporate your change about passing in the executor and addVector from range. I also added the wire up to passing in parameters from VectorFormat all the way in. cc. @benwtrent thanks for creating the merger abstraction, I used it right away, so if you have some spare time please review it as well! I added some more details to the description you may want to check before reading the code. |
|
This is awesome. I am so happy it's a clean change without tons of complexity and we still get 4x speed up with additional threads. I will give it a review this weekend or early next week. |
msokolov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mostly minor comments - looks exciting! I would be curious to see the contention times and also understand how this changes CPU usage vs. single-threaded.
lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsWriter.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsWriter.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswConcurrentMergeBuilder.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsWriter.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswConcurrentMergeBuilder.java
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphMerger.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/util/hnsw/IncrementalHnswGraphMerger.java
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/util/hnsw/OnHeapHnswGraph.java
Outdated
Show resolved
Hide resolved
lucene/core/src/test/org/apache/lucene/util/hnsw/HnswGraphTestCase.java
Outdated
Show resolved
Hide resolved
@msokolov as for CPU usage, I just tested with 1M docs, and on my laptop I can see when it's doing forceMerge it constantly ranges from 790~810%, while if a merge triggers when adding the document, the CPU can goes to 900%. I would like to see how it scales to, say 20 or even 40 cores, but unfortunately this laptop with 10 cores (including 2 efficient core) is the only one I have for now... BTW the forceMerge of the 1M index took |
benwtrent
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments on API design. I am not sure if this is considered a "final draft for review" or whats just there as a stop-gap for testing still.
lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsFormat.java
Outdated
Show resolved
Hide resolved
| long start = System.nanoTime(); | ||
| nbrsOfNbr.rwlock.writeLock().lock(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My small concern here is that now we are locking even when no executor is being used and we know its a single thread. How bad are we hitting users that aren't using multi-threaded merge?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So when I'm testing it's like 20ms of overhead on grabing this lock for building roughly 20k docs with 384 dim vectors. So I think it's not bad given the HNSW build itself is much slower?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
20ms of overhead on grabing this lock for building roughly 20k docs with 384 dim vectors.
Does contention grow linearly? Or is it logarithmic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had previously posted a NeighborIterator abstraction that was designed to hide the locking (somewhat) from the HnswSearcher/Builder. That way we don't do any locking in the serial implementation. But @zhaih pointed out it was too aggressively holding the lock around a large critical section. I think we could enhance the concurrent implementation of that abstraction by having it copy the neighbors as this PR does, and then release the lock immediately after copying rather than holding the lock for the entire iteration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does contention grow linearly? Or is it logarithmic?
@benwtrent If you mean the 20ms it's growing linearly with number of nodes we insert, as it's basically the overhead we pay for acquiring the lock for each neighbor. It's not contention tho because it's single thread execution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@msokolov Yeah that solves the searcher's part, however I think the more complex one is on writer as I mentioned below we need to group multiple method together in a "transaction".
I think in you PR you used a returned Closeable to kinda "hide" the lock, but I think that's essentially the same as exposing the lock?
| float[] score; | ||
| int[] node; | ||
| private int sortedNodeSize; | ||
| public final ReadWriteLock rwlock = new ReentrantReadWriteLock(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we have a ConcurrentNeighborArray? Always requiring locking by the external caller is not good. NeighborArray needs to change its API so external folks shouldn't worry about the lock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can hide the lock inside the NeighborArray by adding some APIs, but it's hard to make a totally thread safe class given how we use it right now: for example, when we add a neighbor and need to purge the non-diverse one, we're adding -> sorting -> calculating non-diverse -> removing. And it's not enough to just ensure each operation to be atomic because someone can always add a node after we sort and makes things more complex. So even I create this ConcurrentNeighborArray and every operation is nicely protected the caller still need to be extra careful...
But with the caller control the lock, it's much easier to control this "transaction" behavior and only add lock when they need.
I'm not saying this is good, but given this is really a very hacky class already (even before the lock) and internal to HNSW only maybe we can leave it as a follow-up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can make it a refactor later, but I am convinced we will want a separate *NeighborArray class for concurrent vs. serial graph building. There is no technical reason to slow down serial graph building and I think we can make concurrent even nicer by freeing itself from having to even think about serial graph building.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I don't have a very good idea for that yet, I'll for sure create an issue after this is merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like we need a ConcurrentHNSWGraphBuilder. It can use much of the same logic, but it instead uses concurrent neighbors. But that can be a later refactor.
|
@msokolov @benwtrent I removed almost all @benwtrent Please check whether the rebase and modification, especially the scalar quantized part is done correctly. I did it according to my understanding but that is not necessarily correct :) I'll do another round of benchmark run tomorrow to make sure things are still intact. |
| // System.out.println("addVectors [" + minOrd + " " + maxOrd + ") initialized.size=" + | ||
| // initializedNodes.size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all these comments should probably be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, removed!
| float[] score; | ||
| int[] node; | ||
| private int sortedNodeSize; | ||
| public final ReadWriteLock rwlock = new ReentrantReadWriteLock(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like we need a ConcurrentHNSWGraphBuilder. It can use much of the same logic, but it instead uses concurrent neighbors. But that can be a later refactor.
lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsFormat.java
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsWriter.java
Show resolved
Hide resolved
|
I reran the benchmark with the current code and it still looks good, both the recall and latency remains the same. The only unresolved issue right now is the naming of the interface, I'm aware that |
benwtrent
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhaih could you create an issue for tracking us optimizing the single threaded case to not use locks?
|
as for the renaming maybe we can look to do it as a followup PR? IHnswGraphSearcher -> HnswGraphSearcher? Or maybe we can just use HnswSearcher right now? Re: controlling concurrency. Maybe eventually we figure out how to do this as part of ConcurrentMergeScheduler? Since it is all about merging, that seems like a logical place. It currently allows setting |
msokolov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good, I think it's ready to merge.
|
Thanks Mike and Ben for reviewing! I'll try to merge and backport it tomorrow. And create a issue for future refactoring |
|
FYI I pushed a commit 4576ae0 to fix the behavior that vector writer close the executor passed in |
|
For my understanding, how does it play out with merge throttling, could we end up sleeping within tasks that get passed to the configured executor service? |
I think we rate limit the IndexOutput only in CMS? If my understand is correct then I think this is not affected by throttling as it's all memory operation and we'll write to disk after all the tasks are completed |
…graph build (#12799) Make the TaskExecutor public which is currently pkg-private. At indexing time we concurrently create the hnsw graph (Concurrent HNSW Merge #12660). We could use the TaskExecutor implementation to do this for us. Use TaskExecutor#invokeAll in HnswConcurrentMergeBuilder#build to run the workers concurrently.
…graph build (apache#12799) Make the TaskExecutor public which is currently pkg-private. At indexing time we concurrently create the hnsw graph (Concurrent HNSW Merge apache#12660). We could use the TaskExecutor implementation to do this for us. Use TaskExecutor#invokeAll in HnswConcurrentMergeBuilder#build to run the workers concurrently.
…graph build (apache#12799) Make the TaskExecutor public which is currently pkg-private. At indexing time we concurrently create the hnsw graph (Concurrent HNSW Merge apache#12660). We could use the TaskExecutor implementation to do this for us. Use TaskExecutor#invokeAll in HnswConcurrentMergeBuilder#build to run the workers concurrently.
Lucene PR apache/lucene#12660 introduced concurrent HNSW merge and PR apache/lucene#13124 made ConcurrentMergeScheduler to provide an intra merge executor. But to enable concurrent intra merge for HNSW graphs we still need to provide numMergeWorker to the codec definition. This PR does the following: - provides numMergeWorker to the HNSW vectors codecs definions, where numMergerWorkes set as a node's property of index.merge.scheduler.max_thread_count - enables concurrent merge only for force merge operations.
Description
This PR still contains all kinds of magic number and names I need to improve. But I think we can start review it as it is not that hacky and I think is more on acceptable side. For sure I will add quite more javadocs to make the code readable, especially the add graph node part.
The main idea is that we merely need to care about the thread safety of
OnHeapHnswGraphafter the #12651 (still, the entry node election is a bit complex), so I just create one RW lock for each NeighbourArray and lock them when necessary.The main change happens in
HnswGraphBuilder#addGraphNodewhere I changed the sequence of updating the graph such that inserting a new node won't impact searching the previous graph, then I just hardcoded a thread pool inaddVectorsmethod to test the change. And it shows a good 4x speed up on forceMerge when I'm using 8 threads, and 3x overall indexing speed with 1M docs.More details
A new
ConcurrentHnswMergerto create theHnswConcurrentMergeBuilder, which will contain several workers (ConcurrentMergeWorker) which essentially each is a graph builder and we will spawn one thread for each of them.Then main logic still just as stated in the description, I make the
OnHeapHnswGraphmostly thread-safe, and modified the way we add the graph node from the level 0 and goes up such that when a search is initiated concurrently, this new added node will not lead to a dead-end (because if the new node is discovered, it is guaranteed to have all out-going edges already and it is guaranteed to be able to go to the lower level)Promoting entry node is a bit tricky because essentially we need to change both node and entry level at the same time. So I made a new class and use AtomicReference to switch it. It seems to be working ok.
One (I think worst) side effect of this change is that we have to manually set the entry node, because we add node to graph from top most level, then add neighbour to the node from the level 0, so we only want to set the new node to be entry node after we have done all this, so it cannot be set automatically when the node is first added at the highest level.
Benchmark Test
Set up
50MB writer buffer, force-merge to 1 segment at last. 384 dimension of vectors.
Candidate
Update: 100k doc with contention time infomation (note the first few time was for single thread overhead of locks)
Baseline