Concurrent HNSW Merge #12660

zhaih · 2023-10-12T06:09:51Z

Description

This PR still contains all kinds of magic number and names I need to improve. But I think we can start review it as it is not that hacky and I think is more on acceptable side. For sure I will add quite more javadocs to make the code readable, especially the add graph node part.

The main idea is that we merely need to care about the thread safety of OnHeapHnswGraph after the #12651 (still, the entry node election is a bit complex), so I just create one RW lock for each NeighbourArray and lock them when necessary.

The main change happens in HnswGraphBuilder#addGraphNode where I changed the sequence of updating the graph such that inserting a new node won't impact searching the previous graph, then I just hardcoded a thread pool in addVectors method to test the change. And it shows a good 4x speed up on forceMerge when I'm using 8 threads, and 3x overall indexing speed with 1M docs.

More details

A new ConcurrentHnswMerger to create the HnswConcurrentMergeBuilder, which will contain several workers (ConcurrentMergeWorker) which essentially each is a graph builder and we will spawn one thread for each of them.

Then main logic still just as stated in the description, I make the OnHeapHnswGraph mostly thread-safe, and modified the way we add the graph node from the level 0 and goes up such that when a search is initiated concurrently, this new added node will not lead to a dead-end (because if the new node is discovered, it is guaranteed to have all out-going edges already and it is guaranteed to be able to go to the lower level)

Promoting entry node is a bit tricky because essentially we need to change both node and entry level at the same time. So I made a new class and use AtomicReference to switch it. It seems to be working ok.

One (I think worst) side effect of this change is that we have to manually set the entry node, because we add node to graph from top most level, then add neighbour to the node from the level 0, so we only want to set the new node to be entry node after we have done all this, so it cannot be set automatically when the node is first added at the highest level.

Benchmark Test

Set up

50MB writer buffer, force-merge to 1 segment at last. 384 dimension of vectors.

Candidate

recall	latency	nDoc	fanout	maxConn	beamWidth	visited	index ms
Force merge done, time: 2 ms
0.838	 0.28	10000	0	64	250	100	3649	1.00	post-filter
Force merge done, time: 20012 ms
0.744	 1.15	100000	0	64	250	100	66680	1.00	post-filter
Force merge done, time: 470161 ms
0.686	12.94	1000000	0	64	250	100	1019300	1.00	post-filter

Update: 100k doc with contention time infomation (note the first few time was for single thread overhead of locks)

Total contention time: 20 ms
Total contention time: 44 ms
Total contention time: 65 ms
Total contention time: 87 ms
Total contention time: 109 ms
Total contention time: 109 ms
Total contention time: 5765 ms
Force merge done, time: 17830 ms
0.741	 1.20	100000	0	64	250	100	65650	1.00	post-filter

Baseline

recall	latency	nDoc	fanout	maxConn	beamWidth	visited	index ms
Force merge done, time: 3 ms
0.838	 0.28	10000	0	64	250	100	3749	1.00	post-filter
Force merge done, time: 97579 ms
0.745	 1.19	100000	0	64	250	100	145355	1.00	post-filter
Force merge done, time: 2059821 ms
0.683	12.81	1000000	0	64	250	100	2902514	1.00	post-filter

msokolov · 2023-10-12T18:07:03Z

This looks like a great start. I worked up a very similar PR but I think I have some concurrency bugs around the update/add of new entry points in the upper graph levels. I might post mine anyway because it has some syntax sugar I think would be nice to bring over here. One idea I had was to create a NeighborArrayIterator that would handle read-locking in the OnHeapGraph case with another implementation that just encapsulates the iteration with no locking for use by other (off-heap) HnswGraph. With that and a little refactoring we can use the same HnswGraphSearcher for both cases?

zhaih · 2023-10-12T19:54:56Z

That sounds great, thanks Mike! Please post the PR and I will try to see how to put it together. Meanwhile let's try to get the prerequisite PR reviewed and pushed first? Such that this one can look cleaner.

…

On Thu, Oct 12, 2023, 11:07 Michael Sokolov ***@***.***> wrote: This looks like a great start. I worked up a very similar PR but I think I have some concurrency bugs around the update/add of new entry points in the upper graph levels. I might post mine anyway because it has some syntax sugar I think would be nice to bring over here. One idea I had was to create a NeighborArrayIterator that would handle read-locking in the OnHeapGraph case with another implementation that just encapsulates the iteration with no locking for use by other (off-heap) HnswGraph. With that and a little refactoring we can use the same HnswGraphSearcher for both cases? — Reply to this email directly, view it on GitHub <#12660 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEFSB7C4JF6SQ7JQ2SJCR63X7AWVFANCNFSM6AAAAAA55CGFD4> . You are receiving this because you authored the thread.Message ID: ***@***.***>

xjtushilei · 2023-10-13T01:40:32Z

Good Job

msokolov · 2023-10-15T18:41:20Z

OK, I posted #12683. I'm curious about the performance measurements. We want to know how much lock contention there is so ideally we'd compare total time to merge using single-threaded vs using this. Did your measurement cover the entire indexing process including indexing + flushes (not merges)? Oh wait I see you distinguished force merge vs the whole thing. It's great that we can go faster, but I feel we ought to strive to do better than 1/2 per thread (4x speedup for 8 threads). Maybe once we are happy with the code we can test 2 threads since there ought to be less contention? Although frankly 4x speedup is great! But I'd also wonder how much CPU we are using then? is it 8x CPU usage too, or more likely some of those threads spend time waiting? Also - how large was your index? I think we'd expect greater speedups for larger indexes?

zhaih · 2023-10-16T04:11:46Z

Thanks @msokolov, I'll take a look soon.

so ideally we'd compare total time to merge using single-threaded vs using this.

The most fair comparison here I posted might be the force merge time for 100k docs, which is 20012ms vs 97579ms. Since there should be no merging happening during normal indexing due to index size (around 300mb and per segment 50mb buffer), all the conditions are the same, but still we're counting the flush time and preparation time during the comparison.

Because the lock are only used in several places, I'm wondering whether we could simply just accumulate the lock waitting time using System.nanoTime() to get a feel of how much contention it has. But in general a nearly 5x speed up over 8 thread is not bad I feel.

zhaih · 2023-10-20T06:54:43Z

@msokolov I incorporate your change about passing in the executor and addVector from range. I also added the wire up to passing in parameters from VectorFormat all the way in.
For the NeighbourIterator I still need to think a bit, the part I'm a little bit unsure is that if we keep the reader lock locked and just pass an iterator out (without copy the array) it may create more contention. So that part I'm still keeping my original implementation.
Anyway I think right now all the prerequisite PRs is checked in so it's in a reviewable state. Altho I still need to add javadocs, tests I still want some early feedback on the class struture, interface and logic improvement. So please review it!

cc. @benwtrent thanks for creating the merger abstraction, I used it right away, so if you have some spare time please review it as well!

I added some more details to the description you may want to check before reading the code.

benwtrent · 2023-10-20T11:59:01Z

This is awesome. I am so happy it's a clean change without tons of complexity and we still get 4x speed up with additional threads.

I will give it a review this weekend or early next week.

msokolov

mostly minor comments - looks exciting! I would be curious to see the contention times and also understand how this changes CPU usage vs. single-threaded.

lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsWriter.java

lucene/core/src/java/org/apache/lucene/util/hnsw/HnswConcurrentMergeBuilder.java

lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsWriter.java

lucene/core/src/java/org/apache/lucene/util/hnsw/HnswConcurrentMergeBuilder.java

lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java

lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphMerger.java

lucene/core/src/java/org/apache/lucene/util/hnsw/IncrementalHnswGraphMerger.java

lucene/core/src/java/org/apache/lucene/util/hnsw/OnHeapHnswGraph.java

lucene/core/src/test/org/apache/lucene/util/hnsw/HnswGraphTestCase.java

zhaih · 2023-10-23T05:39:11Z

I would be curious to see the contention times and also understand how this changes CPU usage vs. single-threaded.

@msokolov as for CPU usage, I just tested with 1M docs, and on my laptop I can see when it's doing forceMerge it constantly ranges from 790~810%, while if a merge triggers when adding the document, the CPU can goes to 900%.

I would like to see how it scales to, say 20 or even 40 cores, but unfortunately this laptop with 10 cores (including 2 efficient core) is the only one I have for now...

BTW the forceMerge of the 1M index took 362125 ms, while the contention time for that merge was 81847 ms in CPU time, so average 10230 ms wall time and thus ~2.8% of total time. And array copy time was 12151 ms in CPU time, and average 1519 ms wall time and thus ~0.4% of total time. So you're right, comparing with result of 100k documents, the contention overhead goes down when the graph is getting bigger.

benwtrent

Comments on API design. I am not sure if this is considered a "final draft for review" or whats just there as a stop-gap for testing still.

lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsFormat.java

benwtrent · 2023-10-23T14:29:39Z

lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java

+      long start = System.nanoTime();
+      nbrsOfNbr.rwlock.writeLock().lock();


My small concern here is that now we are locking even when no executor is being used and we know its a single thread. How bad are we hitting users that aren't using multi-threaded merge?

So when I'm testing it's like 20ms of overhead on grabing this lock for building roughly 20k docs with 384 dim vectors. So I think it's not bad given the HNSW build itself is much slower?

20ms of overhead on grabing this lock for building roughly 20k docs with 384 dim vectors.

Does contention grow linearly? Or is it logarithmic?

I had previously posted a NeighborIterator abstraction that was designed to hide the locking (somewhat) from the HnswSearcher/Builder. That way we don't do any locking in the serial implementation. But @zhaih pointed out it was too aggressively holding the lock around a large critical section. I think we could enhance the concurrent implementation of that abstraction by having it copy the neighbors as this PR does, and then release the lock immediately after copying rather than holding the lock for the entire iteration.

Does contention grow linearly? Or is it logarithmic?

@benwtrent If you mean the 20ms it's growing linearly with number of nodes we insert, as it's basically the overhead we pay for acquiring the lock for each neighbor. It's not contention tho because it's single thread execution.

@msokolov Yeah that solves the searcher's part, however I think the more complex one is on writer as I mentioned below we need to group multiple method together in a "transaction".
I think in you PR you used a returned Closeable to kinda "hide" the lock, but I think that's essentially the same as exposing the lock?

benwtrent · 2023-10-23T14:34:30Z

lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java

  float[] score;
  int[] node;
  private int sortedNodeSize;
+  public final ReadWriteLock rwlock = new ReentrantReadWriteLock(true);


Why don't we have a ConcurrentNeighborArray? Always requiring locking by the external caller is not good. NeighborArray needs to change its API so external folks shouldn't worry about the lock.

I think we can hide the lock inside the NeighborArray by adding some APIs, but it's hard to make a totally thread safe class given how we use it right now: for example, when we add a neighbor and need to purge the non-diverse one, we're adding -> sorting -> calculating non-diverse -> removing. And it's not enough to just ensure each operation to be atomic because someone can always add a node after we sort and makes things more complex. So even I create this ConcurrentNeighborArray and every operation is nicely protected the caller still need to be extra careful...

But with the caller control the lock, it's much easier to control this "transaction" behavior and only add lock when they need.

I'm not saying this is good, but given this is really a very hacky class already (even before the lock) and internal to HNSW only maybe we can leave it as a follow-up?

We can make it a refactor later, but I am convinced we will want a separate *NeighborArray class for concurrent vs. serial graph building. There is no technical reason to slow down serial graph building and I think we can make concurrent even nicer by freeing itself from having to even think about serial graph building.

Yeah I don't have a very good idea for that yet, I'll for sure create an issue after this is merged.

It seems like we need a ConcurrentHNSWGraphBuilder. It can use much of the same logic, but it instead uses concurrent neighbors. But that can be a later refactor.

lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphMerger.java

zhaih · 2023-10-25T06:52:41Z

@msokolov @benwtrent I removed almost all nocommit (except the renaming one) and rebased to main, please take a look if you have time.

@benwtrent Please check whether the rebase and modification, especially the scalar quantized part is done correctly. I did it according to my understanding but that is not necessarily correct :)

I'll do another round of benchmark run tomorrow to make sure things are still intact.

benwtrent · 2023-10-25T15:17:03Z

lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java

+    //    System.out.println("addVectors [" + minOrd + " " + maxOrd + ") initialized.size=" +
+    // initializedNodes.size());


all these comments should probably be removed.

Good catch, removed!

benwtrent · 2023-10-25T15:22:12Z

lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java

  float[] score;
  int[] node;
  private int sortedNodeSize;
+  public final ReadWriteLock rwlock = new ReentrantReadWriteLock(true);


It seems like we need a ConcurrentHNSWGraphBuilder. It can use much of the same logic, but it instead uses concurrent neighbors. But that can be a later refactor.

lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsFormat.java

lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsWriter.java

zhaih · 2023-10-26T05:39:06Z

I reran the benchmark with the current code and it still looks good, both the recall and latency remains the same.

The only unresolved issue right now is the naming of the interface, I'm aware that IHnswGraphBuilder is not the convention in lucene, however I just tried to change the HnswGraphBuilder to some other name and change the interface to HnswGraphBuilder, then git will not be smart enough to know the renaming and migrate the history. So any idea on what will be a good name of the intterface?

benwtrent

@zhaih could you create an issue for tracking us optimizing the single threaded case to not use locks?

msokolov · 2023-10-27T14:09:56Z

as for the renaming maybe we can look to do it as a followup PR? IHnswGraphSearcher -> HnswGraphSearcher? Or maybe we can just use HnswSearcher right now?

Re: controlling concurrency. Maybe eventually we figure out how to do this as part of ConcurrentMergeScheduler? Since it is all about merging, that seems like a logical place. It currently allows setting maxMergeCount and maxThreadCount independently, but enforces maxThreadCount <= maxMergeCount. We could potentially fiddle with this, maybe it would be as simple as relaxing that constraint and using "spare threads" to do concurrent graph merging.

msokolov

Looking good, I think it's ready to merge.

zhaih · 2023-10-28T05:33:59Z

Thanks Mike and Ben for reviewing! I'll try to merge and backport it tomorrow. And create a issue for future refactoring

zhaih · 2023-10-31T06:10:27Z

FYI I pushed a commit 4576ae0 to fix the behavior that vector writer close the executor passed in

jpountz · 2023-10-31T09:19:46Z

For my understanding, how does it play out with merge throttling, could we end up sleeping within tasks that get passed to the configured executor service?

zhaih · 2023-10-31T14:54:44Z

For my understanding, how does it play out with merge throttling, could we end up sleeping within tasks that get passed to the configured executor service?

I think we rate limit the IndexOutput only in CMS? If my understand is correct then I think this is not affected by throttling as it's all memory operation and we'll write to disk after all the tasks are completed

…graph build (#12799) Make the TaskExecutor public which is currently pkg-private. At indexing time we concurrently create the hnsw graph (Concurrent HNSW Merge #12660). We could use the TaskExecutor implementation to do this for us. Use TaskExecutor#invokeAll in HnswConcurrentMergeBuilder#build to run the workers concurrently.

…graph build (apache#12799) Make the TaskExecutor public which is currently pkg-private. At indexing time we concurrently create the hnsw graph (Concurrent HNSW Merge apache#12660). We could use the TaskExecutor implementation to do this for us. Use TaskExecutor#invokeAll in HnswConcurrentMergeBuilder#build to run the workers concurrently.

Lucene PR apache/lucene#12660 introduced concurrent HNSW merge and PR apache/lucene#13124 made ConcurrentMergeScheduler to provide an intra merge executor. But to enable concurrent intra merge for HNSW graphs we still need to provide numMergeWorker to the codec definition. This PR does the following: - provides numMergeWorker to the HNSW vectors codecs definions, where numMergerWorkes set as a node's property of index.merge.scheduler.max_thread_count - enables concurrent merge only for force merge operations.

zhaih mentioned this pull request Oct 12, 2023

Optimize OnHeapHnswGraph's data structure #12651

Merged

zhaih mentioned this pull request Oct 13, 2023

Extract the hnsw graph merging from being part of the vector writer #12657

Merged

msokolov pushed a commit to msokolov/lucene that referenced this pull request Oct 15, 2023

partially merged apache#12660

ee70e31

zhaih force-pushed the concurrentKnn branch 3 times, most recently from 12c40f0 to 120c322 Compare October 20, 2023 06:36

zhaih marked this pull request as ready for review October 20, 2023 06:36

zhaih changed the title ~~[DRAFT] Concurrent HNSW Merge~~ Concurrent HNSW Merge Oct 21, 2023

msokolov reviewed Oct 22, 2023

View reviewed changes

benwtrent reviewed Oct 23, 2023

View reviewed changes

zhaih force-pushed the concurrentKnn branch from 84b0fae to a81714d Compare October 25, 2023 06:48

benwtrent reviewed Oct 25, 2023

View reviewed changes

benwtrent approved these changes Oct 27, 2023

View reviewed changes

msokolov approved these changes Oct 27, 2023

View reviewed changes

This was referenced Oct 27, 2023

Use more modern vectors in nightly bench mikemccand/luceneutil#237

Open

Switch luceneutil to use concurrent vector merge mikemccand/luceneutil#239

Closed

zhaih added 2 commits October 28, 2023 00:05

Concurrent HNSW merging, rebased, not tested after rebase

dbbeadf

fix small bug

3bb2ac6

zhaih added 6 commits October 28, 2023 00:05

remove extra comments, tested

886d2be

Fix back comp

6344e07

rename

4ca337d

Rename

1a32d59

javadoc

886b906

CHANGES entry

3276382

zhaih force-pushed the concurrentKnn branch from 9d3d458 to 3276382 Compare October 28, 2023 07:06

zhaih merged commit a8c52e2 into apache:main Oct 28, 2023

zhaih added a commit that referenced this pull request Oct 28, 2023

Concurrent HNSW Merge (#12660)

5fe4842

zhaih mentioned this pull request Oct 28, 2023

Refactor HNSW graph build such that concurrent build won't impact single thread build #12732

Open

zhaih added this to the 9.9.0 milestone Oct 28, 2023

zhaih deleted the concurrentKnn branch October 29, 2023 04:33

zhaih mentioned this pull request Oct 31, 2023

Better wire up HNSW concurrent merge config #12740

Open

shubhamvishu mentioned this pull request Nov 11, 2023

Make TaskExecutor cx public and use TaskExecutor for concurrent HNSW graph build #12799

Merged

mayya-sharipova mentioned this pull request May 1, 2024

Enable concurrent intra merge for HNSW graphs elastic/elasticsearch#108164

Open

		long start = System.nanoTime();
		nbrsOfNbr.rwlock.writeLock().lock();

		// System.out.println("addVectors [" + minOrd + " " + maxOrd + ") initialized.size=" +
		// initializedNodes.size());

Concurrent HNSW Merge #12660

Concurrent HNSW Merge #12660

Uh oh!

Conversation

zhaih commented Oct 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

More details

Benchmark Test

Set up

Candidate

Baseline

Uh oh!

msokolov commented Oct 12, 2023

Uh oh!

zhaih commented Oct 12, 2023 via email

Uh oh!

xjtushilei commented Oct 13, 2023

Uh oh!

msokolov commented Oct 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhaih commented Oct 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhaih commented Oct 20, 2023

Uh oh!

benwtrent commented Oct 20, 2023

Uh oh!

msokolov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zhaih commented Oct 23, 2023

Uh oh!

benwtrent left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zhaih commented Oct 25, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zhaih commented Oct 12, 2023 •

edited

Loading

msokolov commented Oct 15, 2023 •

edited

Loading

zhaih commented Oct 16, 2023 •

edited

Loading

zhaih commented Oct 26, 2023 •

edited

Loading