Skip to content

Conversation

@zhaih
Copy link
Contributor

@zhaih zhaih commented Oct 12, 2023

Description

This PR still contains all kinds of magic number and names I need to improve. But I think we can start review it as it is not that hacky and I think is more on acceptable side. For sure I will add quite more javadocs to make the code readable, especially the add graph node part.

The main idea is that we merely need to care about the thread safety of OnHeapHnswGraph after the #12651 (still, the entry node election is a bit complex), so I just create one RW lock for each NeighbourArray and lock them when necessary.

The main change happens in HnswGraphBuilder#addGraphNode where I changed the sequence of updating the graph such that inserting a new node won't impact searching the previous graph, then I just hardcoded a thread pool in addVectors method to test the change. And it shows a good 4x speed up on forceMerge when I'm using 8 threads, and 3x overall indexing speed with 1M docs.

More details

A new ConcurrentHnswMerger to create the HnswConcurrentMergeBuilder, which will contain several workers (ConcurrentMergeWorker) which essentially each is a graph builder and we will spawn one thread for each of them.

Then main logic still just as stated in the description, I make the OnHeapHnswGraph mostly thread-safe, and modified the way we add the graph node from the level 0 and goes up such that when a search is initiated concurrently, this new added node will not lead to a dead-end (because if the new node is discovered, it is guaranteed to have all out-going edges already and it is guaranteed to be able to go to the lower level)

Promoting entry node is a bit tricky because essentially we need to change both node and entry level at the same time. So I made a new class and use AtomicReference to switch it. It seems to be working ok.

One (I think worst) side effect of this change is that we have to manually set the entry node, because we add node to graph from top most level, then add neighbour to the node from the level 0, so we only want to set the new node to be entry node after we have done all this, so it cannot be set automatically when the node is first added at the highest level.

Benchmark Test

Set up

50MB writer buffer, force-merge to 1 segment at last. 384 dimension of vectors.

Candidate

recall	latency	nDoc	fanout	maxConn	beamWidth	visited	index ms
Force merge done, time: 2 ms
0.838	 0.28	10000	0	64	250	100	3649	1.00	post-filter
Force merge done, time: 20012 ms
0.744	 1.15	100000	0	64	250	100	66680	1.00	post-filter
Force merge done, time: 470161 ms
0.686	12.94	1000000	0	64	250	100	1019300	1.00	post-filter

Update: 100k doc with contention time infomation (note the first few time was for single thread overhead of locks)

Total contention time: 20 ms
Total contention time: 44 ms
Total contention time: 65 ms
Total contention time: 87 ms
Total contention time: 109 ms
Total contention time: 109 ms
Total contention time: 5765 ms
Force merge done, time: 17830 ms
0.741	 1.20	100000	0	64	250	100	65650	1.00	post-filter

Baseline

recall	latency	nDoc	fanout	maxConn	beamWidth	visited	index ms
Force merge done, time: 3 ms
0.838	 0.28	10000	0	64	250	100	3749	1.00	post-filter
Force merge done, time: 97579 ms
0.745	 1.19	100000	0	64	250	100	145355	1.00	post-filter
Force merge done, time: 2059821 ms
0.683	12.81	1000000	0	64	250	100	2902514	1.00	post-filter

@msokolov
Copy link
Contributor

This looks like a great start. I worked up a very similar PR but I think I have some concurrency bugs around the update/add of new entry points in the upper graph levels. I might post mine anyway because it has some syntax sugar I think would be nice to bring over here. One idea I had was to create a NeighborArrayIterator that would handle read-locking in the OnHeapGraph case with another implementation that just encapsulates the iteration with no locking for use by other (off-heap) HnswGraph. With that and a little refactoring we can use the same HnswGraphSearcher for both cases?

@zhaih
Copy link
Contributor Author

zhaih commented Oct 12, 2023 via email

@xjtushilei
Copy link

Good Job

msokolov pushed a commit to msokolov/lucene that referenced this pull request Oct 15, 2023
@msokolov
Copy link
Contributor

msokolov commented Oct 15, 2023

OK, I posted #12683. I'm curious about the performance measurements. We want to know how much lock contention there is so ideally we'd compare total time to merge using single-threaded vs using this. Did your measurement cover the entire indexing process including indexing + flushes (not merges)? Oh wait I see you distinguished force merge vs the whole thing. It's great that we can go faster, but I feel we ought to strive to do better than 1/2 per thread (4x speedup for 8 threads). Maybe once we are happy with the code we can test 2 threads since there ought to be less contention? Although frankly 4x speedup is great! But I'd also wonder how much CPU we are using then? is it 8x CPU usage too, or more likely some of those threads spend time waiting? Also - how large was your index? I think we'd expect greater speedups for larger indexes?

@zhaih
Copy link
Contributor Author

zhaih commented Oct 16, 2023

Thanks @msokolov, I'll take a look soon.

so ideally we'd compare total time to merge using single-threaded vs using this.

The most fair comparison here I posted might be the force merge time for 100k docs, which is 20012ms vs 97579ms. Since there should be no merging happening during normal indexing due to index size (around 300mb and per segment 50mb buffer), all the conditions are the same, but still we're counting the flush time and preparation time during the comparison.

Because the lock are only used in several places, I'm wondering whether we could simply just accumulate the lock waitting time using System.nanoTime() to get a feel of how much contention it has. But in general a nearly 5x speed up over 8 thread is not bad I feel.

@zhaih zhaih force-pushed the concurrentKnn branch 3 times, most recently from 12c40f0 to 120c322 Compare October 20, 2023 06:36
@zhaih zhaih marked this pull request as ready for review October 20, 2023 06:36
@zhaih
Copy link
Contributor Author

zhaih commented Oct 20, 2023

@msokolov I incorporate your change about passing in the executor and addVector from range. I also added the wire up to passing in parameters from VectorFormat all the way in.
For the NeighbourIterator I still need to think a bit, the part I'm a little bit unsure is that if we keep the reader lock locked and just pass an iterator out (without copy the array) it may create more contention. So that part I'm still keeping my original implementation.
Anyway I think right now all the prerequisite PRs is checked in so it's in a reviewable state. Altho I still need to add javadocs, tests I still want some early feedback on the class struture, interface and logic improvement. So please review it!

cc. @benwtrent thanks for creating the merger abstraction, I used it right away, so if you have some spare time please review it as well!

I added some more details to the description you may want to check before reading the code.

@benwtrent
Copy link
Member

This is awesome. I am so happy it's a clean change without tons of complexity and we still get 4x speed up with additional threads.

I will give it a review this weekend or early next week.

@zhaih zhaih changed the title [DRAFT] Concurrent HNSW Merge Concurrent HNSW Merge Oct 21, 2023
Copy link
Contributor

@msokolov msokolov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly minor comments - looks exciting! I would be curious to see the contention times and also understand how this changes CPU usage vs. single-threaded.

@zhaih
Copy link
Contributor Author

zhaih commented Oct 23, 2023

I would be curious to see the contention times and also understand how this changes CPU usage vs. single-threaded.

@msokolov as for CPU usage, I just tested with 1M docs, and on my laptop I can see when it's doing forceMerge it constantly ranges from 790~810%, while if a merge triggers when adding the document, the CPU can goes to 900%.

I would like to see how it scales to, say 20 or even 40 cores, but unfortunately this laptop with 10 cores (including 2 efficient core) is the only one I have for now...

BTW the forceMerge of the 1M index took 362125 ms, while the contention time for that merge was 81847 ms in CPU time, so average 10230 ms wall time and thus ~2.8% of total time. And array copy time was 12151 ms in CPU time, and average 1519 ms wall time and thus ~0.4% of total time. So you're right, comparing with result of 100k documents, the contention overhead goes down when the graph is getting bigger.

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments on API design. I am not sure if this is considered a "final draft for review" or whats just there as a stop-gap for testing still.

Comment on lines 321 to 315
long start = System.nanoTime();
nbrsOfNbr.rwlock.writeLock().lock();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My small concern here is that now we are locking even when no executor is being used and we know its a single thread. How bad are we hitting users that aren't using multi-threaded merge?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So when I'm testing it's like 20ms of overhead on grabing this lock for building roughly 20k docs with 384 dim vectors. So I think it's not bad given the HNSW build itself is much slower?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

20ms of overhead on grabing this lock for building roughly 20k docs with 384 dim vectors.

Does contention grow linearly? Or is it logarithmic?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had previously posted a NeighborIterator abstraction that was designed to hide the locking (somewhat) from the HnswSearcher/Builder. That way we don't do any locking in the serial implementation. But @zhaih pointed out it was too aggressively holding the lock around a large critical section. I think we could enhance the concurrent implementation of that abstraction by having it copy the neighbors as this PR does, and then release the lock immediately after copying rather than holding the lock for the entire iteration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does contention grow linearly? Or is it logarithmic?

@benwtrent If you mean the 20ms it's growing linearly with number of nodes we insert, as it's basically the overhead we pay for acquiring the lock for each neighbor. It's not contention tho because it's single thread execution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@msokolov Yeah that solves the searcher's part, however I think the more complex one is on writer as I mentioned below we need to group multiple method together in a "transaction".
I think in you PR you used a returned Closeable to kinda "hide" the lock, but I think that's essentially the same as exposing the lock?

float[] score;
int[] node;
private int sortedNodeSize;
public final ReadWriteLock rwlock = new ReentrantReadWriteLock(true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we have a ConcurrentNeighborArray? Always requiring locking by the external caller is not good. NeighborArray needs to change its API so external folks shouldn't worry about the lock.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can hide the lock inside the NeighborArray by adding some APIs, but it's hard to make a totally thread safe class given how we use it right now: for example, when we add a neighbor and need to purge the non-diverse one, we're adding -> sorting -> calculating non-diverse -> removing. And it's not enough to just ensure each operation to be atomic because someone can always add a node after we sort and makes things more complex. So even I create this ConcurrentNeighborArray and every operation is nicely protected the caller still need to be extra careful...

But with the caller control the lock, it's much easier to control this "transaction" behavior and only add lock when they need.

I'm not saying this is good, but given this is really a very hacky class already (even before the lock) and internal to HNSW only maybe we can leave it as a follow-up?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can make it a refactor later, but I am convinced we will want a separate *NeighborArray class for concurrent vs. serial graph building. There is no technical reason to slow down serial graph building and I think we can make concurrent even nicer by freeing itself from having to even think about serial graph building.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I don't have a very good idea for that yet, I'll for sure create an issue after this is merged.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we need a ConcurrentHNSWGraphBuilder. It can use much of the same logic, but it instead uses concurrent neighbors. But that can be a later refactor.

@zhaih
Copy link
Contributor Author

zhaih commented Oct 25, 2023

@msokolov @benwtrent I removed almost all nocommit (except the renaming one) and rebased to main, please take a look if you have time.

@benwtrent Please check whether the rebase and modification, especially the scalar quantized part is done correctly. I did it according to my understanding but that is not necessarily correct :)

I'll do another round of benchmark run tomorrow to make sure things are still intact.

Comment on lines 177 to 178
// System.out.println("addVectors [" + minOrd + " " + maxOrd + ") initialized.size=" +
// initializedNodes.size());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all these comments should probably be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, removed!

float[] score;
int[] node;
private int sortedNodeSize;
public final ReadWriteLock rwlock = new ReentrantReadWriteLock(true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we need a ConcurrentHNSWGraphBuilder. It can use much of the same logic, but it instead uses concurrent neighbors. But that can be a later refactor.

@zhaih
Copy link
Contributor Author

zhaih commented Oct 26, 2023

I reran the benchmark with the current code and it still looks good, both the recall and latency remains the same.

The only unresolved issue right now is the naming of the interface, I'm aware that IHnswGraphBuilder is not the convention in lucene, however I just tried to change the HnswGraphBuilder to some other name and change the interface to HnswGraphBuilder, then git will not be smart enough to know the renaming and migrate the history. So any idea on what will be a good name of the intterface?

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhaih could you create an issue for tracking us optimizing the single threaded case to not use locks?

@msokolov
Copy link
Contributor

as for the renaming maybe we can look to do it as a followup PR? IHnswGraphSearcher -> HnswGraphSearcher? Or maybe we can just use HnswSearcher right now?

Re: controlling concurrency. Maybe eventually we figure out how to do this as part of ConcurrentMergeScheduler? Since it is all about merging, that seems like a logical place. It currently allows setting maxMergeCount and maxThreadCount independently, but enforces maxThreadCount <= maxMergeCount. We could potentially fiddle with this, maybe it would be as simple as relaxing that constraint and using "spare threads" to do concurrent graph merging.

Copy link
Contributor

@msokolov msokolov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, I think it's ready to merge.

@zhaih
Copy link
Contributor Author

zhaih commented Oct 28, 2023

Thanks Mike and Ben for reviewing! I'll try to merge and backport it tomorrow. And create a issue for future refactoring

@zhaih zhaih merged commit a8c52e2 into apache:main Oct 28, 2023
zhaih added a commit that referenced this pull request Oct 28, 2023
@zhaih zhaih added this to the 9.9.0 milestone Oct 28, 2023
@zhaih zhaih deleted the concurrentKnn branch October 29, 2023 04:33
@zhaih
Copy link
Contributor Author

zhaih commented Oct 31, 2023

FYI I pushed a commit 4576ae0 to fix the behavior that vector writer close the executor passed in

@jpountz
Copy link
Contributor

jpountz commented Oct 31, 2023

For my understanding, how does it play out with merge throttling, could we end up sleeping within tasks that get passed to the configured executor service?

@zhaih
Copy link
Contributor Author

zhaih commented Oct 31, 2023

For my understanding, how does it play out with merge throttling, could we end up sleeping within tasks that get passed to the configured executor service?

I think we rate limit the IndexOutput only in CMS? If my understand is correct then I think this is not affected by throttling as it's all memory operation and we'll write to disk after all the tasks are completed

javanna pushed a commit that referenced this pull request Nov 21, 2023
…graph build (#12799)

Make the TaskExecutor public which is currently pkg-private. At indexing time we concurrently create the hnsw graph (Concurrent HNSW Merge #12660). We could use the TaskExecutor implementation to do this for us.
Use TaskExecutor#invokeAll in HnswConcurrentMergeBuilder#build to run the workers concurrently.
javanna pushed a commit to javanna/lucene that referenced this pull request Nov 21, 2023
…graph build (apache#12799)

Make the TaskExecutor public which is currently pkg-private. At indexing time we concurrently create the hnsw graph (Concurrent HNSW Merge apache#12660). We could use the TaskExecutor implementation to do this for us.
Use TaskExecutor#invokeAll in HnswConcurrentMergeBuilder#build to run the workers concurrently.
javanna pushed a commit to javanna/lucene that referenced this pull request Nov 21, 2023
…graph build (apache#12799)

Make the TaskExecutor public which is currently pkg-private. At indexing time we concurrently create the hnsw graph (Concurrent HNSW Merge apache#12660). We could use the TaskExecutor implementation to do this for us.
Use TaskExecutor#invokeAll in HnswConcurrentMergeBuilder#build to run the workers concurrently.
mayya-sharipova added a commit to mayya-sharipova/elasticsearch that referenced this pull request May 6, 2024
Lucene PR  apache/lucene#12660 introduced
concurrent HNSW merge and PR
 apache/lucene#13124 made ConcurrentMergeScheduler
to provide an intra merge executor.

But to enable concurrent intra merge for HNSW graphs we still need to
provide numMergeWorker to the codec definition.
This PR does the following:

- provides numMergeWorker to the HNSW vectors codecs definions, where
numMergerWorkes set as a node's property of index.merge.scheduler.max_thread_count
- enables concurrent merge only for force merge operations.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants