Initial implementation of vector similarity index#63675
Initial implementation of vector similarity index#63675rschu1ze merged 31 commits intoClickHouse:masterfrom
Conversation
|
This is an automated comment for commit fb76cb9 with description of existing statuses. It's updated for the latest CI running ❌ Click here to open a full report in a separate page
Successful checks
|
b0655d4 to
a819c8c
Compare
|
@rschu1ze Sorry to jump into the draft, I was just wondering if you had an idea about what kind of search speed performance improvements we might see in Clickhouse Cloud (if any) due to the SimSIMD hardware acceleration (e.g. negligible, 2x, 10x?) |
7c03aaa to
93daa99
Compare
|
Dear @antaljanosbenjamin, this PR hasn't been updated for a while. You will be unassigned. Will you continue working on it? If so, please feel free to reassign yourself. |
9691a4c to
5c35324
Compare
5c35324 to
2d62118
Compare
First, index type "vector_similarity" is more speaking and user-friendly than "usearch". Second, we should not expose the name of the library doing the job (usearch). Of course, the docs will continue to mention usearch (credit where credit is due). Existing setting `allow_experimental_usearch_index` was marked obsolete. A new settings `allow_experimental_vector_similarity_index` was added.
Index types 'annoy' and 'usearch' were removed and replaced by 'vector_similarity' indexes in an earlier commit. This means unfortuantely, that if customers have tables with these indexes and upgrade, their database might not start anymore - the system loads the metadata at startup, thinks something is wrong with such tables, and halts immediately. This commit adds support for loading and attaching such indexes back. Data insert or use (search) return an error which recommends a migration to 'vector_similarity' indexes. The implementation is generally similar to what has recently been implemented for 'full_text' indexes [1, 2]. [1] ClickHouse#64656 [2] ClickHouse#64846
USearch (similar to FAISS) allows to specify the distance function, quantization, and various HNSW meta-parameters for index creation and sarch. Some users wished for greater configurability, so let's expose them. Index creation now requires either - 2 parameters (with the other 4 parameters taking on default values), or - 6 parameters for full control This commit also remove quantization `f64` (that would be upsampling).
Previously, only this syntax to create a skip index worked:
INDEX index_name column_name TYPE vector_similarity('hnsw', 'L2Distance')
Now, this syntax will work as well:
INDEX index_name column_name TYPE vector_similarity(hnsw, L2Distance)
3a2c0c4 to
fb76cb9
Compare
|
ClickHouse Stateful tests (tsan, ParallelReplicas): #66683 |
antaljanosbenjamin
left a comment
There was a problem hiding this comment.
Only minor things, feel free to merge the PR and the issues can be addressed in another PR easily, none of them critical.
|
Thanks @antaljanosbenjamin Thanks for checking. I actually already work on the successor PR and will include all your suggestions there. |
This PR changed too many things to enumerate here. I tried to keep individual commits small, self-contained, and properly described (commit msg). Reviewers may go through the individual commits.
In sum:
annoyandusearchare gone (*)vector_similarity,vector_similarityindex is under settingallow_experimental_vector_similarity_index(*) it will still be possible to load old tables
annoy/usearchindexes but data insert / search will throw a message that recommends migration tovector_similarity.Fixes: #49669
Changelog category (leave one):