You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/en/engines/table-engines/mergetree-family/annindexes.md
+42-1Lines changed: 42 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,7 +8,7 @@ title: 'Exact and Approximate Vector Search'
8
8
9
9
# Exact and approximate vector search
10
10
11
-
The problem of finding the N closest points in a multi-dimensional (vector) space for a given point is known as [nearest neighbor search](https://en.wikipedia.org/wiki/Nearest_neighbor_search) or, shorter, vector search.
11
+
The problem of finding the N closest points in a multi-dimensional (vector) space for a given point is known as [nearest neighbor search](https://en.wikipedia.org/wiki/Nearest_neighbor_search) or, in short: vector search.
12
12
Two general approaches exist for solving vector search:
13
13
- Exact vector search calculates the distance between the given point and all points in the vector space. This ensures the best possible accuracy, i.e. the returned points are guaranteed to be the actual nearest neighbors. Since the vector space is explored exhaustively, exact vector search can be too slow for real-world use.
14
14
- Approximate vector search refers to a group of techniques (e.g., special data structures like graphs and random forests) which compute results much faster than exact vector search. The result accuracy is typically "good enough" for practical use. Many approximate techniques provide parameters to tune the trade-off between the result accuracy and the search time.
@@ -147,6 +147,47 @@ Further restrictions apply:
147
147
- Vector similarity indexes require that all arrays in the underlying column have `<dimension>`-many elements - this is checked during index creation. To detect violations of this requirement as early as possible, users can add a [constraint](/sql-reference/statements/create/table.md#constraints) for the vector column, e.g., `CONSTRAINT same_length CHECK length(vectors) = 256`.
148
148
- Likewise, array values in the underlying column must not be empty (`[]`) or have a default value (also `[]`).
149
149
150
+
**Estimating storage and memory consumption**
151
+
152
+
A vector generated for use with a typical AI model (e.g. a Large Language Model, [LLMs](https://en.wikipedia.org/wiki/Large_language_model)) consists of hundreds or thousands of floating-point values.
153
+
Thus, a single vector value can have a memory consumption of multiple kilobyte.
154
+
Users who like to estimate the storage required for the underlying vector column in the table, as well as the main memory needed for the vector similarity index, can use below two formula:
155
+
156
+
Storage consumption of the vector column in the table (uncompressed):
157
+
158
+
```text
159
+
Storage consumption = Number of vectors * Dimension * Size of column data type
160
+
```
161
+
162
+
Example for the [dbpedia dataset](https://huggingface.co/datasets/KShivendu/dbpedia-entities-openai-1M):
The vector similarity index must be fully loaded from disk into main memory to perform searches.
169
+
Similarly, the vector index is also constructed fully in memory and then saved to disk.
170
+
171
+
Memory consumption required to load a vector index:
172
+
173
+
```text
174
+
Memory for vectors in the index (mv) = Number of vectors * Dimension * Size of quantized data type
175
+
Memory for in-memory graph (mg) = Number of vectors * hnsw_max_connections_per_layer * 2 * 4
176
+
177
+
Memory consumption: mv + mg
178
+
```
179
+
180
+
Example for the [dbpedia dataset](https://huggingface.co/datasets/KShivendu/dbpedia-entities-openai-1M):
181
+
182
+
```text
183
+
Memory for vectors in the index (mv) = 1 million * 1536 * 2 (for BFloat16) = 3072 MB
184
+
Memory for in-memory graph (mg) = 1 million * 64 * 2 * 4 = 512 MB
185
+
186
+
Memory consumption = 3072 + 512 = 3584 MB
187
+
```
188
+
189
+
Above formula does not account for additional memory required by vector similarity indexes to allocate runtime data structures like pre-allocated buffers and caches.
190
+
150
191
### Using a Vector Similarity Index {#using-a-vector-similarity-index}
Copy file name to clipboardExpand all lines: docs/en/sql-reference/data-types/newjson.md
+59-2Lines changed: 59 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -673,7 +673,7 @@ By default, this limit is `1024`, but you can change it in the type declaration
673
673
674
674
When the limit is reached, all new paths inserted to a `JSON` column will be stored in a single shared data structure.
675
675
It's still possible to read such paths as sub-columns,
676
-
but it will require reading the entire shared datastructure to extract the values of this path.
676
+
but it might be less efficient ([see section about shared data](#shared-data-structure)).
677
677
This limit is needed to avoid having an enormous number of different sub-columns that can make the table unusable.
678
678
679
679
Let's see what happens when the limit is reached in a few different scenarios.
@@ -771,6 +771,63 @@ ORDER BY _part ASC
771
771
772
772
As we can see, ClickHouse kept the most frequent paths `a`, `b` and `c` and moved paths `d` and `e` to a shared data structure.
773
773
774
+
## Shared data structure {#shared-data-structure}
775
+
776
+
As was described in the previous section, when the `max_dynamic_paths` limit is reached all new paths are stored in a single shared data structure.
777
+
In this section we will look into the details of the shared data structure and how we read paths sub-columns from it.
778
+
779
+
### Shared data structure in memory {#shared-data-structure-in-memory}
780
+
781
+
In memory, shared data structure is just a sub-column with type `Map(String, String)` that stores mapping from a flattened JSON path to a binary encoded value.
782
+
To extract a path subcolumn from it, we just iterate over all rows in this `Map` column and try to find the requested path and its values.
783
+
784
+
### Shared data structure in MergeTree parts {#shared-data-structure-in-merge-tree-parts}
785
+
786
+
In [MergeTree](../../engines/table-engines/mergetree-family/mergetree.md) tables we store data in data parts that stores everything on disk (local or remote). And data on disk can be stored in a different way compared to memory.
787
+
Currently, there are 3 different shared data structure serializations in MergeTree data parts: `map`, `map_with_buckets`
788
+
and `advanced`.
789
+
790
+
The serialization version is controlled by MergeTree
and [object_shared_data_serialization_version_for_zero_level_parts](../../operations/settings/merge-tree-settings.md#object_shared_data_serialization_version_for_zero_level_parts)
793
+
(zero level part is the part created during inserting data into the table, during merges parts have higher level).
794
+
795
+
Note: changing shared data structure serialization is supported only
796
+
for `v3`[object serialization version](../../operations/settings/merge-tree-settings.md#object_serialization_version)
797
+
798
+
#### Map {#shared-data-map}
799
+
800
+
In `map` serialization version shared data is serialized as a single column with type `Map(String, String)` the same as it's stored in
801
+
memory. To read path sub-column from this type of serialization ClickHouse reads the whole `Map` column and
802
+
extracts the requested path in memory.
803
+
804
+
This serialization is efficient for writing data and reading the whole `JSON` column, but it's not efficient for reading paths sub-columns.
805
+
806
+
#### Map with buckets {#shared-data-map-with-buckets}
807
+
808
+
In `map_with_buckets` serialization version shared data is serialized as `N` columns ("buckets") with type `Map(String, String)`.
809
+
Each such bucket contains only subset of paths. To read path sub-column from this type of serialization ClickHouse
810
+
reads the whole `Map` column from a single bucket and extracts the requested path in memory.
811
+
812
+
This serialization is less efficient for writing data and reading the whole `JSON` column, but it's more efficient for reading paths sub-columns
813
+
because it reads data only from required buckets.
814
+
815
+
Number of buckets `N` is controlled by MergeTree settings [object_shared_data_buckets_for_compact_part](
816
+
../../operations/settings/merge-tree-settings.md#object_shared_data_buckets_for_compact_part) (8 by default)
817
+
and [object_shared_data_buckets_for_wide_part](
818
+
../../operations/settings/merge-tree-settings.md#object_shared_data_buckets_for_wide_part) (32 by default).
819
+
820
+
#### Advanced {#shared-data-advanced}
821
+
822
+
In `advanced` serialization version shared data is serialized in a special data structure that maximizes the performance
823
+
of paths sub-columns reading by storing some additional information that allows to read only the data of requested paths.
824
+
This serialization also supports buckets, so each bucket contains only sub-set of paths.
825
+
826
+
This serialization is quite inefficient for writing data (so it's not recommended to use this serialization for zero-level parts), reading the whole `JSON` column is slightly less efficient compared to `map` serialization, but it's very efficient for reading paths sub-columns.
827
+
828
+
Note: because of storing some additional information inside the data structure, the disk storage size is higher with this serialization compared to
There are several functions that can help to inspect the content of the JSON column:
@@ -975,7 +1032,7 @@ Before creating `JSON` column and loading data into it, consider the following t
975
1032
- Investigate your data and specify as many path hints with types as you can. It will make storage and reading much more efficient.
976
1033
- Think about what paths you will need and what paths you will never need. Specify paths that you won't need in the `SKIP` section, and `SKIP REGEXP` section if needed. This will improve the storage.
977
1034
- Don't set the `max_dynamic_paths` parameter to very high values, as it can make storage and reading less efficient.
978
-
While highly dependent on system parameters such as memory, CPU, etc., a general rule of thumb would be to not set `max_dynamic_paths`> 10 000.
1035
+
While highly dependent on system parameters such as memory, CPU, etc., a general rule of thumb would be to not set `max_dynamic_paths`greater than 10 000 for the local filesystem storage and 1024 for the remote filesystem storage.
0 commit comments