Skip to content

Commit 0c8faef

Browse files
authored
Merge branch 'master' into divanik/fix_wrong_column_mapper
2 parents 282bac0 + 524fe1d commit 0c8faef

File tree

509 files changed

+29302
-4997
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

509 files changed

+29302
-4997
lines changed

ci/jobs/scripts/check_style/aspell-ignore/en/aspell-dict.txt

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -569,6 +569,8 @@ LGPL
569569
LIMITs
570570
LINEITEM
571571
LLDB
572+
LLM
573+
LLMs
572574
LLVM's
573575
LOCALTIME
574576
LOCALTIMESTAMP
@@ -1148,6 +1150,7 @@ ThreadsActive
11481150
ThreadsInOvercommitTracker
11491151
TimeSeries
11501152
TimescaleDB's
1153+
TimestampType
11511154
Timeunit
11521155
TinyLog
11531156
Tkachenko
@@ -1252,13 +1255,13 @@ XORs
12521255
Xeon
12531256
YAML
12541257
YAMLRegExpTree
1258+
YTsaurus
12551259
YYYY
12561260
YYYYMMDD
12571261
YYYYMMDDToDate
12581262
YYYYMMDDhhmmssToDateTime
12591263
Yandex
12601264
Yasm
1261-
YTsaurus
12621265
ZCurve
12631266
ZSTDQAT
12641267
Zabbix
@@ -2682,6 +2685,8 @@ profiler
26822685
projectio
26832686
proleptic
26842687
prometheus
2688+
prometheusQuery
2689+
prometheusQueryRange
26852690
proportionsZTest
26862691
proto
26872692
protobuf
@@ -3137,6 +3142,7 @@ timeSeriesPredictLinearToGrid
31373142
timeSeriesRange
31383143
timeSeriesRateToGrid
31393144
timeSeriesResampleToGridWithStaleness
3145+
timeSeriesSelector
31403146
timeSeriesTags
31413147
timeSlot
31423148
timeSlots

ci/jobs/scripts/workflow_hooks/feature_docs.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,7 @@ def belongs_to_autogenerated_category(filename):
2727
"Conditional",
2828
"Distance",
2929
"DateAndTime",
30+
"Null"
3031
]
3132

3233
try:

contrib/thrift

docs/en/engines/table-engines/mergetree-family/annindexes.md

Lines changed: 42 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ title: 'Exact and Approximate Vector Search'
88

99
# Exact and approximate vector search
1010

11-
The problem of finding the N closest points in a multi-dimensional (vector) space for a given point is known as [nearest neighbor search](https://en.wikipedia.org/wiki/Nearest_neighbor_search) or, shorter, vector search.
11+
The problem of finding the N closest points in a multi-dimensional (vector) space for a given point is known as [nearest neighbor search](https://en.wikipedia.org/wiki/Nearest_neighbor_search) or, in short: vector search.
1212
Two general approaches exist for solving vector search:
1313
- Exact vector search calculates the distance between the given point and all points in the vector space. This ensures the best possible accuracy, i.e. the returned points are guaranteed to be the actual nearest neighbors. Since the vector space is explored exhaustively, exact vector search can be too slow for real-world use.
1414
- Approximate vector search refers to a group of techniques (e.g., special data structures like graphs and random forests) which compute results much faster than exact vector search. The result accuracy is typically "good enough" for practical use. Many approximate techniques provide parameters to tune the trade-off between the result accuracy and the search time.
@@ -147,6 +147,47 @@ Further restrictions apply:
147147
- Vector similarity indexes require that all arrays in the underlying column have `<dimension>`-many elements - this is checked during index creation. To detect violations of this requirement as early as possible, users can add a [constraint](/sql-reference/statements/create/table.md#constraints) for the vector column, e.g., `CONSTRAINT same_length CHECK length(vectors) = 256`.
148148
- Likewise, array values in the underlying column must not be empty (`[]`) or have a default value (also `[]`).
149149

150+
**Estimating storage and memory consumption**
151+
152+
A vector generated for use with a typical AI model (e.g. a Large Language Model, [LLMs](https://en.wikipedia.org/wiki/Large_language_model)) consists of hundreds or thousands of floating-point values.
153+
Thus, a single vector value can have a memory consumption of multiple kilobyte.
154+
Users who like to estimate the storage required for the underlying vector column in the table, as well as the main memory needed for the vector similarity index, can use below two formula:
155+
156+
Storage consumption of the vector column in the table (uncompressed):
157+
158+
```text
159+
Storage consumption = Number of vectors * Dimension * Size of column data type
160+
```
161+
162+
Example for the [dbpedia dataset](https://huggingface.co/datasets/KShivendu/dbpedia-entities-openai-1M):
163+
164+
```text
165+
Storage consumption = 1 million * 1536 * 4 (for Float32) = 6.1 GB
166+
```
167+
168+
The vector similarity index must be fully loaded from disk into main memory to perform searches.
169+
Similarly, the vector index is also constructed fully in memory and then saved to disk.
170+
171+
Memory consumption required to load a vector index:
172+
173+
```text
174+
Memory for vectors in the index (mv) = Number of vectors * Dimension * Size of quantized data type
175+
Memory for in-memory graph (mg) = Number of vectors * hnsw_max_connections_per_layer * 2 * 4
176+
177+
Memory consumption: mv + mg
178+
```
179+
180+
Example for the [dbpedia dataset](https://huggingface.co/datasets/KShivendu/dbpedia-entities-openai-1M):
181+
182+
```text
183+
Memory for vectors in the index (mv) = 1 million * 1536 * 2 (for BFloat16) = 3072 MB
184+
Memory for in-memory graph (mg) = 1 million * 64 * 2 * 4 = 512 MB
185+
186+
Memory consumption = 3072 + 512 = 3584 MB
187+
```
188+
189+
Above formula does not account for additional memory required by vector similarity indexes to allocate runtime data structures like pre-allocated buffers and caches.
190+
150191
### Using a Vector Similarity Index {#using-a-vector-similarity-index}
151192

152193
:::note

docs/en/sql-reference/data-types/newjson.md

Lines changed: 59 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -673,7 +673,7 @@ By default, this limit is `1024`, but you can change it in the type declaration
673673

674674
When the limit is reached, all new paths inserted to a `JSON` column will be stored in a single shared data structure.
675675
It's still possible to read such paths as sub-columns,
676-
but it will require reading the entire shared data structure to extract the values of this path.
676+
but it might be less efficient ([see section about shared data](#shared-data-structure)).
677677
This limit is needed to avoid having an enormous number of different sub-columns that can make the table unusable.
678678

679679
Let's see what happens when the limit is reached in a few different scenarios.
@@ -771,6 +771,63 @@ ORDER BY _part ASC
771771

772772
As we can see, ClickHouse kept the most frequent paths `a`, `b` and `c` and moved paths `d` and `e` to a shared data structure.
773773

774+
## Shared data structure {#shared-data-structure}
775+
776+
As was described in the previous section, when the `max_dynamic_paths` limit is reached all new paths are stored in a single shared data structure.
777+
In this section we will look into the details of the shared data structure and how we read paths sub-columns from it.
778+
779+
### Shared data structure in memory {#shared-data-structure-in-memory}
780+
781+
In memory, shared data structure is just a sub-column with type `Map(String, String)` that stores mapping from a flattened JSON path to a binary encoded value.
782+
To extract a path subcolumn from it, we just iterate over all rows in this `Map` column and try to find the requested path and its values.
783+
784+
### Shared data structure in MergeTree parts {#shared-data-structure-in-merge-tree-parts}
785+
786+
In [MergeTree](../../engines/table-engines/mergetree-family/mergetree.md) tables we store data in data parts that stores everything on disk (local or remote). And data on disk can be stored in a different way compared to memory.
787+
Currently, there are 3 different shared data structure serializations in MergeTree data parts: `map`, `map_with_buckets`
788+
and `advanced`.
789+
790+
The serialization version is controlled by MergeTree
791+
settings [object_shared_data_serialization_version](../../operations/settings/merge-tree-settings.md#object_shared_data_serialization_version)
792+
and [object_shared_data_serialization_version_for_zero_level_parts](../../operations/settings/merge-tree-settings.md#object_shared_data_serialization_version_for_zero_level_parts)
793+
(zero level part is the part created during inserting data into the table, during merges parts have higher level).
794+
795+
Note: changing shared data structure serialization is supported only
796+
for `v3` [object serialization version](../../operations/settings/merge-tree-settings.md#object_serialization_version)
797+
798+
#### Map {#shared-data-map}
799+
800+
In `map` serialization version shared data is serialized as a single column with type `Map(String, String)` the same as it's stored in
801+
memory. To read path sub-column from this type of serialization ClickHouse reads the whole `Map` column and
802+
extracts the requested path in memory.
803+
804+
This serialization is efficient for writing data and reading the whole `JSON` column, but it's not efficient for reading paths sub-columns.
805+
806+
#### Map with buckets {#shared-data-map-with-buckets}
807+
808+
In `map_with_buckets` serialization version shared data is serialized as `N` columns ("buckets") with type `Map(String, String)`.
809+
Each such bucket contains only subset of paths. To read path sub-column from this type of serialization ClickHouse
810+
reads the whole `Map` column from a single bucket and extracts the requested path in memory.
811+
812+
This serialization is less efficient for writing data and reading the whole `JSON` column, but it's more efficient for reading paths sub-columns
813+
because it reads data only from required buckets.
814+
815+
Number of buckets `N` is controlled by MergeTree settings [object_shared_data_buckets_for_compact_part](
816+
../../operations/settings/merge-tree-settings.md#object_shared_data_buckets_for_compact_part) (8 by default)
817+
and [object_shared_data_buckets_for_wide_part](
818+
../../operations/settings/merge-tree-settings.md#object_shared_data_buckets_for_wide_part) (32 by default).
819+
820+
#### Advanced {#shared-data-advanced}
821+
822+
In `advanced` serialization version shared data is serialized in a special data structure that maximizes the performance
823+
of paths sub-columns reading by storing some additional information that allows to read only the data of requested paths.
824+
This serialization also supports buckets, so each bucket contains only sub-set of paths.
825+
826+
This serialization is quite inefficient for writing data (so it's not recommended to use this serialization for zero-level parts), reading the whole `JSON` column is slightly less efficient compared to `map` serialization, but it's very efficient for reading paths sub-columns.
827+
828+
Note: because of storing some additional information inside the data structure, the disk storage size is higher with this serialization compared to
829+
`map` and `map_with_buckets` serializations.
830+
774831
## Introspection functions {#introspection-functions}
775832

776833
There are several functions that can help to inspect the content of the JSON column:
@@ -975,7 +1032,7 @@ Before creating `JSON` column and loading data into it, consider the following t
9751032
- Investigate your data and specify as many path hints with types as you can. It will make storage and reading much more efficient.
9761033
- Think about what paths you will need and what paths you will never need. Specify paths that you won't need in the `SKIP` section, and `SKIP REGEXP` section if needed. This will improve the storage.
9771034
- Don't set the `max_dynamic_paths` parameter to very high values, as it can make storage and reading less efficient.
978-
While highly dependent on system parameters such as memory, CPU, etc., a general rule of thumb would be to not set `max_dynamic_paths` > 10 000.
1035+
While highly dependent on system parameters such as memory, CPU, etc., a general rule of thumb would be to not set `max_dynamic_paths` greater than 10 000 for the local filesystem storage and 1024 for the remote filesystem storage.
9791036

9801037
## Further Reading {#further-reading}
9811038

0 commit comments

Comments
 (0)