ClickHouse
diff --git a/‎ci/jobs/scripts/check_style/aspell-ignore/en/aspell-dict.txt‎
Lines changed: 7 additions & 1 deletion b/‎ci/jobs/scripts/check_style/aspell-ignore/en/aspell-dict.txt‎
Lines changed: 7 additions & 1 deletion
diff --git a/‎ci/jobs/scripts/workflow_hooks/feature_docs.py‎
Lines changed: 1 addition & 0 deletions b/‎ci/jobs/scripts/workflow_hooks/feature_docs.py‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎contrib/thrift‎ b/‎contrib/thrift‎
diff --git a/‎docs/en/engines/table-engines/mergetree-family/annindexes.md‎
Lines changed: 42 additions & 1 deletion b/‎docs/en/engines/table-engines/mergetree-family/annindexes.md‎
Lines changed: 42 additions & 1 deletion
diff --git a/‎docs/en/sql-reference/data-types/newjson.md‎
Lines changed: 59 additions & 2 deletions b/‎docs/en/sql-reference/data-types/newjson.md‎
Lines changed: 59 additions & 2 deletions
@@ -569,6 +569,8 @@ LGPL
 LIMITs
 LINEITEM
 LLDB
+LLM
+LLMs
 LLVM's
 LOCALTIME
 LOCALTIMESTAMP
@@ -1148,6 +1150,7 @@ ThreadsActive
 ThreadsInOvercommitTracker
 TimeSeries
 TimescaleDB's
+TimestampType
 Timeunit
 TinyLog
 Tkachenko
@@ -1252,13 +1255,13 @@ XORs
 Xeon
 YAML
 YAMLRegExpTree
+YTsaurus
 YYYY
 YYYYMMDD
 YYYYMMDDToDate
 YYYYMMDDhhmmssToDateTime
 Yandex
 Yasm
-YTsaurus
 ZCurve
 ZSTDQAT
 Zabbix
@@ -2682,6 +2685,8 @@ profiler
 projectio
 proleptic
 prometheus
+prometheusQuery
+prometheusQueryRange
 proportionsZTest
 proto
 protobuf
@@ -3137,6 +3142,7 @@ timeSeriesPredictLinearToGrid
 timeSeriesRange
 timeSeriesRateToGrid
 timeSeriesResampleToGridWithStaleness
+timeSeriesSelector
 timeSeriesTags
 timeSlot
 timeSlots
 
@@ -27,6 +27,7 @@ def belongs_to_autogenerated_category(filename):
         "Conditional",
         "Distance",
         "DateAndTime",
+        "Null"
     ]
 
     try:
 
@@ -8,7 +8,7 @@ title: 'Exact and Approximate Vector Search'
 
 # Exact and approximate vector search
 
-The problem of finding the N closest points in a multi-dimensional (vector) space for a given point is known as [nearest neighbor search](https://en.wikipedia.org/wiki/Nearest_neighbor_search) or, shorter, vector search.
+The problem of finding the N closest points in a multi-dimensional (vector) space for a given point is known as [nearest neighbor search](https://en.wikipedia.org/wiki/Nearest_neighbor_search) or, in short: vector search.
 Two general approaches exist for solving vector search:
 - Exact vector search calculates the distance between the given point and all points in the vector space. This ensures the best possible accuracy, i.e. the returned points are guaranteed to be the actual nearest neighbors. Since the vector space is explored exhaustively, exact vector search can be too slow for real-world use.
 - Approximate vector search refers to a group of techniques (e.g., special data structures like graphs and random forests) which compute results much faster than exact vector search. The result accuracy is typically "good enough" for practical use. Many approximate techniques provide parameters to tune the trade-off between the result accuracy and the search time.
@@ -147,6 +147,47 @@ Further restrictions apply:
 - Vector similarity indexes require that all arrays in the underlying column have `<dimension>`-many elements - this is checked during index creation. To detect violations of this requirement as early as possible, users can add a [constraint](/sql-reference/statements/create/table.md#constraints) for the vector column, e.g., `CONSTRAINT same_length CHECK length(vectors) = 256`.
 - Likewise, array values in the underlying column must not be empty (`[]`) or have a default value (also `[]`).
 
+**Estimating storage and memory consumption**
+
+A vector generated for use with a typical AI model (e.g. a Large Language Model, [LLMs](https://en.wikipedia.org/wiki/Large_language_model)) consists of hundreds or thousands of floating-point values.
+Thus, a single vector value can have a memory consumption of multiple kilobyte.
+Users who like to estimate the storage required for the underlying vector column in the table, as well as the main memory needed for the vector similarity index, can use below two formula:
+
+Storage consumption of the vector column in the table (uncompressed):
+
+```text
+Storage consumption = Number of vectors * Dimension * Size of column data type
+```
+
+Example for the [dbpedia dataset](https://huggingface.co/datasets/KShivendu/dbpedia-entities-openai-1M):
+
+```text
+Storage consumption = 1 million * 1536 * 4 (for Float32) = 6.1 GB
+```
+
+The vector similarity index must be fully loaded from disk into main memory to perform searches.
+Similarly, the vector index is also constructed fully in memory and then saved to disk.
+
+Memory consumption required to load a vector index:
+
+```text
+Memory for vectors in the index (mv) = Number of vectors * Dimension * Size of quantized data type
+Memory for in-memory graph (mg) = Number of vectors * hnsw_max_connections_per_layer * 2 * 4
+
+Memory consumption: mv + mg
+```
+
+Example for the [dbpedia dataset](https://huggingface.co/datasets/KShivendu/dbpedia-entities-openai-1M):
+
+```text
+Memory for vectors in the index (mv) = 1 million * 1536 * 2 (for BFloat16) = 3072 MB
+Memory for in-memory graph (mg) = 1 million * 64 * 2 * 4 = 512 MB
+
+Memory consumption = 3072 + 512 = 3584 MB
+```
+
+Above formula does not account for additional memory required by vector similarity indexes to allocate runtime data structures like pre-allocated buffers and caches.
+
 ### Using a Vector Similarity Index {#using-a-vector-similarity-index}
 
 :::note
 
@@ -673,7 +673,7 @@ By default, this limit is `1024`, but you can change it in the type declaration
 
 When the limit is reached, all new paths inserted to a `JSON` column will be stored in a single shared data structure. 
 It's still possible to read such paths as sub-columns, 
-but it will require reading the entire shared data structure to extract the values of this path. 
+but it might be less efficient ([see section about shared data](#shared-data-structure)). 
 This limit is needed to avoid having an enormous number of different sub-columns that can make the table unusable.
 
 Let's see what happens when the limit is reached in a few different scenarios.
@@ -771,6 +771,63 @@ ORDER BY _part ASC
 
 As we can see, ClickHouse kept the most frequent paths `a`, `b` and `c` and moved paths `d` and `e` to a shared data structure.
 
+## Shared data structure {#shared-data-structure}
+
+As was described in the previous section, when the `max_dynamic_paths` limit is reached all new paths are stored in a single shared data structure.
+In this section we will look into the details of the shared data structure and how we read paths sub-columns from it.
+
+### Shared data structure in memory {#shared-data-structure-in-memory}
+
+In memory, shared data structure is just a sub-column with type `Map(String, String)` that stores mapping from a flattened JSON path to a binary encoded value.
+To extract a path subcolumn from it, we just iterate over all rows in this `Map` column and try to find the requested path and its values.
+
+### Shared data structure in MergeTree parts {#shared-data-structure-in-merge-tree-parts}
+
+In [MergeTree](../../engines/table-engines/mergetree-family/mergetree.md) tables we store data in data parts that stores everything on disk (local or remote). And data on disk can be stored in a different way compared to memory.
+Currently, there are 3 different shared data structure serializations in MergeTree data parts: `map`, `map_with_buckets`
+and `advanced`.
+
+The serialization version is controlled by MergeTree
+settings [object_shared_data_serialization_version](../../operations/settings/merge-tree-settings.md#object_shared_data_serialization_version)
+and [object_shared_data_serialization_version_for_zero_level_parts](../../operations/settings/merge-tree-settings.md#object_shared_data_serialization_version_for_zero_level_parts) 
+(zero level part is the part created during inserting data into the table, during merges parts have higher level).
+
+Note: changing shared data structure serialization is supported only
+for `v3` [object serialization version](../../operations/settings/merge-tree-settings.md#object_serialization_version)
+
+#### Map {#shared-data-map}
+
+In `map` serialization version shared data is serialized as a single column with type `Map(String, String)` the same as it's stored in
+memory. To read path sub-column from this type of serialization ClickHouse reads the whole `Map` column and
+extracts the requested path in memory.
+
+This serialization is efficient for writing data and reading the whole `JSON` column, but it's not efficient for reading paths sub-columns.
+
+#### Map with buckets {#shared-data-map-with-buckets} 
+
+In `map_with_buckets` serialization version shared data is serialized as `N` columns ("buckets") with type `Map(String, String)`.
+Each such bucket contains only subset of paths. To read path sub-column from this type of serialization ClickHouse
+reads the whole `Map` column from a single bucket and extracts the requested path in memory.
+
+This serialization is less efficient for writing data and reading the whole `JSON` column, but it's more efficient for reading paths sub-columns
+because it reads data only from required buckets.
+
+Number of buckets `N` is controlled by MergeTree settings [object_shared_data_buckets_for_compact_part](
+../../operations/settings/merge-tree-settings.md#object_shared_data_buckets_for_compact_part) (8 by default)
+and [object_shared_data_buckets_for_wide_part](
+../../operations/settings/merge-tree-settings.md#object_shared_data_buckets_for_wide_part) (32 by default).
+
+#### Advanced {#shared-data-advanced}
+
+In `advanced` serialization version shared data is serialized in a special data structure that maximizes the performance
+of paths sub-columns reading by storing some additional information that allows to read only the data of requested paths.
+This serialization also supports buckets, so each bucket contains only sub-set of paths.
+
+This serialization is quite inefficient for writing data (so it's not recommended to use this serialization for zero-level parts), reading the whole `JSON` column is slightly less efficient compared to `map` serialization, but it's very efficient for reading paths sub-columns.
+
+Note: because of storing some additional information inside the data structure, the disk storage size is higher with this serialization compared to 
+`map` and `map_with_buckets` serializations.
+
 ## Introspection functions {#introspection-functions}
 
 There are several functions that can help to inspect the content of the JSON column: 
@@ -975,7 +1032,7 @@ Before creating `JSON` column and loading data into it, consider the following t
 - Investigate your data and specify as many path hints with types as you can. It will make storage and reading much more efficient.
 - Think about what paths you will need and what paths you will never need. Specify paths that you won't need in the `SKIP` section, and `SKIP REGEXP` section if needed. This will improve the storage.
 - Don't set the `max_dynamic_paths` parameter to very high values, as it can make storage and reading less efficient. 
-  While highly dependent on system parameters such as memory, CPU, etc., a general rule of thumb would be to not set `max_dynamic_paths` > 10 000.
+  While highly dependent on system parameters such as memory, CPU, etc., a general rule of thumb would be to not set `max_dynamic_paths` greater than 10 000 for the local filesystem storage and 1024 for the remote filesystem storage.
 
 ## Further Reading {#further-reading}
Original file line number	Diff line number	Diff line change
`@@ -27,6 +27,7 @@ def belongs_to_autogenerated_category(filename):`
`27`	`27`	`"Conditional",`
`28`	`28`	`"Distance",`
`29`	`29`	`"DateAndTime",`
	`30`	`+ "Null"`
`30`	`31`	`]`
`31`	`32`
`32`	`33`	`try:`