Introduce statistics of type "number of distinct values"#59357
Introduce statistics of type "number of distinct values"#59357hanfei1991 merged 36 commits intoClickHouse:masterfrom
Conversation
|
This is an automated comment for commit c04e7e6 with description of existing statuses. It's updated for the latest CI running ⏳ Click here to open a full report in a separate page
Successful checks
|
3d422d2 to
3b798b5
Compare
|
|
| The following operations are available: | ||
|
|
||
| - `ALTER TABLE [db].table ADD STATISTIC (columns list) TYPE type` - Adds statistic description to tables metadata. | ||
| - `ALTER TABLE [db].table ADD STATISTIC (columns list) TYPE (type list)` - Adds statistic description to tables metadata. |
There was a problem hiding this comment.
I know it has been like that before this PR already but still: This page does not make it clear which statistics types exist and how they are different (e.g. with respect to size, usefulness, update cost, usefulness, etc.).
Also, I did not find syntax
CREATE TABLE t1
(
a Float64 STATISTIC(tdigest),
b Int64,
pk String,
) Engine = MergeTree() ORDER BY pk;documented anywhere (I only found it used in tests). EDIT: Sorry, that is documented (https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree).
It would also be great if this page could include a few examples.
Oh, and is there a system table which shows statistics objects on existing tables?
|
|
||
| Stores distribution of values from numeric columns in [TDigest](https://github.com/tdunning/t-digest) sketch. | ||
|
|
||
| - `uniq` |
There was a problem hiding this comment.
No need for too much brevity. For readability, we can afford unique or unique_count as a statistic name during CREATE/ALTER TABLE.
(for comparison, consider secondary index type bloom_filter which is also not abbreviated as bf)
There was a problem hiding this comment.
Re l. 1071: As a user, I'd like to understand more here. What is actually stored?
- a single value representing the exact distinct value count per column?
- a single value representing an approximate distinct value count per column? ("Estimate", l. 1071 makes me think something is not exact)
- a data structure (which one?) that allows to get the number of distinct values for a given value, either the exact count, or an approximate count, or some mixed form (such as top-k statistics which allow to get the exact count for the k contained values and approximate counts for all other values).
Please also mention here if the statistics exist per part, per partition, per shard, or per table.
|
@alexey-milovidov maybe I need your approval, otherwise it cannot be merged |
| @@ -0,0 +1,45 @@ | |||
| DROP TABLE IF EXISTS t1; | |||
There was a problem hiding this comment.
This test looks similar to 02864_statistic_operate.sql. What if we combine it with the other file? That will reduce the risk to change stuff in one file but then to forget changing the other file.
There was a problem hiding this comment.
well, I want to avoid a huge test file, so I made different statistics type in a different file
This PR introduces the "number of distinct values" (NDV) as statistics type. To add such statistics:
Or to add NDV statistics plus other statistics kinds:
NDV statistics are currently used to estimate the selectivity of equal (
=) filters if the NDV is small (< 2048).Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Introduce statistics of type "number of distinct values".
Documentation entry for user-facing changes