Introduce statistics of type "number of distinct values" by hanfei1991 · Pull Request #59357 · ClickHouse/ClickHouse

hanfei1991 · 2024-01-29T21:49:55Z

This PR introduces the "number of distinct values" (NDV) as statistics type. To add such statistics:

ALTER TABLE tab ADD STATISTICS col1 TYPE Uniq;

Or to add NDV statistics plus other statistics kinds:

ALTER TABLE tab ADD STATISTICS col1 TYPE TDigest, Uniq;

NDV statistics are currently used to estimate the selectivity of equal (=) filters if the NDV is small (< 2048).

Changelog category (leave one):

New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Introduce statistics of type "number of distinct values".

Documentation entry for user-facing changes

[ X ] Documentation is written (mandatory for new features)

robot-ch-test-poll4 · 2024-01-29T21:52:54Z

This is an automated comment for commit c04e7e6 with description of existing statuses. It's updated for the latest CI running

⏳ Click here to open a full report in a separate page

Check name	Description	Status
A Sync	If it fails, ask a maintainer for help	⏳ pending
CI running	A meta-check that indicates the running CI. Normally, it's in success or pending state. The failed status indicates some problems with the PR	⏳ pending
Mergeable Check	Checks if all other necessary checks are successful	⏳ pending

Successful checks

Check name	Description	Status
ClickHouse build check	Builds ClickHouse in various configurations for use in further steps. You have to fix the builds that fail. Build logs often has enough information to fix the error, but you might have to reproduce the failure locally. The cmake options can be found in the build log, grepping for cmake. Use these options and follow the general build process	✅ success
Docs check	Builds and tests the documentation	✅ success
Fast test	Normally this is the first check that is ran for a PR. It builds ClickHouse and runs most of stateless functional tests, omitting some. If it fails, further checks are not started until it is fixed. Look at the report to see which tests fail, then reproduce the failure locally as described here	✅ success
Flaky tests	Checks if new added or modified tests are flaky by running them repeatedly, in parallel, with more randomization. Functional tests are run 100 times with address sanitizer, and additional randomization of thread scheduling. Integration tests are run up to 10 times. If at least once a new test has failed, or was too long, this check will be red. We don't allow flaky tests, read the doc	✅ success
PR Check	Checks correctness of the PR's body	✅ success
Stateful tests	Runs stateful functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc	✅ success
Style check	Runs a set of checks to keep the code style clean. If some of tests failed, see the related log from the report	✅ success
Unit tests	Runs the unit tests for different release types	✅ success

src/Storages/Statistics/Statistics.cpp

docs/en/engines/table-engines/mergetree-family/mergetree.md

src/Storages/Statistics/UniqStatistic.h

src/Storages/Statistics/Statistics.cpp

alexey-milovidov · 2024-03-20T01:15:36Z

uniq uses adaptive sampling - it is fast and precise, at the expense of memory usage.
For column statistics, even 2.5 KB HLL12 (with small object optimization) will work.

rschu1ze · 2024-04-08T11:42:14Z

docs/en/sql-reference/statements/alter/statistic.md

 The following operations are available:

-   `ALTER TABLE [db].table ADD STATISTIC (columns list) TYPE type` - Adds statistic description to tables metadata.
+-   `ALTER TABLE [db].table ADD STATISTIC (columns list) TYPE (type list)` - Adds statistic description to tables metadata.


I know it has been like that before this PR already but still: This page does not make it clear which statistics types exist and how they are different (e.g. with respect to size, usefulness, update cost, usefulness, etc.).

Also, I did not find syntax

CREATE TABLE t1 ( a Float64 STATISTIC(tdigest), b Int64, pk String, ) Engine = MergeTree() ORDER BY pk;

documented anywhere (I only found it used in tests). EDIT: Sorry, that is documented (https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree).

It would also be great if this page could include a few examples.

Oh, and is there a system table which shows statistics objects on existing tables?

rschu1ze · 2024-04-08T12:03:33Z

docs/en/engines/table-engines/mergetree-family/mergetree.md


    Stores distribution of values from numeric columns in [TDigest](https://github.com/tdunning/t-digest) sketch.

+- `uniq`


No need for too much brevity. For readability, we can afford unique or unique_count as a statistic name during CREATE/ALTER TABLE.

(for comparison, consider secondary index type bloom_filter which is also not abbreviated as bf)

Re l. 1071: As a user, I'd like to understand more here. What is actually stored?

a single value representing the exact distinct value count per column?

a single value representing an approximate distinct value count per column? ("Estimate", l. 1071 makes me think something is not exact)

a data structure (which one?) that allows to get the number of distinct values for a given value, either the exact count, or an approximate count, or some mixed form (such as top-k statistics which allow to get the exact count for the k contained values and approximate counts for all other values).

Please also mention here if the statistics exist per part, per partition, per shard, or per table.

src/Storages/StatisticsDescription.cpp

src/Storages/Statistics/TDigestStatistic.h

src/Storages/Statistics/UniqStatistic.h

tests/queries/0_stateless/02864_statistic_uniq.sql

hanfei1991 · 2024-05-21T15:01:55Z

@alexey-milovidov maybe I need your approval, otherwise it cannot be merged

tests/integration/test_manipulate_statistics/test.py

tests/queries/0_stateless/02864_statistic_exception.sql

rschu1ze · 2024-05-21T20:26:24Z

tests/queries/0_stateless/02864_statistic_uniq.sql

@@ -0,0 +1,45 @@
+DROP TABLE IF EXISTS t1;


This test looks similar to 02864_statistic_operate.sql. What if we combine it with the other file? That will reduce the risk to change stuff in one file but then to forget changing the other file.

well, I want to avoid a huge test file, so I made different statistics type in a different file

fixed

hanfei1991 added 2 commits January 29, 2024 19:49

support uniq for statistics

552e1ac

Merge branch 'master' into hanfei/stats_uniq

f46065b

robot-ch-test-poll4 added the pr-feature Pull request with new product feature label Jan 29, 2024

fix style

b755db6

alexey-milovidov previously requested changes Jan 29, 2024

View reviewed changes

src/Storages/Statistics/Statistics.cpp Outdated Show resolved Hide resolved

alexey-milovidov changed the title ~~Support NDV Statistics~~ Support "number of distinct values" Statistics Jan 30, 2024

ucasfl reviewed Jan 30, 2024

View reviewed changes

docs/en/engines/table-engines/mergetree-family/mergetree.md Outdated Show resolved Hide resolved

hanfei1991 added 2 commits January 30, 2024 10:30

address comments

95abcaf

try to fix tests

3b798b5

hanfei1991 force-pushed the hanfei/stats_uniq branch from 3d422d2 to 3b798b5 Compare January 30, 2024 16:38

make tests greate again

2b5b958

UnamedRus reviewed Feb 13, 2024

View reviewed changes

src/Storages/Statistics/UniqStatistic.h Outdated Show resolved Hide resolved

hanfei1991 mentioned this pull request Mar 11, 2024

Fix statistic merge #61143

Closed

1 task

JackyWoo reviewed Mar 12, 2024

View reviewed changes

src/Storages/Statistics/Statistics.cpp Show resolved Hide resolved

src/Storages/Statistics/Statistics.cpp Show resolved Hide resolved

Merge branch 'master' into hanfei/stats_uniq

b2ceeba

hanfei1991 added 9 commits March 21, 2024 17:28

fix tests

7ec3c48

Merge branch 'master' into hanfei/stats_uniq

a367e07

Merge branch 'master' into hanfei/stats_uniq

61052e1

Merge branch 'master' into hanfei/stats_uniq

11a4ae5

fix tests

4775259

Merge branch 'master' into hanfei/stats_uniq

5df9152

try to fix tests

547f993

fix tests

e38ab18

fix clang tidy

1979ea5

rschu1ze self-assigned this Apr 5, 2024

rschu1ze reviewed Apr 8, 2024

View reviewed changes

hanfei1991 added 2 commits April 9, 2024 12:15

Merge branch 'master' into hanfei/stats_uniq

e5aa439

address comments

3a38064

hanfei1991 added 4 commits May 14, 2024 18:16

Merge branch 'master' into hanfei/stats_uniq

79bbe0b

Merge branch 'master' into hanfei/stats_uniq

a19fb0a

address comments

e9cfdc9

refine docs

2e2d207

rschu1ze approved these changes May 19, 2024

View reviewed changes

hanfei1991 added 2 commits May 21, 2024 01:28

Merge branch 'master' into hanfei/stats_uniq

b8e7e99

fix tests

93a6c1e

rschu1ze reviewed May 21, 2024

View reviewed changes

address comments

6e15d6b

rschu1ze approved these changes May 22, 2024

View reviewed changes

hanfei1991 added 6 commits May 23, 2024 15:28

fix fuzzer

76eae62

Merge branch 'master' into hanfei/stats_uniq

ee7ad46

refind docs

dc30cee

fix

25d9741

Merge branch 'master' into hanfei/stats_uniq

7b9ee52

fix build

e939e0a

hanfei1991 enabled auto-merge May 26, 2024 14:50

hanfei1991 requested a review from alexey-milovidov May 26, 2024 14:51

Merge branch 'master' into hanfei/stats_uniq

f7ca338

Merge branch 'master' into hanfei/stats_uniq

c04e7e6

hanfei1991 added this pull request to the merge queue Jun 5, 2024

Merged via the queue into ClickHouse:master with commit ac430bb Jun 5, 2024

hanfei1991 deleted the hanfei/stats_uniq branch June 5, 2024 13:23

robot-clickhouse-ci-1 added the pr-synced-to-cloud The PR is synced to the cloud repo label Jun 5, 2024

nikitamikhaylov mentioned this pull request Jun 5, 2024

Add new settings to changes history. #64860

Merged

22 tasks

Algunenano mentioned this pull request Jun 25, 2024

Flaky / stuck: 02864_statistics_uniq #65655

Closed

alexey-milovidov mentioned this pull request Aug 3, 2024

Logical error: 'Statistics 'tdigest' does not support estimating value of type String'. #67742

Closed


		Stores distribution of values from numeric columns in [TDigest](https://github.com/tdunning/t-digest) sketch.

		- `uniq`

Conversation

hanfei1991 commented Jan 29, 2024 • edited by rschu1ze Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Documentation entry for user-facing changes

Uh oh!

robot-ch-test-poll4 commented Jan 29, 2024 • edited by robot-clickhouse-ci-1 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alexey-milovidov commented Mar 20, 2024

Uh oh!

rschu1ze Apr 8, 2024

Choose a reason for hiding this comment

Uh oh!

rschu1ze Apr 8, 2024

Choose a reason for hiding this comment

Uh oh!

rschu1ze Apr 8, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hanfei1991 commented May 21, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rschu1ze May 21, 2024

Choose a reason for hiding this comment

Uh oh!

hanfei1991 May 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

hanfei1991 commented Jan 29, 2024 •

edited by rschu1ze

Loading

robot-ch-test-poll4 commented Jan 29, 2024 •

edited by robot-clickhouse-ci-1

Loading

hanfei1991 May 22, 2024 •

edited

Loading