-
Notifications
You must be signed in to change notification settings - Fork 8.3k
Inconsistent query results #135
Description
I was running some queries to verify the consistency of data after a load operation. Below are some observations
Clickhouse version details
ClickHouse client version 1.1.54022.
Connecting to database smsr at localhost:9000.
Connected to ClickHouse server version 1.1.54022.
TOTAL number of records loaded. This is consistent with the raw CSV data.
:) select count() from cdr;
SELECT count()
FROM cdr
┌───count()─┐
│ 527670245 │
└───────────┘
1 rows in set. Elapsed: 0.138 sec. Processed 527.67 million rows, 1.06 GB (3.83 billion rows/s., 7.65 GB/s.)
The number of records for each date. This too is consistent with the raw data.
:) select count(), merge_date from cdr group by merge_date;
SELECT
count(),
merge_date
FROM cdr
GROUP BY merge_date
┌───count()─┬─merge_date─┐
│ 205080000 │ 2015-01-21 │
│ 322590245 │ 2015-01-22 │
└───────────┴────────────┘
2 rows in set. Elapsed: 0.835 sec. Processed 527.67 million rows, 1.06 GB (631.70 million rows/s., 1.26 GB/s.)
How can the total number of distinct rows when grouped by the merge date field be more than the total records in the table?
:) select count(distinct id), merge_date from cdr group by merge_date;
SELECT
countDistinct(id),
merge_date
FROM cdr
GROUP BY merge_date
┌──uniq(id)─┬─merge_date─┐
│ 206247254 │ 2015-01-21 │
│ 324102175 │ 2015-01-22 │
└───────────┴────────────┘
2 rows in set. Elapsed: 21.991 sec. Processed 527.67 million rows, 3.17 GB (23.99 million rows/s., 143.97 MB/s.)
:) select 206247254 + 324102175, 206247254 + 324102175 - 527670245;
SELECT
206247254 + 324102175,
(206247254 + 324102175) - 527670245
┌─plus(206247254, 324102175)─┬─minus(plus(206247254, 324102175), 527670245)─┐
│ 530349429 │ 2679184 │
└────────────────────────────┴──────────────────────────────────────────────┘
1 rows in set. Elapsed: 0.005 sec.
How can the total number of distinct column values be more than the total records in the table?
:) select count(distinct id), count(), count(distinct id) - count() from cdr;
SELECT
countDistinct(id),
count(),
countDistinct(id) - count()
FROM cdr
┌──uniq(id)─┬───count()─┬─minus(uniq(id), count())─┐
│ 531654647 │ 527670245 │ 3984402 │
└───────────┴───────────┴──────────────────────────┘
1 rows in set. Elapsed: 1.169 sec. Processed 527.67 million rows, 2.11 GB (451.55 million rows/s., 1.81 GB/s.)
Are there any approximations involved here? How can we get precision if it is so?
Attached a text file with the query outputs for readability.