Inconsistent query results

I was running some queries to verify the consistency of data after a load operation.  Below are some observations
# Clickhouse version details

ClickHouse client version 1.1.54022.
Connecting to database smsr at localhost:9000.
Connected to ClickHouse server version 1.1.54022.
# TOTAL number of records loaded. This is consistent with the raw CSV data.

:) select count() from cdr;

SELECT count()
FROM cdr

┌───count()─┐
│ 527670245 │
└───────────┘

1 rows in set. Elapsed: 0.138 sec. Processed 527.67 million rows, 1.06 GB (3.83 billion rows/s., 7.65 GB/s.)
# The number of records for each date.  This too is consistent with the raw data.

:) select count(), merge_date from cdr group by merge_date;

SELECT
    count(),
    merge_date
FROM cdr
GROUP BY merge_date

┌───count()─┬─merge_date─┐
│ 205080000 │ 2015-01-21 │
│ 322590245 │ 2015-01-22 │
└───────────┴────────────┘

2 rows in set. Elapsed: 0.835 sec. Processed 527.67 million rows, 1.06 GB (631.70 million rows/s., 1.26 GB/s.)
# How can the total number of distinct rows when grouped by the merge date field be more than the total records in the table?

:) select count(distinct id), merge_date from cdr group by merge_date;

SELECT
    countDistinct(id),
    merge_date
FROM cdr
GROUP BY merge_date

┌──uniq(id)─┬─merge_date─┐
│ 206247254 │ 2015-01-21 │
│ 324102175 │ 2015-01-22 │
└───────────┴────────────┘

2 rows in set. Elapsed: 21.991 sec. Processed 527.67 million rows, 3.17 GB (23.99 million rows/s., 143.97 MB/s.)

:) select 206247254  + 324102175, 206247254 + 324102175 - 527670245;

SELECT
    206247254 + 324102175,
    (206247254 + 324102175) - 527670245

┌─plus(206247254, 324102175)─┬─minus(plus(206247254, 324102175), 527670245)─┐
│                  530349429 │                                      2679184 │
└────────────────────────────┴──────────────────────────────────────────────┘

1 rows in set. Elapsed: 0.005 sec.
# How can the total number of distinct column values be more than the total records in the table?

:) select count(distinct id), count(), count(distinct id) - count() from cdr;

SELECT
    countDistinct(id),
    count(),
    countDistinct(id) - count()
FROM cdr

┌──uniq(id)─┬───count()─┬─minus(uniq(id), count())─┐
│ 531654647 │ 527670245 │                  3984402 │
└───────────┴───────────┴──────────────────────────┘

1 rows in set. Elapsed: 1.169 sec. Processed 527.67 million rows, 2.11 GB (451.55 million rows/s., 1.81 GB/s.)

Are there any approximations involved here?  How can we get precision if it is so?

Attached a text file with the query outputs for readability.

[clickhouse_inconsistent.txt](https://github.com/yandex/ClickHouse/files/528670/clickhouse_inconsistent.txt)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent query results #135

Clickhouse version details

TOTAL number of records loaded. This is consistent with the raw CSV data.

The number of records for each date. This too is consistent with the raw data.

How can the total number of distinct rows when grouped by the merge date field be more than the total records in the table?

How can the total number of distinct column values be more than the total records in the table?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Inconsistent query results #135

Description

Clickhouse version details

TOTAL number of records loaded. This is consistent with the raw CSV data.

The number of records for each date. This too is consistent with the raw data.

How can the total number of distinct rows when grouped by the merge date field be more than the total records in the table?

How can the total number of distinct column values be more than the total records in the table?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions