-
Notifications
You must be signed in to change notification settings - Fork 8.3k
FINAL on table with is_deleted works 10-30 times slower than expected #65180
Copy link
Copy link
Closed
Labels
Description
Problem
I have two identical ReplacingMergeTree tables with identical data except that one table uses the is_deleted table option and the other does not. The queries that use the table with is_deleted option work much slower than queries that use table without is_deleted option but filter deleted rows explicitly.
In other words, SELECT count(*) FROM table_with_is_deleted FINAL work much slower than SELECT count(*) FROM table_without_is_deleted WHERE deleted = 0
Observations
- The output of EXPLAIN is the same
- For CH 24.5, in system.query_log, I see the slow queries use the following functions:
['and', 'not', 'notIn', 'equals', 'greater', 'tuple', 'CAST'], fast queries use['notIn', 'and', 'equals']. For CH 24.2 it is['equals', 'not', 'and', 'greater', 'tuple', 'CAST']and['equals', 'and']respectively. - In query profile, slow queries read much more data
- Queries on tables with
is_deletedoption on ClickHouse CH 24.5 are 20% slower than the same queries in CH 24.4. - It does not matter if tables are optimized i.e. have just one part or not optimized, but have the same number of active parts.
- When I was working on generating sample data for this ticket, I noticed that I could reproduce the issue only when I generate quite a bit of data, on small data sets the queries are fast (used_functions are just
['notIn', 'and', 'equals']), I also noticed that with the cardinality of data is high, then both queries are slow (used_functions are['and', 'not', 'notIn', 'equals', 'greater', 'tuple', 'CAST'])
How to reproduce
- Which ClickHouse server version to use
CH 24.2, 24.4, 24.5 CREATE TABLEstatements for all tables involved
CREATE TABLE table_with_is_deleted
(
`school_year_id` UInt32,
`district_id` UInt32,
`student_id` UInt32,
`assessment_id` UInt32,
`value` Int32,
`etl_timestamp` Int64,
`deleted` UInt8 DEFAULT 0
)
ENGINE = ReplacingMergeTree(etl_timestamp, deleted)
ORDER BY (school_year_id, district_id, student_id, assessment_id);
CREATE TABLE table_without_is_deleted
(
`school_year_id` UInt32,
`district_id` UInt32,
`student_id` UInt32,
`assessment_id` UInt32,
`value` Int32,
`etl_timestamp` Int64,
`deleted` UInt8 DEFAULT 0
)
ENGINE = ReplacingMergeTree(etl_timestamp)
ORDER BY (school_year_id, district_id, student_id, assessment_id);
- Sample data for all these tables, use [clickhouse-obfuscator]
truncate table table_with_is_deleted;
INSERT INTO table_with_is_deleted (school_year_id, district_id, student_id, assessment_id, value, etl_timestamp, deleted)
SELECT
1 as school_year_id,
1 as district_id,
students.number AS student_id,
assessments.number AS assessment_id,
rand32() as value,
now() as etl_timestamp,
0 as deleted
FROM numbers(10000) as students
CROSS JOIN numbers(200) as assessments;
INSERT INTO table_with_is_deleted (school_year_id, district_id, student_id, assessment_id, value, etl_timestamp, deleted)
SELECT
6 as school_year_id,
6 as district_id,
students.number AS student_id,
assessments.number AS assessment_id,
rand32() as value,
now() as etl_timestamp,
0 as deleted
FROM numbers(30000) as students
CROSS JOIN numbers(100) as assessments;
truncate table table_without_is_deleted;
INSERT INTO table_without_is_deleted (school_year_id, district_id, student_id, assessment_id, value, etl_timestamp, deleted)
SELECT
school_year_id, district_id, student_id, assessment_id, value, etl_timestamp, deleted
FROM table_with_is_deleted;
OPTIMIZE TABLE table_with_is_deleted FINAL;
OPTIMIZE TABLE table_without_is_deleted FINAL;
- Queries to run that lead to slow performance
select count(*) from table_without_is_deleted FINAL where school_year_id = 6 and district_id = 6 and deleted = 0;
select count(*) from table_with_is_deleted FINAL where school_year_id = 6 and district_id = 6;
select count(*) from table_without_is_deleted FINAL where deleted = 0;
select count(*) from table_with_is_deleted FINAL;
Reactions are currently unavailable