Skip to content

FINAL on table with is_deleted works 10-30 times slower than expected #65180

@AlexKichkailo

Description

@AlexKichkailo

Problem
I have two identical ReplacingMergeTree tables with identical data except that one table uses the is_deleted table option and the other does not. The queries that use the table with is_deleted option work much slower than queries that use table without is_deleted option but filter deleted rows explicitly.

In other words, SELECT count(*) FROM table_with_is_deleted FINAL work much slower than SELECT count(*) FROM table_without_is_deleted WHERE deleted = 0

Observations

  1. The output of EXPLAIN is the same
  2. For CH 24.5, in system.query_log, I see the slow queries use the following functions: ['and', 'not', 'notIn', 'equals', 'greater', 'tuple', 'CAST'], fast queries use ['notIn', 'and', 'equals']. For CH 24.2 it is ['equals', 'not', 'and', 'greater', 'tuple', 'CAST'] and ['equals', 'and'] respectively.
  3. In query profile, slow queries read much more data
  4. Queries on tables with is_deleted option on ClickHouse CH 24.5 are 20% slower than the same queries in CH 24.4.
  5. It does not matter if tables are optimized i.e. have just one part or not optimized, but have the same number of active parts.
  6. When I was working on generating sample data for this ticket, I noticed that I could reproduce the issue only when I generate quite a bit of data, on small data sets the queries are fast (used_functions are just ['notIn', 'and', 'equals']), I also noticed that with the cardinality of data is high, then both queries are slow (used_functions are ['and', 'not', 'notIn', 'equals', 'greater', 'tuple', 'CAST'])

How to reproduce

  • Which ClickHouse server version to use
    CH 24.2, 24.4, 24.5
  • CREATE TABLE statements for all tables involved
CREATE TABLE table_with_is_deleted
(
    `school_year_id` UInt32,
    `district_id` UInt32,
    `student_id` UInt32,
    `assessment_id` UInt32,
    `value` Int32,
    `etl_timestamp` Int64,
    `deleted` UInt8 DEFAULT 0
)
ENGINE = ReplacingMergeTree(etl_timestamp, deleted)
ORDER BY (school_year_id, district_id, student_id, assessment_id);

CREATE TABLE table_without_is_deleted
(
    `school_year_id` UInt32,
    `district_id` UInt32,
    `student_id` UInt32,
    `assessment_id` UInt32,
    `value` Int32,
    `etl_timestamp` Int64,
    `deleted` UInt8 DEFAULT 0
)
ENGINE = ReplacingMergeTree(etl_timestamp)
ORDER BY (school_year_id, district_id, student_id, assessment_id);
  • Sample data for all these tables, use [clickhouse-obfuscator]
truncate table table_with_is_deleted;

INSERT INTO table_with_is_deleted (school_year_id, district_id, student_id, assessment_id, value, etl_timestamp, deleted)
SELECT
    1 as school_year_id,
    1 as district_id,
    students.number AS student_id,
    assessments.number AS assessment_id,
    rand32() as value,
    now() as etl_timestamp,
    0 as deleted
FROM numbers(10000) as students
CROSS JOIN numbers(200) as assessments;


INSERT INTO table_with_is_deleted (school_year_id, district_id, student_id, assessment_id, value, etl_timestamp, deleted)
SELECT
    6 as school_year_id,
    6 as district_id,
    students.number AS student_id,
    assessments.number AS assessment_id,
    rand32() as value,
    now() as etl_timestamp,
    0 as deleted
FROM numbers(30000) as students
CROSS JOIN numbers(100) as assessments;

truncate table table_without_is_deleted;
INSERT INTO table_without_is_deleted (school_year_id, district_id, student_id, assessment_id, value, etl_timestamp, deleted)
SELECT
    school_year_id, district_id, student_id, assessment_id, value, etl_timestamp, deleted
FROM table_with_is_deleted;

OPTIMIZE TABLE table_with_is_deleted FINAL;
OPTIMIZE TABLE table_without_is_deleted FINAL;

  • Queries to run that lead to slow performance
    select count(*) from table_without_is_deleted FINAL where school_year_id = 6 and district_id = 6 and deleted = 0;
    select count(*) from table_with_is_deleted FINAL where school_year_id = 6 and district_id = 6;
    select count(*) from table_without_is_deleted FINAL where deleted = 0;
    select count(*) from table_with_is_deleted FINAL;

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions