Skip to content

Optimize inverse dictionary lookup#88971

Merged
nihalzp merged 65 commits intoClickHouse:masterfrom
nihalzp:rev-dict-lookup-optimize
Nov 29, 2025
Merged

Optimize inverse dictionary lookup#88971
nihalzp merged 65 commits intoClickHouse:masterfrom
nihalzp:rev-dict-lookup-optimize

Conversation

@nihalzp
Copy link
Copy Markdown
Member

@nihalzp nihalzp commented Oct 24, 2025

Changelog category (leave one):

  • Performance Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Optimize repeated inverse dictionary lookups by doing faster lookups into a precomputed set of possible key values. Closes #7968.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

Directly edit documentation source files in the "docs" folder with the same pull-request as code changes

or

Add a user-readable short description of the changes that should be added to docs.clickhouse.com below.

At a minimum, the following information should be added (but add more as needed).

  • Motivation: Why is this function, table engine, etc. useful to ClickHouse users?

  • Parameters: If the feature being added takes arguments, options or is influenced by settings, please list them below with a brief explanation.

  • Example use: A query or command.

Details

Optimize WHERE dictGetFamily(..., ATTR_COL, COL) = CONSTEXPR into WHERE COL IN (SELECT ... FROM dictionary WHERE ATTR_COL = CONSTEXPR)

Supports:

  • dictGet families like dictGetString, dictGetInt32, etc.
  • Dictionaries with composite keys.
  • Comparison operators: =, !=, <, <=, >, >=, match, LIKE, ILIKE and their negations.

Example:

SELECT col FROM tab WHERE dictGet(DICT_NAME, DICT_ATTRIBUTE_COL, col) = CONSTEXPR;

Result:

SELECT col FROM t WHERE col IN (SELECT DICT_KEY_COL FROM dictionary(DICT_NAME) WHERE DICT_ATTRIBUTE_COL = CONSTEXPR);

Performance:

DROP TABLE IF EXISTS ref_colors; CREATE TABLE ref_colors (id UInt32, name String) ENGINE = MergeTree ORDER BY id;

INSERT INTO ref_colors SELECT number, if((number) % 10000 = 0, 'red', 'blue') FROM numbers(1000000); DROP DICTIONARY IF EXISTS colors; CREATE DICTIONARY colors (id UInt32, name String) PRIMARY KEY id SOURCE(CLICKHOUSE(TABLE 'ref_colors')) LAYOUT(HASHED()) LIFETIME(0);

DROP TABLE IF EXISTS t; CREATE TABLE t (color_id UInt32, payload String) ENGINE = MergeTree ORDER BY color_id; INSERT INTO t SELECT cityHash64(number) % 1000000 AS color_id, toString(number) FROM numbers_mt(50000000);

Before:

SET optimize_inverse_dictionary_lookup = 0;
SELECT count() FROM t WHERE dictGet('colors','name', color_id) = 'red';

Elapsed: 0.198 sec.

Now:

SET optimize_inverse_dictionary_lookup = 1;
SELECT count() FROM t WHERE dictGet('colors','name', color_id) = 'red';

Elapsed: 0.056 sec.

Resolves #7968

@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Oct 24, 2025

Workflow [PR], commit [4369634]

Summary:

job_name test_name status info comment
Stateless tests (amd_msan, parallel) failure
00816_long_concurrent_alter_column FAIL cidb, issue
00906_low_cardinality_cache FAIL cidb, issue
02992_all_columns_should_have_comment FAIL cidb
BuzzHouse (amd_debug) failure
Logical error: 'Inconsistent AST formatting: the query: FAIL cidb, issue

@clickhouse-gh clickhouse-gh bot added the pr-performance Pull request with some performance improvements label Oct 24, 2025
@UnamedRus
Copy link
Copy Markdown
Contributor

UnamedRus commented Oct 24, 2025

SET optimize_inverse_dictionary_lookup = 0;

It is OK, that it's under setting, because there is certain problem that a lot of default dictionary layouts quite bad on access not by key column.

When people manually do this kind of optimization, sometimes they do it not against dictionary, but "SOURCE" table of it.
(But this raise other questions about non consistency with data in dictionary and etc)

Another, more complex alternative, is to build cache of color_id -> result of expression: dictGet('colors','name', xxx) = 'red'

@nihalzp
Copy link
Copy Markdown
Member Author

nihalzp commented Oct 25, 2025

Another, more complex alternative, is to build cache of color_id -> result of expression: dictGet('colors','name', xxx) = 'red'

Actually, I think the complex alternative that you mentioned is happening under the hood with this optimization. For example, for this query:

EXPLAIN PLAN SELECT color_id, payload FROM t WHERE dictGetString('colors', 'name', color_id) = 'red' ORDER BY color_id, payload;

The output is:

CreatingSets (Create sets before main query execution)
  Expression (Project names)
    Sorting (Sorting for ORDER BY)
      Expression ((Before ORDER BY + Projection))
        Expression ((WHERE + Change column names to column identifiers))
          ReadFromMergeTree (default.t)

My understanding is that it means dictGetString('colors', 'name', color_id) = 'red' is converted to a set in which later we are doing repeated lookups via IN, completely skipping dictionary lookup per row which could be slow.

@nihalzp nihalzp requested a review from novikd November 12, 2025 07:07
Copy link
Copy Markdown
Member

@novikd novikd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nihalzp
Copy link
Copy Markdown
Member Author

nihalzp commented Nov 29, 2025

@nihalzp nihalzp added this pull request to the merge queue Nov 29, 2025
Merged via the queue into ClickHouse:master with commit c607c00 Nov 29, 2025
247 of 256 checks passed
@nihalzp nihalzp deleted the rev-dict-lookup-optimize branch November 29, 2025 03:14
@robot-ch-test-poll3 robot-ch-test-poll3 added the pr-synced-to-cloud The PR is synced to the cloud repo label Nov 29, 2025
@nihalzp nihalzp restored the rev-dict-lookup-optimize branch November 29, 2025 04:40
@fm4v
Copy link
Copy Markdown
Member

fm4v commented Dec 1, 2025

@nihalzp
Copy link
Copy Markdown
Member Author

nihalzp commented Dec 1, 2025

@fm4v Most likely not caused by this. The optimization in this PR is only applied for dictGet(....) = 'const' form which is not the case in the slowed down queries.

https://fiddle.clickhouse.com/8ba04806-58a1-4a09-b458-6b8e0ad6432e

It is confirmed by EXPLAIN SYNTAX. We can see that optimization is not applied.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-performance Pull request with some performance improvements pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Optimise query where with injective dictionary

5 participants