Skip to content

Optimize Analyzer with repeat aliases#88043

Open
maxjustus wants to merge 44 commits intoClickHouse:masterfrom
maxjustus:optimize-query-analysis-4
Open

Optimize Analyzer with repeat aliases#88043
maxjustus wants to merge 44 commits intoClickHouse:masterfrom
maxjustus:optimize-query-analysis-4

Conversation

@maxjustus
Copy link
Copy Markdown
Contributor

@maxjustus maxjustus commented Oct 2, 2025

Changelog category (leave one):

  • Performance Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Add identifier resolution caching to prevent duplicate identifier resolution during query analysis.

Resolves #86019.
Resolves #72721.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

Details

Optimizes the new analyzer when handling multiple instances of the same alias expression in a query by caching the resolved alias expression node and reusing it. This resolves #86019

test query 1:

/*
-- Master
Executed in    7.49 secs    fish           external
   usr time    7.19 secs    0.29 millis    7.19 secs
   sys time    0.80 secs    1.54 millis    0.80 secs

-- This branch
Executed in    1.99 secs    fish           external
   usr time    1.78 secs    0.22 millis    1.78 secs
   sys time    0.62 secs    1.20 millis    0.62 secs
*/

SELECT
    multiIf(number = 0, 0, number = 1, 2, number = 2, 4, number = 3, 6, number = 4, 8, number = 5, 10, number = 6, 12, number = 7, 14, number = 8, 16, number = 9, 18, number = 10, 20, number = 11, 22, number = 12, 24, number = 13, 26, number = 14, 28, number = 15, 30, number = 16, 32, number = 17, 34, number = 18, 36, number = 19, 38, number = 20, 40, number = 21, 42, number = 22, 44, number = 23, 46, number = 24, 48, number = 25, 50, number = 26, 52, number = 27, 54, number = 28, 56, number = 29, 58, number = 30, 60, number = 31, 62, number = 32, 64, number = 33, 66, number = 34, 68, number = 35, 70, number = 36, 72, number = 37, 74, number = 38, 76, number = 39, 78, number = 40, 80, number = 41, 82, number = 42, 84, number = 43, 86, number = 44, 88, number = 45, 90, number = 46, 92, number = 47, 94, number = 48, 96, number = 49, 98, number = 50, 100, number = 51, 102, number = 52, 104, number = 53, 106, number = 54, 108, number = 55, 110, number = 56, 112, number = 57, 114, number = 58, 116, number = 59, 118, number = 60, 120, number = 61, 122, number = 62, 124, number = 63, 126, number = 64, 128, number = 65, 130, number = 66, 132, number = 67, 134, number = 68, 136, number = 69, 138, number = 70, 140, number = 71, 142, number = 72, 144, number = 73, 146, number = 74, 148, number = 75, 150, number = 76, 152, number = 77, 154, number = 78, 156, number = 79, 158, number = 80, 160, number = 81, 162, number = 82, 164, number = 83, 166, number = 84, 168, number = 85, 170, number = 86, 172, number = 87, 174, number = 88, 176, number = 89, 178, number = 90, 180, number = 91, 182, number = 92, 184, number = 93, 186, number = 94, 188, number = 95, 190, number = 96, 192, number = 97, 194, number = 98, 196, number = 99, 198, number) AS number_revised,
    multiIf(number_revised = 0, 0, number_revised = 1, 2, number_revised = 2, 4, number_revised = 3, 6, number_revised = 4, 8, number_revised = 5, 10, number_revised = 6, 12, number_revised = 7, 14, number_revised = 8, 16, number_revised = 9, 18, number_revised = 10, 20, number_revised = 11, 22, number_revised = 12, 24, number_revised = 13, 26, number_revised = 14, 28, number_revised = 15, 30, number_revised = 16, 32, number_revised = 17, 34, number_revised = 18, 36, number_revised = 19, 38, number_revised = 20, 40, number_revised = 21, 42, number_revised = 22, 44, number_revised = 23, 46, number_revised = 24, 48, number_revised = 25, 50, number_revised = 26, 52, number_revised = 27, 54, number_revised = 28, 56, number_revised = 29, 58, number_revised = 30, 60, number_revised = 31, 62, number_revised = 32, 64, number_revised = 33, 66, number_revised = 34, 68, number_revised = 35, 70, number_revised = 36, 72, number_revised = 37, 74, number_revised = 38, 76, number_revised = 39, 78, number_revised = 40, 80, number_revised = 41, 82, number_revised = 42, 84, number_revised = 43, 86, number_revised = 44, 88, number_revised = 45, 90, number_revised = 46, 92, number_revised = 47, 94, number_revised = 48, 96, number_revised = 49, 98, number_revised = 50, 100, number_revised = 51, 102, number_revised = 52, 104, number_revised = 53, 106, number_revised = 54, 108, number_revised = 55, 110, number_revised = 56, 112, number_revised = 57, 114, number_revised = 58, 116, number_revised = 59, 118, number_revised = 60, 120, number_revised = 61, 122, number_revised = 62, 124, number_revised = 63, 126, number_revised = 64, 128, number_revised = 65, 130, number_revised = 66, 132, number_revised = 67, 134, number_revised = 68, 136, number_revised = 69, 138, number_revised = 70, 140, number_revised = 71, 142, number_revised = 72, 144, number_revised = 73, 146, number_revised = 74, 148, number_revised = 75, 150, number_revised = 76, 152, number_revised = 77, 154, number_revised = 78, 156, number_revised = 79, 158, number_revised = 80, 160, number_revised = 81, 162, number_revised = 82, 164, number_revised = 83, 166, number_revised = 84, 168, number_revised = 85, 170, number_revised = 86, 172, number_revised = 87, 174, number_revised = 88, 176, number_revised = 89, 178, number_revised = 90, 180, number_revised = 91, 182, number_revised = 92, 184, number_revised = 93, 186, number_revised = 94, 188, number_revised = 95, 190, number_revised = 96, 192, number_revised = 97, 194, number_revised = 98, 196, number_revised = 99, 198, number_revised) AS number_revised_2,
    multiIf(number_revised_2 = 0, 0, number_revised_2 = 1, 2, number_revised_2 = 2, 4, number_revised_2 = 3, 6, number_revised_2 = 4, 8, number_revised_2 = 5, 10, number_revised_2) AS number_revised_3,
    multiIf(number_revised_3 = 0, 0, number_revised_3 = 1, 2, number_revised_3 = 2, 4, number_revised_3) AS number_revised_4
FROM system.numbers
LIMIT 100
SETTINGS enable_analyzer = 1, max_expanded_ast_elements = 5000000000;

test query 2

/*
Master:
  enable_analyzer=1 (peak memory usage 15GB)
  ________________________________________________________
  Executed in  100.74 secs    fish           external
     usr time   97.57 secs    0.26 millis   97.57 secs
     sys time    3.62 secs    1.83 millis    3.61 secs

  enable_analyzer=0 (peak memory usage 10GB)
  ________________________________________________________
  Executed in   14.10 secs    fish           external
     usr time   12.82 secs    0.35 millis   12.82 secs
     sys time    1.75 secs    2.32 millis    1.75 secs

This branch:
  enable_analyzer=1 (peak memory usage 5.7GB)
  ________________________________________________________
  Executed in   15.41 secs    fish           external
     usr time   13.91 secs    0.28 millis   13.91 secs
     sys time    1.97 secs    2.27 millis    1.97 secs

  enable_analyzer=0 (peak memory usage 10GB)
  ________________________________________________________
  Executed in   14.32 secs    fish           external
     usr time   12.99 secs    0.24 millis   12.99 secs
     sys time    1.80 secs    2.04 millis    1.79 secs
*/

set enable_analyzer = 1;

CREATE TABLE sample_table (
    id UInt64,
    value UInt64
) 
ENGINE = MergeTree()
ORDER BY id;

-- Insert sample data
INSERT INTO sample_table VALUES (1, 10);


WITH
    -- Deeply nested IF statements for col1
    if(value = 10, 'A',
        if(value = 20, 'B',
            if(value = 30, 'C',
                if(value = 40, 'D',
                    if(value = 50, 'E', 'F')
                )
            )
        )
    ) AS col1,

    -- Another deeply nested IF statements relying on col1
    if(col1 = 'A', 'Alpha',
        if(col1 = 'B', 'Beta',
            if(col1 = 'C', 'Gamma',
                if(col1 = 'D', 'Delta',
                    if(col1 = 'E', 'Epsilon', 'Other')
                )
            )
        )
    ) AS col2,

    -- Deeply nested IF statements relying on col2
    if(col2 = 'Alpha', 1,
        if(col2 = 'Beta', 2,
            if(col2 = 'Gamma', 3,
                if(col2 = 'Delta', 4,
                    if(col2 = 'Epsilon', 5, 0)
                )
            )
        )
    ) AS col3,

    -- Deeply nested IF statements relying on col3
    if(col3 = 1, 'One',
        if(col3 = 2, 'Two',
            if(col3 = 3, 'Three',
                if(col3 = 4, 'Four',
                    if(col3 = 5, 'Five', 'Zero')
                )
            )
        )
    ) AS col4,

    -- Deeply nested IF statements relying on col4
    if(col4 = 'One', 'Uno',
        if(col4 = 'Two', 'Dos',
            if(col4 = 'Three', 'Tres',
                if(col4 = 'Four', 'Cuatro',
                    if(col4 = 'Five', 'Cinco', 'Cero')
                )
            )
        )
    ) AS col5,

    -- Deeply nested IF statements relying on col5
    if(col5 = 'Uno', 'I',
        if(col5 = 'Dos', 'II',
            if(col5 = 'Tres', 'III',
                if(col5 = 'Cuatro', 'IV',
                    if(col5 = 'Cinco', 'V', 'Other')
                )
            )
        )
    ) AS col6,

    -- Deeply nested IF statements relying on col6
    if(col6 = 'I', 'Primero',
        if(col6 = 'II', 'Segundo',
            if(col6 = 'III', 'Tercero',
                if(col6 = 'IV', 'Cuarto',
                    if(col6 = 'V', 'Quinto', 'Otro')
                )
            )
        )
    ) AS col7,

    -- Deeply nested IF statements relying on col7
    if(col7 = 'Primero', 'First',
        if(col7 = 'Segundo', 'Second',
            if(col7 = 'Tercero', 'Third',
                if(col7 = 'Cuarto', 'Fourth',
                    if(col7 = 'Quinto', 'Fifth', 'Other')
                )
            )
        )
    ) AS col8,
    if(col8 = 'First', 1, 0) AS col9,
    if(col9 = 1, 'One', 'Zero') AS col10,
    if(col10 = 'One', if(col1 = 'A', 'A-One', 'Other'), 'Zero') AS col11

SELECT 
  --  id, 
    col1, 
    col2, 
    col3, 
    col4, 
    col5, 
    col6, 
    col7, 
    col8,
    col9,
    col10,
    col11
FROM sample_table
settings max_expanded_ast_elements  = 5000000000;

Note

Medium Risk
Changes core analyzer identifier-resolution behavior by introducing per-scope caching with several conditional bypasses; incorrect cache eligibility could cause subtle mis-resolutions in queries with aliases/JOIN nullability/PREWHERE.

Overview
Improves analyzer performance by adding a per-IdentifierResolveScope cache for resolved identifiers and wiring it into QueryAnalyzer::tryResolveIdentifier to reuse results for repeated lookups.

Adds guardrails to avoid incorrect reuse by disabling the cache during JOIN-tree initialization/resolution and PREWHERE, skipping caching for non-initial alias-resolution contexts, active self-referential alias resolution, and nullable_group_by_keys.

Fixes shared-node mutation hazards introduced by caching by cloning reused subtrees before mutation in createUniqueAliasesIfNecessary (IN-subquery revisit) and when rewriting IN to EXISTS in resolveFunction.cpp. Adds a new performance test for deep alias chains and updates multiple query-tree reference outputs to reflect the new node reuse/IDs and ORDER BY normalization.

Written by Cursor Bugbot for commit 3b35a50. This will update automatically on new commits. Configure here.

@novikd novikd self-assigned this Oct 2, 2025
@novikd novikd added the can be tested Allows running workflows for external contributors label Oct 2, 2025
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Oct 2, 2025

Workflow [PR], commit [7d6f965]

Summary:

job_name test_name status info comment
Integration tests (amd_asan_ubsan, targeted) failure
OOM in dmesg FAIL cidb
Stateless tests (amd_llvm_coverage, old analyzer, s3 storage, DatabaseReplicated, WasmEdge, parallel) failure
01323_redundant_functions_in_order_by FAIL cidb
Integration tests (amd_llvm_coverage, 2/5) failure
test_overcommit_tracker/test.py::test_user_overcommit FAIL cidb
Stress test (amd_debug) failure
Logical error: '(isConst() || isSparse() || isReplicated()) ? getDataType() == rhs.getDataType() : typeid(*this) == typeid(rhs)' (STID: 2508-35cc) FAIL cidb
Stress test (amd_tsan) failure
Logical error: Shard number is greater than shard count: shard_num=A shard_count=B cluster=C (STID: 5066-457d) FAIL cidb
Stress test (amd_msan) failure
Logical error: 'std::exception. Code: 1001, type: std::length_error, e.what() = basic_string (version 26.4.1.426), Stack trace: (STID: 2508-34fb) FAIL cidb
Stress test (arm_asan_ubsan) failure
UndefinedBehaviorSanitizer: undefined behavior (STID: 2527-3af3) FAIL cidb
Stress test (arm_asan_ubsan, s3) failure
Hung check failed, possible deadlock found FAIL cidb
Stress test (arm_ubsan) failure
Logical error: 'No available columns' (STID: 3938-33a6) FAIL cidb
Performance Comparison (amd_release, master_head, 3/6) failure
Check Results failure

AI Review

Summary

This PR adds identifier-resolution caching in the analyzer to reduce repeated alias resolution and AST growth, with targeted cache disabling for unsafe phases (JOIN tree init, PREWHERE) and cloning safeguards for mutable IN/EXISTS rewrite paths. I reviewed the changed core analyzer files and tests for correctness, safety, and performance risks, and also checked existing inline comments from clickhouse-gh[bot] to avoid duplication. I did not find any additional high-confidence issues beyond already-posted bot comments.

Missing context
  • ⚠️ No CI logs/results were provided in this task context, so runtime validation status could not be independently confirmed here.
ClickHouse Rules
Item Status Notes
Deletion logging
Serialization versioning
Core-area scrutiny
No test removal
Experimental gate
No magic constants
Backward compatibility
SettingsChangesHistory.cpp
PR metadata quality
Safe rollout
Compilation time
Final Verdict
  • Status: ✅ Approve

@clickhouse-gh clickhouse-gh bot added the pr-performance Pull request with some performance improvements label Oct 2, 2025
@maxjustus
Copy link
Copy Markdown
Contributor Author

Just noticed this in the fast tests:

2025-10-02 21:41:47 00933_alter_ttl:                                                        [ FAIL ] 0.09 sec.
2025-10-02 21:41:47 Reason: return code:  51
2025-10-02 21:41:47 Received exception from server (version 25.10.1):
2025-10-02 21:41:47 Code: 51. DB::Exception: Received from localhost:9000. DB::Exception: Empty list of columns in projection. In scope SELECT  FROM dummy.dummy: default expression and column type are incompatible.. (EMPTY_LIST_OF_COLUMNS_QUERIED)
2025-10-02 21:41:47 (query: alter table ttl modify ttl d + interval 1 day;)

I'll fix that really quick.

@maxjustus
Copy link
Copy Markdown
Contributor Author

ok fixed that issue

}
}

std::optional<Block> validateDefaultsWithAnalyzer(ASTPtr default_expr_list, const NamesAndTypesList & all_columns, ContextPtr context, bool get_sample_block)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be better to move this change into a separate pull request because these changes touch absolutely different logic.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Just rebased and force pushed so this just has the cached alias changes. I'll PR the new analyzer CREATE TABLE changes shortly.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok here's the other PR: #88087

@maxjustus maxjustus force-pushed the optimize-query-analysis-4 branch from 859336a to 7810f1a Compare October 3, 2025 16:22
@maxjustus maxjustus changed the title Optimize Analyzer with repeat aliases and enable for CREATE/ALTER TABLE Optimize Analyzer with repeat aliases Oct 3, 2025
@maxjustus
Copy link
Copy Markdown
Contributor Author

hmm, server died during fast test. I ran each of the non-distributed test queries locally and they succeeded.. Not seeing any useful details in the test logs explaining why it crashed

@maxjustus maxjustus force-pushed the optimize-query-analysis-4 branch 5 times, most recently from e6b0e44 to e2eb747 Compare October 3, 2025 23:13
@maxjustus
Copy link
Copy Markdown
Contributor Author

Found the segfault.. Working on it.

@maxjustus maxjustus force-pushed the optimize-query-analysis-4 branch 12 times, most recently from 3e7a409 to d29dad6 Compare October 7, 2025 17:30
@maxjustus
Copy link
Copy Markdown
Contributor Author

Ok @novikd I think this is good to go

@maxjustus maxjustus force-pushed the optimize-query-analysis-4 branch from 5889074 to 7c61494 Compare October 15, 2025 18:36
novikd and others added 29 commits March 30, 2026 10:08
* clone on retrieval from alias cache if alias query tree contains
  query nodes or context contains aggregate function so createUniqueAliasesIfNecessary
  has distinct query tree nodes to rewrite based on context
* reset alias cache before resolving prewhere
* update reference files
This fixes the issue with using alias expression cache with prewhere.
the code sets context.allow_resolve_from_using to false but the alias
expression cache did not honor that before, so it would resolve any
cached alias expression resolved from using.
…he conditions

the only condition where we need to clone from cache now is if
rewrite_in_to_join is enabled and an alias expression contains an
IN/exists function with a subquery as args, since that will get
rewritten to a join requiring unique table aliases for each instance of
the alias expression in the query if there are multiple.
query times out after 15 seconds when run without alias expression cache.
The interpolate_list variable was used but never defined, causing a
build failure. This builds the NameSet from query interpolate nodes
and adds the parameter to the function signature.
The `__interpolate` function was added in PR ClickHouse#93197 and subsequently
reverted. This branch had code wrapping INTERPOLATE columns in
`__interpolate` function calls, but the function implementation no
longer exists after rebase.

Changes:
- Remove `interpolate_list` parameter from `resolveProjectionExpressionNodeList`
- Remove `__interpolate` wrapping logic from projection resolution
- Remove construction of `interpolate_list` set in `resolveQuery`
…ing rebase

Upstream added niladic function resolution code using the old variable name
`identifier_resolve_settings`, which was renamed to `identifier_resolve_context`
by a branch commit. Two references were missed during rebase conflict resolution.
…e_nulls`

When `group_by_use_nulls` is true, `resolved_expressions` is cleared to
force re-resolution with nullable wrapping. The identifier cache was not
cleared, leaving stale entries cached during WHERE (before
`nullable_group_by_keys` is populated). Not a correctness bug today
since nullable wrapping happens after cache lookup, but the asymmetry
is fragile.
…ution inside IN functions

Commit ed9ed33 replaced the per-IN-function cache scoping with a
`hasSubqueryNode` exclusion at cache insertion time. That only prevented
caching the top-level expression containing a subquery, but identifiers
resolved *inside* different IN function instances (like `number`) were
still cached and shared, causing `createUniqueAliasesIfNecessary` to
produce inconsistent table alias numbering.

Restore the `in_function_instance_id` approach: each IN function gets a
unique ID in the cache key so identifiers resolved within different IN
contexts get independent cache entries. Remove `hasSubqueryNode` (now
dead code).

Fixes test `03316_analyzer_unique_table_aliases_dist`.
Move the clone that prevents `createUniqueAliasesIfNecessary` from
mutating shared query tree nodes from the identifier cache retrieval
path in `tryResolveIdentifier` to the actual mutation site in
`createUniqueAliasesIfNecessary.cpp`.

- Remove 12-line conditional clone block from `tryResolveIdentifier`
  that checked `rewrite_in_to_join`, `isNameOfLocalInFunction`, etc.
- Add `arg = arg->clone()` before `CreateUniqueTableAliasesVisitor`
  runs on IN subquery arguments, so each IN function gets its own
  independent copy regardless of how it was resolved.
Clone `node` (the IN function) rather than just its subquery argument.
When the IN function is a child of an alias expression shared via the
alias cache, cloning only the arg mutates the shared function node
in-place — the second visitor pass overwrites the first's work, leaving
both references pointing to the same aliased subquery.

Cloning the whole node via `node = node->clone()` replaces the parent's
child pointer, so each shared reference gets its own independent copy.
…ed are co-located

also
- re-adds clone on retrieve for in to join transform
- limits cache use to when default flags are used
…r cache clone check

`createUniqueAliasesIfNecessary` mutates all IN variants (including
`globalIn`, `globalNotIn`, etc.) via `isNameOfInFunction`.
`containsInOrExistsFunction` was only checking local variants, so
cached `globalIn` expressions wouldn't get cloned on retrieval.
The old analyzer raises CYCLIC_ALIASES for ambiguous self-referencing
aliases; the new analyzer resolves them to the alias reference. Scope
the setting override to just that query so the remaining queries in the
test still exercise the new analyzer code path.
…ion with subquery arguments

Replace `isNameOfInFunction` name-based check with structural
`isSubqueryNodeType` check on function arguments. This covers `EXISTS`
and any future function taking subquery arguments, not just the 16 IN
variants.

- Rename `in_function_instance_*` to `subquery_function_instance_*`
- Add `functionHasSubqueryArgument` helper using `isSubqueryNodeType`
The structural `functionHasSubqueryArgument` check was unstable across
push/pop because resolution can mutate function arguments between the
two calls. A subquery argument present at push time could be
resolved/replaced before pop, causing the pop to skip decrementing
`subquery_function_instance_stack` and eventually crashing.

Use `isNameOfInFunction` plus `"exists"` instead — function names are
stable across the resolution lifecycle.
It's less optimal and the approach this removes should work and passes
tests, but disabling cache inside subqueries is more conservative which
makes some sense for the initial introduction of this cache.
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Mar 30, 2026

LLVM Coverage Report

Metric Baseline Current Δ
Lines 84.10% 84.10% +0.00%
Functions 90.90% 90.90% +0.00%
Branches 76.70% 76.70% +0.00%

Changed lines: 95.76% (226/236) · Uncovered code

Full report · Diff report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-performance Pull request with some performance improvements

Projects

None yet

2 participants