Optimize Analyzer with repeat aliases#88043
Optimize Analyzer with repeat aliases#88043maxjustus wants to merge 44 commits intoClickHouse:masterfrom
Conversation
|
Workflow [PR], commit [7d6f965] Summary: ❌
AI ReviewSummaryThis PR adds identifier-resolution caching in the analyzer to reduce repeated alias resolution and AST growth, with targeted cache disabling for unsafe phases (JOIN tree init, Missing context
ClickHouse Rules
Final Verdict
|
|
Just noticed this in the fast tests: I'll fix that really quick. |
|
ok fixed that issue |
| } | ||
| } | ||
|
|
||
| std::optional<Block> validateDefaultsWithAnalyzer(ASTPtr default_expr_list, const NamesAndTypesList & all_columns, ContextPtr context, bool get_sample_block) |
There was a problem hiding this comment.
It'd be better to move this change into a separate pull request because these changes touch absolutely different logic.
There was a problem hiding this comment.
Agreed. Just rebased and force pushed so this just has the cached alias changes. I'll PR the new analyzer CREATE TABLE changes shortly.
859336a to
7810f1a
Compare
|
hmm, server died during fast test. I ran each of the non-distributed test queries locally and they succeeded.. Not seeing any useful details in the test logs explaining why it crashed |
e6b0e44 to
e2eb747
Compare
|
Found the segfault.. Working on it. |
3e7a409 to
d29dad6
Compare
|
Ok @novikd I think this is good to go |
5889074 to
7c61494
Compare
* clone on retrieval from alias cache if alias query tree contains query nodes or context contains aggregate function so createUniqueAliasesIfNecessary has distinct query tree nodes to rewrite based on context * reset alias cache before resolving prewhere * update reference files
This fixes the issue with using alias expression cache with prewhere. the code sets context.allow_resolve_from_using to false but the alias expression cache did not honor that before, so it would resolve any cached alias expression resolved from using.
…he conditions the only condition where we need to clone from cache now is if rewrite_in_to_join is enabled and an alias expression contains an IN/exists function with a subquery as args, since that will get rewritten to a join requiring unique table aliases for each instance of the alias expression in the query if there are multiple.
query times out after 15 seconds when run without alias expression cache.
…ng allow_resolve_from_using to cache key
… using in function cache scope
The interpolate_list variable was used but never defined, causing a build failure. This builds the NameSet from query interpolate nodes and adds the parameter to the function signature.
The `__interpolate` function was added in PR ClickHouse#93197 and subsequently reverted. This branch had code wrapping INTERPOLATE columns in `__interpolate` function calls, but the function implementation no longer exists after rebase. Changes: - Remove `interpolate_list` parameter from `resolveProjectionExpressionNodeList` - Remove `__interpolate` wrapping logic from projection resolution - Remove construction of `interpolate_list` set in `resolveQuery`
…ing rebase Upstream added niladic function resolution code using the old variable name `identifier_resolve_settings`, which was renamed to `identifier_resolve_context` by a branch commit. Two references were missed during rebase conflict resolution.
…e_nulls` When `group_by_use_nulls` is true, `resolved_expressions` is cleared to force re-resolution with nullable wrapping. The identifier cache was not cleared, leaving stale entries cached during WHERE (before `nullable_group_by_keys` is populated). Not a correctness bug today since nullable wrapping happens after cache lookup, but the asymmetry is fragile.
…ution inside IN functions Commit ed9ed33 replaced the per-IN-function cache scoping with a `hasSubqueryNode` exclusion at cache insertion time. That only prevented caching the top-level expression containing a subquery, but identifiers resolved *inside* different IN function instances (like `number`) were still cached and shared, causing `createUniqueAliasesIfNecessary` to produce inconsistent table alias numbering. Restore the `in_function_instance_id` approach: each IN function gets a unique ID in the cache key so identifiers resolved within different IN contexts get independent cache entries. Remove `hasSubqueryNode` (now dead code). Fixes test `03316_analyzer_unique_table_aliases_dist`.
Move the clone that prevents `createUniqueAliasesIfNecessary` from mutating shared query tree nodes from the identifier cache retrieval path in `tryResolveIdentifier` to the actual mutation site in `createUniqueAliasesIfNecessary.cpp`. - Remove 12-line conditional clone block from `tryResolveIdentifier` that checked `rewrite_in_to_join`, `isNameOfLocalInFunction`, etc. - Add `arg = arg->clone()` before `CreateUniqueTableAliasesVisitor` runs on IN subquery arguments, so each IN function gets its own independent copy regardless of how it was resolved.
Clone `node` (the IN function) rather than just its subquery argument. When the IN function is a child of an alias expression shared via the alias cache, cloning only the arg mutates the shared function node in-place — the second visitor pass overwrites the first's work, leaving both references pointing to the same aliased subquery. Cloning the whole node via `node = node->clone()` replaces the parent's child pointer, so each shared reference gets its own independent copy.
…ed are co-located also - re-adds clone on retrieve for in to join transform - limits cache use to when default flags are used
…r cache clone check `createUniqueAliasesIfNecessary` mutates all IN variants (including `globalIn`, `globalNotIn`, etc.) via `isNameOfInFunction`. `containsInOrExistsFunction` was only checking local variants, so cached `globalIn` expressions wouldn't get cloned on retrieval.
The old analyzer raises CYCLIC_ALIASES for ambiguous self-referencing aliases; the new analyzer resolves them to the alias reference. Scope the setting override to just that query so the remaining queries in the test still exercise the new analyzer code path.
Required for compatibility tracking.
…ion with subquery arguments Replace `isNameOfInFunction` name-based check with structural `isSubqueryNodeType` check on function arguments. This covers `EXISTS` and any future function taking subquery arguments, not just the 16 IN variants. - Rename `in_function_instance_*` to `subquery_function_instance_*` - Add `functionHasSubqueryArgument` helper using `isSubqueryNodeType`
The structural `functionHasSubqueryArgument` check was unstable across push/pop because resolution can mutate function arguments between the two calls. A subquery argument present at push time could be resolved/replaced before pop, causing the pop to skip decrementing `subquery_function_instance_stack` and eventually crashing. Use `isNameOfInFunction` plus `"exists"` instead — function names are stable across the resolution lifecycle.
It's less optimal and the approach this removes should work and passes tests, but disabling cache inside subqueries is more conservative which makes some sense for the initial introduction of this cache.
LLVM Coverage Report
Changed lines: 95.76% (226/236) · Uncovered code |
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Add identifier resolution caching to prevent duplicate identifier resolution during query analysis.
Resolves #86019.
Resolves #72721.
Documentation entry for user-facing changes
Details
Optimizes the new analyzer when handling multiple instances of the same alias expression in a query by caching the resolved alias expression node and reusing it. This resolves #86019
test query 1:
test query 2
Note
Medium Risk
Changes core analyzer identifier-resolution behavior by introducing per-scope caching with several conditional bypasses; incorrect cache eligibility could cause subtle mis-resolutions in queries with aliases/JOIN nullability/PREWHERE.
Overview
Improves analyzer performance by adding a per-
IdentifierResolveScopecache for resolved identifiers and wiring it intoQueryAnalyzer::tryResolveIdentifierto reuse results for repeated lookups.Adds guardrails to avoid incorrect reuse by disabling the cache during JOIN-tree initialization/resolution and PREWHERE, skipping caching for non-initial alias-resolution contexts, active self-referential alias resolution, and
nullable_group_by_keys.Fixes shared-node mutation hazards introduced by caching by cloning reused subtrees before mutation in
createUniqueAliasesIfNecessary(IN-subquery revisit) and when rewritingINtoEXISTSinresolveFunction.cpp. Adds a new performance test for deep alias chains and updates multiple query-tree reference outputs to reflect the new node reuse/IDs and ORDER BY normalization.Written by Cursor Bugbot for commit 3b35a50. This will update automatically on new commits. Configure here.