Skip to content

Hypothesis index in the analyzer#48381

Closed
antonio2368 wants to merge 16 commits intomasterfrom
analyzer-hypothesis-index
Closed

Hypothesis index in the analyzer#48381
antonio2368 wants to merge 16 commits intomasterfrom
analyzer-hypothesis-index

Conversation

@antonio2368
Copy link
Copy Markdown
Member

@antonio2368 antonio2368 commented Apr 4, 2023

Changelog category (leave one):

  • Not for changelog (changelog entry is not required)

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

@robot-ch-test-poll4 robot-ch-test-poll4 added the pr-not-for-changelog This PR should not be mentioned in the changelog label Apr 4, 2023
@novikd novikd self-assigned this Apr 4, 2023
/// transform active formulas
for (auto & formula : active_atomic_formulas)
{
formula = formula->clone(); /// do all operations with copy
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Query Tree it'll clone the whole subtree even when we don't need to modify the node. I think it's better to avoid it.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, QueryTree is constructed inside this class + when we create CNF we clone the nodes again.
And throughout the class I think it's cloned almost always.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend fixing it but if you don't want to do it now you can merge it anyway.

@novikd
Copy link
Copy Markdown
Member

novikd commented Apr 21, 2023

Also, it seems we don't have a documentation entry for the Hypothesis index. It'd be nice if you write it because at least you now understand how it works.

@robot-clickhouse-ci-2
Copy link
Copy Markdown
Contributor

robot-clickhouse-ci-2 commented May 2, 2023

This is an automated comment for commit a397f18 with description of existing statuses. It's updated for the latest CI running

❌ Click here to open a full report in a separate page

Successful checks
Check nameDescriptionStatus
AST fuzzerRuns randomly generated queries to catch program errors. The build type is optionally given in parenthesis. If it fails, ask a maintainer for help✅ success
ClickBenchRuns [ClickBench](https://github.com/ClickHouse/ClickBench/) with instant-attach table✅ success
ClickHouse build checkBuilds ClickHouse in various configurations for use in further steps. You have to fix the builds that fail. Build logs often has enough information to fix the error, but you might have to reproduce the failure locally. The cmake options can be found in the build log, grepping for cmake. Use these options and follow the general build process✅ success
Compatibility checkChecks that clickhouse binary runs on distributions with old libc versions. If it fails, ask a maintainer for help✅ success
Docker server and keeper imagesThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Docs checkThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Fast testsThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Flaky testsChecks if new added or modified tests are flaky by running them repeatedly, in parallel, with more randomization. Functional tests are run 100 times with address sanitizer, and additional randomization of thread scheduling. Integrational tests are run up to 10 times. If at least once a new test has failed, or was too long, this check will be red. We don't allow flaky tests, read the doc✅ success
Install packagesChecks that the built packages are installable in a clear environment✅ success
Integration testsThe integration tests report. In parenthesis the package type is given, and in square brackets are the optional part/total tests✅ success
Mergeable CheckChecks if all other necessary checks are successful✅ success
SQLancerFuzzing tests that detect logical bugs with SQLancer tool✅ success
SqllogicRun clickhouse on the sqllogic test set against sqlite and checks that all statements are passed✅ success
Stateful testsRuns stateful functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc✅ success
Style checkThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Unit testsRuns the unit tests for different release types✅ success
Check nameDescriptionStatus
CI runningA meta-check that indicates the running CI. Normally, it's in success or pending state. The failed status indicates some problems with the PR⏳ pending
Performance ComparisonMeasure changes in query performance. The performance test report is described in detail here. In square brackets are the optional part/total tests❌ failure
Stateless testsRuns stateless functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc❌ failure
Stress testRuns stateless functional tests concurrently from several clients to detect concurrency-related errors❌ failure
Upgrade checkRuns stress tests on server version from last release and then tries to upgrade it to the version from the PR. It checks if the new server can successfully startup without any errors, crashes or sanitizer asserts❌ failure

@robot-clickhouse-ci-1
Copy link
Copy Markdown
Contributor

This is an automated comment for commit 1c7e447 with description of existing statuses. It's updated for the latest CI running
The full report is available here
The overall status of the commit is 🟡 pending

Check nameDescriptionStatus
CI runningA meta-check that indicates the running CI. Normally, it's in success or pending state. The failed status indicates some problems with the PR🟡 pending
Mergeable CheckChecks if all other necessary checks are successful🟢 success

std::end(active_atomic_formulas),
std::begin(hypothesis),
std::end(hypothesis));
auto ast_filter_node = buildFilterNode(query_info.query);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect that this can be wrong in very edge cases because this filter AST part in Analyzer is not valid.

Copy link
Copy Markdown
Member Author

@antonio2368 antonio2368 May 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain a bit more? Why wouldn't it be valid?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand we use a query tree that can be different from AST in query_info.query because of optimizations. So in the case of allow_experimental_analyzer you want to use query_info.query_tree to build a filter.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@antonio2368 I spent some time and understood that the problem is more complex. The filters in the query tree are also not those that we use. For example, it misses row policy filters which are added in the Planner. In the case of a new analyzer, the correct filter is stored in the SelectQueryInfo::filter_actions_dag. So you need to use it.

@qoega qoega added the analyzer Issues and pull-requests related to new analyzer label Jan 9, 2024
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Jun 25, 2024

Dear @novikd, this PR hasn't been updated for a while. You will be unassigned. Will you continue working on it? If so, please feel free to reassign yourself.

@novikd novikd self-assigned this Jun 26, 2024
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Jul 30, 2024

Dear @novikd, this PR hasn't been updated for a while. You will be unassigned. Will you continue working on it? If so, please feel free to reassign yourself.

@alexey-milovidov alexey-milovidov changed the title Hypothesis index in new analyzer Hypothesis index in the analyzer Jul 27, 2025
@antonio2368 antonio2368 marked this pull request as draft November 20, 2025 15:11
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Nov 20, 2025

Workflow [PR], commit [c7ae560]

Summary:

job_name test_name status info comment
Docs check failure
Run markdown linter failure cidb
Build (amd_compat) failure
Cmake configuration failure cidb
Build (s390x) failure
Build ClickHouse failure cidb
Stateless tests (arm_asan, targeted) failure
01459_manual_write_to_replicas_quorum_detach_attach FAIL cidb, flaky
Stateless tests (amd_binary, ParallelReplicas, s3 storage, parallel) failure
01624_soft_constraints FAIL cidb
Stateless tests (amd_msan, parallel, 2/2) failure
Server died FAIL cidb
03009_storage_memory_circ_buffer_usage SERVER_DIED cidb
Sanitizer assert (in stderr.log) FAIL cidb
Killed by signal (in clickhouse-server.log or clickhouse-server.err.log) FAIL cidb
Stateless tests (amd_tsan, s3 storage, parallel) failure
03212_variant_dynamic_cast_or_default FAIL cidb, flaky
Integration tests (amd_tsan, 3/6) failure
test_ssh/test.py::test_paramiko_password FAIL cidb
Stress test (amd_msan) failure
Server died FAIL cidb
Hung check failed, possible deadlock found (see hung_check.log) FAIL cidb
Killed by signal (in clickhouse-server.log) FAIL cidb
Fatal message in clickhouse-server.log (see fatal_messages.txt) FAIL cidb
Killed by signal (output files) FAIL cidb
Found signal in gdb.log FAIL cidb
BuzzHouse (amd_debug) failure
Logical error: 'Inconsistent AST formatting: the query: FAIL cidb

@antonio2368 antonio2368 force-pushed the analyzer-hypothesis-index branch from 0b15381 to dfc5c72 Compare November 20, 2025 15:23
@antonio2368 antonio2368 force-pushed the analyzer-hypothesis-index branch from dfc5c72 to c7ae560 Compare November 20, 2025 19:08
@antonio2368 antonio2368 requested a review from Copilot November 25, 2025 13:35
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements hypothesis index support for the analyzer (new query planning infrastructure), enabling it to work with both the legacy AST-based planner and the new analyzer.

Key changes:

  • Refactored hypothesis index implementation to support both AST and ActionsDAG (analyzer) paths using templates
  • Added ActionsDAG-based CNF (Conjunctive Normal Form) conversion infrastructure
  • Implemented semantic hashing and equality for ActionsDAG nodes to enable comparison across different DAG instances
  • Updated ComparisonGraph to work with ActionsDAG nodes in addition to AST nodes

Reviewed changes

Copilot reviewed 24 out of 24 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tests/queries/0_stateless/01624_soft_constraints.sh Enabled use_skip_indexes_on_data_read setting for test
tests/parallel_replicas_blacklist.txt Removed test from blacklist
tests/analyzer_tech_debt.txt Removed test from analyzer technical debt list
src/Storages/MergeTree/MergeTreeIndices.h Added context parameter to createIndexMergedCondition signature
src/Storages/MergeTree/MergeTreeIndexHypothesisMergedCondition.h Refactored to use PIMPL pattern with abstract interface
src/Storages/MergeTree/MergeTreeIndexHypothesisMergedCondition.cpp Implemented dual-path support (AST/ActionsDAG) using template
src/Storages/MergeTree/MergeTreeIndexHypothesis.h Added context parameter to method signature
src/Storages/MergeTree/MergeTreeIndexHypothesis.cpp Passed context to merged condition constructor
src/Storages/ConstraintsDescription.h Added ActionsDAGData class and method for constraints
src/Storages/ConstraintsDescription.cpp Implemented ActionsDAG constraint data conversion
src/Processors/QueryPlan/ReadFromMergeTree.cpp Passed context when creating index merged conditions
src/Interpreters/TreeCNFConverter.h Moved CNF helper functions to Common/CNFHelpers.h
src/Interpreters/ComparisonGraph.h Extended to support ActionsDAG nodes
src/Interpreters/ComparisonGraph.cpp Added ActionsDAG node semantic comparison and hashing
src/Interpreters/ActionsDAGHashUtils.h New file with ActionsDAG node hashing utilities
src/Interpreters/ActionsDAGCNF.h New file with ActionsDAG CNF class interface
src/Interpreters/ActionsDAGCNF.cpp New file with ActionsDAG CNF implementation
src/Common/CNFHelpers.h New file with shared CNF reduction algorithms
src/Common/CNFAtomicFormula.h New file with generic CNF atomic formula template
src/Analyzer/Passes/CNFAtomicFormula.h Changed to use shared template from Common
src/Analyzer/Passes/CNFAtomicFormula.cpp File removed (logic moved to template)
src/Analyzer/Passes/CNF.h Added iterateGroups method
src/Analyzer/Passes/CNF.cpp Updated include for CNF helpers
docs/en/engines/table-engines/mergetree-family/mergetree.md Added hypothesis index documentation

Comment on lines 1884 to +1885
{
skip_indexes.merged_indices.emplace_back();
skip_indexes.merged_indices.back().condition = index_helper->createIndexMergedCondition(query_info, metadata_snapshot);
if (inserted)
Copy link

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate condition check: if (inserted) appears twice in a nested manner. The inner check at line 1886 is redundant and the code inside will always execute since it's already within an if (inserted) block. Remove the inner condition.

Copilot uses AI. Check for mistakes.
skip_indexes.merged_indices[it->second].addIndex(index_helper);
}

skip_indexes.merged_indices[it->second].addIndex(index_helper);
Copy link

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate addIndex call: This line appears both inside the nested if (inserted) block (line 1891) and after it (line 1894), resulting in the index being added twice when inserted is true. Remove one of these duplicate calls.

Suggested change
skip_indexes.merged_indices[it->second].addIndex(index_helper);

Copilot uses AI. Check for mistakes.

We define the following index first
```sql
INDEX t (a < 10) TYPE hypothesis GRANULARITY 1`
Copy link

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mismatched backtick: The code example has a closing backtick (`) instead of an opening backtick at the end. Should be a regular backtick for opening the code block or removed entirely if it's meant to be inline.

Suggested change
INDEX t (a < 10) TYPE hypothesis GRANULARITY 1`
INDEX t (a < 10) TYPE hypothesis GRANULARITY 1

Copilot uses AI. Check for mistakes.
Comment on lines +28 to +31
if (node_with_hash.hash > rhs.node_with_hash.hash)
return false;

return node_with_hash.hash < rhs.node_with_hash.hash || negative < rhs.negative;
Copy link

Copilot AI Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect comparison logic: Line 28-29 returns false if lhs.hash > rhs.hash, but line 31 then checks lhs.hash < rhs.hash. This means when hashes are equal, only negative < rhs.negative is evaluated. However, the early return on line 29 makes the < rhs.hash check on line 31 redundant. The logic should be: return node_with_hash < rhs.node_with_hash || (node_with_hash == rhs.node_with_hash && negative < rhs.negative); assuming NodeWithHashType has proper comparison operators.

Suggested change
if (node_with_hash.hash > rhs.node_with_hash.hash)
return false;
return node_with_hash.hash < rhs.node_with_hash.hash || negative < rhs.negative;
return node_with_hash < rhs.node_with_hash
|| (node_with_hash == rhs.node_with_hash && negative < rhs.negative);

Copilot uses AI. Check for mistakes.
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Jan 27, 2026

Dear @antonio2368, this PR hasn't been updated for a while. Will you continue working on it? If not, please close it. Otherwise, ignore this message.

@alexey-milovidov alexey-milovidov added the close in a month if not active This will be closed in case of no information label Jan 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

analyzer Issues and pull-requests related to new analyzer close in a month if not active This will be closed in case of no information pr-not-for-changelog This PR should not be mentioned in the changelog

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants