Skip to content

Inverted Text Index v2#86485

Merged
rschu1ze merged 65 commits intoClickHouse:masterfrom
CurtizJ:make-text-index-great
Sep 17, 2025
Merged

Inverted Text Index v2#86485
rschu1ze merged 65 commits intoClickHouse:masterfrom
CurtizJ:make-text-index-great

Conversation

@CurtizJ
Copy link
Copy Markdown
Member

@CurtizJ CurtizJ commented Sep 1, 2025

Changelog category (leave one):

  • Backward Incompatible Change

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

The inverted text index was reworked from scratch to be scalable for datasets that don't fit into RAM.

Resolves #87003.

@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Sep 1, 2025

Workflow [PR], commit [2ca6b0e]

Summary:

job_name test_name status info comment
Stateless tests (amd_debug, AsyncInsert, s3 storage, parallel) failure
02676_analyzer_limit_offset FAIL
Stateless tests (amd_tsan, sequential, 2/2) failure
03141_fetches_errors_stress FAIL
Integration tests (arm_binary, distributed plan, 1/4) failure
test_storage_rabbitmq/test.py::test_rabbitmq_reject_broken_messages_dead_letter_queue FAIL
test_storage_rabbitmq/test.py::test_rabbitmq_reject_broken_messages_stream FAIL
Stress test (amd_tsan) failure
Server died FAIL
Hung check failed, possible deadlock found (see hung_check.log) FAIL
Killed by signal (in clickhouse-server.log) FAIL
Fatal message in clickhouse-server.log (see fatal_messages.txt) FAIL
Killed by signal (output files) FAIL
Found signal in gdb.log FAIL
Performance Comparison (amd_release, master_head, 1/3) failure
Check Results failure

@clickhouse-gh clickhouse-gh bot added the pr-backward-incompatible Pull request with backwards incompatible changes label Sep 1, 2025
Copy link
Copy Markdown
Member

@Ergus Ergus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out.query_string = std::make_unique<GinQueryString>();
const auto & value = const_value.safeGet<String>();
token_extractor->stringToGinFilter(value.data(), value.size(), *out.query_string);
// const auto & value = const_value.safeGet<String>();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These functions are not supported, so initializing query_string is pointless for them I think.

}
}

if (!index_for_step.empty() && !non_index_column.empty())
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC If this happens is already too late and far from the problem source . Could we prevent the issue earlier or at least add a comment on where to look?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can do this before returning from getReadTaskColumns.

@CurtizJ

This comment was marked as resolved.

@CurtizJ CurtizJ force-pushed the make-text-index-great branch from 7197270 to 00bfa6f Compare September 11, 2025 21:55
static const String ARGUMENT_TOKENIZER = "tokenizer";
static const String ARGUMENT_NGRAM_SIZE = "ngram_size";
static const String ARGUMENT_SEPARATORS = "separators";
static const String ARGUMENT_DICTIONARY_BLOCK_SIZE = "dictionary_block_size";
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once ready for merge, may I ask to update docs/en/engines/table-engines/mergetree-family/invertedindexes.md

  • remove mentions of deleted index parameter segment_digestion_threshold_bytes
  • add mentino of the two new index parameters (can be brief, so we at least don't forget).

Copy link
Copy Markdown
Member Author

@CurtizJ CurtizJ Sep 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to adjust the documentation, let's do it in a separate PR.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries, I'll do that in #86896

@rschu1ze rschu1ze marked this pull request as ready for review September 17, 2025 08:52
@rschu1ze
Copy link
Copy Markdown
Member

Stateless tests (amd_debug, AsyncInsert, s3 storage, parallel):

  • 02676_analyzer_limit_offset: fails with the same error very sporadically in master too. We got unlucky.

Stateless tests (amd_tsan, sequential, 2/2):

Integration tests (arm_binary, distributed plan, 1/4):

  • test_storage_rabbitmq: timeout, infra issue

Stress test (amd_tsan):

Performance Comparison (amd_release, master_head, 1/3):

  • Error message in logs/text_index-err.log: DB::Exception: Unknown setting 'query_plan_direct_read_from_text_index'. Stack trace: ... this error is expected to happen on the previous version.

@rschu1ze rschu1ze added this pull request to the merge queue Sep 17, 2025
Merged via the queue into ClickHouse:master with commit 2fdffe4 Sep 17, 2025
117 of 123 checks passed
@robot-ch-test-poll2 robot-ch-test-poll2 added the pr-synced-to-cloud The PR is synced to the cloud repo label Sep 17, 2025
}
}

storage_snapshot = std::make_shared<StorageSnapshot>(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will likely break enable_shared_storage_snapshot_in_query = 1 #79471 . I ran into the same issue when developing projection index, and I also haven’t found a good solution yet.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-backward-incompatible Pull request with backwards incompatible changes pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[RFC] Disk friendly format for inverted text index

10 participants