Skip to content

Be more graceful with existing tables with inverted indexes#64656

Merged
rschu1ze merged 4 commits intoClickHouse:masterfrom
rschu1ze:deprecated-inverted-index
Jun 4, 2024
Merged

Be more graceful with existing tables with inverted indexes#64656
rschu1ze merged 4 commits intoClickHouse:masterfrom
rschu1ze:deprecated-inverted-index

Conversation

@rschu1ze
Copy link
Copy Markdown
Member

@rschu1ze rschu1ze commented May 30, 2024

#62884 renamed index type inverted to full_text. As a result, tables with existing indexes of type index could no longer be loaded. When the load happens during database startup (this is the default behavior), the database throws an exception Unknown Index type 'inverted' and immediately shut down due to corrupt metadata. One could argue that these indexes are experimental, but the behavior is still painful.

This PR re-enables loading of tables with inverted indexes but these indexes will throw an exception (Please drop the index and re-create it) as soon as they are used by a SELECT query.

Changelog category (leave one):

  • Not for changelog (changelog entry is not required)

@rschu1ze rschu1ze added the pr-must-backport Pull request should be backported intentionally. Use this label with great care! label May 30, 2024
@robot-ch-test-poll robot-ch-test-poll added the pr-not-for-changelog This PR should not be mentioned in the changelog label May 30, 2024
@robot-ch-test-poll
Copy link
Copy Markdown
Contributor

robot-ch-test-poll commented May 30, 2024

This is an automated comment for commit ef0bff6 with description of existing statuses. It's updated for the latest CI running

❌ Click here to open a full report in a separate page

Check nameDescriptionStatus
CI runningA meta-check that indicates the running CI. Normally, it's in success or pending state. The failed status indicates some problems with the PR❌ failure
Integration testsThe integration tests report. In parenthesis the package type is given, and in square brackets are the optional part/total tests❌ failure
Successful checks
Check nameDescriptionStatus
A SyncIf it fails, ask a maintainer for help✅ success
AST fuzzerRuns randomly generated queries to catch program errors. The build type is optionally given in parenthesis. If it fails, ask a maintainer for help✅ success
ClickBenchRuns [ClickBench](https://github.com/ClickHouse/ClickBench/) with instant-attach table✅ success
ClickHouse build checkBuilds ClickHouse in various configurations for use in further steps. You have to fix the builds that fail. Build logs often has enough information to fix the error, but you might have to reproduce the failure locally. The cmake options can be found in the build log, grepping for cmake. Use these options and follow the general build process✅ success
Compatibility checkChecks that clickhouse binary runs on distributions with old libc versions. If it fails, ask a maintainer for help✅ success
Docker keeper imageThe check to build and optionally push the mentioned image to docker hub✅ success
Docker server imageThe check to build and optionally push the mentioned image to docker hub✅ success
Docs checkBuilds and tests the documentation✅ success
Fast testNormally this is the first check that is ran for a PR. It builds ClickHouse and runs most of stateless functional tests, omitting some. If it fails, further checks are not started until it is fixed. Look at the report to see which tests fail, then reproduce the failure locally as described here✅ success
Flaky testsChecks if new added or modified tests are flaky by running them repeatedly, in parallel, with more randomization. Functional tests are run 100 times with address sanitizer, and additional randomization of thread scheduling. Integration tests are run up to 10 times. If at least once a new test has failed, or was too long, this check will be red. We don't allow flaky tests, read the doc✅ success
Install packagesChecks that the built packages are installable in a clear environment✅ success
Mergeable CheckChecks if all other necessary checks are successful✅ success
PR CheckThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Performance ComparisonMeasure changes in query performance. The performance test report is described in detail here. In square brackets are the optional part/total tests✅ success
Stateful testsRuns stateful functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc✅ success
Stateless testsRuns stateless functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc✅ success
Stress testRuns stateless functional tests concurrently from several clients to detect concurrency-related errors✅ success
Style checkRuns a set of checks to keep the code style clean. If some of tests failed, see the related log from the report✅ success
Unit testsRuns the unit tests for different release types✅ success
Upgrade checkRuns stress tests on server version from last release and then tries to upgrade it to the version from the PR. It checks if the new server can successfully startup without any errors, crashes or sanitizer asserts✅ success

@rschu1ze rschu1ze force-pushed the deprecated-inverted-index branch from eef96bb to 344314d Compare May 31, 2024 15:40
@rschu1ze rschu1ze force-pushed the deprecated-inverted-index branch from 344314d to d5ec72d Compare May 31, 2024 15:50
@antaljanosbenjamin antaljanosbenjamin self-assigned this Jun 3, 2024
/// Index type 'inverted' was renamed to 'full_text' in May 2024.
/// Tables with old indexes can be loaded during a transition period. We still want let users know that they should drop existing
/// indexes and re-create them. Function `createIndexGranule` is called whenever the index is used by queries. Reject the query if we
/// are an old index.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// are an old index.
/// have an old index.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines +15 to +16
SET allow_experimental_inverted_index = 0;
CREATE TABLE tab(k UInt64, s String, INDEX idx(s) TYPE inverted(2)) ENGINE = MergeTree() ORDER BY k;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks bad in my opinion. It is not documented, but if somebody is sloppy, they can create a table with inverted indices, even though they should use full_text. Cannot we have the backdoor only for attach queries? Then if I am not mistaken we should be able to attach tables with inverted indices, but won't be able to create new ones.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@antaljanosbenjamin I fully agree with you but only in a perfect world.

Consider this: Inverted indexes are experimental, enabling the corresponding setting is only fine for testing. Even while testing, anything can - in theory - happen, data loss, corruption, etc. From a product perspective, breaking experimental features is acceptable (it is not nice though).

My expectation is that nobody in their right mind uses inverted indexes in production. The number of affected users should therefore converge towards zero. This PR accommodates for the handful of users where this assumption does not hold. Consider this PR more of a quickfix in a situation where a real fix (i.e. what you propose) isn't worth the trouble as the addressed problems happens very rarely right now and ceases to happen in future. It will cease to happen in future because the backdoor isn't documented and it will be disabled after a transition period at the end of the year.

So in some sense, this PR isn't perfect but it improves over simply breaking old persistences ...

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this PR isn't perfect but it improves over simply breaking old persistences

Okay, this is true, so I am okay merging this as is.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rschu1ze I came across this PR and I saw that you already discussed this, but I have one suggestion, not sure have you considered it or not, so what do you think about the following:

  • make allow_experimental_inverted_index an alias for allow_experimental_full_text_index (that way clickhouse-client will report it as a warning, though not a lot of users look into them, but if we are talking about development env, then maybe the situation is different)
  • keep throwing exception on SELECT that uses "inverted" indexes (so that users have to migrate to "fulltext")
  • allow attaching "inverted" indexes (to make server start)
  • forbid creating new tables with "inverted" indexes (to avoid such tables in future)

Since right now it is still possible to create this deprecated index (and only with allow_experimental_inverted_index=0) which at least looks odd (even though SELECT is forbidden).

Plus if your workaround will be kept till the end of the year, this time should be enough for all users of this experimental feature to migrate to new index (for whose who do update ClickHouse). And also that way it will be done in the same fashion as for other futures.

Thoughts?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@azat I pushed a follow-up to this yesterday. The situation based on this PR and the follow-up is actually similar to what you describe:

  • creating new inverted indexes is still possible (assuming allow_experimental_inverted_index = 1)
  • inverted indexes can be attached (again assuming allow_experimental_inverted_index = 1)
  • to encourage migration: 1. an exception is thrown during usage (making newly created and attached indexes useless) and 2. the docs no longer mention allow_experimental_inverted_index.

About warnings shown by clickhouse-client: Yes, developers and CLI people may check. Any GUI or driver that connects to ClickHouse via HTTP will not see the warning and therefore not bother 😢

I really appreciate your thoughts but in sum, I guess we think too much about gracefully migrating an experimental feature. The number of production use cases is expected to converge to zero. The main reason for pushing this PR and the follow-up is that ClickHouse server refused to start when it encounters old inverted indexes (see the PR message) - as long as this blocker is removed I'd be good.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

creating new inverted indexes is still possible (assuming allow_experimental_inverted_index = 1)

This still looks odd, what is the reason to allow this?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No good reason except that I made this PR in a bit of a hurry and did not find out quickly where to distinguish table "creation" vs. "attaching" in the code. For a non-experimental feature, that's imho okay.

@rschu1ze
Copy link
Copy Markdown
Member Author

rschu1ze commented Jun 4, 2024

Integration tests (aarch64) [5/6]: #64451

A sync is green.

Lgtm, forcing MergeableCheck green now.

@rschu1ze rschu1ze added this pull request to the merge queue Jun 4, 2024
Merged via the queue into ClickHouse:master with commit 9cf8bdc Jun 4, 2024
@rschu1ze rschu1ze deleted the deprecated-inverted-index branch June 4, 2024 19:24
@robot-clickhouse robot-clickhouse added the pr-synced-to-cloud The PR is synced to the cloud repo label Jun 4, 2024
@robot-ch-test-poll4 robot-ch-test-poll4 added the pr-backports-created Backport PRs are successfully created, it won't be processed by CI script anymore label Jun 4, 2024
@robot-ch-test-poll3 robot-ch-test-poll3 added the pr-backports-created-cloud deprecated label, NOOP label Jun 4, 2024
@nikitamikhaylov nikitamikhaylov removed the pr-backports-created-cloud deprecated label, NOOP label Jun 5, 2024
@robot-clickhouse-ci-2 robot-clickhouse-ci-2 added the pr-backports-created-cloud deprecated label, NOOP label Jun 5, 2024
@nikitamikhaylov nikitamikhaylov removed pr-backports-created Backport PRs are successfully created, it won't be processed by CI script anymore pr-backports-created-cloud deprecated label, NOOP labels Jun 5, 2024
@robot-clickhouse-ci-2 robot-clickhouse-ci-2 added the pr-backports-created-cloud deprecated label, NOOP label Jun 5, 2024
@robot-ch-test-poll robot-ch-test-poll added the pr-backports-created Backport PRs are successfully created, it won't be processed by CI script anymore label Jun 5, 2024
azat added a commit to azat/ClickHouse that referenced this pull request Jun 6, 2024
@Felixoid Felixoid removed pr-backports-created Backport PRs are successfully created, it won't be processed by CI script anymore pr-backports-created-cloud deprecated label, NOOP labels Jun 7, 2024
robot-clickhouse added a commit that referenced this pull request Jun 7, 2024
@robot-ch-test-poll2 robot-ch-test-poll2 added the pr-backports-created Backport PRs are successfully created, it won't be processed by CI script anymore label Jun 7, 2024
robot-clickhouse-ci-1 added a commit that referenced this pull request Jun 7, 2024
Backport #64656 to 24.5: Be more graceful with existing tables with `inverted` indexes
@robot-ch-test-poll robot-ch-test-poll added the pr-backports-created-cloud deprecated label, NOOP label Jun 8, 2024
@maxknv maxknv removed pr-backports-created Backport PRs are successfully created, it won't be processed by CI script anymore pr-backports-created-cloud deprecated label, NOOP labels Jun 8, 2024
@robot-clickhouse-ci-2 robot-clickhouse-ci-2 added the pr-backports-created-cloud deprecated label, NOOP label Jun 8, 2024
@robot-ch-test-poll1 robot-ch-test-poll1 added the pr-backports-created Backport PRs are successfully created, it won't be processed by CI script anymore label Jun 8, 2024
rschu1ze added a commit to rschu1ze/ClickHouse that referenced this pull request Jun 10, 2024
Index types 'annoy' and 'usearch' were removed and replaced by
'vector_similarity' indexes in an earlier commit.

This means unfortuantely, that if customers have tables with these
indexes and upgrade, their database might not start anymore - the
system loads the metadata at startup, thinks something is wrong with
such tables, and halts immediately.

This commit adds support for loading and attaching such indexes back.
Data insert or use (search) return an error which recommends a migration
to 'vector_similarity' indexes. The implementation is generally similar
to what has recently been implemented for 'full_text' indexes [1, 2].

[1] ClickHouse#64656
[2] ClickHouse#64846
rschu1ze added a commit to rschu1ze/ClickHouse that referenced this pull request Aug 12, 2024
Index types 'annoy' and 'usearch' were removed and replaced by
'vector_similarity' indexes in an earlier commit.

This means unfortuantely, that if customers have tables with these
indexes and upgrade, their database might not start anymore - the
system loads the metadata at startup, thinks something is wrong with
such tables, and halts immediately.

This commit adds support for loading and attaching such indexes back.
Data insert or use (search) return an error which recommends a migration
to 'vector_similarity' indexes. The implementation is generally similar
to what has recently been implemented for 'full_text' indexes [1, 2].

[1] ClickHouse#64656
[2] ClickHouse#64846
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-backports-created Backport PRs are successfully created, it won't be processed by CI script anymore pr-backports-created-cloud deprecated label, NOOP pr-must-backport Pull request should be backported intentionally. Use this label with great care! pr-not-for-changelog This PR should not be mentioned in the changelog pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.