Skip to content

Projection Index Step 1: Support _part_offset in normal projections#78429

Merged
KochetovNicolai merged 25 commits intoClickHouse:masterfrom
amosbird:projection-index-1
Apr 21, 2025
Merged

Projection Index Step 1: Support _part_offset in normal projections#78429
KochetovNicolai merged 25 commits intoClickHouse:masterfrom
amosbird:projection-index-1

Conversation

@amosbird
Copy link
Copy Markdown
Collaborator

@amosbird amosbird commented Mar 30, 2025

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Allow to specify _part_offset in normal projection. This is the first step to build projection index. It can be used with #58224 and can help improve #63207

This PR also adds some part-level constant virtual columns to projection metadata such as _part, _partition_id, etc.

This PR also adds mergeTreeProjection table function to read from projection parts directly. It is used for introspection.

This PR also uses a better way to construct projection metadata without relying on TemporaryTableHolder. This closes #74853. (Issue is too minor to warrant a dedicated bugfix PR on its own, but worth addressing while we're modifying related code)

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Mar 30, 2025

Workflow [PR], commit [055406d]

@clickhouse-gh clickhouse-gh bot added the pr-improvement Pull request with some product improvements label Mar 30, 2025
@amosbird amosbird force-pushed the projection-index-1 branch from a15e784 to c4592e6 Compare March 30, 2025 13:25
@UnamedRus
Copy link
Copy Markdown
Contributor

UnamedRus commented Mar 30, 2025

This PR also adds mergeTreeProjection table function to read from projection parts directly. It is used for introspection.

TBH, there is actually requests/cases when people want to read projection data directly.
#35017

@amosbird amosbird force-pushed the projection-index-1 branch from ca83047 to 1b23e03 Compare March 31, 2025 06:51
@KochetovNicolai KochetovNicolai self-assigned this Mar 31, 2025
@amosbird amosbird force-pushed the projection-index-1 branch from 1b23e03 to b0fab7f Compare April 1, 2025 13:14
@EmeraldShift
Copy link
Copy Markdown
Contributor

In the example query:

SELECT
    count()
FROM events
WHERE (_part, _part_offset) IN (
    SELECT _part, _part_offset
    FROM events
    WHERE user_id = 42
)

what happens if there's a merge between executing the inner SELECT and the outer one? Could it return a _part that is not in the new merged table, and then the results are incomplete? I thought if the two queries used different snapshots than this wouldn't work. I'd be very happy to be wrong! If it means that this query will just work once this is merged, then I'm super excited!

@amosbird
Copy link
Copy Markdown
Collaborator Author

amosbird commented Apr 6, 2025

what happens if there's a merge between executing the inner SELECT and the outer one?

It currently uses different snapshots, which can lead to the expected issues. This will be addressed in the next phase of the projection index, where we will avoid relying on subqueries to index rows during queries.

@amosbird amosbird requested a review from Copilot April 6, 2025 15:48
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 11 out of 29 changed files in this pull request and generated no comments.

Files not reviewed (18)
  • ci/jobs/scripts/check_style/aspell-ignore/en/aspell-dict.txt: Language not supported
  • src/Interpreters/MutationsInterpreter.cpp: Language not supported
  • src/Processors/QueryPlan/Optimizations/optimizeUseNormalProjection.cpp: Language not supported
  • src/Storages/MergeTree/MergeTask.cpp: Language not supported
  • src/Storages/MergeTree/MergeTask.h: Language not supported
  • src/Storages/MergeTree/MergeTreeData.cpp: Language not supported
  • src/Storages/MergeTree/MergeTreeData.h: Language not supported
  • src/Storages/MergeTree/MergeTreeDataMergerMutator.cpp: Language not supported
  • src/Storages/MergeTree/MergeTreeDataWriter.cpp: Language not supported
  • src/Storages/MergeTree/MergeTreeReadTask.h: Language not supported
  • src/Storages/MergeTree/MergeTreeSequentialSource.cpp: Language not supported
  • src/Storages/MergeTree/MergeTreeSequentialSource.h: Language not supported
  • src/Storages/MergeTree/MergeTreeVirtualColumns.cpp: Language not supported
  • src/Storages/MergeTree/MergeTreeVirtualColumns.h: Language not supported
  • src/Storages/MergeTree/MergedPartOffsets.h: Language not supported
  • src/Storages/MergeTree/StorageFromMergeTreeProjection.cpp: Language not supported
  • src/Storages/MergeTree/StorageFromMergeTreeProjection.h: Language not supported
  • src/Storages/ProjectionsDescription.cpp: Language not supported

@amosbird amosbird force-pushed the projection-index-1 branch from 3bf8bd9 to 26a0192 Compare April 11, 2025 15:37
@amosbird
Copy link
Copy Markdown
Collaborator Author

test_keeper_auth/test.py::test_partial_auth[get_genuine_zk] #78981

test_storage_rabbitmq/test.py::test_rabbitmq_tsv_with_delimiter #71049

I cannot find report for TSan crash in 01154_move_partition_long, but it exists in many other PR's CI runs too.

@amosbird amosbird force-pushed the projection-index-1 branch 2 times, most recently from 2a56cf1 to 416b01f Compare April 16, 2025 05:20
@amosbird amosbird force-pushed the projection-index-1 branch from a010da8 to 782b09b Compare April 19, 2025 04:31
@KochetovNicolai
Copy link
Copy Markdown
Member

Fuzzer:

SELECT (1 OR (isNullable(1) OR ((1 OR (8 OR (toNullable(8) IS NULL) OR 1 OR 8 OR isZeroOrNull(materialize(8))) OR 1 OR (NOT 19)) < 0) OR 8)) < 0, json.a.r.:`Array(JSON)`.^b, *, NOT isNullable(materialize(76)) FROM test PREWHERE (NOT 19) OR 1 OR ((1 OR (8 OR (0 <= ((toLowCardinality(8) OR 1 OR 8 OR (8 IS NULL)) OR 1 OR (NOT 19) OR toNullable(1))))) >= 0) QUALIFY materialize(19) ORDER BY 4 ASC, id DESC NULLS LAST FORMAT `Null`

not related.

@EmeraldShift
Copy link
Copy Markdown
Contributor

I'm seeing some strange behavior with a projection containing _part_offset when parts are merged. It looks like the _parent_part_offset column stays equal to the original offset before the merge, while _part has the expected behavior of changing to the new higher-level part. See fiddle:

CREATE TABLE t (id UInt32, projection by_id (select _part_offset order by id)) ENGINE=MergeTree ORDER BY tuple();

-- Insert a lot of rows so background merges happen automatically
INSERT INTO t WITH 100000000 AS n SELECT n - number FROM numbers(n);

CREATE OR REPLACE VIEW TableUsingProjection AS
SELECT *, _part, _part_offset FROM t WHERE (_part, _part_offset) IN (SELECT (_part, _part_offset) FROM t WHERE id = 1);
CREATE OR REPLACE VIEW IntrospectProjection AS
SELECT *, _part FROM mergeTreeProjection(currentDatabase(), t, by_id) WHERE id = 1;

-- Before some automatic merges
SELECT * FROM TableUsingProjection;
SELECT * FROM IntrospectProjection;

-- Give some time for merges (non-determinstic)
SELECT sleep(3);

-- After some automatic merges
SELECT * FROM TableUsingProjection;
SELECT * FROM IntrospectProjection;

it's not guaranteed due to timing of merges, but I saw a result like this:

   +-id-+-_part-------+-_part_offset-+
1. |  1 | all_90_90_0 |      1036182 |
   +----+-------------+--------------+
   +-id-+-_parent_part_offset-+-_part-------+
1. |  1 |             1036182 | all_90_90_0 |
   +----+---------------------+-------------+
   +-sleep(3)-+
1. |        0 |
   +----------+
   +------id-+-_part-------+-_part_offset-+
1. | 5559766 | all_85_90_1 |      1036182 |
   +---------+-------------+--------------+
   +-id-+-_parent_part_offset-+-_part-------+
1. |  1 |             1036182 | all_85_90_1 |
   +----+---------------------+-------------+

Notably, I cannot reproduce the problem using OPTIMIZE TABLE t FINAL to force merges. When I do this, the _parent_part_offset column correctly updates to the new offset.

Is this expected given the current in-progress state of projection indexes?

@amosbird
Copy link
Copy Markdown
Collaborator Author

amosbird commented Jun 2, 2025

@EmeraldShift The issue stems from MergeTree tables that lack a sorting key. While optimizations are intended for this case, they haven't been fully implemented yet. I’ll address it later this week.

@EmeraldShift
Copy link
Copy Markdown
Contributor

Oh, interesting. I hadn't considered using tuple() for testing would change the behavior. For what it's worth, I'm not planning on using any unsorted tables in production. I'll keep playing around with tables with a sort key.

@amosbird
Copy link
Copy Markdown
Collaborator Author

amosbird commented Jun 3, 2025

@EmeraldShift Fixed in #81188.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-improvement Pull request with some product improvements pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ALTER column on temporary table with projection forever stuck

7 participants