Skip to content

insertion deduplication on retries for materialised views#61601

Merged
CheSema merged 72 commits intomasterfrom
chesema-dedup-matview
Jul 4, 2024
Merged

insertion deduplication on retries for materialised views#61601
CheSema merged 72 commits intomasterfrom
chesema-dedup-matview

Conversation

@CheSema
Copy link
Copy Markdown
Member

@CheSema CheSema commented Mar 19, 2024

Implements ideas from #60008
Docs in progress ClickHouse/clickhouse-docs#2394

I improved deduplication by enhancing annotation of chunks on a pipeline level.
Now, each chunk could have several attached structures with base class ChunkInfo which are differ by the derived type. That annotation is passing with the chunks through the Processors. See Chunk::ChunkInfoCollection, CollectionOfDerivedItems<ChunkInfo>.

The deduplication token for each chunk is written as TokenInfo (derived class from ChunkInfo) with SetInitialTokenTransform. After that token could be updated. See DeduplicationToken::TokenInfo::BuildingStage.

Initial value for TokenInfo is taken either from insert_deduplication_token setting or it is calculated as a hash from inserted data.
In order to distinguish equal blocks which should not be deduplicated, TokenInfo is update with more detailed information about the source of the data, like the names of MV on the way to the table.

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

This PR changes how deduplication for MV works.
Fixed a lot of cases like:

  • on destination table: data is split for 2 or more blocks and that blocks is considered as duplicate when that block is inserted in parallel.
  • on MV destination table: the equal blocks are deduplicated, that happens when MV often produces equal data as a result for different input data due to performing aggregation.
  • on MV destination table: the equal blocks which comes from different MV are deduplicated

Settings update_insert_deduplication_token_in_dependent_materialized_views is depricated. The deduplicated token for inserted blocks in MV is calculated based on source data. Always.

@CheSema CheSema added the do not test disable testing on pull request label Mar 19, 2024
@robot-ch-test-poll robot-ch-test-poll added the pr-improvement Pull request with some product improvements label Mar 19, 2024
@robot-ch-test-poll
Copy link
Copy Markdown
Contributor

robot-ch-test-poll commented Mar 19, 2024

This is an automated comment for commit 438fd89 with description of existing statuses. It's updated for the latest CI running

❌ Click here to open a full report in a separate page

Check nameDescriptionStatus
Flaky testsChecks if new added or modified tests are flaky by running them repeatedly, in parallel, with more randomization. Functional tests are run 100 times with address sanitizer, and additional randomization of thread scheduling. Integration tests are run up to 10 times. If at least once a new test has failed, or was too long, this check will be red. We don't allow flaky tests, read the doc❌ failure
Integration testsThe integration tests report. In parenthesis the package type is given, and in square brackets are the optional part/total tests❌ failure
Successful checks
Check nameDescriptionStatus
BuildsThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Docs checkBuilds and tests the documentation✅ success
Fast testNormally this is the first check that is ran for a PR. It builds ClickHouse and runs most of stateless functional tests, omitting some. If it fails, further checks are not started until it is fixed. Look at the report to see which tests fail, then reproduce the failure locally as described here✅ success
Stateful testsRuns stateful functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc✅ success
Stateless testsRuns stateless functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc✅ success
Style checkRuns a set of checks to keep the code style clean. If some of tests failed, see the related log from the report✅ success
Unit testsRuns the unit tests for different release types✅ success

@CheSema CheSema force-pushed the chesema-dedup-matview branch from eeb0b2a to 750cf61 Compare March 22, 2024 16:50
@CheSema CheSema changed the title deduplication: mt|rmt x sync|async x token|hash deduplication: mt|rmt x single-thread|multi-thread x token|hash Mar 25, 2024
@CheSema CheSema force-pushed the chesema-dedup-matview branch 4 times, most recently from 1f3ab82 to b2b4cbf Compare April 20, 2024 00:19
@CheSema CheSema removed the do not test disable testing on pull request label Apr 20, 2024
@CheSema CheSema force-pushed the chesema-dedup-matview branch from 6b57e35 to 3436beb Compare April 20, 2024 15:44
@CheSema CheSema force-pushed the chesema-dedup-matview branch 4 times, most recently from c828bea to 9cf75d9 Compare May 2, 2024 17:52
@CheSema CheSema force-pushed the chesema-dedup-matview branch from c31ad5f to 4fddb9a Compare May 7, 2024 10:20
@kssenii kssenii self-assigned this May 7, 2024
@CheSema CheSema force-pushed the chesema-dedup-matview branch from 4fddb9a to cb94ff8 Compare May 17, 2024 14:26
@CheSema
Copy link
Copy Markdown
Member Author

CheSema commented Jul 4, 2024

The only concern here is test_storage_s3_queue/test.py::test_multiple_tables_streaming_sync_distributed[ordered]
It failed in this PR for the first time on insignificant change.

That test flasks in other PR with no clear relation to the changes
https://play.clickhouse.com/play?user=play#c2VsZWN0IAp0b1N0YXJ0T2ZIb3VyKGNoZWNrX3N0YXJ0X3RpbWUpIGFzIGQsCmNvdW50KCksICBncm91cFVuaXFBcnJheShwdWxsX3JlcXVlc3RfbnVtYmVyKSwgIGFueShyZXBvcnRfdXJsKQpmcm9tIGNoZWNrcyB3aGVyZSAnMjAyNC0wMS0wMScgPD0gY2hlY2tfc3RhcnRfdGltZSBhbmQgdGVzdF9uYW1lIGxpa2UgJyV0ZXN0X3N0b3JhZ2VfczNfcXVldWUvdGVzdC5weTo6dGVzdF9tdWx0aXBsZV90YWJsZXNfc3RyZWFtaW5nX3N5bmNfZGlzdHJpYnV0ZWRbb3JkZXJlZF0lJyBhbmQgdGVzdF9zdGF0dXMgaW4gKCdGQUlMJywgJ0ZMQUtZJykgZ3JvdXAgYnkgZCBvcmRlciBieSBkIGRlc2M=

From the test logic, it requires that the both S3Queue from different nodes read some files. From my point of view it is a race condition, it is legal that some times only the one of the two nodes processed all the files.

@CheSema CheSema added this pull request to the merge queue Jul 4, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jul 4, 2024
@CheSema
Copy link
Copy Markdown
Member Author

CheSema commented Jul 4, 2024

03172_error_log_table_not_empty -- is flaky, fixing it in #66093

@CheSema CheSema added this pull request to the merge queue Jul 4, 2024
Merged via the queue into master with commit 1981640 Jul 4, 2024
@CheSema CheSema deleted the chesema-dedup-matview branch July 4, 2024 14:07
@robot-ch-test-poll3 robot-ch-test-poll3 added the pr-synced-to-cloud The PR is synced to the cloud repo label Jul 4, 2024
baibaichen added a commit to Kyligence/gluten that referenced this pull request Jul 5, 2024
baibaichen added a commit to Kyligence/gluten that referenced this pull request Jul 5, 2024
baibaichen added a commit to apache/gluten that referenced this pull request Jul 5, 2024
* [GLUTEN-1632][CH]Daily Update Clickhouse Version (20240705)

* Fix build due to ClickHouse/ClickHouse#61601

---------

Co-authored-by: kyligence-git <[email protected]>
Co-authored-by: Chang Chen <[email protected]>
@CheSema
Copy link
Copy Markdown
Member Author

CheSema commented Jul 5, 2024

It is interesting

chassert(isUniqTypes()); --
https://s3.amazonaws.com/clickhouse-test-reports/66093/78a2139f2a43752196a029995b6965ada359c954/stress_test__tsan_.html
#66122

Upgrade Check --
Changed settings are not reflected in settings changes history (see changed_settings.txt):
update_insert_deduplication_token_in_dependent_materialized_views

01275_parallel_mv -- flaks
https://s3.amazonaws.com/clickhouse-test-reports/66093/78a2139f2a43752196a029995b6965ada359c954/stateless_tests__aarch64_.html

I did not see it in CI here!

@Algunenano
Copy link
Copy Markdown
Member

It is interesting

You only run a partial CI, not full. 73 successful, 4 skipped, and 8 failing checks Test_3 was never run, because Test_2 was never fully successful.

@Algunenano
Copy link
Copy Markdown
Member

00002_log_and_exception_messages_formatting was also affected by the introduction of a new noisy log debug: {}, token: {}

@CheSema
Copy link
Copy Markdown
Member Author

CheSema commented Jul 5, 2024

It is interesting

You only run a partial CI, not full. 73 successful, 4 skipped, and 8 failing checks Test_3 was never run, because Test_2 was never fully successful.

That is sad. I'm reverting this change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-improvement Pull request with some product improvements pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants