Skip to content

Forbid cleaning of tmp directories that can be used by an active mutation/merge.#28760

Merged
alexey-milovidov merged 2 commits intoClickHouse:masterfrom
azat:mutator-forbid-cleaner
Oct 17, 2021
Merged

Forbid cleaning of tmp directories that can be used by an active mutation/merge.#28760
alexey-milovidov merged 2 commits intoClickHouse:masterfrom
azat:mutator-forbid-cleaner

Conversation

@azat
Copy link
Copy Markdown
Member

@azat azat commented Sep 8, 2021

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Forbid cleaning of tmp directories that can be used by an active mutation/merge.

Detailed description / Documentation draft:
Mutator cannot be run in parallel with cleaner,
since cleaner may remove partial result of a merge,
and chances of this will be increase if:

  • temporary_directories_lifetime is pretty low,
  • someone adjust system-wide clock.

And these will lead to missing columns in these part,
and the server will figure out that something is wrong only on DETACH/ATTACH.

NOTE: tested manually

@robot-clickhouse robot-clickhouse added pr-bugfix Pull request with bugfix, not backported by default pr-improvement Pull request with some product improvements and removed pr-bugfix Pull request with bugfix, not backported by default labels Sep 8, 2021
@alexey-milovidov
Copy link
Copy Markdown
Member

cleaner may remove partial result of a merge

Only happens if no files were written during temporary_directories_lifetime (24 hours by default).

Copy link
Copy Markdown
Member

@alexey-milovidov alexey-milovidov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

@alexey-milovidov alexey-milovidov self-assigned this Sep 8, 2021
@azat
Copy link
Copy Markdown
Member Author

azat commented Sep 8, 2021

Only happens if no files were written during temporary_directories_lifetime (24 hours by default).

Yes, but:

  • someone may set it to something low
  • someone may adjust system time (i.e. after hardware replacement at start time will be far away, but after a short period of time NTP will adjust it)

@azat azat changed the title Forbid cleaning of old tmp directories in parallel with mutations/merges. Forbid cleaning of tmp directories that can be used by an active mutation/merge. Sep 10, 2021
@azat azat force-pushed the mutator-forbid-cleaner branch 2 times, most recently from 5092e39 to 321e847 Compare September 11, 2021 06:20
@azat
Copy link
Copy Markdown
Member Author

azat commented Sep 11, 2021

Functional stateless tests (release, DatabaseReplicated) — fail: 1, passed: 3238, skipped: 117

2021-09-11 03:55:27 [e010d80598a5] 2021.09.11 03:55:26.451731 [ 30617 ] {42848bab-ffaf-43b9-aacf-d71f8dfe972e} <Error> executeQuery: Code: 27. DB::ParsingException: Cannot parse input: expected 'version: ' at end of stream. (CANNOT_PARSE_INPUT_ASSERTION_FAILED) (version 21.11.1.8053) (from [::1]:46266) (comment: '/usr/share/clickhouse-test/queries/0_stateless/01161_all_system_tables.sh') (in query: SELECT * FROM system.distributed_ddl_queue LIMIT 10000 FORMAT Null), Stack trace (when copying this message, always include the lines below):

Unrelated, but interesting, sure right now there are no such nodes (for p in $(zk-shell 127.1:2182 --run-once 'ls /clickhouse/task_queue/ddl'); do zk-shell 127.1:2182 --run-once "json_cat /clickhouse/task_queue/ddl/$p" | grep -q version || echo $p; done), but what it was?
Create node and put the node's content should be atomic, right?
@alesapin maybe it is some issue in keeper?

@azat
Copy link
Copy Markdown
Member Author

azat commented Sep 18, 2021

Functional stateless tests (release, DatabaseReplicated) — fail: 1, passed: 3265, skipped: 117

01161_all_system_tables

2021-09-17 13:35:11 [87453c98cfd3] 2021.09.17 13:35:10.931973 [ 25922 ] {af4987c4-a280-4211-b228-464f626afe98} <Error> executeQuery: Code: 27. DB::ParsingException: Cannot parse input: expected 'version: ' at end of stream. (CANNOT_PARSE_INPUT_ASSERTION_FAILED) (version 21.11.1.8109) (from [::1]:58870) (comment: '/usr/share/clickhouse-test/queries/0_stateless/01161_all_system_tables.sh') (in query: SELECT * FROM system.distributed_ddl_queue LIMIT 10000 FORMAT Null), Stack trace (when copying this message, always include the lines below):

Looks like should be fixed by #29061, in particular https://github.com/ClickHouse/ClickHouse/pull/29061/files#diff-27b2fdac05cd0b4e165bebc7509be60eaeb37a260ffb35cc69f5614e3f81c3d8R223-R228 (since asyncTryGet does not throw on ZNONODE``), @tavplubix please correct me if I'm wrong.

@azat
Copy link
Copy Markdown
Member Author

azat commented Sep 23, 2021

@alexey-milovidov can you do a final look please?

@azat
Copy link
Copy Markdown
Member Author

azat commented Sep 26, 2021

Functional stateless tests (release, DatabaseReplicated) — fail: 1, passed: 3265, skipped: 117

Fixed in #29061 (and it is already merged, anyway this PR should not be related to the failure of 01161_all_system_tables)

@azat azat force-pushed the mutator-forbid-cleaner branch from be8d84f to 20bdda6 Compare September 27, 2021 21:31
@azat
Copy link
Copy Markdown
Member Author

azat commented Sep 27, 2021

Conflicting files
src/Storages/MergeTree/MergeTreeDataMergerMutator.cpp

Rebased to fix conflicts.

@azat
Copy link
Copy Markdown
Member Author

azat commented Sep 28, 2021

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Sep 28, 2021

CLA assistant check
All committers have signed the CLA.

@azat
Copy link
Copy Markdown
Member Author

azat commented Oct 12, 2021

Conflicting files
src/Storages/MergeTree/MergeTask.cpp

Let me know if conflicts should be resolved for review.

@azat azat force-pushed the mutator-forbid-cleaner branch from 20bdda6 to 405d537 Compare October 15, 2021 08:20
azat added 2 commits October 16, 2021 00:43
…merge/mutation

v2: rebase against MergeTask
v3: rebase due to conflicts in src/Storages/MergeTree/MergeTreeDataMergerMutator.cpp
v4:
- rebase due to conflicts in src/Storages/MergeTree/MergeTask.cpp
- drop common/scope_guard_safe.h (not used)
@azat azat force-pushed the mutator-forbid-cleaner branch from 405d537 to 07e8b2b Compare October 15, 2021 21:44
@azat
Copy link
Copy Markdown
Member Author

azat commented Oct 16, 2021

Functional stateless tests (release, wide parts enabled) — fail: 1, passed: 3463, skipped: 2
Functional stateless tests (thread) — fail: 1, passed: 3438, skipped: 27

Stress test (thread) — Test script failed

Code: 241. DB::Exception: Received from localhost:9000. DB::Exception: Memory limit (total) exceeded: would use 120.69 GiB (attempt to allocate chunk of 2105374 bytes), maximum: 94.44 GiB. (MEMORY_LIMIT_EXCEEDED)

Copy link
Copy Markdown
Member

@alexey-milovidov alexey-milovidov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@alexey-milovidov alexey-milovidov merged commit 4f11cfa into ClickHouse:master Oct 17, 2021
@azat azat deleted the mutator-forbid-cleaner branch October 17, 2021 06:19
azat added a commit to azat/ClickHouse that referenced this pull request Mar 8, 2022
In ClickHouse#33291 final part commit had been defered, and now it can take
significantly more time, that may lead to "Part directory doesn't exist"
error during INSERT:

    2022.02.21 18:18:06.979881 [ 11329 ] {insert} <Debug> executeQuery: (from 127.1:24572, user: default) INSERT INTO db.table (...) VALUES
    2022.02.21 20:58:03.933593 [ 11329 ] {insert} <Trace> db.table: Renaming temporary part tmp_insert_20220214_18044_18044_0 to 20220214_270654_270654_0.
    2022.02.21 21:16:50.961917 [ 11329 ] {insert} <Trace> db.table: Renaming temporary part tmp_insert_20220214_18197_18197_0 to 20220214_270689_270689_0.
    ...
    2022.02.22 21:16:57.632221 [ 64878 ] {} <Warning> db.table: Removing temporary directory /clickhouse/data/db/table/tmp_insert_20220214_18232_18232_0/
    ...
    2022.02.23 12:23:56.277480 [ 11329 ] {insert} <Trace> db.table: Renaming temporary part tmp_insert_20220214_18232_18232_0 to 20220214_273459_273459_0.
    2022.02.23 12:23:56.299218 [ 11329 ] {insert} <Error> executeQuery: Code: 107. DB::Exception: Part directory /clickhouse/data/db/table/tmp_insert_20220214_18232_18232_0/ doesn't exist. Most likely it is a logical error. (FILE_DOESNT_EXIST) (version 22.2.1.1) (from 127.1:24572) (in query: INSERT INTO db.table (...) VALUES), Stack trace (when copying this message, always include the lines below):

Follow-up for: ClickHouse#28760
Refs: ClickHouse#33291

Signed-off-by: Azat Khuzhin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-improvement Pull request with some product improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants