Skip to content

Replace lost parts with empty parts instead of hacking replication queue#25820

Merged
alesapin merged 13 commits intomasterfrom
better_remove_empty_parts
Jul 4, 2021
Merged

Replace lost parts with empty parts instead of hacking replication queue#25820
alesapin merged 13 commits intomasterfrom
better_remove_empty_parts

Conversation

@alesapin
Copy link
Copy Markdown
Member

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Better handling of lost parts for ReplicatedMergeTree tables. Fixes rare inconsistencies in ReplicationQueue. Nothing should be visible to the user. Fixes #10368.

@robot-clickhouse robot-clickhouse added the pr-improvement Pull request with some product improvements label Jun 29, 2021
@alesapin
Copy link
Copy Markdown
Member Author

Will add only basic integration tests. Our stress/trash tests will catch other errors.

@tavplubix tavplubix self-assigned this Jun 29, 2021
@alesapin alesapin marked this pull request as ready for review June 30, 2021 08:22
@alesapin
Copy link
Copy Markdown
Member Author

@Mergifyio update

@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Jun 30, 2021

Command update: success

Branch has been successfully updated

@alesapin
Copy link
Copy Markdown
Member Author

One more time

@alesapin
Copy link
Copy Markdown
Member Author

alesapin commented Jul 1, 2021

@Mergifyio update
one more time

@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Jul 1, 2021

Command update: success

Branch has been successfully updated

@alesapin
Copy link
Copy Markdown
Member Author

alesapin commented Jul 2, 2021

Got it:
2021.07.02 14:41:29.809067 [ 250 ] {} <Fatal> : Logical error: 'Tried to create empty part 5_58_63_1, but it replaces existing parts 5_58_58_0, 5_59_59_0, 5_60_60_0, 5_61_61_0.'.

@alesapin
Copy link
Copy Markdown
Member Author

alesapin commented Jul 2, 2021

01268_procfs_metrics?

@alesapin
Copy link
Copy Markdown
Member Author

alesapin commented Jul 3, 2021

@Mergifyio update
last time

@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Jul 3, 2021

Command update: success

Branch has been successfully updated

@alesapin
Copy link
Copy Markdown
Member Author

alesapin commented Jul 4, 2021

test_replicated_mutations/test.py::test_mutations -- flaky in master.
01936_quantiles_cannot_return_null -- unrelated to changes.

@alesapin alesapin merged commit cf2fc94 into master Jul 4, 2021
@alesapin alesapin deleted the better_remove_empty_parts branch July 4, 2021 15:55
nvartolomei added a commit to nvartolomei/ClickHouse that referenced this pull request Jul 22, 2021
This was introduced in ClickHouse#8602.
The idea was to avoid data re-appearing in ClickHouse after DROP/DETACH
PARTITION. This problem was only present in MergeTree engine and I don't
understand why we need to do the same in ReplicatedMergeTree.

For ReplicatedMergeTree the state of truth is stored in ZK, deleting
things from filesystem just introduces inconsistencies and this is the
main source for errors like "No active replica has part X or covering
part".

The resulting problem is fixed by
ClickHouse#25820, but in my opinion
we would better avoid introducing the ZK/FS inconsistency in the first
place.

When does this inconsistency appear? Often the sequence is like this:

0. Write 2 parts to ZK [all_0_0_0, all_1_1_0]
1. A merge gets scheduled
2. New part replaces old parts [new: all_0_1_1, old: all_0_0_0, all_1_1_0]
3. Replica gets shutdown and old parts are removed from filesystem
4. Replica comes back online, metadata about all parts is still stored in ZK for this new replica.
5. Other replica after cleanup thread runs will have only [all_0_1_1] in
   ZK
5. User triggers a DROP_RANGE after a while (drop range is for all_0_1_9999*)
6. Each replica deletes from ZK only [all_0_1_1]. The replica that got
   restarted uses its in-memory state to choose nodes to delete from ZK.
7. Restart the replica again. It will now think that there are 2 parts
   that it lost and needs to fetch them [all_0_0_0, all_1_1_0].

`clearOldPartsAndRemoveFromZK` which is triggered from cleanup thread
runs cleanup sequence correctly, it first removes things from ZK and
then from filesystem. I don't see much benefit of triggering it on
shutdown and would rather have it called only from a single place.

---

This is a very, very edge case situation but it proves that the current
"fix" (ClickHouse#25820) isn't
complete.

```
create table test(
    v UInt64
)
engine=ReplicatedMergeTree('/clickhouse/test', 'one')
order by v
settings old_parts_lifetime = 30;

create table test2(
    v UInt64
)
engine=ReplicatedMergeTree('/clickhouse/test', 'two')
order by v
settings old_parts_lifetime = 30;

create table test3(
    v UInt64
)
engine=ReplicatedMergeTree('/clickhouse/test', 'three')
order by v
settings old_parts_lifetime = 30;

insert into table test values (1), (2), (3);
insert into table test values (4);

optimize table test final;

detach table test;
detach table test2;

alter table test3 drop partition tuple();

attach table test;
attach table test2;
```

```
(CONNECTED [localhost:9181]) /> ls /clickhouse/test/replicas/one/parts
all_0_0_0
all_1_1_0
(CONNECTED [localhost:9181]) /> ls /clickhouse/test/replicas/two/parts
all_0_0_0
all_1_1_0
(CONNECTED [localhost:9181]) /> ls /clickhouse/test/replicas/three/parts
```

```
detach table test;
attach table test;
```

`test` will now figure out that parts exist only in ZK and will issue `GET_PART`
after first removing parts from ZK.

`test2` will receive fetch for unknown parts and will trigger part checks itself.
Because `test` doesn't have the parts anymore in ZK `test2` will mark them as LostForever.
It will also not insert empty parts, because the partition is empty.

`test` is left with `GET_PART` in the queue and stuck.

```
SELECT
    table,
    type,
    replica_name,
    new_part_name,
    last_exception
FROM system.replication_queue

Query id: 74c5aa00-048d-4bc1-a2ea-6f69501c11a0

Row 1:
──────
table:          test
type:           GET_PART
replica_name:   one
new_part_name:  all_0_0_0
last_exception: Code: 234. DB::Exception: No active replica has part all_0_0_0 or covering part. (NO_REPLICA_HAS_PART) (version 21.9.1.1)

Row 2:
──────
table:          test
type:           GET_PART
replica_name:   one
new_part_name:  all_1_1_0
last_exception: Code: 234. DB::Exception: No active replica has part all_1_1_0 or covering part. (NO_REPLICA_HAS_PART) (version 21.9.1.1)
```
azat added a commit to azat/ClickHouse that referenced this pull request Aug 30, 2021
…y part

AFAICS the problem is that some parts may be replaced with empty parts
(after ClickHouse#25820), and removed by the cleanup thread, due to it is empty
[1] (while it should not be deleted since it can download source part):

    <details>

    ```
    2021.08.18 20:11:22.687933 [ 341 ] {} <Trace> test_dpefxp.alter_table_1 (0758ca24-90e7-452c-8758-ca2490e7252c): Created log entry for mutation -1_115_115_0_146
    ...
    2021.08.18 20:11:22.707609 [ 22825 ] {} <Trace> test_dpefxp.alter_table_6 (766ae414-e113-4965-b66a-e414e1137965): Executing log entry to mutate part -1_115_115_0 to -1_115_115_0_146
    2021.08.18 20:11:22.707643 [ 22825 ] {} <Debug> test_dpefxp.alter_table_6 (766ae414-e113-4965-b66a-e414e1137965): Source part -1_115_115_0 for -1_115_115_0_146 is not ready; will try to fetch it instead
    ...
    2021.08.18 20:11:22.709397 [ 333 ] {} <Trace> test_dpefxp.alter_table_6 (ReplicatedMergeTreeQueue): Not executing log entry queue-0000001579 for part -1_115_115_0 because it is covered by part -1_115_115_0_146 that is currently executing.
    2021.08.18 20:11:22.718861 [ 22825 ] {} <Information> test_dpefxp.alter_table_6 (766ae414-e113-4965-b66a-e414e1137965): DB::Exception: No active replica has part -1_115_115_0_146 or covering part
    ...
    2021.08.18 20:11:27.936829 [ 295 ] {} <Information> test_dpefxp.alter_table_6 (766ae414-e113-4965-b66a-e414e1137965): Going to replace lost part -1_115_115_0_146 with empty part
    2021.08.18 20:11:27.957839 [ 295 ] {} <Information> test_dpefxp.alter_table_6 (766ae414-e113-4965-b66a-e414e1137965): Created empty part -1_115_115_0_146 instead of lost part
    ...
    2021.08.18 20:11:28.731635 [ 257 ] {} <Trace> test_dpefxp.alter_table_6 (ReplicatedMergeTreeCleanupThread): Cleared 190 old blocks from ZooKeeper
    ...
    2021.08.18 20:11:28.734507 [ 257 ] {} <Trace> test_dpefxp.alter_table_6 (766ae414-e113-4965-b66a-e414e1137965): Will try to insert a log entry to DROP_RANGE for part: -1_115_115_0_146
    2021.08.18 20:11:28.779373 [ 22837 ] {} <Debug> test_dpefxp.alter_table_6 (766ae414-e113-4965-b66a-e414e1137965): Removed 1 parts inside -1_115_115_0_146.
    ...
    2021.08.18 20:11:28.792600 [ 273 ] {} <Trace> test_dpefxp.alter_table_6 (766ae414-e113-4965-b66a-e414e1137965): Created log entry /clickhouse/tables/00993_system_parts_race_condition_drop_zookeeper_test_dpefxp/alter_table/log/log-0000003459 for merge -1_111_118_1
    ...
    2021.08.18 20:11:28.910988 [ 354 ] {} <Error> test_dpefxp.alter_table_7 (ReplicatedMergeTreeQueue): Code: 49. DB::Exception: Part -1_111_118_1 intersects next part -1_115_115_0_146. It is a bug. (LOGICAL_ERROR), Stack trace (when copying this message, always include the lines below):
    2021.08.18 20:11:31.282160 [ 305 ] {} <Error> test_dpefxp.alter_table_2 (ReplicatedMergeTreeQueue): Code: 49. DB::Exception: Part -1_111_118_1 intersects next part -1_115_115_0_146. It is a bug. (LOGICAL_ERROR), Stack trace (when copying this message, always include the lines below):
    ```

    </details>

  [1]: https://clickhouse-test-reports.s3.yandex.net/27752/59e3cb18f4e53c453951267b5599afeb664290d8/functional_stateless_tests_(release,_wide_parts_enabled).html
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-improvement Pull request with some product improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Some merges may stuck

3 participants