Skip to content

Remove useless code in ReplicatedMergeTreeRestartingThread#36113

Merged
tavplubix merged 9 commits intomasterfrom
remove-useless-code-2
Jun 9, 2022
Merged

Remove useless code in ReplicatedMergeTreeRestartingThread#36113
tavplubix merged 9 commits intomasterfrom
remove-useless-code-2

Conversation

@alexey-milovidov
Copy link
Copy Markdown
Member

Changelog category (leave one):

  • Not for changelog (changelog entry is not required)

The code from 2015 is no longer relevant, because session restore is impossible in our ZooKeeper client.

@robot-ch-test-poll robot-ch-test-poll added the pr-not-for-changelog This PR should not be mentioned in the changelog label Apr 10, 2022
@tavplubix tavplubix self-assigned this Apr 11, 2022
@tavplubix
Copy link
Copy Markdown
Member

It's not really useless, it makes sense, for example, if zk client started new session due to operation timeout (default is 10 seconds) and ephemeral node from previous session still exists (default session timeout is 30 seconds). It's highly unlikely that some replica will remove is_active node of other replica, because:

  • replicas must have the same replica_name in engine arguments
  • replicas must have the same pid and "seed", where "seed" is generated from thread id and current time (btw, hash.update(&times) looks redundant)

Maybe we should add hostname to random seed (to avoid collisions when multiple servers start simultaneously).
Also we can avoid manual removal of is_active node and try to wait for it to disappear instead (like here). It will solve the issue with hard restart when all replicated tables fail to start replication because ephemeral nodes still exist.

@alexey-milovidov
Copy link
Copy Markdown
Member Author

alexey-milovidov commented Apr 11, 2022

+ agree. Waiting ephemeral node to disappear looks to be more safe.
And let's add the host name to the seed.

@alexey-milovidov
Copy link
Copy Markdown
Member Author

Could you please takeover this branch?

@tavplubix
Copy link
Copy Markdown
Member

Stateful tests (address, actions) - cannot copy extracted data for './usr/bin/clickhouse' to '/usr/bin/clickhouse.dpkg-new': failed to write (No space left on device) - something's wrong with CI infrastructure, cc: @Felixoid
Stateful tests (memory, actions) - same
Integration tests (asan, actions) [1/3] - test_polymorphic_parts - Ephemeral node /clickhouse/tables/test/shard5/sync_table/replicas/0/is_active still exists after 6s, it's because client's and server's configs are inconsistent, will fix configs and update exception message
Integration tests (asan, actions) [2/3] - Get "https://registry-1.docker.io/v2/": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
Integration tests (release, actions) [1/2] - same
Stress test (memory, actions) - OOM

@tavplubix
Copy link
Copy Markdown
Member

tavplubix commented Apr 25, 2022

Many integration tests failed with
Cannot start service zoo3: failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: exec: "--config=/etc/clickhouse-keeper/keeper_config3.xml": stat --config=/etc/clickhouse-keeper/keeper_config3.xml: no such file or directory: unknown
because keeper_cmd_prefix was not set, probably because docker_compose_keeper.yml appeared to be newer than cluster.py. Both files were changed in #31833. Maybe it's because we take compose-files from the integration-tests-runner image


and cluster.py from repository. So it's some race condition in CI infrastructure, сс: @Felixoid.
Btw, why do we put all compose-files into integration-tests-runner image?

Stress test - Backward compatibility check - Ephemeral node /clickhouse/tables/02232_allow_only_replicated_engine_test_1/root/replicas/1/is_active still exists after 60s, probably it's owned by someone else., it's interesting... Ephemeral nodes were not removed from Keeper after server restart, cc: @alesapin

2022.04.22 20:22:51.718676 [ 236578 ] {} <Information> test_1.mute_stylecheck: Became leader
2022.04.22 20:22:51.718895 [ 236703 ] {} <Debug> test_1.mute_stylecheck (ReplicatedMergeTreeRestartingThread): Activating replica.
2022.04.22 20:23:51.740708 [ 236703 ] {} <Error> test_1.mute_stylecheck (ReplicatedMergeTreeRestartingThread): void DB::ReplicatedMergeTreeRestartingThread::run(): Code: 49. DB::Exception: Ephemeral node /clickhouse/tables/02232_allow_only_replicated_engine_test_1/root/replicas/1/is_active still exists after 60s, probably it's owned by someone else. Either session_timeout_ms in client's config is different from server's config or it's a bug. Node data: 'pid: 62505, random: 8116781955302429524'. (LOGICAL_ERROR)

@Felixoid
Copy link
Copy Markdown
Member

Please, rebase it on / merge master into the branch, please. It's the easiest way to have the issue fixed for now.

Btw, why do we put all compose-files into integration-tests-runner image?

Can't really say, didn't dig into them yet.

@alexey-milovidov
Copy link
Copy Markdown
Member Author

@Mergifyio update

@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Apr 30, 2022

update

✅ Branch has been successfully updated

@tavplubix
Copy link
Copy Markdown
Member

Tests will fail anyway because of the bug in Keeper described above

@alexey-milovidov
Copy link
Copy Markdown
Member Author

@Mergifyio update

@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented May 22, 2022

update

✅ Branch has been successfully updated

@tavplubix tavplubix added the do not test disable testing on pull request label May 22, 2022
@alesapin
Copy link
Copy Markdown
Member

alesapin commented Jun 3, 2022

Related: #37824

@tavplubix
Copy link
Copy Markdown
Member

And now it makes sense to merge with master and restart tests

@tavplubix tavplubix removed the do not test disable testing on pull request label Jun 3, 2022
@tavplubix
Copy link
Copy Markdown
Member

@Mergifyio update

@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Jun 3, 2022

update

✅ Branch has been successfully updated

@tavplubix
Copy link
Copy Markdown
Member

Integration tests:
test_keeper_zookeeper_converter - #37890
test_parts_delete_zookeeper/test.py::test_merge_doesnt_work_without_zookeeper - #37957
Stress test (memory, actions) - #37664

@tavplubix tavplubix merged commit 780f7c8 into master Jun 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-not-for-changelog This PR should not be mentioned in the changelog

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants