Skip to content

Should fail ddl query as soon as possible if table is shutdown#19684

Merged
tavplubix merged 7 commits intoClickHouse:masterfrom
yiguolei:master
Feb 2, 2021
Merged

Should fail ddl query as soon as possible if table is shutdown#19684
tavplubix merged 7 commits intoClickHouse:masterfrom
yiguolei:master

Conversation

@yiguolei
Copy link
Copy Markdown
Contributor

@yiguolei yiguolei commented Jan 27, 2021

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

Changelog category (leave one):

  • Bug Fix

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Background thread which executes ON CLUSTER queries might hang waiting for dropped replicated table to do something. It's fixed.

Detailed description / Documentation draft:

Check table status during ddl execution. Set task status to exception if the table is shutdown by drop command or others.

@robot-clickhouse robot-clickhouse added the pr-improvement Pull request with some product improvements label Jan 27, 2021
@tavplubix tavplubix self-assigned this Jan 27, 2021
@tavplubix
Copy link
Copy Markdown
Member

Ok, but could you please explain which problem are you tying to solve? If table is partially shutdown, then status.is_leader must be false and everything should work fine, because DDLWorker takes it into account and tries to execute query on other replica. If it's not possible for some reason, then query fails with Task ... was not executed by anyone ... after several retries.

@yiguolei
Copy link
Copy Markdown
Contributor Author

yiguolei commented Jan 27, 2021

@tavplubix If all replica is dropped, status.is_leader is false on all replicas, so that the number of tries will not be updated by any replica. And it will have to wait for timeout currently it is MAX_EXECUTION_TIMEOUT_SEC = 3600s. It is too long and will blocking other DDL tasks.

@tavplubix
Copy link
Copy Markdown
Member

tavplubix commented Jan 27, 2021

You're right, there is a bug in DDLWorker. However, DROP TABLE is not the only thing that partially shutdowns replicated table (and that's why test_ddl_worker_non_leader/test.py::test_non_leader_replica failed), it also happens on lost ZooKeeper connection and SYSTEM RESTART REPLICA. I'm not sure if we should finish DDL query with Table is shutdown ... in these cases. Maybe check IStorage::is_dropped flag instead of partial_shutdown_called? On the other hand, is_dropped is false for detached tables, but DETACH/ATTACH are usually used to restart replica (and that's how SYSTEM RESTART REPLICA works). To avoid waiting on permanently detached tables we can use something like this:

bool replica_dropped = storage->is_dropped;
bool all_replicas_likely_detached = status.active_replicas == 0 && !DatabaseCatalog::instance().isTableExist(table_id, context);
if (replica_dropped || all_replicas_likely_detached) /// Table is shutdown ...

What do you think?

Btw, could you please add a test? It can be simple stateless test with test_shard_localhost cluster with one replica.

@robot-clickhouse robot-clickhouse added pr-bugfix Pull request with bugfix, not backported by default and removed pr-improvement Pull request with some product improvements labels Jan 27, 2021
@yiguolei
Copy link
Copy Markdown
Contributor Author

@tavplubix You are right. I have another commit. ^-^
But I find it is hard to test it, because the table has to be dropped during the ddl task's running stage. It is very difficult to make such a test. Could you please provide some idea?

@tavplubix
Copy link
Copy Markdown
Member

Take a look at *.sh tests (such as 01150_ddl_guard_rwr and 00993_system_parts_race_condition_drop_zookeeper). You can run two threads, one thread will execute CREATE and DROP and other thread will execute ALTER ... ON CLUSTER.

robot-clickhouse pushed a commit that referenced this pull request Feb 2, 2021
robot-clickhouse pushed a commit that referenced this pull request Feb 2, 2021
tavplubix added a commit that referenced this pull request Feb 3, 2021
Backport #19684 to 21.1: Should fail ddl query as soon as possible if table is shutdown
tavplubix added a commit that referenced this pull request Feb 3, 2021
Backport #19684 to 20.12: Should fail ddl query as soon as possible if table is shutdown
tavplubix added a commit that referenced this pull request Feb 3, 2021
Backport #19684 to 20.11: Should fail ddl query as soon as possible if table is shutdown
tavplubix added a commit that referenced this pull request Feb 3, 2021
Backport #19684 to 21.2: Should fail ddl query as soon as possible if table is shutdown
robot-clickhouse pushed a commit that referenced this pull request Feb 15, 2021
tavplubix added a commit that referenced this pull request Feb 16, 2021
Backport #19684 to 20.8: Should fail ddl query as soon as possible if table is shutdown
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-bugfix Pull request with bugfix, not backported by default

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants