Skip to content

Improvements in lost replica recovery#15701

Merged
tavplubix merged 4 commits intomasterfrom
better_clone_replica
Oct 8, 2020
Merged

Improvements in lost replica recovery#15701
tavplubix merged 4 commits intomasterfrom
better_clone_replica

Conversation

@tavplubix
Copy link
Copy Markdown
Member

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

Changelog category (leave one):

  • Bug Fix

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Fixed too low default value of max_replicated_logs_to_keep setting, which might cause replicas to become lost too often. Improve lost replica recovery process by choosing the most up-to-date replica to clone. Also do not remove old parts from lost replica, detach them instead.

@robot-clickhouse robot-clickhouse added the pr-bugfix Pull request with bugfix, not backported by default label Oct 6, 2020
@tavplubix
Copy link
Copy Markdown
Member Author

Integration tests (release) - test_always_fetch_merged/test.py::test_replica_always_download - flacky
Performance - unrelated
Stress test (undefined) — Server failed to start - not a bug, seems like we need more retries in alive check

/// For compatibility with older ClickHouse versions
get_is_lost.stat.version = -1;
}
else if (get_is_lost.data != "0")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Write a log message about it?

if (!log_entries.empty())
{
String last_entry_num = std::max_element(log_entries.begin(), log_entries.end())->substr(4);
max_log_entry = std::stol(last_entry_num);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parse<UInt64>(last_entry_num.substr(strlen("log-"))) looks more clear

if (source_replica.empty())
throw Exception("All replicas are lost", ErrorCodes::ALL_REPLICAS_LOST);

if (is_new_replica)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's also add info about source queue size and log pointer.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add log pointer here?

LOG_INFO(log, "Replica {} has approximate {} queue lag and {} queue size", source_replica_name, replica_queue_lag, replica_queue_size);

@tavplubix
Copy link
Copy Markdown
Member Author

AST fuzzer - known issue
Docs check - unrelated
Performance - unrelated

@tavplubix tavplubix merged commit 7b5f2b2 into master Oct 8, 2020
@tavplubix tavplubix deleted the better_clone_replica branch October 8, 2020 20:19
tavplubix added a commit that referenced this pull request Oct 9, 2020
Backport #15701 to 20.9: Improvements in lost replica recovery
tavplubix added a commit that referenced this pull request Oct 9, 2020
Backport #15701 to 20.8: Improvements in lost replica recovery
tavplubix added a commit that referenced this pull request Oct 9, 2020
Backport #15701 to 20.7: Improvements in lost replica recovery
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-bugfix Pull request with bugfix, not backported by default

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants