Skip to content

Improve the time to recover keeper connections#42541

Merged
nikitamikhaylov merged 15 commits intoClickHouse:masterfrom
Algunenano:restaring-thread-investigation
Oct 25, 2022
Merged

Improve the time to recover keeper connections#42541
nikitamikhaylov merged 15 commits intoClickHouse:masterfrom
Algunenano:restaring-thread-investigation

Conversation

@Algunenano
Copy link
Copy Markdown
Member

@Algunenano Algunenano commented Oct 20, 2022

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

  • Improve the time to recover lost keeper connections

Information about CI checks: https://clickhouse.com/docs/en/development/continuous-integration/

Based on #42323
Closes #42323
Closes #42251

2 main changes:

  • Ephemeral nodes are now created using the server UUID as the content. When we find the node unexpectedly (because of a hard crash or a lost ZK connection) we will delete them manually if they belonged to this server; otherwise we will wait as before (3x session time max).
  • The recovery task (ReplicatedMergeTreeRestartingThread) is now relaunched on failure. Before it re-scheduled sometimes immediately and sometimes after 1 second (10 seconds in previous releases), but it was buggy and in some situations it would fail to re-schedule it (thus falling back to the automated checks). Now failures are re-scheduled after 100-10000ms.

After this is discussed and reviewed, I'm thinking it would make sense for some queries to be able to wait for ZK recovery instead of immediately fail; it would be something similar to lock_acquire_timeout but just for this. For example, with the new improvements an INSERT could avoid throwing an exception if it waited for 100-300 ms for a new ZK connection to be ready.

static String generateActiveNodeIdentifier()
{
return "pid: " + toString(getpid()) + ", random: " + toString(randomSeed());
return Field(ServerUUID::get()).dump();
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to change this to use the server UUID only so it's faster to recover also after unclean shutdowns. There might be better approaches

@robot-clickhouse robot-clickhouse added the pr-improvement Pull request with some product improvements label Oct 20, 2022
@devcrafter
Copy link
Copy Markdown
Member

After this is discussed and reviewed, I'm thinking it would make sense for some queries to be able to wait for ZK recovery instead of immediately fail; it would be something similar to lock_acquire_timeout but just for this. For example, with the new improvements an INSERT could avoid throwing an exception if it waited for 100-300 ms for a new ZK connection to be ready.

INSERT retries should cover it, see #39764

@tavplubix tavplubix self-assigned this Oct 21, 2022
Comment on lines +70 to +75
task->scheduleAfter(immediately_ms);
}
}
catch (...)
{
task->scheduleAfter(immediately_ms);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will retry each 100-1000ms even if current replica is completely partitioned from a cluster. Should we increase this value a bit on each failure until it reaches 10s (the original value of retry_period_ms)?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I went for 1s because it was reduced to it before, but it should be better to have some exponential backoff (or some other simpler backoff) for up to 10 seconds.

storage.partial_shutdown_event.set();
storage.replica_is_active_node = nullptr;

LOG_TRACE(log, "Waiting for threads to finish");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These messages were useful sometimes

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was a mistake on my part. I had added a bunch of extra logs to help me understand what was going on and deleted some that I didn't add.

@nikitamikhaylov
Copy link
Copy Markdown
Member

Stress tests (thread) - Azure thread leek. #42640

@nikitamikhaylov nikitamikhaylov merged commit 0016bc2 into ClickHouse:master Oct 25, 2022
@tavplubix
Copy link
Copy Markdown
Member

Hm, there's an issue: quorum inserts rely on content of is_active node:

/// And what if it is possible that the current replica at this time has ceased to be active
/// and the quorum is marked as failed and deleted?
String value;
if (!zookeeper->tryGet(storage.replica_path + "/is_active", value, nullptr)
|| value != is_active_node_value)
throw Exception("Replica become inactive while waiting for quorum", ErrorCodes::NO_ACTIVE_REPLICAS);

Maybe we should compare node version instead

@Algunenano
Copy link
Copy Markdown
Member Author

Hm, there's an issue: quorum inserts rely on content of is_active node:

I'm not familiar with that part of the code. Do you mind explaining what the issue is? Before the change the value changed between restarts and stable inside the process (AFAICS), and now the value is also stable between restarts too. Is this a problem?

@tavplubix
Copy link
Copy Markdown
Member

According to the code I posted above, the value is supposed to change between sessions (so probably this code did not work correctly even before this PR)

@tavplubix
Copy link
Copy Markdown
Member

According to the code I posted above, the value is supposed to change between sessions (so probably this code did not work correctly even before this PR)

Fixed in #42878

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-improvement Pull request with some product improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Replicated table takes too long to recover after ZK session expired

6 participants