Skip to content

ZK watch leak in DDLWorker #26036

@nvartolomei

Description

@nvartolomei

DDLWorker thread checks in a loop for new tasks by calling ZooKeeper GetChildren with a watch callback.

Strings queue_nodes = zookeeper->getChildren(queue_dir, nullptr, queue_updated_event);

Due to a recent change from #25373 instead of blocking on new events we retry the loop every 10 seconds instead of when watch callback is triggered.

/// FIXME It may hang for unknown reason. Timeout is just a hotfix.
constexpr int queue_wait_timeout_ms = 10000;
queue_updated_event->tryWait(queue_wait_timeout_ms);

This causes a zookeeper watch callback leak. It is probably not very important as all that happens is that a new std::function is added to the watches list for the ddl worker path every 10 seconds. What is 32 bytes (sizeof(std::function)?) every 10 seconds?

However, it is annoying to see that on metrics. The metric is somewhat useful for catching bugs and this isn't an important one.

I understand that we could probably fix this "temporary fix" in DDLWorker and remove the tryWait + retrying, but that pattern was seen in other places and is useful for implementing a non-blocking watch, also used in waitForTableReplicaToProcessLogEntry.

Is there a laconic fix for that? Maybe we could somehow remove the callback when not needed/before retry? Maybe we could change the behaviour slightly and add watchcallback only if it wasn't already added (this wouldn't work for getChildren which wraps a Poco Event in a watch)? Anything else beside just ignoring this?


I understand that all the callbacks will be cleared when the event is triggered.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions