Skip to content

Async log improvements#83214

Merged
Algunenano merged 5 commits intoClickHouse:masterfrom
Algunenano:async_log_improvements
Jul 4, 2025
Merged

Async log improvements#83214
Algunenano merged 5 commits intoClickHouse:masterfrom
Algunenano:async_log_improvements

Conversation

@Algunenano
Copy link
Copy Markdown
Member

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Async logs: Limit the max number of entries that are hold in the queue

Note that each channel has its own queue, so a message might be dropped in syslog or console (because they are the slowest ones) but still make it to text files or system.text_log. The number of 10k is arbitrary.

I replaced Poco's NotificationQueue for our own, simplified, queue. I also tested several no-locks queues with bad results. Either they were slower, had TSAN issues or did not support blocking on pop() which meant we'd need to busy wait. I also tried to replace the condition_variable with an std::atomic_bool. It appeared to be slightly faster in the producer , but it consumed way more CPU in the consumer so I ended up discarding it.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Jul 3, 2025

Workflow [PR], commit [16c54a8]

Summary:

job_name test_name status info comment
Stateless tests (arm_binary) failure
00980_zookeeper_merge_tree_alter_settings FAIL

@clickhouse-gh clickhouse-gh bot added the pr-improvement Pull request with some product improvements label Jul 3, 2025
@alexey-milovidov alexey-milovidov self-assigned this Jul 3, 2025
@Algunenano
Copy link
Copy Markdown
Member Author

@Algunenano Algunenano added this pull request to the merge queue Jul 4, 2025
Merged via the queue into ClickHouse:master with commit 21dea30 Jul 4, 2025
119 of 122 checks passed
@Algunenano Algunenano deleted the async_log_improvements branch July 4, 2025 10:47
@robot-clickhouse-ci-1 robot-clickhouse-ci-1 added the pr-synced-to-cloud The PR is synced to the cloud repo label Jul 4, 2025
baibaichen pushed a commit to Kyligence/gluten that referenced this pull request Jul 5, 2025
baibaichen pushed a commit to apache/gluten that referenced this pull request Jul 6, 2025
* [GLUTEN-1632][CH]Daily Update Clickhouse Version (20250705)

* Fix benchmark build

* Fix Benchmark build due to ClickHouse/ClickHouse#79417

* Revert "Fix Build due to ClickHouse/ClickHouse#80931"

This reverts commit 02d12f6.

* Fix Build due to ClickHouse/ClickHouse#81886

* Fix Link issue due to ClickHouse/ClickHouse#83121

* Fix Build due to ClickHouse/ClickHouse#82604

* Fix Build due to ClickHouse/ClickHouse#82945

* Fix Build due to ClickHouse/ClickHouse#83214

---------

Co-authored-by: kyligence-git <[email protected]>
Co-authored-by: Chang chen <[email protected]>
class AsyncLogMessageQueue
{
/// Maximum size of the queue, to prevent memory overflow
static constexpr size_t max_size = 10'000;
Copy link
Copy Markdown
Member

@azat azat Aug 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I think this can be too small (not small in terms of number of rows, but in terms of memory, I mean it is OK to use 100MiB for logs queue, while 10K likely around 10MiB?), maybe we should make this configurable?

Also since this can eat memory, let's add some metrics for the number of messages in the queue?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also we need an event for dropped messages due to queue overflow

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, all 3 make sense.

  • Make the limit configurable.
  • Add a metric with number of pending messages in each queue (although naming might be hard, need to think about how to best handle it).
  • Add a profile event for dropped messages, ideally also per queue.

I'll look into it.

Aside from that there is another improvement I didn't make because I didn't expect it to be necessary, which is that the flushing thread could flush multiple messages at once (up to a certain limit). So instead of doing push message, then flush, we could do push 20 messages if available, then flush the stream once. At least for console and disk it should be faster

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-improvement Pull request with some product improvements pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants