Skip to content

Balanced reading from JBOD#16423

Merged
KochetovNicolai merged 3 commits intoClickHouse:masterfrom
amosbird:jbodread
Oct 29, 2020
Merged

Balanced reading from JBOD#16423
KochetovNicolai merged 3 commits intoClickHouse:masterfrom
amosbird:jbodread

Conversation

@amosbird
Copy link
Copy Markdown
Collaborator

@amosbird amosbird commented Oct 27, 2020

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Better read task scheduling for JBOD architecture and MergeTree storage. New setting read_backoff_min_concurrency which serves as the lower limit to the number of reading threads.

Detailed description / Documentation draft:

Disk-aware read scheduling is useful to avoid tail latency issues when dealing with huge data on JBOD array. I've observed a lot of read clustering issues, that is, we concurrently read from one disk for 20 seconds, and then switch to another one for the next 20 seconds.

I've tested it in some production environment with 12 disks JBOD array setup, and the results are very promising.

The baseline takes 573.039 sec, with JBOD task split, it reaches 429.389 sec, with random read task stealing, it gets 185.612 sec.

It works well with current read backoff mechanism.

update

Random stealing incurs reader reinit cost. Now we use a different scheme. First we try if any backoff threads can be resurrected. If no, we steal the next one. Thanks to the pre-balanced workloads, it should have a pretty good uniform distribution in general.

With this steal strategy, the runtime varies from 105 ~ 135 secs.

@robot-clickhouse robot-clickhouse added the pr-improvement Pull request with some product improvements label Oct 27, 2020
@KochetovNicolai KochetovNicolai self-assigned this Oct 27, 2020
@amosbird amosbird force-pushed the jbodread branch 4 times, most recently from 52259fe to 55673c1 Compare October 28, 2020 20:03
}

/// Before processing next thread, change volume if possible.
/// Different threads will likely start reading from different volumes,
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's actually different disks


{
/// Group parts by volume name.
/// We try minimize the number of threads concurrently read from the same volume.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's disk instead of volume.

@KochetovNicolai
Copy link
Copy Markdown
Member

Yandex third-party checks is broken in master
Integration tests (asan) is a ci issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-improvement Pull request with some product improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants