Skip to content

Auto-balance big parts in JBOD array#16481

Merged
alesapin merged 5 commits intoClickHouse:masterfrom
amosbird:jbodbalance
Mar 12, 2021
Merged

Auto-balance big parts in JBOD array#16481
alesapin merged 5 commits intoClickHouse:masterfrom
amosbird:jbodbalance

Conversation

@amosbird
Copy link
Copy Markdown
Collaborator

@amosbird amosbird commented Oct 28, 2020

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Introduce a new merge tree setting min_bytes_to_rebalance_partition_over_jbod which allows assigning new parts to different disks of a JBOD volume in a balanced way.

Detailed description / Documentation draft:

JBOD volume is notorious in that only a small subset of disks can be fully utilized for a given query. This PR focuses on a particular common scenario: partition-wise queries. Now big new parts are assigned to different disks of a JBOD volume in a balanced way.
It might be useful for #16300.

@robot-clickhouse robot-clickhouse added the pr-improvement Pull request with some product improvements label Oct 28, 2020
@amosbird amosbird marked this pull request as draft October 28, 2020 15:25
@amosbird amosbird force-pushed the jbodbalance branch 2 times, most recently from c28def2 to e10faae Compare October 29, 2020 07:32
@alesapin alesapin self-assigned this Nov 20, 2020
@amosbird amosbird marked this pull request as ready for review February 18, 2021 08:52
@amosbird amosbird requested a review from alesapin February 18, 2021 08:56
@amosbird amosbird force-pushed the jbodbalance branch 2 times, most recently from 9bc3ec3 to a36c60c Compare February 18, 2021 10:40
@amosbird
Copy link
Copy Markdown
Collaborator Author

@azat
Copy link
Copy Markdown
Member

azat commented Feb 19, 2021

Copy link
Copy Markdown
Member

@alesapin alesapin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm a little confused, but we just have three types of parts:

  1. Existing parts.
  2. Currently merging source parts that will become outdated (included in 1).
  3. Currently merging parts that will be added as a result of the merge.

For each new merged/fetched part we are trying to find a disk that has a minimum size of data in the same partition. So we just summing sizes by a disk of all existing parts from this same partition excluding parts 2) and adding parts 3). After that, we choose the smallest occupied disk.

The code looks slightly more complicated :)

Btw, how it will work with background moves? Also without tests merge is impossible.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unclear message. Which part is missing? What does it mean?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is it possible?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because currently_emerging_parts is also maintained separately. We do commit before removing parts from it. Thus the previous getDataPartsStateRange(MergeTreeData::DataPartState::Committed) might contain (rarely) the same part.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, it's a LOGICAL_ERROR, isn't it?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible that parts are stored on different volumes. We ignore them here.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

big_not_merging_parts?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why they are already not here? Ho we can have covered parts which are not in currently_submerging_parts? Also why we don't check?

part->isStoredOnDisk() && part->getBytesOnDisk() >= min_bytes_to_rebalance_partition_over_jbod
                && part_info.partition_id == part->info.partition_id

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently_submerging_parts only records parts in previous balanced reservation.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, no checks -- maybe this part from a different partition or small?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, we should check partition here. It will be alway big though.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment: "Used only for logging"

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

big_submerging_parst_from_partition?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The set is used to remove some big valid parts out of calculation. It doesn't have currently submerging parts because those are not yet inserted into currently_submerging_parts, which only contains parts that participate in some other balanced activity.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why map and not unordered_map?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disk names are usually sorted.

for _ in range(10):
try:
print("Syncing replica")
node2.query("SYSTEM SYNC REPLICA tbl")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This query ignores timeout, which exception do we expect?

Copy link
Copy Markdown
Collaborator Author

@amosbird amosbird Mar 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added timeout = 10s. We can safely ignore any exceptions. The test checks the final state of disk balance.

"insert into tmp2 select randConstant() % 2, randomPrintableASCII(16) from numbers(50)"
)

time.sleep(2)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we need it?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't.

WriteBufferFromOwnString log_str;
writeCString("\nbalancer: \n", log_str);
for (const auto & [disk_name, per_disk_parts] : disk_parts_for_logging)
writeString(fmt::format(" {}: [{}]\n", disk_name, boost::algorithm::join(per_disk_parts, ", ")), log_str);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, fmt::join looks prettier.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

Copy link
Copy Markdown
Member

@alesapin alesapin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, quite isolated and useful code. Also, we have a setting that allows disabling this behavior.

@alesapin alesapin merged commit cca05da into ClickHouse:master Mar 12, 2021
@sevirov
Copy link
Copy Markdown
Contributor

sevirov commented Jul 29, 2021

Internal documentation ticket: DOCSUP-12425

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-improvement Pull request with some product improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants