Skip to content

S3 zero copy replication#16240

Merged
alesapin merged 40 commits intoClickHouse:masterfrom
ianton-ru:s3_zero_copy_replication
Mar 14, 2021
Merged

S3 zero copy replication#16240
alesapin merged 40 commits intoClickHouse:masterfrom
ianton-ru:s3_zero_copy_replication

Conversation

@ianton-ru
Copy link
Copy Markdown
Contributor

@ianton-ru ianton-ru commented Oct 21, 2020

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Zero-copy replication for ReplicatedMergeTree over S3 storage

Detailed description / Documentation draft:

Zero-copy replication over S3 storage

@robot-clickhouse robot-clickhouse added doc-alert pr-feature Pull request with new product feature labels Oct 21, 2020
@ianton-ru ianton-ru force-pushed the s3_zero_copy_replication branch 2 times, most recently from 123025c to d5bc7ad Compare October 22, 2020 11:26
@ianton-ru ianton-ru force-pushed the s3_zero_copy_replication branch from d5bc7ad to 652c56e Compare October 22, 2020 13:07
@ianton-ru ianton-ru force-pushed the s3_zero_copy_replication branch from 698e68c to 4f7065e Compare October 23, 2020 12:02
@ianton-ru ianton-ru force-pushed the s3_zero_copy_replication branch from 4f7065e to e3879af Compare October 23, 2020 13:28
Приемник в ответ отсылает куку send_s3_metadata=1 в случае, если идут метаданные. В остальных случаях отсылаются данные, как и прежде.

Применик перед запросом смотрит, будет ли хранить данные в S3. Проверка сейчас кривая - если в сторадже есть S3, то считаем, что будет S3.
Если да S3, то отсылает в запросе send_s3_metadata=1.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Строго говоря не исключена ситуация когда и приёмник и передатчик на s3 но банкеты разные. Наверное надо адрес бакета отправлять или какой-то хэш от него.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Адрес может быть разным в случае например нахождения нод в разных ДЦ и использования разных прокси для доступа к S3. В качестве страховки приемник после получения метаданных проверяет доступность (фактически - наличие в S3 первого объекта от первого файла, не скачивая, в просто через list), и если данные не доступны, то сваливается в старый механизм с полноценной копией.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Бонусом дешево получилась возможность работы с несколькими разными S3, когда у разных нод разная очередность (перебирает все S3 в сторадже в поисках нужного). Но непонятно, кому это может пригодиться. :)

@ianton-ru ianton-ru force-pushed the s3_zero_copy_replication branch from b8f9f57 to eba98b0 Compare January 18, 2021 16:17
@ianton-ru ianton-ru changed the title [WIP] Prototype/MVP/Proof-of-concept etc. S3 zero copy replication S3 zero copy replication Jan 18, 2021
@Sallery-X
Copy link
Copy Markdown
Contributor

How to solve the cache consistency problem of different replicas?

In Clickhouse data is "append only". Sometimes entry part can be deleted. So could you please describe where you see a problem?

If multiple replicas share the same storage, how to resolve the share file confliction when the diffrent replicas to excute insert and merge?

@Sallery-X
Copy link
Copy Markdown
Contributor

I think need to consider the ddl worker, no need to do the ddl on the all replicas.

@ianton-ru
Copy link
Copy Markdown
Contributor Author

ianton-ru commented Mar 9, 2021

How to solve the cache consistency problem of different replicas?

In Clickhouse data is "append only". Sometimes entry part can be deleted. So could you please describe where you see a problem?

If multiple replicas share the same storage, how to resolve the share file confliction when the diffrent replicas to excute insert and merge?

On insert one replica creates all files of part, after that files never changes.
On merge replica makes a lock in zookeeper, other replicas wait for end and get merged part.

@ianton-ru ianton-ru force-pushed the s3_zero_copy_replication branch from 885ae9e to 18ff577 Compare March 9, 2021 14:56
@ianton-ru ianton-ru force-pushed the s3_zero_copy_replication branch from 18ff577 to aff13c0 Compare March 9, 2021 17:50
Copy link
Copy Markdown
Member

@alesapin alesapin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, but need to remove suspicious if.

@alesapin
Copy link
Copy Markdown
Member

Yarrr! Conflict on ErrorCode.

@boqu
Copy link
Copy Markdown

boqu commented Apr 13, 2021

Can I know whether this change will be merged to 21.3-lts? If yes, do we have any timeline? Thanks!

namespace DB
{

struct DiskType
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like antipattern.
And there is zero comments in this file 😭

What does it mean "disk type" and why do we need to discriminate them?
Maybe remove this file and all its usages?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-feature Pull request with new product feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants