Filesystem abstraction layer by Alex-Burmak · Pull Request #7946 · ClickHouse/ClickHouse

Alex-Burmak · 2019-11-27T12:37:42Z

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

Changelog category (leave one):

New Feature

Changelog entry (up to few sentences, required except for Non-significant/Documentation categories):

Filesystem abstraction layer - supporting changes for running ClickHouse over S3 / HDFS.

Detailed description (optional):

The idea is to abstract interaction with filesystem to be able to run ClickHouse on top of backends other than POSIX-compatible filesystem. First of all it’s going to add support for S3 and HDFS backends.

The given PR contains the abstraction itself and its implementation for local filesystem backend. Integration of the abstraction with existing code base as well as implementation of other backends are going to be covered by follow up PRs.

The abstraction actually is not a new one. Instead the existing concept of Disk was refactored and its API was extended with stuff for dealing with files and directories. The class itself is now an interface - IDisk, and its concrete implementations represent different backends: DiskLocal, DiskS3 and DiskHDFS (the last two is under development). As a result the new functionality could be used together with storage policies. For example, we could create a storage policy where “hot” data is stored on local disks and S3 is used for “cold” data.

alesapin

Now we have two new entities: DiskFile and DiskFileIterator for it, and also generalized interface for Disk. These entities look Ok. But implementation seems very tight connected and complicated to me. Regardless of disk type (FS, HDFS, S3) path can be represented as string or very thin wrapper, like Poco::Path or Poco::URI. I think you should organize interaction between your classes with pathes, and don't pass shared pointers each time and inherit shared_for_this.

Also, this code is very general, you must provide comments to each class and method.

dbms/src/Disks/IDisk.h

dbms/src/Disks/DiskLocal.h

alesapin · 2019-12-03T11:54:49Z

dbms/src/Disks/IDisk.h

+    /// Creates a directory and all parent directories if necessary.
+    virtual void createDirectories() = 0;
+
+    virtual void moveTo(const String & new_path) = 0;


How we can implement an atomic move (at least without full copy) of file for example over S3 storage? MergeTree codebase relies on this behavior.

Unfortunately S3 API doesn't support move or rename. The best we can do is to implement moveTo using copy and delete.

In MergeTree I think we can check capabilities of underlying storage (disk implementation) and do things accordingly:

if (disk->support_atomic_move()) // algorithm 1 else // algorithm 2

Atomic rename is crucial for MergeTree to work correctly. It cannot be emulated with copy alone, probably - with copy and atomic switch. But this make things complicated.

dbms/src/Disks/DiskLocal.h

dbms/src/Disks/DiskLocal.cpp

dbms/src/Disks/IDisk.h

alesapin · 2019-12-03T12:41:06Z

And of course, this is very small step to the described task. MergeTree codebase havily relies on local filesystem and the proposed idea seems very challenging.

alexey-milovidov · 2019-12-11T01:31:56Z

We can store directories and files on local filesystem, but instead of file contents, store S3 URL along with refcount inside them, and store actual content in S3 (we can use arbitrary path in S3, even generated UUID as file name).

Advantages:

we will be able to do atomic renames of files and directories;
we will be able to use hardlinks by incrementing refcount;
spend less time for directory listings and other metadata operations.

Disadvantages:

need to pay attention on delete operation to avoid leaks of data on S3.

To use ReplicatedMergeTree, we need to apply some tricks for replication to only send references over the network, instead of actual data. Difficult to make refcounts work.

This can be made under the same FS abstraction layer. But it will actually abstract only file contents.

Metadata operations can be further abstracted by using another data structure instead of FS. We can even store metadata in ZK, but this will require careful synchronization + notifications, that differs from both MergeTree and ReplicatedMergeTree.

alesapin · 2019-12-12T15:11:49Z

This code is quite isolated, so we can merge it. But I think IDisk interface will significantly change in future.

alexey-milovidov · 2019-12-13T17:40:03Z

but instead of file contents, store S3 URL along with refcount

+ file size.

Filesystem abstraction layer

3e5ef56

Alex-Burmak requested a review from a team November 27, 2019 12:37

ghost requested review from vitlibar and removed request for a team November 27, 2019 12:37

vitlibar self-assigned this Nov 27, 2019

abyss7 added the can be tested label Nov 27, 2019

Alex-Burmak added 3 commits November 27, 2019 18:49

Fixed darwin buils

910ceb6

Fixed implementation of DiskLocal::moveTo

9c60149

Merge remote-tracking branch 'refs/remotes/upstream/master' into vfs

e5ffdc0

vitlibar added the pr-codecleanup label Dec 2, 2019

vitlibar assigned alesapin and unassigned vitlibar Dec 2, 2019

alesapin requested changes Dec 3, 2019

View reviewed changes

Alex-Burmak added 3 commits December 3, 2019 18:16

Merge remote-tracking branch 'refs/remotes/upstream/master' into vfs

21623c3

Merge remote-tracking branch 'refs/remotes/upstream/master' into vfs

8104395

Addressed code review comments

edd11ab

Alex-Burmak requested a review from alesapin December 8, 2019 21:03

Alex-Burmak added 2 commits December 9, 2019 08:47

Minor amendments

f27d462

Merge remote-tracking branch 'refs/remotes/upstream/master' into vfs

5a93441

Alex-Burmak added 3 commits December 12, 2019 12:10

Merge remote-tracking branch 'refs/remotes/upstream/master' into vfs

2ce6136

Minor code cleanup

c514ec3

Fixed merge issues

1f69a7a

alesapin merged commit 8fb9541 into ClickHouse:master Dec 12, 2019

Alex-Burmak deleted the filesystem_abstraction branch December 12, 2019 15:17

Alex-Burmak mentioned this pull request Dec 23, 2019

Integration of log storage engines with IDisk interface #8356

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filesystem abstraction layer#7946

Filesystem abstraction layer#7946
alesapin merged 12 commits intoClickHouse:masterfrom
Alex-Burmak:filesystem_abstraction

Alex-Burmak commented Nov 27, 2019

Uh oh!

alesapin left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alesapin Dec 3, 2019

Uh oh!

Alex-Burmak Dec 3, 2019

Uh oh!

alexey-milovidov Dec 11, 2019 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alesapin commented Dec 3, 2019

Uh oh!

alexey-milovidov commented Dec 11, 2019 •

edited

Loading

Uh oh!

alesapin commented Dec 12, 2019

Uh oh!

alexey-milovidov commented Dec 13, 2019 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

Alex-Burmak commented Nov 27, 2019

Uh oh!

alesapin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alesapin Dec 3, 2019

Choose a reason for hiding this comment

Uh oh!

Alex-Burmak Dec 3, 2019

Choose a reason for hiding this comment

Uh oh!

alexey-milovidov Dec 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alesapin commented Dec 3, 2019

Uh oh!

alexey-milovidov commented Dec 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alesapin commented Dec 12, 2019

Uh oh!

alexey-milovidov commented Dec 13, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

alexey-milovidov Dec 11, 2019 •

edited

Loading

alexey-milovidov commented Dec 11, 2019 •

edited

Loading

alexey-milovidov commented Dec 13, 2019 •

edited

Loading