Skip to content

Filesystem abstraction layer#7946

Merged
alesapin merged 12 commits intoClickHouse:masterfrom
Alex-Burmak:filesystem_abstraction
Dec 12, 2019
Merged

Filesystem abstraction layer#7946
alesapin merged 12 commits intoClickHouse:masterfrom
Alex-Burmak:filesystem_abstraction

Conversation

@Alex-Burmak
Copy link
Copy Markdown
Contributor

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

Changelog category (leave one):

  • New Feature

Changelog entry (up to few sentences, required except for Non-significant/Documentation categories):

Filesystem abstraction layer - supporting changes for running ClickHouse over S3 / HDFS.

Detailed description (optional):

The idea is to abstract interaction with filesystem to be able to run ClickHouse on top of backends other than POSIX-compatible filesystem. First of all it’s going to add support for S3 and HDFS backends.

The given PR contains the abstraction itself and its implementation for local filesystem backend. Integration of the abstraction with existing code base as well as implementation of other backends are going to be covered by follow up PRs.

The abstraction actually is not a new one. Instead the existing concept of Disk was refactored and its API was extended with stuff for dealing with files and directories. The class itself is now an interface - IDisk, and its concrete implementations represent different backends: DiskLocal, DiskS3 and DiskHDFS (the last two is under development). As a result the new functionality could be used together with storage policies. For example, we could create a storage policy where “hot” data is stored on local disks and S3 is used for “cold” data.

@Alex-Burmak Alex-Burmak requested a review from a team November 27, 2019 12:37
@ghost ghost requested review from vitlibar and removed request for a team November 27, 2019 12:37
@vitlibar vitlibar self-assigned this Nov 27, 2019
Copy link
Copy Markdown
Member

@alesapin alesapin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now we have two new entities: DiskFile and DiskFileIterator for it, and also generalized interface for Disk. These entities look Ok. But implementation seems very tight connected and complicated to me. Regardless of disk type (FS, HDFS, S3) path can be represented as string or very thin wrapper, like Poco::Path or Poco::URI. I think you should organize interaction between your classes with pathes, and don't pass shared pointers each time and inherit shared_for_this.

Also, this code is very general, you must provide comments to each class and method.

/// Creates a directory and all parent directories if necessary.
virtual void createDirectories() = 0;

virtual void moveTo(const String & new_path) = 0;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How we can implement an atomic move (at least without full copy) of file for example over S3 storage? MergeTree codebase relies on this behavior.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately S3 API doesn't support move or rename. The best we can do is to implement moveTo using copy and delete.

In MergeTree I think we can check capabilities of underlying storage (disk implementation) and do things accordingly:

if (disk->support_atomic_move())
  // algorithm 1
else
  // algorithm 2

Copy link
Copy Markdown
Member

@alexey-milovidov alexey-milovidov Dec 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Atomic rename is crucial for MergeTree to work correctly. It cannot be emulated with copy alone, probably - with copy and atomic switch. But this make things complicated.

@alesapin
Copy link
Copy Markdown
Member

alesapin commented Dec 3, 2019

And of course, this is very small step to the described task. MergeTree codebase havily relies on local filesystem and the proposed idea seems very challenging.

@Alex-Burmak Alex-Burmak requested a review from alesapin December 8, 2019 21:03
@alexey-milovidov
Copy link
Copy Markdown
Member

alexey-milovidov commented Dec 11, 2019

We can store directories and files on local filesystem, but instead of file contents, store S3 URL along with refcount inside them, and store actual content in S3 (we can use arbitrary path in S3, even generated UUID as file name).

Advantages:

  • we will be able to do atomic renames of files and directories;
  • we will be able to use hardlinks by incrementing refcount;
  • spend less time for directory listings and other metadata operations.

Disadvantages:

  • need to pay attention on delete operation to avoid leaks of data on S3.

To use ReplicatedMergeTree, we need to apply some tricks for replication to only send references over the network, instead of actual data. Difficult to make refcounts work.

This can be made under the same FS abstraction layer. But it will actually abstract only file contents.

Metadata operations can be further abstracted by using another data structure instead of FS. We can even store metadata in ZK, but this will require careful synchronization + notifications, that differs from both MergeTree and ReplicatedMergeTree.

@alesapin
Copy link
Copy Markdown
Member

This code is quite isolated, so we can merge it. But I think IDisk interface will significantly change in future.

@alesapin alesapin merged commit 8fb9541 into ClickHouse:master Dec 12, 2019
@Alex-Burmak Alex-Burmak deleted the filesystem_abstraction branch December 12, 2019 15:17
@alexey-milovidov
Copy link
Copy Markdown
Member

alexey-milovidov commented Dec 13, 2019

but instead of file contents, store S3 URL along with refcount

+ file size.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants