Filesystem abstraction layer#7946
Conversation
alesapin
left a comment
There was a problem hiding this comment.
Now we have two new entities: DiskFile and DiskFileIterator for it, and also generalized interface for Disk. These entities look Ok. But implementation seems very tight connected and complicated to me. Regardless of disk type (FS, HDFS, S3) path can be represented as string or very thin wrapper, like Poco::Path or Poco::URI. I think you should organize interaction between your classes with pathes, and don't pass shared pointers each time and inherit shared_for_this.
Also, this code is very general, you must provide comments to each class and method.
dbms/src/Disks/IDisk.h
Outdated
| /// Creates a directory and all parent directories if necessary. | ||
| virtual void createDirectories() = 0; | ||
|
|
||
| virtual void moveTo(const String & new_path) = 0; |
There was a problem hiding this comment.
How we can implement an atomic move (at least without full copy) of file for example over S3 storage? MergeTree codebase relies on this behavior.
There was a problem hiding this comment.
Unfortunately S3 API doesn't support move or rename. The best we can do is to implement moveTo using copy and delete.
In MergeTree I think we can check capabilities of underlying storage (disk implementation) and do things accordingly:
if (disk->support_atomic_move())
// algorithm 1
else
// algorithm 2
There was a problem hiding this comment.
Atomic rename is crucial for MergeTree to work correctly. It cannot be emulated with copy alone, probably - with copy and atomic switch. But this make things complicated.
|
And of course, this is very small step to the described task. MergeTree codebase havily relies on local filesystem and the proposed idea seems very challenging. |
|
We can store directories and files on local filesystem, but instead of file contents, store S3 URL along with refcount inside them, and store actual content in S3 (we can use arbitrary path in S3, even generated UUID as file name). Advantages:
Disadvantages:
To use ReplicatedMergeTree, we need to apply some tricks for replication to only send references over the network, instead of actual data. Difficult to make refcounts work. This can be made under the same FS abstraction layer. But it will actually abstract only file contents. Metadata operations can be further abstracted by using another data structure instead of FS. We can even store metadata in ZK, but this will require careful synchronization + notifications, that differs from both MergeTree and ReplicatedMergeTree. |
|
This code is quite isolated, so we can merge it. But I think |
+ file size. |
I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en
Changelog category (leave one):
Changelog entry (up to few sentences, required except for Non-significant/Documentation categories):
Filesystem abstraction layer - supporting changes for running ClickHouse over S3 / HDFS.
Detailed description (optional):
The idea is to abstract interaction with filesystem to be able to run ClickHouse on top of backends other than POSIX-compatible filesystem. First of all it’s going to add support for S3 and HDFS backends.
The given PR contains the abstraction itself and its implementation for local filesystem backend. Integration of the abstraction with existing code base as well as implementation of other backends are going to be covered by follow up PRs.
The abstraction actually is not a new one. Instead the existing concept of Disk was refactored and its API was extended with stuff for dealing with files and directories. The class itself is now an interface - IDisk, and its concrete implementations represent different backends: DiskLocal, DiskS3 and DiskHDFS (the last two is under development). As a result the new functionality could be used together with storage policies. For example, we could create a storage policy where “hot” data is stored on local disks and S3 is used for “cold” data.