-
Notifications
You must be signed in to change notification settings - Fork 8.1k
Description
A directory on local filesystem can be set up to hold a cache for remote filesystem up to specified size.
Proposed implementation
It contains immutable write-once files with paths:
cache_path/012/0123456789abcdef0123456789abcdef/111111_last
Where:
cache_path - configured location;
0123456789abcdef0123456789abcdef - SipHash128 of file path on the remote filesystem (as seen by the user of IDisk interface);
012 - first three hex letters of the hash;
111111 - start offset of the cached segment in the file in decimal;
_last - if it is the last segment in file at the time it was read, so we know that the file ends there;
The cache acts like an adaptor of the remote filesystem and adapts the following methods:
- read - when file is read, the contiguous read segments are also saved in cache and the overlapping segments that are already in cache are reused;
- write (optionally) - when file is written, it is also saved in cache;
- append - mark the last file segment as not last; then similar to write;
- remove - when file is deleted, it is removed from cache;
- rename and hardlink - to keep cached data usable when path changes we apply the corresponding renames and hardlinks in cache.
If a file is being read, multiple cache entries can be created:
- if seek was made and we have read multiple contiguous segments;
- if part of file was already in cache but we are reading overlapping segment: we read and save the remaining segments to separate cache entries;
The cache can use simple LRU eviction with whole segments as entries. So, large segments are cached and evicted as a whole. But we can split large segments prematurally (when writing new cache entries) - e.g. by 16 MiB threshold.
The metadata about cache is kept in memory: the set of all entries and the last usage time. On server restart, we load the list of files in cache and the cache is continued to be usable but we lost the information about the last usage time per entry. For this case, we should initialize last usage time to zero and on cache eviction - evict random entries with no information about last usage time.
The max size of the cache can be set up both in bytes (main use case) and in the number of entries (safety threshold just in case to save memory on cache metadata and the number of inodes).
ReadSettings can be implemented to control cache usage:
- should a query use cache on reading if it is available (this setting is mostly for testing and comparison);
- should a query save read data in cache (e.g. it can be disabled for non-important long queries);
- should a query save written data in cache (for INSERT queries);
- thresholds to disable saving data in cache by the estimated read/write size;
The cache is set up in config on VFS (disk) level, maybe as a wrapper on another disk (similarly to encrypted disk). For example, it will allow to choose whether to cache encrypted or non-encrypted data from encrypted filesystems.
Notes
- We use local filesystem for cache. We expect that the cache will also take the advantage of the OS page cache.
- We assume that the files are immutable or (less important case) mutable but append-only. Looks like it works fine.
- The last access time is save only in memory. Alternatively we can use
atimein filesystem but it's an overkill. - Looks like it's unnecessary to write any checksums in cache as we already have tons of checksums on other levels;
- We use external (logical) filesystem path as a cache key. We can also think about using internal object names for object-storage VFS as cache key. But the former is more versatile (composable).
- We allow hardlinks. To keep things simple we can avoid bothering with correctly calculating the size of cache entries - so the actual size will be lower than calculated.
- The cache can be fragmented but we don't care at all (the fragmentation will be moderate and it should work fine).
- The cache may lock concurrent readers of the overlapping ranges so the one will read and cache data and the subsequent readers will reuse it.
Further directions
- Allow to keep small files in memory and set up a smaller cache size for data in memory;
- A setting to limit the max amount of cached data per query;
- Other algorithms instead of LRU (e.g. segmented LRU, exp-smoothed LFU) and allow to tune cache importance per query.
- Sometimes it may be slow to write data into local filesystem while reading from remote filesystem. We can maintain a separate buffer of write tasks and background thread to do it and it can deliberately drop dome writes.