A cache for data lake's metadata

**Use case**

Iceberg, DeltaLake, and Hudi tables require reading and processing metadata files (such as manifests) before reading the actual data files. The proposal is to allow caching of this metadata for a configurable amount of time.

**Describe the solution you'd like**

Before implementing this task, provide a motivating example showing how much time is spent reading and parsing metadata. For example, by loading Wikistat to Iceberg.

https://clickhouse.com/docs/en/getting-started/example-datasets/wikistat

An in-memory cache (it does not persist during restarts, and it is not shared between different servers) of a limited configurable size. The cache can contain raw files content or parsed contents (up to implementation details).

The cache is keyed by table location. We don't perform head requests to check for changes before using the cache, which means - it is invalidated only by time, and it can become incorrect (for example, if the old manifest files are deleted, the query, still using the cache, will return an error).

Most details of the implementation are made consistent with the query cache.

Add server configuration:

```
    <datalake_metadata_cache>
        <max_size_in_bytes>1073741824</max_size_in_bytes>
        <max_entries>1024</max_entries>
        <max_entry_size_in_bytes>1048576</max_entry_size_in_bytes>
    </datalake_metadata_cache>
```
defining the global limits.

Add query-level settings:

`use_datalake_metadata_cache` - a switch to turn all usage on/off;
`enable_writes_to_datalake_metadata_cache` - if the query has read metadata, should it save metadata in the cache;
`enable_reads_from_datalake_metadata_cache` - should the query look up and use an entry in the cache if it exists;
`datalake_metadata_cache_cache_ttl` - do not use entries found in the query cache if they are more stale than the specified threshold in seconds;

The entries are not actively deleted after the TTL. Instead, each entry is checked for staleness for its usage.

Add a system command:

`SYSTEM DROP DATALAKE METADATA CACHE`

Add a system table, `system.datalake_metadata_cache`, for introspection.

Add a couple of ProfileEvents and CurrentMetrics similar to the query cache.

**Additional context**

Currently, we skip features such as - separation between users; - cache tags.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A cache for data lake's metadata #71579

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

A cache for data lake's metadata #71579

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions