Skip to content

A cache for data lake's metadata #71579

@alexey-milovidov

Description

@alexey-milovidov

Use case

Iceberg, DeltaLake, and Hudi tables require reading and processing metadata files (such as manifests) before reading the actual data files. The proposal is to allow caching of this metadata for a configurable amount of time.

Describe the solution you'd like

Before implementing this task, provide a motivating example showing how much time is spent reading and parsing metadata. For example, by loading Wikistat to Iceberg.

https://clickhouse.com/docs/en/getting-started/example-datasets/wikistat

An in-memory cache (it does not persist during restarts, and it is not shared between different servers) of a limited configurable size. The cache can contain raw files content or parsed contents (up to implementation details).

The cache is keyed by table location. We don't perform head requests to check for changes before using the cache, which means - it is invalidated only by time, and it can become incorrect (for example, if the old manifest files are deleted, the query, still using the cache, will return an error).

Most details of the implementation are made consistent with the query cache.

Add server configuration:

    <datalake_metadata_cache>
        <max_size_in_bytes>1073741824</max_size_in_bytes>
        <max_entries>1024</max_entries>
        <max_entry_size_in_bytes>1048576</max_entry_size_in_bytes>
    </datalake_metadata_cache>

defining the global limits.

Add query-level settings:

use_datalake_metadata_cache - a switch to turn all usage on/off;
enable_writes_to_datalake_metadata_cache - if the query has read metadata, should it save metadata in the cache;
enable_reads_from_datalake_metadata_cache - should the query look up and use an entry in the cache if it exists;
datalake_metadata_cache_cache_ttl - do not use entries found in the query cache if they are more stale than the specified threshold in seconds;

The entries are not actively deleted after the TTL. Instead, each entry is checked for staleness for its usage.

Add a system command:

SYSTEM DROP DATALAKE METADATA CACHE

Add a system table, system.datalake_metadata_cache, for introspection.

Add a couple of ProfileEvents and CurrentMetrics similar to the query cache.

Additional context

Currently, we skip features such as - separation between users; - cache tags.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions