-
Notifications
You must be signed in to change notification settings - Fork 9
Parquet metadata cache max-size setting is semantically broken (entries counted as weight 1, not bytes) #114
Copy link
Copy link
Open
Labels
Description
Summary
The input_format_parquet_metadata_cache_max_size server setting is documented and intended as a byte limit for the Parquet file metadata cache, but the implementation uses the default per-entry weight of 1 (entry count), not bytes. As a result, the cache can grow far beyond the configured "maximum size," leading to unbounded memory use, OOM risk, and availability loss when querying many unique Parquet objects.
Source: Audit of PR #1385 – Antalya 26.1 – Forward port of parquet metadata caching.
Impact
- Correctness/availability: High — resource exhaustion and potential process crash.
- Likelihood: Realistic for object-storage workloads with high file cardinality.
- Blast radius: Process-wide (global singleton cache).
- Exploitability: Operationally easy (querying many unique Parquet objects is normal usage).
Root cause
- Product contract: Setting is described as "Maximum size of parquet file metadata cache" and defaults to
500000000(bytes). - Implementation:
ParquetFileMetaDataCacheextendsCacheBasewith the default policy, which usesEqualWeightFunction— every cache entry has weight 1, so the limit is enforced as number of entries, not bytes. - Result: With many unique
path:etagkeys and non-trivial metadata per file, the cache can retain far more than the intended byte budget because eviction is driven by entry count, not memory size.
Affected code / anchors
| Location | Relevance |
|---|---|
src/Processors/Formats/Impl/ParquetFileMetaDataCache.h |
Cache class inherits from CacheBase with default (entry-count) weight |
src/Common/ICachePolicy.h |
EqualWeightFunction returns 1 for any entry |
src/Core/ServerSettings.cpp |
input_format_parquet_metadata_cache_max_size declared as bytes (e.g. 500000000) |
Minimal reproduction
- Set
input_format_parquet_use_metadata_cache=1and leave the server default forinput_format_parquet_metadata_cache_max_size(500000000). - Query Parquet files in object storage with many unique keys (distinct path/etag).
- Observe: Cache admission continues well past the intended byte budget because each entry contributes weight 1, so the "max size" is effectively a cap on entry count, not memory.
Affected transitions / subsystems
- Transitions: T2 (cache insert), T5 (configured-size enforcement at server start).
- Subsystem: Parquet object-storage read path; both
ParquetandParquetMetadatainput formats; all server threads using the global cache.
Labels / metadata suggestions
- Severity: High
- Component: Parquet / Formats / Caching
- Type: Bug (correctness / resource contract)
Reactions are currently unavailable