Skip to content

Parquet metadata cache max-size setting is semantically broken (entries counted as weight 1, not bytes) #114

@Selfeer

Description

@Selfeer

Summary

The input_format_parquet_metadata_cache_max_size server setting is documented and intended as a byte limit for the Parquet file metadata cache, but the implementation uses the default per-entry weight of 1 (entry count), not bytes. As a result, the cache can grow far beyond the configured "maximum size," leading to unbounded memory use, OOM risk, and availability loss when querying many unique Parquet objects.

Source: Audit of PR #1385 – Antalya 26.1 – Forward port of parquet metadata caching.


Impact

  • Correctness/availability: High — resource exhaustion and potential process crash.
  • Likelihood: Realistic for object-storage workloads with high file cardinality.
  • Blast radius: Process-wide (global singleton cache).
  • Exploitability: Operationally easy (querying many unique Parquet objects is normal usage).

Root cause

  • Product contract: Setting is described as "Maximum size of parquet file metadata cache" and defaults to 500000000 (bytes).
  • Implementation: ParquetFileMetaDataCache extends CacheBase with the default policy, which uses EqualWeightFunction — every cache entry has weight 1, so the limit is enforced as number of entries, not bytes.
  • Result: With many unique path:etag keys and non-trivial metadata per file, the cache can retain far more than the intended byte budget because eviction is driven by entry count, not memory size.

Affected code / anchors

Location Relevance
src/Processors/Formats/Impl/ParquetFileMetaDataCache.h Cache class inherits from CacheBase with default (entry-count) weight
src/Common/ICachePolicy.h EqualWeightFunction returns 1 for any entry
src/Core/ServerSettings.cpp input_format_parquet_metadata_cache_max_size declared as bytes (e.g. 500000000)

Minimal reproduction

  1. Set input_format_parquet_use_metadata_cache=1 and leave the server default for input_format_parquet_metadata_cache_max_size (500000000).
  2. Query Parquet files in object storage with many unique keys (distinct path/etag).
  3. Observe: Cache admission continues well past the intended byte budget because each entry contributes weight 1, so the "max size" is effectively a cap on entry count, not memory.

Affected transitions / subsystems

  • Transitions: T2 (cache insert), T5 (configured-size enforcement at server start).
  • Subsystem: Parquet object-storage read path; both Parquet and ParquetMetadata input formats; all server threads using the global cache.

Labels / metadata suggestions

  • Severity: High
  • Component: Parquet / Formats / Caching
  • Type: Bug (correctness / resource contract)

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions