Parquet metadata cache v2#98140
Conversation
…dge case where we have the metadata cache turned on, but we are not using the native v3 reader
src/Formats/FormatFactory.cpp
Outdated
| size_t max_parsing_threads = parser_shared_resources->getParsingThreadsPerReader(); | ||
| bool parallel_parsing = max_parsing_threads > 1 && settings[Setting::input_format_parallel_parsing] | ||
| && creators.file_segmentation_engine_creator && !creators.random_access_input_creator && !need_only_count; | ||
| && creators.file_segmentation_engine_creator && !(creators.random_access_input_creator && creators.random_access_input_creator_with_metadata) |
There was a problem hiding this comment.
Let's replace creators.random_access_input_creator_with_metadata with (creators.random_access_input_creator_with_metadata && metadata.has_value())
There was a problem hiding this comment.
I see, we could make this check more strict like this
bool parallel_parsing = max_parsing_threads > 1
&& settings[Setting::input_format_parallel_parsing]
&& creators.file_segmentation_engine_creator
&& !(creators.random_access_input_creator
&& creators.random_access_input_creator_with_metadata
&& metadata.has_value())
&& !need_only_count;
We are testing to see if we should use parallel parsing. I'm not sure that the presence or absence of object metadata makes that decision any clearer or more correct
| LOG_TRACE(lambda_logger, "using arrow reader in ParquetBlockInputFormat without metadata cache"); | ||
| return std::make_shared<ParquetBlockInputFormat>( | ||
| buf, | ||
| std::make_shared<const Block>(sample), | ||
| settings, | ||
| std::move(parser_shared_resources), | ||
| std::move(format_filter_info), | ||
| min_bytes_for_seek); | ||
| min_bytes_for_seek |
There was a problem hiding this comment.
The problem is not resolved by removing the logical error. We still calls the function which deosn't use object metadata in registerRandomAccessInputFormatWithMetadata. It is redundant, we have the same code below. We still should have logical error if we are in this branch
There was a problem hiding this comment.
The problem is presumably:
here you use settings.parquet.use_native_reader_v3
in the code above you use
context_->getSettingsRef()[Setting::input_format_parquet_use_native_reader_v3]
There was a problem hiding this comment.
Revert #89750 (parquet footer cache) which causes a LOGICAL_ERROR exception when reading Parquet from S3 with the default settings (use_native_reader_v3 = false). The metadata-aware format creator only handled the use_native_reader_v3 path and threw a LOGICAL_ERROR for the default Arrow reader path, triggered whenever S3 storage provides blob metadata
I thought Alexey reverted this because we need a usable format creator in the case that the parquet metadata cache is turned on, but we don't have object store metadata. That makes sense to me. If, for some reason, we don't get object store metadata in the response, we still need to read the parquet file, and thus, we need some kind of format creator for reading.
There was a problem hiding this comment.
I suppose Alexey reverted it just because we encountered logical error. Why can we have metadata cache without object_metadata? It seems redundant, because it is unusable without metadata
There was a problem hiding this comment.
Yes, it was only due to a logical error.
|
There is really strange diff in SettingsChangesHistory, my bad that I overlooked it, I will try to fix |
…check more restrictive, and setting previous value for use_parquet_metadata_cache to false
|
|
||
| size_t ParquetMetadataCacheKeyHash::operator()(const ParquetMetadataCacheKey & key) const | ||
| { | ||
| return std::hash<String>{}(key.file_path) ^ std::hash<String>{}(key.etag); |
There was a problem hiding this comment.
While now it's not too wrong, the code smells bad practice.
See slide 11 here: https://presentations.clickhouse.com/2017-hash_tables/?full#11
Also, we shouldn't be using std::hash. Use methods from Hash.h
There was a problem hiding this comment.
…and improving hash calculation of cache keys
| ProfileEvents::increment(ProfileEvents::ParquetMetadataCacheHits); | ||
| } | ||
| return result.first->metadata; | ||
| } |
There was a problem hiding this comment.
Cache returns large metadata struct by value
Medium Severity
getOrSetMetadata returns parquet::format::FileMetaData by value, causing a full deep copy of the thrift metadata struct on every cache hit. For files with many row groups and columns, this metadata can be substantial. The whole point of the cache is to avoid re-downloading, but the copy overhead on each access partially defeats the purpose.
There was a problem hiding this comment.
I guess this is true, but it seems like a pretty large refactor. Maybe in another PR we could attempt this? The part that gives me pause here https://github.com/ClickHouse/ClickHouse/pull/98140/changes/BASE..b8ba83a5537d6c235d33919156cee5868f93fba6#diff-0b5a953b45a537794f28b2c4c4822e1f884b496c68fc5d05697d512958dd2791R101 in the native reader's initializeIfNeeded
{
std::lock_guard lock(reader_mutex);
reader.emplace();
reader->reader.prefetcher.init(in, read_options, parser_shared_resources);
reader->reader.file_metadata = getFileMetadata(reader->reader.prefetcher);
reader->reader.init(read_options, getPort().getHeader(), format_filter_info);
reader->init(parser_shared_resources, buckets_to_read ? std::optional(buckets_to_read->row_group_ids) : std::nullopt);
}
We are assigning the file_metadata from our cache lookup. I think we would need to refactor the reader to also deal with a pointer as well
| min_bytes_for_seek); | ||
| throw Exception( | ||
| ErrorCodes::LOGICAL_ERROR, | ||
| "Implementation of ParquetBlockInputFormat using arrow reader didn't require blob metadata for initialization"); |
There was a problem hiding this comment.
Metadata-aware creator throws for non-v3 Parquet reader
Medium Severity
The random_access_input_creator_with_metadata lambda for Parquet throws LOGICAL_ERROR when use_native_reader_v3 is false. In getInputImpl, this creator is unconditionally preferred whenever object_with_metadata has a value, with no check on the reader version. The current single call site in StorageObjectStorageSource guards against this, but the public getInputWithMetadata API has no such protection, making any future caller that passes metadata for non-v3 Parquet reads crash.
Additional Locations (1)
aa00dcb
| if (settings.parquet.use_native_reader_v3) | ||
| { | ||
| LOG_TRACE(lambda_logger, "using native reader v3 in ParquetBlockInputFormat with metadata cache"); | ||
| ParquetMetadataCachePtr metadata_cache = CurrentThread::getQueryContext()->getParquetMetadataCache(); |
There was a problem hiding this comment.
It is not guaranteed that there will be query context - https://pastila.nl/?001a755c/6964bb8d5180dc5249a29874fd0605f2#KuZwHaHJq+YWh3d8naQn0Q==GCM
And it crashes
There was a problem hiding this comment.
…t_metadata_cache_v2 Parquet metadata cache v2
…data_cache_261 Antalya 26.1 Backport of ClickHouse#98140, ClickHouse#99230, ClickHouse#99231 and ClickHouse#96545 - Parquet metadata cache (upstream impl) and arrow library version bump


Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Added a new SLRU cache for Parquet metadata to improve read performance by removing the need to re-download files just to read metadata.
Resolves #89102.
Documentation entry for user-facing changes
This pull request adds a new SLRU cache for parquet file metadata. When a file is read for the first time, we capture the file metadata which includes the schema information. Now when the file needs to be read, we can look up the file from the cache and have access to the file metadata. We use the filename and etag from the cloud storage response as cache keys. That way we can cache updated schemas for the same file.
The cache is enabled with
use_parquet_metadata_cache=1and can be dropped withSYSTEM CLEAR PARQUET METADATA CACHE. Dropping a table backed by a parquet file will not drop the corresponding cache entry. All cache invalidation is handled by SLRU in the parent classCacheBase.h. Even if the cache is enabled, it only works with the native v3 parquet reader.Note
Medium Risk
Introduces a new global cache and threads it through Parquet read paths (especially object storage), which can affect memory usage and correctness of metadata invalidation (etag-keyed) as well as format factory selection logic.
Overview
Adds a new global Parquet metadata SLRU cache (defaults + server settings) and wires it into Parquet native v3 reads so metadata can be reused instead of re-reading file footers, keyed by object path +
etag.Extends
FormatFactorywith metadata-aware random-access creators (getInputWithMetadata/registerRandomAccessInputFormatWithMetadata) and updates object-storage reads to pass object metadata whenuse_parquet_metadata_cache=1(and v3 reader is enabled). Adds observability/ops hooks: newProfileEvents/CurrentMetrics, new privilege + parser support forSYSTEM DROP PARQUET METADATA CACHE, and documentation/tests covering cache hit/miss and disabling in existing S3 Parquet tests.Written by Cursor Bugbot for commit 9ae55b9. This will update automatically on new commits. Configure here.