Skip to content

Parquet footer cache#89750

Merged
divanik merged 138 commits intoClickHouse:masterfrom
grantholly-clickhouse:parquet_footer_cache
Feb 10, 2026
Merged

Parquet footer cache#89750
divanik merged 138 commits intoClickHouse:masterfrom
grantholly-clickhouse:parquet_footer_cache

Conversation

@grantholly-clickhouse
Copy link
Copy Markdown
Contributor

@grantholly-clickhouse grantholly-clickhouse commented Nov 7, 2025

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Added a new SLRU cache for Parquet metadata to improve reading performance by removing the need to re-download files just to read metadata. The cache can be dropped with SYSTEM DROP PARQUET METADATA CACHE. Resolves #89102.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

Details

This adds a new SLRU cache for parquet metadata (#89102). This aims to improve parquet reading performance by removing the need to re-download a file just to read the metadata. I tried to closely follow the pattern of the Iceberg file metadata cache as was suggested in the related issue.

All the configuration for both caches falls under parquet_metadata_cache_*.

…and factoring out extractFilePathAndETagFromReadBuffer
if (metadata_cache && metadata.has_value())
{
String file_name = metadata->getFileName();
String file_name = metadata->getPath();
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the important difference here?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Full path on a blob_storage opposed to a last file part (after the last slash) in a file path. We can have potential problems with files which have one file name but lay in different folders

std::lock_guard lock(reader_mutex);
reader.emplace();
reader->reader.prefetcher.init(in, read_options, parser_shared_resources);
reader->reader.file_metadata = getFileMetadata(reader->reader.prefetcher);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this line of code was added (I refactored the block of code which was done before, so I don't know).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That line is correct. This is where we need to look up the file metadata.

@divanik
Copy link
Copy Markdown
Member

divanik commented Feb 9, 2026

The PR seems ok to merge, but there is some strange stuff with CI, creating new commit and restarting CI not to spend time on dealing with CI infrastructure

@divanik divanik force-pushed the parquet_footer_cache branch from b4886cc to d4179cb Compare February 9, 2026 12:28
@divanik
Copy link
Copy Markdown
Member

divanik commented Feb 9, 2026

Sorry, I merged dirty master to the PR, already fixed the problem

Copy link
Copy Markdown
Member

@divanik divanik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small question if the change is related

@divanik divanik added this pull request to the merge queue Feb 10, 2026
Merged via the queue into ClickHouse:master with commit 5243ef7 Feb 10, 2026
127 of 134 checks passed
@robot-ch-test-poll3 robot-ch-test-poll3 added the pr-synced-to-cloud The PR is synced to the cloud repo label Feb 10, 2026
alexey-milovidov added a commit that referenced this pull request Feb 13, 2026
…footer_cache"

This reverts commit 5243ef7, reversing
changes made to 6e3a5ad.
alexey-milovidov added a commit that referenced this pull request Feb 13, 2026
Revert "Merge pull request #89750 - parquet_footer_cache"
grantholly-clickhouse added a commit to grantholly-clickhouse/ClickHouse that referenced this pull request Feb 13, 2026
…se/parquet_footer_cache"

This reverts commit 223e53c.
@divanik
Copy link
Copy Markdown
Member

divanik commented Feb 13, 2026

The problem is clear: firstly we took info about use_parquet_v3_reader from the context, after that we took it from the format_settings. For some reason we have different values in two different sources (that's obviously mess, but out of the scope of this PR). @grantholly-clickhouse, can you try to reproduce the problem, adding a test, fixing and opening the new PR (I suggest not dealing with rerevert, it can be dangerous)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-feature Pull request with new product feature pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cache Parquet footer in LRU Cache

5 participants