-
Notifications
You must be signed in to change notification settings - Fork 2k
Description
Is your feature request related to a problem or challenge?
- Part of [Epic] Enable parquet metadata cache by default #17000
@nuno-faria implemented the core Parquet Metadata caching logic in the following PR: - feat: Cache Parquet metadata in built in parquet reader #16971
However, it doesn't seem to help certain queries that use statistcs. Specifically, I expect the second time the query is run it should do no network at all because the ParquetMetadata is already cached:
> set datafusion.execution.parquet.cache_metadata = true;
0 row(s) fetched.
Elapsed 0.000 seconds.
> select count(*) from 's3://clickhouse-public-datasets/hits_compatible/athena_partitioned/';
+----------+
| count(*) |
+----------+
| 99997497 |
+----------+
1 row(s) fetched.
Elapsed 4.632 seconds.
> select count(*) from 's3://clickhouse-public-datasets/hits_compatible/athena_partitioned/';
+----------+
| count(*) |
+----------+
| 99997497 |
+----------+
1 row(s) fetched.
Elapsed 2.717 seconds.Describe the solution you'd like
I would like the queries above to go faster by using the ParquetMetaData cache
Describe alternatives you've considered
I think this is related to the fact that there is a separate path to retrieve statistics for ListingTable, specifically https://github.com/apache/datafusion/blob/1452333cf0933d4d8da032af68bc5a3a05c62483/datafusion/datasource-parquet/src/file_format.rs#L975-L974
So to fix this issue, I think what we need to do is to check the FileMetadataCache first before actually fetching any ParquetMetadata
Additional context
No response