-
Notifications
You must be signed in to change notification settings - Fork 8.3k
Queries to S3 having glob patterns takes a long time to complete #53643
Description
Describe the situation
There is a huge difference wrt query response time when querying S3 with or without glob patterns in the S3 URL
How to reproduce
Below is an example query without glob pattern
SELECT Column1, Column2, _path
FROM s3('https://test-s3-bucket.s3.us-west-2.amazonaws.com/Partition1/202308210735/*.parquet', <aws_access_key_id>, <aws_secret_access_key>) LIMIT 2
Below is an example query with glob pattern
SELECT Column1, Column2, _path
FROM s3('https://test-s3-bucket.s3.us-west-2.amazonaws.com/Partition1/{202308210735,DUMMY}/*.parquet', <aws_access_key_id>, <aws_secret_access_key>) LIMIT 2
Effectively both the queries should be reading or parsing the same number of files ("DUMMY" used the glob pattern is non-existent).
The resulting response time is approximately 1.5s for query without glob pattern.
But the same query with blob pattern is taking around 140s
- Which ClickHouse server version to use
ClickHouse client version 23.7.3.14 (official build).
Expected performance
The response time in both cases should be more or less the same.
Additional context
Data stored in S3 is of parquet format. There are multiple files within 202308210735. As is obvious from the pattern, the data is time partitioned and there will be multiple folders like 202308210740, 202308210745 etc..
Additionally, there are multiple top-level folders as well (eg) Partition1, Partition2 etc..
I did run query analysis of both the queries and from that its quite clear that the glob pattern based query is leading a substantially higher S3ListObject & S3Reads. Below is a snapshot of the comparision
