Skip to content

Queries to S3 having glob patterns takes a long time to complete #53643

@MaheshGPai

Description

@MaheshGPai

Describe the situation
There is a huge difference wrt query response time when querying S3 with or without glob patterns in the S3 URL

How to reproduce
Below is an example query without glob pattern

SELECT Column1, Column2, _path
FROM s3('https://test-s3-bucket.s3.us-west-2.amazonaws.com/Partition1/202308210735/*.parquet', <aws_access_key_id>, <aws_secret_access_key>) LIMIT 2

Below is an example query with glob pattern

SELECT Column1, Column2, _path
FROM s3('https://test-s3-bucket.s3.us-west-2.amazonaws.com/Partition1/{202308210735,DUMMY}/*.parquet', <aws_access_key_id>, <aws_secret_access_key>) LIMIT 2

Effectively both the queries should be reading or parsing the same number of files ("DUMMY" used the glob pattern is non-existent).
The resulting response time is approximately 1.5s for query without glob pattern.
But the same query with blob pattern is taking around 140s

  • Which ClickHouse server version to use
    ClickHouse client version 23.7.3.14 (official build).

Expected performance
The response time in both cases should be more or less the same.

Additional context
Data stored in S3 is of parquet format. There are multiple files within 202308210735. As is obvious from the pattern, the data is time partitioned and there will be multiple folders like 202308210740, 202308210745 etc..
Additionally, there are multiple top-level folders as well (eg) Partition1, Partition2 etc..

I did run query analysis of both the queries and from that its quite clear that the glob pattern based query is leading a substantially higher S3ListObject & S3Reads. Below is a snapshot of the comparision

image

Metadata

Metadata

Assignees

No one assigned

    Labels

    comp-object-storageObject storage connectivity (S3/GCS/Azure) including credentials, retries, multipart, etc.performance

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions