-
Notifications
You must be signed in to change notification settings - Fork 8.3k
Setting: max_readers for S3/url/hdfs cluster table engines #52437
Description
Use case
Increase the max readers used for reading files from s3(Cluster) table engine and similar. When using clusters on the size of 96 cores, and running an s3Cluster select on 96 files, I observe only 50-something S3 readers being used (see https://clickhousedb.slack.com/archives/CU478UEQZ/p1689859840124579 for more context, including long query output):
Describe the solution you'd like
A max_readers setting that would be propagated to the nodes (or just on the single node for s3()) to allow a max number of readers used to pull s3 files per node. For example:
SELECT count() FROM s3Cluster('{cluster}', 'https://s3.us-east-1.amazonaws.com/altinity-clickhouse-data/nyc_taxi_rides/data/tripdata/data-*.csv.gz', 'CSVWithNames',
'pickup_date Date, id UInt64, vendor_id String, tpep_pickup_datetime DateTime, tpep_dropoff_datetime DateTime, passenger_count UInt8, trip_distance Float32, pickup_longitude Float32, pickup_latitude Float32, rate_code_id String, store_and_fwd_flag String, dropoff_longitude Float32, dropoff_latitude Float32, payment_type LowCardinality(String), fare_amount Float32, extra String, mta_tax Float32, tip_amount Float32, tolls_amount Float32, improvement_surcharge Float32, total_amount Float32, pickup_location_id UInt16, dropoff_location_id UInt16, junk1 String, junk2 String',
'gzip')
settings max_readers=16
This would ensure I get up to 16 readers per node. It could also instead be the aggregate where I'd make it 96 for a 6x16vCPU cluster.
Describe alternatives you've considered
None, it seems like this is some hard-coded wall that disrupts performance :(
Additional context