Limiting the number of parsing threads for the S3 source#53668
Conversation
# Conflicts: # src/Storages/StorageS3.cpp # src/Storages/StorageS3.h # src/Storages/StorageURL.cpp # src/Storages/StorageURL.h
|
This is an automated comment for commit e42da94 with description of existing statuses. It's updated for the latest CI running
|
|
|
||
| const size_t max_download_threads = local_context->getSettingsRef().max_download_threads; | ||
| const size_t max_threads = local_context->getSettingsRef().max_threads; | ||
| const size_t max_parsing_threads = num_streams >= max_threads ? 1 : (max_threads / num_streams); |
There was a problem hiding this comment.
Wait, isn't num_streams usually equal to max_threads? This seems to disable all parallelism when reading from one file (or a small number of them). I tried it, and indeed it's super slow on hits.parquet now.
This works in StorageURL because it clamps num_streams to min(num_streams, num_files) first, but here we don't know the number of files yet.
I guess we should do the first ListObjectsV2 request right here, and defer the continuation (if needed) to DisclosedGlobIterator. (I guess it would be a method in iterator_wrapper that would return the complete list if it's cheap enough to obtain, and nothing otherwise).
There was a problem hiding this comment.
Hm, maybe we should also use max_download_threads instead of max_threads for parquet reading+parsing for remote files? Not sure why I didn't do it this way in the first place.
(I guess the way to do it is to calculate both max_threads / num_streams and max_download_threads / num_streams here, then make registerInputFormatParquet use is_remote_fs ? max_download_threads : max_parsing_threads as the number of threads. Which is kind of silly because is_remote_fs is always true in this case, so the max_threads / num_streams value will go unused. The reason I made RandomAccessInputCreator take max_parsing_threads and max_download_threads separately is because I was going to make ParquetBlockInputFormat do reading in a separate thread pool from parsing. But then everything worked fine without it, so it doesn't seem worth the effort anymore. But maybe the separation of max_parsing_threads and max_download_threads should stay, just in case, idk.)
There was a problem hiding this comment.
Oh, yes, my bad. I was confused by StorageURL and completely forgot that we don't have this num_streams limitation in StorageS3. Will try to fix it ASAP
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
More careful thread management will improve the speed of the S3 table function over a large number of files by more than ~25%.