Optimise hot reading from remote tables#49287
Optimise hot reading from remote tables#49287alexey-milovidov merged 17 commits intoClickHouse:masterfrom
Conversation
|
This is an automated comment for commit b2921df with description of existing statuses. It's updated for the latest CI running
|
0cfb561 to
30d216f
Compare
| , backoff_settings{context_->getSettingsRef()} | ||
| , backoff_state{threads_} | ||
| { | ||
| if (std::ranges::count(is_part_on_remote_disk, true)) |
There was a problem hiding this comment.
hm, is_part_on_remote_disk is initialized in fillPerPartInfo below this check, so this check is always false?
| { | ||
|
|
||
| static constexpr int FILECACHE_DEFAULT_MAX_FILE_SEGMENT_SIZE = 100 * 1024 * 1024; | ||
| static constexpr int FILECACHE_DEFAULT_MAX_FILE_SEGMENT_SIZE = 8 * 1024 * 1024; |
There was a problem hiding this comment.
first of all, why change it. since we start to share segments between tasks ('cause of alignment) now some of them that actually need to read only from the end of the segment could read the whole segment in pre-download. so it seems logical to limit size of the segments.
from what perspective do you think it might be too small? overhead per segment during processing seems very small. the only downside I could think of is prob number of GetObject-s will increase.
There was a problem hiding this comment.
the only downside I could think of is prob number of GetObject-s will increase.
yeah, this is what I was thinking about. The solution could be to make changes to CachedOnDiskReadBufferFromFile around setReadUntilPosition method, currently it sets the end to the end of the file segment and defines the boundaries of range request, so reading for the next file segment means making a new getObject request. So to improve this we could do setReadUntilPosition to the right boundary of the last non-downloaded file segment in a row. The only complication is to control the boundary of writing into current file segment as our buffer might contain more.
This increase in the number of getObject requests can actually be very serious, I saw several times that profile event that counts time we spent to just make a getObject request (in initialize() method of ReadBufferFromS3) strikes to >1sec randomly more and more if the number of such requests is quite high.
There was a problem hiding this comment.
I also saw this effect, in my case it was because of connection timeouts that start to happen more frequently when we increase number of parallel requests. for this i added conn pool for s3 reading in the second pr.
and I like your idea of coalescing reads on GetObject level
| /// We allocate buffers not less than 1M so that s3 requests will not be too small. But the same buffers (members of AsynchronousReadIndirectBufferFromRemoteFS) | ||
| /// are used for reading from files. Some of these readings are fairly small and their performance degrade when we use big buffers (up to ~20% for queries like Q23 from ClickBench). | ||
| if (use_external_buffer && read_type == ReadType::CACHED && settings.local_fs_buffer_size < internal_buffer.size()) | ||
| internal_buffer.resize(settings.local_fs_buffer_size); |
There was a problem hiding this comment.
below when calling predownload() we might decide to bypass cache and change read_method, so we need to resize the buffer back somewhere there
There was a problem hiding this comment.
if I'm not missing smth, we do resize only if read_type == ReadType::CACHED and in the only place where we assign positive value to bytes_to_predownload we also set read_type = REMOTE_FS_READ_AND_PUT_IN_CACHE
| static size_t chooseBufferSize(const ReadSettings & settings, size_t file_size) | ||
| { | ||
| /// Buffers used for prefetch or pre-download better to have enough size, but not bigger than the whole file. | ||
| return std::min<size_t>(std::max<size_t>(settings.prefetch_buffer_size, DBMS_DEFAULT_BUFFER_SIZE), file_size); |
There was a problem hiding this comment.
is max with DBMS_DEFAULT_BUFFER_SIZE necessary? seems redundant, if user makes the buffer too small it's his fault? as by default prefetch_buffer_size is equal to DBMS_DEFAULT_BUFFER_SIZE then it should be ok
There was a problem hiding this comment.
we adjust buffer size (including prefetch buffer) based on mark range sizes. they could be fairly small (few KBs).
since we use them to download segments that (almost) cannot be smaller than 1MB I think it makes sense to enforce this lower bound.
| /// we still have enough span logs for the execution of external queries. | ||
| std::shared_ptr<OpenTelemetry::SpanHolder> query_span = internal ? nullptr : std::make_shared<OpenTelemetry::SpanHolder>("query"); | ||
| if (query_span) | ||
| LOG_DEBUG(&Poco::Logger::get("executeQuery"), "Query span trace_id for opentelemetry log: {}", query_span->trace_id); |
There was a problem hiding this comment.
Maybe it worth to check for query_span->isTraceEnabled() ? To print trace_id only if tracing is enabled
| parts_queue.push(std::move(info.second)); | ||
| } | ||
|
|
||
| LOG_DEBUG(log, "min_marks_for_concurrent_read={}", min_marks_for_concurrent_read); |
There was a problem hiding this comment.
This looks like it deserves to be under LOG_TEST
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):