Improve S3 glob performance#62120
Conversation
|
This is an automated comment for commit 1bae2d9 with description of existing statuses. It's updated for the latest CI running ❌ Click here to open a full report in a separate page
Successful checks
|
This reverts commit 9c9421b6897bf4a95346cef52171839ef67bd522.
070f331 to
a177fbf
Compare
1347468 to
a0a7357
Compare
cd2f704 to
7745f56
Compare
|
Right now, these globs work as described in #49929: There are two solutions I see:
I like the first option more. I am not sure if people really use |
a770679 to
25cab6f
Compare
7b30f2d to
e385810
Compare
6a56008 to
c022215
Compare
This comment was marked as outdated.
This comment was marked as outdated.
|
@kssenii , please take a look, this time tests are OK, and I see no bugs left :) |
|
@kssenii ping |
|
On the latest changes (c9a3159 and 686ea6a): S3 iterator is lazy: it looks for new files only when necessary, and each A couple of examples:
A reproducible example of good vs bad performance con be found in this gist TL;DR: IMO it is better to make reading from S3 "more parallel" in situations when we don't have any idea about how many objects we will read. |
|
Test fails
|
|
@kssenii could you please take a look? |
src/Storages/StorageS3.h
Outdated
|
|
||
| KeyWithInfoPtr next(size_t idx = 0) override; /// NOLINT | ||
| size_t estimatedKeysCount() override; | ||
| bool hasMore(); |
There was a problem hiding this comment.
Maybe get rid of hasMore(), and modify IITerator::estimatedKeysCount() to return int (instead of size_t), with following semantics:
< 0- means that estimation wasn't possible, there are potentially many keys= 0- no keys> 0- there are at least this number of keys in the bucket matching filtering criteria.
This would clear implementation a little bit and prevent bleeding of knowledge about StorageS3Source::DisclosedGlobIterator (and its special protocols) into other parts of the code.
It looks like int is going to be enough, that provides estimation of up to 2*31 keys. Anything more overflows into the < 0 case, which means (not unable to estimate, probably a lot of keys). And it looks like current implementation gives estimation of 1000 keys max.
There was a problem hiding this comment.
or you can have constexpr size_t UNABLE_TO_ESTIMATE = std::numeric_limits<size_t>::max(); and treat that as a special value, without changing the signature.
There was a problem hiding this comment.
This also sounds fine. The returned value doesn't exceed 1000 anyway.
And even if it could, the behavior shall be the same for std::numeric_limits<size_t>::max() matching objects and for unknown number of objects
|
@kssenii I will make some changes (simplify things) like stated here: #62120 (comment) |
This comment was marked as outdated.
This comment was marked as outdated.
|
UPD: Stress test (ubsan) is unrelated. It is a coincidence, I see similar one in master: https://s3.amazonaws.com/clickhouse-test-reports/0/21cbbdd983b092945a4fb6015179971d18c4dab9/stress_test__debug_.html The Stress test (ubsan) fail looks strange. Its report says that a query Maybe it's just a coincidence, none of the previous commits had it. |
|
Fails (not related):
|
Improve performance of processing selection (
{}) glob in StorageS3. Fixes #53643, fixes #49929.Also, fix performance degradation in some cases (see #62120 (comment)). Improvement for #54815 and #54936.
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Improved performance of selection (
{}) globs in StorageS3.Modify your CI run:
NOTE: If your merge the PR with modified CI you MUST KNOW what you are doing
NOTE: Checked options will be applied if set before CI RunConfig/PrepareRunConfig step
Include tests (required builds will be added automatically):
Exclude tests:
Extra options:
Only specified batches in multi-batch jobs: