Improve S3 glob performance by zvonand · Pull Request #62120 · ClickHouse/ClickHouse

zvonand · 2024-03-31T20:17:31Z

Improve performance of processing selection ({}) glob in StorageS3. Fixes #53643, fixes #49929.
Also, fix performance degradation in some cases (see #62120 (comment)). Improvement for #54815 and #54936.

Changelog category (leave one):

Performance Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Improved performance of selection ({}) globs in StorageS3.

Modify your CI run:

NOTE: If your merge the PR with modified CI you MUST KNOW what you are doing
NOTE: Checked options will be applied if set before CI RunConfig/PrepareRunConfig step

Include tests (required builds will be added automatically):

Exclude tests:

Extra options:

do not test (only style check)
disable merge-commit (no merge from master before tests)
disable CI cache (job reuse)

Only specified batches in multi-batch jobs:

1
2
3
4

robot-ch-test-poll2 · 2024-03-31T20:39:01Z

This is an automated comment for commit 1bae2d9 with description of existing statuses. It's updated for the latest CI running

❌ Click here to open a full report in a separate page

Check name	Description	Status
CI running	A meta-check that indicates the running CI. Normally, it's in success or pending state. The failed status indicates some problems with the PR	⏳ pending
Stateless tests	Runs stateless functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc	❌ failure
Stress test	Runs stateless functional tests concurrently from several clients to detect concurrency-related errors	❌ failure

Successful checks

Check name	Description	Status
A Sync	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
AST fuzzer	Runs randomly generated queries to catch program errors. The build type is optionally given in parenthesis. If it fails, ask a maintainer for help	✅ success
ClickBench	Runs [ClickBench](https://github.com/ClickHouse/ClickBench/) with instant-attach table	✅ success
ClickHouse build check	Builds ClickHouse in various configurations for use in further steps. You have to fix the builds that fail. Build logs often has enough information to fix the error, but you might have to reproduce the failure locally. The cmake options can be found in the build log, grepping for cmake. Use these options and follow the general build process	✅ success
Compatibility check	Checks that clickhouse binary runs on distributions with old libc versions. If it fails, ask a maintainer for help	✅ success
Docker keeper image	The check to build and optionally push the mentioned image to docker hub	✅ success
Docker server image	The check to build and optionally push the mentioned image to docker hub	✅ success
Docs check	Builds and tests the documentation	✅ success
Fast test	Normally this is the first check that is ran for a PR. It builds ClickHouse and runs most of stateless functional tests, omitting some. If it fails, further checks are not started until it is fixed. Look at the report to see which tests fail, then reproduce the failure locally as described here	✅ success
Flaky tests	Checks if new added or modified tests are flaky by running them repeatedly, in parallel, with more randomization. Functional tests are run 100 times with address sanitizer, and additional randomization of thread scheduling. Integrational tests are run up to 10 times. If at least once a new test has failed, or was too long, this check will be red. We don't allow flaky tests, read the doc	✅ success
Install packages	Checks that the built packages are installable in a clear environment	✅ success
Integration tests	The integration tests report. In parenthesis the package type is given, and in square brackets are the optional part/total tests	✅ success
Mergeable Check	Checks if all other necessary checks are successful	✅ success
PR Check	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Performance Comparison	Measure changes in query performance. The performance test report is described in detail here. In square brackets are the optional part/total tests	✅ success
Stateful tests	Runs stateful functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc	✅ success
Style check	Runs a set of checks to keep the code style clean. If some of tests failed, see the related log from the report	✅ success
Unit tests	Runs the unit tests for different release types	✅ success
Upgrade check	Runs stress tests on server version from last release and then tries to upgrade it to the version from the PR. It checks if the new server can successfully startup without any errors, crashes or sanitizer asserts	✅ success

This reverts commit 9c9421b6897bf4a95346cef52171839ef67bd522.

zvonand · 2024-04-04T10:28:52Z

Right now, these globs work as described in #49929:
Imagine we have some <bucket>, and a lot of files that can be described as path{1,2,3.......100000}/file*.csv. From these files, we only want to select path{1,2,3}/file*.csv. When we do e.g. SELECT * FROM s3('<bucket>/path{1,2,3}/file*.csv'), CH retrieves all the files that have a prefix <bucket>/path. This can be thousands of files we don't need. This makes some simple queries impossible to run.

There are two solutions I see:

Doing on-demand (lazy) calls to ListObjectsV2 for each option inside {} glob, in our case it will be listing keys with prefix path1/file, then with prefix path2/file, then with prefix path3/file. This significantly reduces number of calls to ListObjectsV2. But when initializing StorageS3, it only performs one call, all other calls are performed only when CH needs to read the next file. This leads to a problem:
This breaks a test, and I am not sure if it really breaks real workflows. The problem is with test test_schema_inference_cache with globs. To test it, currently we run a describe query (e.g.DESC s3(bucket/file{1,2,3,4}.csvvvv), then check for cache misses/hits on next queries. This lazy approach will not work for the test: describe query will not read all possible files from S3, it will take those that have already been read on initialization, and also put them into schema cache.
This works now because the test is trivial: there are only 4 files, that are also read on Storage initialization (current behavior: read all keys with prefix file), and the cache is populated properly.
With this new approach, in this test only file1.csvvvv will be put into cache. BUT select query will work perfect, and all files will be added to schema cache.
Actually, it is very close to the current behavior. If each selection has many underlying files, this difference will be unnoticeable.
The second approach is similar to the first, but we do all these calls on initialization and store a complete list of files to be read. This can be rather expensive, as there can be thousands of files.

I like the first option more. I am not sure if people really use describe queries to populate the cache.

…d-s3-globs

zvonand · 2024-04-11T11:11:28Z

@kssenii , please take a look, this time tests are OK, and I see no bugs left :)

zvonand · 2024-04-18T11:29:21Z

@kssenii ping

…d-s3-globs

zvonand · 2024-04-25T09:44:26Z

On the latest changes (c9a3159 and 686ea6a):

S3 iterator is lazy: it looks for new files only when necessary, and each ListObjectsV2 lists 1000 objects. First ListObjectsV2 is done on initialization of the iterator itself.
And here is the thing: the estimation of number of keys was solely based on number of suitable files found in this first portion of listed objects. If we have a very "general" glob or if we have a strong filter, it is a common case that none of the objects satisfy the criteria, and reading from S3 throws to "failsafe" mode, reading objects one by one.

A couple of examples:

generic_prefix**.someformat with WHERE filtering by _path and there are no keys matching the filter among the first batch of files listed;
prefix/{a,b,c}*.someformat -- similar to the above, but after changes in this PR it would become a more common thing. Imagine there are no keys matching prefix/a*.someformat (which is listed first, on iterator initialization). Then, CH would read all these files sequentially.

A reproducible example of good vs bad performance con be found in this gist

TL;DR: IMO it is better to make reading from S3 "more parallel" in situations when we don't have any idea about how many objects we will read.

src/Storages/StorageS3.cpp

zvonand · 2024-04-25T18:56:52Z

Test fails

Integration tests (tsan) [1/6]:
- test_mutation_with_broken_projection -- Flaky tests test_broken_projections #63002
Stateless tests (tsan, s3 storage) [4/5]
- 03000_traverse_shadow_system_data_paths -- Flaky test 03000_traverse_shadow_system_data_paths #63003
Stateless tests (tsan) [1/5]
- 03094_grouparraysorted_memory -- Flaky test 03094_grouparraysorted_memory #63086
Stress test (asan) -- it goes somewhere to DB::SerializationArray (according to the stacktrace), looks unrelated to this PR.

zvonand · 2024-04-28T12:08:48Z

@kssenii could you please take a look?

Enmk · 2024-05-06T11:27:42Z

src/Storages/StorageS3.h


        KeyWithInfoPtr next(size_t idx = 0) override; /// NOLINT
        size_t estimatedKeysCount() override;
+        bool hasMore();


Maybe get rid of hasMore(), and modify IITerator::estimatedKeysCount() to return int (instead of size_t), with following semantics:

< 0 - means that estimation wasn't possible, there are potentially many keys

= 0 - no keys

> 0 - there are at least this number of keys in the bucket matching filtering criteria.

This would clear implementation a little bit and prevent bleeding of knowledge about StorageS3Source::DisclosedGlobIterator (and its special protocols) into other parts of the code.

It looks like int is going to be enough, that provides estimation of up to 2*31 keys. Anything more overflows into the < 0 case, which means (not unable to estimate, probably a lot of keys). And it looks like current implementation gives estimation of 1000 keys max.

or you can have constexpr size_t UNABLE_TO_ESTIMATE = std::numeric_limits<size_t>::max(); and treat that as a special value, without changing the signature.

This also sounds fine. The returned value doesn't exceed 1000 anyway.
And even if it could, the behavior shall be the same for std::numeric_limits<size_t>::max() matching objects and for unknown number of objects

zvonand · 2024-05-06T11:35:57Z

@kssenii I will make some changes (simplify things) like stated here: #62120 (comment)
Could you also take a look after it?

src/Storages/StorageS3.cpp

zvonand · 2024-05-07T14:06:05Z

UPD: Stress test (ubsan) is unrelated. It is a coincidence, I see similar one in master: https://s3.amazonaws.com/clickhouse-test-reports/0/21cbbdd983b092945a4fb6015179971d18c4dab9/stress_test__debug_.html

The Stress test (ubsan) fail looks strange.

Its report says that a query SELECT count() FROM s3(\'http://localhost:11111/test/test_1/test_INT_MAX.tsv\'); was hung. But this query is not related to the changes in this PR. This URI has no globs, DisclosedGlobIterator is not used, so none of the modified code could possibly be run.

Maybe it's just a coincidence, none of the previous commits had it.

zvonand · 2024-05-07T19:19:00Z

Fails (not related):

Stress test (ubsan) -- Improve S3 glob performance #62120 (comment)
Stateless tests (debug, s3 storage) [2/6]:
- 02362_part_log_merge_algorithm -- Flaky test 02362_part_log_merge_algorithm #63491
Stateless tests (debug, s3 storage) [6/6]:
- 02680_mysql_ast_logical_err -- Flaky test 02680_mysql_ast_logical_err #63492

zvonand changed the title ~~Try to improve StorageS3 selection glob performance~~ Improve StorageS3 selection glob performance Mar 31, 2024

alexey-milovidov added the can be tested Allows running workflows for external contributors label Mar 31, 2024

robot-ch-test-poll2 added the pr-performance Pull request with some performance improvements label Mar 31, 2024

zvonand added 4 commits April 2, 2024 11:07

try to improve Storage S3 selection glob performance

98c2048

Revert "try to improve Storage S3 selection glob performance"

73b9ef9

This reverts commit 9c9421b6897bf4a95346cef52171839ef67bd522.

simpler way

70da13b

ignore error when one of selection options not exist

a177fbf

zvonand force-pushed the zvonand-s3-globs branch from 070f331 to a177fbf Compare April 2, 2024 09:07

no reuse request

7232bf4

kssenii self-assigned this Apr 2, 2024

zvonand force-pushed the zvonand-s3-globs branch 3 times, most recently from 1347468 to a0a7357 Compare April 3, 2024 17:48

fix schema inference cache (1)

25cab6f

zvonand force-pushed the zvonand-s3-globs branch 3 times, most recently from cd2f704 to 7745f56 Compare April 3, 2024 20:57

zvonand force-pushed the zvonand-s3-globs branch from a770679 to 25cab6f Compare April 4, 2024 14:20

zvonand added 3 commits April 4, 2024 19:47

adapt test to new behavior

ce3969e

Merge branch 'master' of github.com:ClickHouse/ClickHouse into zvonan…

f4487d0

…d-s3-globs

fix black

e385810

zvonand force-pushed the zvonand-s3-globs branch from 7b30f2d to e385810 Compare April 4, 2024 20:18

zvonand marked this pull request as ready for review April 4, 2024 23:30

zvonand marked this pull request as draft April 8, 2024 11:59

zvonand added 2 commits April 8, 2024 14:04

Merge branch 'master' of github.com:ClickHouse/ClickHouse into zvonan…

98ca507

…d-s3-globs

fix reading of {} with more than 1000 objects under each

3c58e58

zvonand force-pushed the zvonand-s3-globs branch from 6a56008 to c022215 Compare April 9, 2024 20:50

added test for selection globs with many files under

093b71b

This comment was marked as outdated.

Sign in to view

zvonand added 3 commits April 24, 2024 22:14

fix single-threading failsafe when number of files cannot be estimated

c9a3159

Merge branch 'master' of github.com:ClickHouse/ClickHouse into zvonan…

b83a592

…d-s3-globs

fix style and logic of estimation

686ea6a

zvonand changed the title ~~Improve StorageS3 selection glob performance~~ Improve S3 glob performance Apr 25, 2024

Enmk reviewed Apr 25, 2024

View reviewed changes

src/Storages/StorageS3.cpp Outdated Show resolved Hide resolved

fix tidy

b13c7d0

Enmk reviewed May 6, 2024

View reviewed changes

kssenii approved these changes May 6, 2024

View reviewed changes

simplify estimation of number of objects in bucket

731d054

kssenii reviewed May 6, 2024

View reviewed changes

src/Storages/StorageS3.cpp Show resolved Hide resolved

This comment was marked as outdated.

Sign in to view

Enmk reviewed May 7, 2024

View reviewed changes

src/Storages/StorageS3.cpp Outdated Show resolved Hide resolved

zvonand force-pushed the zvonand-s3-globs branch from c73fbcc to 731d054 Compare May 7, 2024 10:54

update comment

1bae2d9

kssenii added this pull request to the merge queue May 7, 2024

Merged via the queue into ClickHouse:master with commit 60cf7cd May 7, 2024

robot-ch-test-poll added the pr-synced-to-cloud The PR is synced to the cloud repo label May 7, 2024

zvonand deleted the zvonand-s3-globs branch May 8, 2024 09:37

kssenii added a commit that referenced this pull request May 21, 2024

Apply changes from PR #62120

c192013

zvonand mentioned this pull request Dec 16, 2024

Optimize s3[Cluster]('s3://bucket/prefix/{a.json,b.json}') S3 list request count #73333

Open

Conversation

zvonand commented Mar 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Modify your CI run:

Include tests (required builds will be added automatically):

Exclude tests:

Extra options:

Only specified batches in multi-batch jobs:

Uh oh!

robot-ch-test-poll2 commented Mar 31, 2024 • edited by robot-clickhouse Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zvonand commented Apr 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as outdated.

zvonand commented Apr 11, 2024

Uh oh!

zvonand commented Apr 18, 2024

Uh oh!

zvonand commented Apr 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

zvonand commented Apr 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zvonand commented Apr 28, 2024

Uh oh!

Enmk May 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Enmk May 6, 2024

Choose a reason for hiding this comment

Uh oh!

zvonand May 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zvonand commented May 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

zvonand commented May 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zvonand commented May 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

zvonand commented Mar 31, 2024 •

edited

Loading

robot-ch-test-poll2 commented Mar 31, 2024 •

edited by robot-clickhouse

Loading

zvonand commented Apr 4, 2024 •

edited

Loading

zvonand commented Apr 25, 2024 •

edited

Loading

zvonand commented Apr 25, 2024 •

edited

Loading

Enmk May 6, 2024 •

edited

Loading

zvonand May 6, 2024 •

edited

Loading

zvonand commented May 6, 2024 •

edited

Loading

zvonand commented May 7, 2024 •

edited

Loading

zvonand commented May 7, 2024 •

edited

Loading