Skip to content

Improve S3 glob performance#62120

Merged
kssenii merged 22 commits intoClickHouse:masterfrom
zvonand:zvonand-s3-globs
May 7, 2024
Merged

Improve S3 glob performance#62120
kssenii merged 22 commits intoClickHouse:masterfrom
zvonand:zvonand-s3-globs

Conversation

@zvonand
Copy link
Copy Markdown
Contributor

@zvonand zvonand commented Mar 31, 2024

Improve performance of processing selection ({}) glob in StorageS3. Fixes #53643, fixes #49929.
Also, fix performance degradation in some cases (see #62120 (comment)). Improvement for #54815 and #54936.

Changelog category (leave one):

  • Performance Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Improved performance of selection ({}) globs in StorageS3.


Modify your CI run:

NOTE: If your merge the PR with modified CI you MUST KNOW what you are doing
NOTE: Checked options will be applied if set before CI RunConfig/PrepareRunConfig step

Include tests (required builds will be added automatically):

  • Fast test
  • Integration Tests
  • Stateless tests
  • Stateful tests
  • Unit tests
  • Performance tests
  • All with ASAN
  • All with TSAN
  • All with Analyzer
  • Add your option here

Exclude tests:

  • Fast test
  • Integration Tests
  • Stateless tests
  • Stateful tests
  • Performance tests
  • All with ASAN
  • All with TSAN
  • All with MSAN
  • All with UBSAN
  • All with Coverage
  • All with Aarch64
  • Add your option here

Extra options:

  • do not test (only style check)
  • disable merge-commit (no merge from master before tests)
  • disable CI cache (job reuse)

Only specified batches in multi-batch jobs:

  • 1
  • 2
  • 3
  • 4

@zvonand zvonand changed the title Try to improve StorageS3 selection glob performance Improve StorageS3 selection glob performance Mar 31, 2024
@alexey-milovidov alexey-milovidov added the can be tested Allows running workflows for external contributors label Mar 31, 2024
@robot-ch-test-poll2 robot-ch-test-poll2 added the pr-performance Pull request with some performance improvements label Mar 31, 2024
@robot-ch-test-poll2
Copy link
Copy Markdown
Contributor

robot-ch-test-poll2 commented Mar 31, 2024

This is an automated comment for commit 1bae2d9 with description of existing statuses. It's updated for the latest CI running

❌ Click here to open a full report in a separate page

Check nameDescriptionStatus
CI runningA meta-check that indicates the running CI. Normally, it's in success or pending state. The failed status indicates some problems with the PR⏳ pending
Stateless testsRuns stateless functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc❌ failure
Stress testRuns stateless functional tests concurrently from several clients to detect concurrency-related errors❌ failure
Successful checks
Check nameDescriptionStatus
A SyncThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
AST fuzzerRuns randomly generated queries to catch program errors. The build type is optionally given in parenthesis. If it fails, ask a maintainer for help✅ success
ClickBenchRuns [ClickBench](https://github.com/ClickHouse/ClickBench/) with instant-attach table✅ success
ClickHouse build checkBuilds ClickHouse in various configurations for use in further steps. You have to fix the builds that fail. Build logs often has enough information to fix the error, but you might have to reproduce the failure locally. The cmake options can be found in the build log, grepping for cmake. Use these options and follow the general build process✅ success
Compatibility checkChecks that clickhouse binary runs on distributions with old libc versions. If it fails, ask a maintainer for help✅ success
Docker keeper imageThe check to build and optionally push the mentioned image to docker hub✅ success
Docker server imageThe check to build and optionally push the mentioned image to docker hub✅ success
Docs checkBuilds and tests the documentation✅ success
Fast testNormally this is the first check that is ran for a PR. It builds ClickHouse and runs most of stateless functional tests, omitting some. If it fails, further checks are not started until it is fixed. Look at the report to see which tests fail, then reproduce the failure locally as described here✅ success
Flaky testsChecks if new added or modified tests are flaky by running them repeatedly, in parallel, with more randomization. Functional tests are run 100 times with address sanitizer, and additional randomization of thread scheduling. Integrational tests are run up to 10 times. If at least once a new test has failed, or was too long, this check will be red. We don't allow flaky tests, read the doc✅ success
Install packagesChecks that the built packages are installable in a clear environment✅ success
Integration testsThe integration tests report. In parenthesis the package type is given, and in square brackets are the optional part/total tests✅ success
Mergeable CheckChecks if all other necessary checks are successful✅ success
PR CheckThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Performance ComparisonMeasure changes in query performance. The performance test report is described in detail here. In square brackets are the optional part/total tests✅ success
Stateful testsRuns stateful functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc✅ success
Style checkRuns a set of checks to keep the code style clean. If some of tests failed, see the related log from the report✅ success
Unit testsRuns the unit tests for different release types✅ success
Upgrade checkRuns stress tests on server version from last release and then tries to upgrade it to the version from the PR. It checks if the new server can successfully startup without any errors, crashes or sanitizer asserts✅ success

@zvonand zvonand force-pushed the zvonand-s3-globs branch from 070f331 to a177fbf Compare April 2, 2024 09:07
@kssenii kssenii self-assigned this Apr 2, 2024
@zvonand zvonand force-pushed the zvonand-s3-globs branch 3 times, most recently from 1347468 to a0a7357 Compare April 3, 2024 17:48
@zvonand zvonand force-pushed the zvonand-s3-globs branch 3 times, most recently from cd2f704 to 7745f56 Compare April 3, 2024 20:57
@zvonand
Copy link
Copy Markdown
Contributor Author

zvonand commented Apr 4, 2024

Right now, these globs work as described in #49929:
Imagine we have some <bucket>, and a lot of files that can be described as path{1,2,3.......100000}/file*.csv. From these files, we only want to select path{1,2,3}/file*.csv. When we do e.g. SELECT * FROM s3('<bucket>/path{1,2,3}/file*.csv'), CH retrieves all the files that have a prefix <bucket>/path. This can be thousands of files we don't need. This makes some simple queries impossible to run.

There are two solutions I see:

  1. Doing on-demand (lazy) calls to ListObjectsV2 for each option inside {} glob, in our case it will be listing keys with prefix path1/file, then with prefix path2/file, then with prefix path3/file. This significantly reduces number of calls to ListObjectsV2. But when initializing StorageS3, it only performs one call, all other calls are performed only when CH needs to read the next file. This leads to a problem:
    This breaks a test, and I am not sure if it really breaks real workflows. The problem is with test test_schema_inference_cache with globs. To test it, currently we run a describe query (e.g.DESC s3(bucket/file{1,2,3,4}.csvvvv), then check for cache misses/hits on next queries. This lazy approach will not work for the test: describe query will not read all possible files from S3, it will take those that have already been read on initialization, and also put them into schema cache.
    This works now because the test is trivial: there are only 4 files, that are also read on Storage initialization (current behavior: read all keys with prefix file), and the cache is populated properly.
    With this new approach, in this test only file1.csvvvv will be put into cache. BUT select query will work perfect, and all files will be added to schema cache.
    Actually, it is very close to the current behavior. If each selection has many underlying files, this difference will be unnoticeable.

  2. The second approach is similar to the first, but we do all these calls on initialization and store a complete list of files to be read. This can be rather expensive, as there can be thousands of files.

I like the first option more. I am not sure if people really use describe queries to populate the cache.

@zvonand zvonand force-pushed the zvonand-s3-globs branch from a770679 to 25cab6f Compare April 4, 2024 14:20
@zvonand zvonand force-pushed the zvonand-s3-globs branch from 7b30f2d to e385810 Compare April 4, 2024 20:18
@zvonand zvonand marked this pull request as ready for review April 4, 2024 23:30
@zvonand zvonand marked this pull request as draft April 8, 2024 11:59
@zvonand zvonand force-pushed the zvonand-s3-globs branch from 6a56008 to c022215 Compare April 9, 2024 20:50
@zvonand

This comment was marked as outdated.

@zvonand
Copy link
Copy Markdown
Contributor Author

zvonand commented Apr 11, 2024

@kssenii , please take a look, this time tests are OK, and I see no bugs left :)

@zvonand
Copy link
Copy Markdown
Contributor Author

zvonand commented Apr 18, 2024

@kssenii ping

@zvonand
Copy link
Copy Markdown
Contributor Author

zvonand commented Apr 25, 2024

On the latest changes (c9a3159 and 686ea6a):

S3 iterator is lazy: it looks for new files only when necessary, and each ListObjectsV2 lists 1000 objects. First ListObjectsV2 is done on initialization of the iterator itself.
And here is the thing: the estimation of number of keys was solely based on number of suitable files found in this first portion of listed objects. If we have a very "general" glob or if we have a strong filter, it is a common case that none of the objects satisfy the criteria, and reading from S3 throws to "failsafe" mode, reading objects one by one.

A couple of examples:

  • generic_prefix**.someformat with WHERE filtering by _path and there are no keys matching the filter among the first batch of files listed;
  • prefix/{a,b,c}*.someformat -- similar to the above, but after changes in this PR it would become a more common thing. Imagine there are no keys matching prefix/a*.someformat (which is listed first, on iterator initialization). Then, CH would read all these files sequentially.

A reproducible example of good vs bad performance con be found in this gist

TL;DR: IMO it is better to make reading from S3 "more parallel" in situations when we don't have any idea about how many objects we will read.

@zvonand zvonand changed the title Improve StorageS3 selection glob performance Improve S3 glob performance Apr 25, 2024
@zvonand
Copy link
Copy Markdown
Contributor Author

zvonand commented Apr 25, 2024

Test fails

@zvonand
Copy link
Copy Markdown
Contributor Author

zvonand commented Apr 28, 2024

@kssenii could you please take a look?


KeyWithInfoPtr next(size_t idx = 0) override; /// NOLINT
size_t estimatedKeysCount() override;
bool hasMore();
Copy link
Copy Markdown
Contributor

@Enmk Enmk May 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe get rid of hasMore(), and modify IITerator::estimatedKeysCount() to return int (instead of size_t), with following semantics:

  • < 0 - means that estimation wasn't possible, there are potentially many keys
  • = 0 - no keys
  • > 0 - there are at least this number of keys in the bucket matching filtering criteria.

This would clear implementation a little bit and prevent bleeding of knowledge about StorageS3Source::DisclosedGlobIterator (and its special protocols) into other parts of the code.

It looks like int is going to be enough, that provides estimation of up to 2*31 keys. Anything more overflows into the < 0 case, which means (not unable to estimate, probably a lot of keys). And it looks like current implementation gives estimation of 1000 keys max.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or you can have constexpr size_t UNABLE_TO_ESTIMATE = std::numeric_limits<size_t>::max(); and treat that as a special value, without changing the signature.

Copy link
Copy Markdown
Contributor Author

@zvonand zvonand May 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also sounds fine. The returned value doesn't exceed 1000 anyway.
And even if it could, the behavior shall be the same for std::numeric_limits<size_t>::max() matching objects and for unknown number of objects

@zvonand
Copy link
Copy Markdown
Contributor Author

zvonand commented May 6, 2024

@kssenii I will make some changes (simplify things) like stated here: #62120 (comment)
Could you also take a look after it?

@zvonand

This comment was marked as outdated.

@zvonand zvonand force-pushed the zvonand-s3-globs branch from c73fbcc to 731d054 Compare May 7, 2024 10:54
@zvonand
Copy link
Copy Markdown
Contributor Author

zvonand commented May 7, 2024

UPD: Stress test (ubsan) is unrelated. It is a coincidence, I see similar one in master: https://s3.amazonaws.com/clickhouse-test-reports/0/21cbbdd983b092945a4fb6015179971d18c4dab9/stress_test__debug_.html

The Stress test (ubsan) fail looks strange.

Its report says that a query SELECT count() FROM s3(\'http://localhost:11111/test/test_1/test_INT_MAX.tsv\'); was hung. But this query is not related to the changes in this PR. This URI has no globs, DisclosedGlobIterator is not used, so none of the modified code could possibly be run.

Maybe it's just a coincidence, none of the previous commits had it.

@zvonand
Copy link
Copy Markdown
Contributor Author

zvonand commented May 7, 2024

Fails (not related):

@kssenii kssenii added this pull request to the merge queue May 7, 2024
Merged via the queue into ClickHouse:master with commit 60cf7cd May 7, 2024
@robot-ch-test-poll robot-ch-test-poll added the pr-synced-to-cloud The PR is synced to the cloud repo label May 7, 2024
@zvonand zvonand deleted the zvonand-s3-globs branch May 8, 2024 09:37
kssenii added a commit that referenced this pull request May 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-performance Pull request with some performance improvements pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Queries to S3 having glob patterns takes a long time to complete S3 Wildcard Issue When Using OR (Not usable)

6 participants