Do not delay final part writing by default (fixes possible Memory limit exceeded during INSERT) by azat · Pull Request #34780 · ClickHouse/ClickHouse

azat · 2022-02-20T22:20:13Z

Changelog category (leave one):

Bug Fix (user-visible misbehaviour in official stable or prestable release)

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Do not delay final part writing by default (fixes possible Memory limit exceeded during INSERT by adding max_insert_delayed_streams_for_parallel_write with default to 1000 for writes to s3 and disabled as before otherwise)

For async s3 writes final part flushing was defered until all the INSERT
block was processed, however in case of too many partitions/columns you
may exceed max_memory_usage limit (since each stream has overhead).

Introduce max_insert_delayed_streams_for_parallel_write (with default to 1000 for writes to s3 and disabled as before otherwise),
to avoid this (and avoid introducing regression).

This should "Memory limit exceeded" errors in performance tests.

Async s3 writes: #33291, #34219, #34215 (cc @KochetovNicolai )

azat · 2022-02-21T08:34:19Z

Note, that performance tests will still show errors, since the problem exists in upstream/master:

# left-server-log.log
2022.02.21 04:11:17.028151 [ 336 ] {723d25b5-df4f-4cad-8a12-0a9a0333874a} <Error> executeQuery: Code: 241. DB::Exception: Memory limit (for query) exceeded: would use 9.31 GiB (attem
pt to allocate chunk of 4225558 bytes), maximum: 9.31 GiB. (MEMORY_LIMIT_EXCEEDED) (version 22.3.1.1) (from [::1]:33884) (in query: INSERT INTO bad_partitions SELECT * FROM numbers(1
0000)), Stack trace (when copying this message, always include the lines below):

# right-server-log.log
2022.02.21 04:11:16.155247 [ 333 ] {f7dbcbe5-f910-4a6b-91fe-f00a5ee94eb5} <Debug> executeQuery: (from [::1]:48010) INSERT INTO bad_partitions SELECT * FROM numbers(10000)
...
2022.02.21 04:11:19.165839 [ 333 ] {f7dbcbe5-f910-4a6b-91fe-f00a5ee94eb5} <Information> executeQuery: Read 10000 rows, 78.13 KiB in 3.010562545 sec., 3321 rows/sec., 25.95 KiB/sec.

KochetovNicolai · 2022-02-21T10:22:30Z

Maybe, hm, we will enable this delay streams only for s3 disk? I don't like this new setting - it shows an implementation details to user.

azat · 2022-02-21T10:28:31Z

Maybe, hm, we will enable this delay streams only for s3 disk?

I thought about this, but even for s3 you should not defer too much parts.

I don't like this new setting - it shows an implementation details to user.

Agree, me neither.

KochetovNicolai · 2022-02-21T10:37:51Z

Well, my estimation is that deferred insertions cold cost about x2 memory in worst case - which is not so bad. And I really don't like adding a new setting.
Probably, we can increase memory limit for a failed test?

azat · 2022-02-22T15:25:32Z

Well, my estimation is that deferred insertions cold cost about x2 memory in worst case - which is not so bad. And I really don't like adding a new setting.

It is not x2 memory, it depends on number of columns (streams) and number of partitions in the insert block.

Probably, we can increase memory limit for a failed test?

The difference can be too high, I don't think that it is good idea to introduce such regression.
That particular query from perf tests (INSERT INTO bad_partitions) needs only 118MB, but now even 10GB is not enough:

# right-server-log.log
2022.02.21 04:11:16.155247 [ 333 ] {f7dbcbe5-f910-4a6b-91fe-f00a5ee94eb5} <Debug> executeQuery: (from [::1]:48010) INSERT INTO bad_partitions SELECT * FROM numbers(10000)
2022.02.21 04:11:19.165839 [ 333 ] {f7dbcbe5-f910-4a6b-91fe-f00a5ee94eb5} <Information> executeQuery: Read 10000 rows, 78.13 KiB in 3.010562545 sec., 3321 rows/sec., 25.95 KiB/sec.
2022.02.21 04:11:19.165899 [ 333 ] {f7dbcbe5-f910-4a6b-91fe-f00a5ee94eb5} <Debug> MemoryTracker: Peak memory usage (for query): 118.36 MiB.
2022.02.21 04:11:19.165943 [ 333 ] {} <Debug> MemoryTracker: Peak memory usage (for query): 118.36 MiB.
2022.02.21 04:11:19.165951 [ 333 ] {} <Debug> TCPHandler: Processed in 3.010847653 sec.

# left-server-log.log
2022.02.21 04:11:16.155247 [ 333 ] {f7dbcbe5-f910-4a6b-91fe-f00a5ee94eb5} <Debug> executeQuery: (from [::1]:48010) INSERT INTO bad_partitions SELECT * FROM numbers(10000)
2022.02.21 04:11:17.028239 [ 336 ] {723d25b5-df4f-4cad-8a12-0a9a0333874a} <Error> TCPHandler: Code: 241. DB::Exception: Memory limit (for query) exceeded: would use 9.31 GiB (attempt to allocate chunk of 4225558 bytes), maximum: 9.3
1 GiB. (MEMORY_LIMIT_EXCEEDED), Stack trace (when copying this message, always include the lines below):

And I'm pretty sure that there are users that INSERTs lots of partitions in one insert... (that will came with their issues)

I don't like this new setting - it shows an implementation details to user.

I'm not the fan of introduction new setting for everything.
Although this may be useful (but maybe too internal) allow user to configure it by himself (since I doubt that even for S3 it is a good idea to defer all streams).

I can propagate information about writer to the MergeTreeSink, and do delaying only for it, which static batch of 100 parts (so as thread pool for s3, even though this is a bit different things), but in future this can be extended to give benefits for:

multi-disk configuration
fsync_after_insert

So what will be your final thoughts about this?

KochetovNicolai · 2022-02-24T13:36:46Z

The table from perftest is a little bit extreme. Indeed, even 10GB is not enough.
I suppose it is because default buffer size is ~ 1M, even if we write there a single value.

Still, this use case is not impossible. I think I almost ok with a new setting. Would better solve it some other way, but can't find anything better.

However, setting max_insert_delayed_streams=1 eliminates all performance increase for S3 parallel inserts. I think we can use like 1000 by default. And, maybe, use it only in case of DiskS3,

azat · 2022-02-27T09:44:56Z

I think we can use like 1000 by default. And, maybe, use it only in case of DiskS3,

Done.

However, setting max_insert_delayed_streams=1 eliminates all performance increase for S3 parallel inserts

By the way, do you have any numbers on how parallel writes helps?

KochetovNicolai · 2022-03-07T11:22:05Z

Sorry for not answering for a long time.
Still, I thing it's better to set max_insert_delayed_streams by default to 1000, but enable it only for disks which support parallel write (so, for ordinary disk it is always disabled).

Maybe we should rename it to max_insert_delayed_streams_for_parallel_writes or something similar

azat · 2022-03-07T11:24:18Z

Still, I thing it's better to set max_insert_delayed_streams by default to 1000, but enable it only for disks which support parallel write (so, for ordinary disk it is always disabled).

It is 1000 by default for S3 in the current version of the patch:
https://github.com/ClickHouse/ClickHouse/pull/34780/files#diff-f1ce98e0ed65fef7539d3eb0f1dc809b184353e05e32ac9ac63f17030a4e6c79R63

Maybe we should rename it to max_insert_delayed_streams_for_parallel_writes or something similar

Ok.

azat · 2022-03-08T05:14:17Z

Done (also I forgot about ReplicatedMergeTree before, new version of this patch set addressed this).

Signed-off-by: Azat Khuzhin <[email protected]>

For async s3 writes final part flushing was defered until all the INSERT block was processed, however in case of too many partitions/columns you may exceed max_memory_usage limit (since each stream has overhead). Introduce max_insert_delayed_streams_for_parallel_writes (with default to 1000 for S3, 0 otherwise), to avoid this. This should "Memory limit exceeded" errors in performance tests. Signed-off-by: Azat Khuzhin <[email protected]>

azat · 2022-03-08T19:18:18Z

Stateless tests flaky check (address, actions) — Timeout, fail: 0, passed: 0

Added no-parallel tag for the test.

Signed-off-by: Azat Khuzhin <[email protected]>

azat · 2022-03-09T15:31:44Z

Stateless tests flaky check (address, actions) — Timeout, fail: 0, passed: 8

Timeout - OK

Stateless tests (thread, actions) [3/3] — fail: 1, passed: 1265, skipped: 5

02051_read_settings - 0615e97 (FWIW in upstream 327s, this PR - 346s)

azat · 2022-03-16T19:10:24Z

@KochetovNicolai this looks ready, can you take a look please?

kssenii · 2022-07-04T14:21:13Z

src/Disks/IDisk.h


+    /// Whether this disk support parallel write
+    /// Overrode in remote fs disks.
+    virtual bool supportParallelWrite() const { return false; }


btw looks like this never worked because remote disks are wrapped into DiskDecorator for DiskRestartProxy and DiskCacheWrapper (old cache version which is stilled turned on by default) and for DiskDecorator this method was not overriden: #38792

@azat the test, added in this PR, does not work after the issue (which I described above) is fixed (see #38792 checks). If this test does not actually check anything now, I will remove it for now, agree?

The test uses explicit max_insert_delayed_streams_for_parallel_write so it works correctly.

As for the failure in the #38792 you can set max_insert_delayed_streams_for_parallel_write=0 explicitly for INSERT w/o this option and then it will ignore remote disk or not.

Also I would recommend add another test that will ensure that S3 does uses parallel writes (via MEMORY_LIMIT_EXCEEDED during INSERT or thread_ids from query_log)

LGDHuaOPER · 2022-08-08T03:35:48Z

At https://clickhouse.com/docs/en/integrations/s3/s3-merge-tree has following desc:

Writes are performed in parallel, with a maximum of 100 concurrent file writing threads. max_insert_delayed_streams_for_parallel_write, which has a default value of 1000, controls the number of S3 blobs written in parallel. Since a buffer is required for each file being written (~1MB), this effectively limits the memory consumption of an INSERT. It may be appropriate to lower this value in low server memory scenarios.

I have a question, whether this indicates that under the default configuration, use S3 Table or MergeTree with s3-disk policy, the server requires at least 100 GIB memory(100 * 1000 * 1MB)?

Where can I modify the configuration of 100 concurrent file writing threads?

azat · 2022-08-08T16:32:02Z

I have a question, whether this indicates that under the default configuration, use S3 Table or MergeTree with s3-disk policy, the server requires at least 100 GIB memory(100 * 1000 * 1MB)?

No, it is ~1GiB (1000*1MiB)

Where can I modify the configuration of 100 concurrent file writing threads?

You can adjust this setting max_insert_delayed_streams_for_parallel_write, but this is not how much threads S3 writes will use.
But you cannot modify the size of this thread pool, it is a static constant in the code -

ClickHouse/src/Disks/ObjectStorages/IObjectStorage.cpp

Line 25 in 664b435

constexpr size_t pool_size = 100;

But if you really need this, and has good numbers to prove, you can submit a patch/PR.

LGDHuaOPER · 2022-08-09T15:13:12Z

Thank you for your answer! My real intention is to find out the real cause of memory jitter when I use S3. Please refer to #38839 .

I really can't find the reason.My situation is like this. There is no memory problem under normal circumstances. Once it reaches the time to moving to S3 DISK(ssd_s3 policy), or at the same time, there are about tens of thousands of lines per second of data insertion (about 1kb per line), and the memory is like this shake. I'm very frustrated.

LGDHuaOPER · 2022-08-09T15:32:35Z

Then I have a little question. Where is the setting max_insert_delayed_streams_for_parallel_write configured? Is it in the configuration file users.xml -> profiles -> default? Thanks.

azat · 2022-08-09T20:43:02Z

I really can't find the reason.

Have you tried to play with this setting?
But it should not eat 40GiB more memory than w/o S3, unless you have tons of parallel INSERT's though...

There is no memory problem under normal circumstances. Once it reaches the time to moving to S3 DISK(ssd_s3 policy), or at the same time, there are about tens of thousands of lines per second of data insertion (about 1kb per line)

This setting (and initial optimization) related only for INSERT, it is not related to merges/TTL moves, so I don't see that it is related.

Then I have a little question. Where is the setting max_insert_delayed_streams_for_parallel_write configured? Is it in the configuration file users.xml -> profiles -> default? Thanks.

Yes, or on a per-query basis.

Please refer to #38839

You should do what @kssenii suggested.

robot-clickhouse added the pr-bugfix Pull request with bugfix, not backported by default label Feb 20, 2022

azat force-pushed the mt-delayed-part-flush branch from 29ddb2b to 1e01c53 Compare February 21, 2022 08:30

KochetovNicolai self-assigned this Feb 21, 2022

azat force-pushed the mt-delayed-part-flush branch from 1e01c53 to e220d29 Compare February 27, 2022 09:44

azat force-pushed the mt-delayed-part-flush branch from e220d29 to 308a782 Compare February 27, 2022 09:54

azat marked this pull request as draft March 7, 2022 11:28

azat marked this pull request as ready for review March 8, 2022 05:12

azat force-pushed the mt-delayed-part-flush branch from 9208e96 to e17a8c0 Compare March 8, 2022 05:13

azat added 2 commits March 8, 2022 22:17

Introduce IDisk::supportParallelWrite()

4200b56

Signed-off-by: Azat Khuzhin <[email protected]>

azat force-pushed the mt-delayed-part-flush branch from e17a8c0 to 3a5a39a Compare March 8, 2022 19:18

tests: tune 02051_read_settings

0615e97

Signed-off-by: Azat Khuzhin <[email protected]>

KochetovNicolai merged commit ee9c2ec into ClickHouse:master Mar 17, 2022

azat deleted the mt-delayed-part-flush branch March 17, 2022 14:45

azat mentioned this pull request Apr 4, 2022

Properly cancel the query after client format error #35867

Merged

kssenii reviewed Jul 4, 2022

View reviewed changes

AlfVII mentioned this pull request Mar 17, 2023

Test for replication issue added and possible fixed done #47671

Closed

1 task

jrdi mentioned this pull request Nov 12, 2023

Materialize views not appending data after deduplicating in dependant table #56642

Closed

Conversation

azat commented Feb 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

azat commented Feb 21, 2022

Uh oh!

KochetovNicolai commented Feb 21, 2022

Uh oh!

azat commented Feb 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KochetovNicolai commented Feb 21, 2022

Uh oh!

azat commented Feb 22, 2022

Uh oh!

KochetovNicolai commented Feb 24, 2022

Uh oh!

azat commented Feb 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KochetovNicolai commented Mar 7, 2022

Uh oh!

azat commented Mar 7, 2022

Uh oh!

azat commented Mar 8, 2022

Uh oh!

azat commented Mar 8, 2022

Uh oh!

azat commented Mar 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

azat commented Mar 16, 2022

Uh oh!

kssenii Jul 4, 2022

Choose a reason for hiding this comment

Uh oh!

kssenii Jul 4, 2022

Choose a reason for hiding this comment

Uh oh!

azat Jul 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LGDHuaOPER commented Aug 8, 2022

Uh oh!

azat commented Aug 8, 2022

Uh oh!

LGDHuaOPER commented Aug 9, 2022

Uh oh!

LGDHuaOPER commented Aug 9, 2022

Uh oh!

azat commented Aug 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

azat commented Feb 20, 2022 •

edited

Loading

azat commented Feb 21, 2022 •

edited

Loading

azat commented Feb 27, 2022 •

edited

Loading

azat commented Mar 9, 2022 •

edited

Loading

azat Jul 5, 2022 •

edited

Loading

azat commented Aug 9, 2022 •

edited

Loading