Skip to content

Do not delay final part writing by default (fixes possible Memory limit exceeded during INSERT)#34780

Merged
KochetovNicolai merged 3 commits intoClickHouse:masterfrom
azat:mt-delayed-part-flush
Mar 17, 2022
Merged

Do not delay final part writing by default (fixes possible Memory limit exceeded during INSERT)#34780
KochetovNicolai merged 3 commits intoClickHouse:masterfrom
azat:mt-delayed-part-flush

Conversation

@azat
Copy link
Copy Markdown
Member

@azat azat commented Feb 20, 2022

Changelog category (leave one):

  • Bug Fix (user-visible misbehaviour in official stable or prestable release)

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Do not delay final part writing by default (fixes possible Memory limit exceeded during INSERT by adding max_insert_delayed_streams_for_parallel_write with default to 1000 for writes to s3 and disabled as before otherwise)

For async s3 writes final part flushing was defered until all the INSERT
block was processed, however in case of too many partitions/columns you
may exceed max_memory_usage limit (since each stream has overhead).

Introduce max_insert_delayed_streams_for_parallel_write (with default to 1000 for writes to s3 and disabled as before otherwise),
to avoid this (and avoid introducing regression).

This should "Memory limit exceeded" errors in performance tests.

Async s3 writes: #33291, #34219, #34215 (cc @KochetovNicolai )

@robot-clickhouse robot-clickhouse added the pr-bugfix Pull request with bugfix, not backported by default label Feb 20, 2022
@azat azat force-pushed the mt-delayed-part-flush branch from 29ddb2b to 1e01c53 Compare February 21, 2022 08:30
@azat
Copy link
Copy Markdown
Member Author

azat commented Feb 21, 2022

Note, that performance tests will still show errors, since the problem exists in upstream/master:

# left-server-log.log
2022.02.21 04:11:17.028151 [ 336 ] {723d25b5-df4f-4cad-8a12-0a9a0333874a} <Error> executeQuery: Code: 241. DB::Exception: Memory limit (for query) exceeded: would use 9.31 GiB (attem
pt to allocate chunk of 4225558 bytes), maximum: 9.31 GiB. (MEMORY_LIMIT_EXCEEDED) (version 22.3.1.1) (from [::1]:33884) (in query: INSERT INTO bad_partitions SELECT * FROM numbers(1
0000)), Stack trace (when copying this message, always include the lines below):

# right-server-log.log
2022.02.21 04:11:16.155247 [ 333 ] {f7dbcbe5-f910-4a6b-91fe-f00a5ee94eb5} <Debug> executeQuery: (from [::1]:48010) INSERT INTO bad_partitions SELECT * FROM numbers(10000)
...
2022.02.21 04:11:19.165839 [ 333 ] {f7dbcbe5-f910-4a6b-91fe-f00a5ee94eb5} <Information> executeQuery: Read 10000 rows, 78.13 KiB in 3.010562545 sec., 3321 rows/sec., 25.95 KiB/sec.

@KochetovNicolai
Copy link
Copy Markdown
Member

Maybe, hm, we will enable this delay streams only for s3 disk? I don't like this new setting - it shows an implementation details to user.

@KochetovNicolai KochetovNicolai self-assigned this Feb 21, 2022
@azat
Copy link
Copy Markdown
Member Author

azat commented Feb 21, 2022

Maybe, hm, we will enable this delay streams only for s3 disk?

I thought about this, but even for s3 you should not defer too much parts.

I don't like this new setting - it shows an implementation details to user.

Agree, me neither.

@KochetovNicolai
Copy link
Copy Markdown
Member

Well, my estimation is that deferred insertions cold cost about x2 memory in worst case - which is not so bad. And I really don't like adding a new setting.
Probably, we can increase memory limit for a failed test?

@azat
Copy link
Copy Markdown
Member Author

azat commented Feb 22, 2022

Well, my estimation is that deferred insertions cold cost about x2 memory in worst case - which is not so bad. And I really don't like adding a new setting.

It is not x2 memory, it depends on number of columns (streams) and number of partitions in the insert block.

Probably, we can increase memory limit for a failed test?

The difference can be too high, I don't think that it is good idea to introduce such regression.
That particular query from perf tests (INSERT INTO bad_partitions) needs only 118MB, but now even 10GB is not enough:

# right-server-log.log
2022.02.21 04:11:16.155247 [ 333 ] {f7dbcbe5-f910-4a6b-91fe-f00a5ee94eb5} <Debug> executeQuery: (from [::1]:48010) INSERT INTO bad_partitions SELECT * FROM numbers(10000)
2022.02.21 04:11:19.165839 [ 333 ] {f7dbcbe5-f910-4a6b-91fe-f00a5ee94eb5} <Information> executeQuery: Read 10000 rows, 78.13 KiB in 3.010562545 sec., 3321 rows/sec., 25.95 KiB/sec.
2022.02.21 04:11:19.165899 [ 333 ] {f7dbcbe5-f910-4a6b-91fe-f00a5ee94eb5} <Debug> MemoryTracker: Peak memory usage (for query): 118.36 MiB.
2022.02.21 04:11:19.165943 [ 333 ] {} <Debug> MemoryTracker: Peak memory usage (for query): 118.36 MiB.
2022.02.21 04:11:19.165951 [ 333 ] {} <Debug> TCPHandler: Processed in 3.010847653 sec.

# left-server-log.log
2022.02.21 04:11:16.155247 [ 333 ] {f7dbcbe5-f910-4a6b-91fe-f00a5ee94eb5} <Debug> executeQuery: (from [::1]:48010) INSERT INTO bad_partitions SELECT * FROM numbers(10000)
2022.02.21 04:11:17.028239 [ 336 ] {723d25b5-df4f-4cad-8a12-0a9a0333874a} <Error> TCPHandler: Code: 241. DB::Exception: Memory limit (for query) exceeded: would use 9.31 GiB (attempt to allocate chunk of 4225558 bytes), maximum: 9.3
1 GiB. (MEMORY_LIMIT_EXCEEDED), Stack trace (when copying this message, always include the lines below):

And I'm pretty sure that there are users that INSERTs lots of partitions in one insert... (that will came with their issues)

I don't like this new setting - it shows an implementation details to user.

I'm not the fan of introduction new setting for everything.
Although this may be useful (but maybe too internal) allow user to configure it by himself (since I doubt that even for S3 it is a good idea to defer all streams).

I can propagate information about writer to the MergeTreeSink, and do delaying only for it, which static batch of 100 parts (so as thread pool for s3, even though this is a bit different things), but in future this can be extended to give benefits for:

  • multi-disk configuration
  • fsync_after_insert

So what will be your final thoughts about this?

@KochetovNicolai
Copy link
Copy Markdown
Member

The table from perftest is a little bit extreme. Indeed, even 10GB is not enough.
I suppose it is because default buffer size is ~ 1M, even if we write there a single value.

Still, this use case is not impossible. I think I almost ok with a new setting. Would better solve it some other way, but can't find anything better.

However, setting max_insert_delayed_streams=1 eliminates all performance increase for S3 parallel inserts. I think we can use like 1000 by default. And, maybe, use it only in case of DiskS3,

@azat azat force-pushed the mt-delayed-part-flush branch from 1e01c53 to e220d29 Compare February 27, 2022 09:44
@azat
Copy link
Copy Markdown
Member Author

azat commented Feb 27, 2022

I think we can use like 1000 by default. And, maybe, use it only in case of DiskS3,

Done.

However, setting max_insert_delayed_streams=1 eliminates all performance increase for S3 parallel inserts

By the way, do you have any numbers on how parallel writes helps?

@azat azat force-pushed the mt-delayed-part-flush branch from e220d29 to 308a782 Compare February 27, 2022 09:54
@KochetovNicolai
Copy link
Copy Markdown
Member

Sorry for not answering for a long time.
Still, I thing it's better to set max_insert_delayed_streams by default to 1000, but enable it only for disks which support parallel write (so, for ordinary disk it is always disabled).

Maybe we should rename it to max_insert_delayed_streams_for_parallel_writes or something similar

@azat
Copy link
Copy Markdown
Member Author

azat commented Mar 7, 2022

Still, I thing it's better to set max_insert_delayed_streams by default to 1000, but enable it only for disks which support parallel write (so, for ordinary disk it is always disabled).

It is 1000 by default for S3 in the current version of the patch:
https://github.com/ClickHouse/ClickHouse/pull/34780/files#diff-f1ce98e0ed65fef7539d3eb0f1dc809b184353e05e32ac9ac63f17030a4e6c79R63

Maybe we should rename it to max_insert_delayed_streams_for_parallel_writes or something similar

Ok.

@azat azat marked this pull request as draft March 7, 2022 11:28
@azat azat marked this pull request as ready for review March 8, 2022 05:12
@azat azat force-pushed the mt-delayed-part-flush branch from 9208e96 to e17a8c0 Compare March 8, 2022 05:13
@azat
Copy link
Copy Markdown
Member Author

azat commented Mar 8, 2022

Done (also I forgot about ReplicatedMergeTree before, new version of this patch set addressed this).

azat added 2 commits March 8, 2022 22:17
For async s3 writes final part flushing was defered until all the INSERT
block was processed, however in case of too many partitions/columns you
may exceed max_memory_usage limit (since each stream has overhead).

Introduce max_insert_delayed_streams_for_parallel_writes (with default
to 1000 for S3, 0 otherwise), to avoid this.

This should "Memory limit exceeded" errors in performance tests.

Signed-off-by: Azat Khuzhin <[email protected]>
@azat azat force-pushed the mt-delayed-part-flush branch from e17a8c0 to 3a5a39a Compare March 8, 2022 19:18
@azat
Copy link
Copy Markdown
Member Author

azat commented Mar 8, 2022

Stateless tests flaky check (address, actions) — Timeout, fail: 0, passed: 0

Added no-parallel tag for the test.

@azat
Copy link
Copy Markdown
Member Author

azat commented Mar 9, 2022

@azat
Copy link
Copy Markdown
Member Author

azat commented Mar 16, 2022

@KochetovNicolai this looks ready, can you take a look please?

@KochetovNicolai KochetovNicolai merged commit ee9c2ec into ClickHouse:master Mar 17, 2022
@azat azat deleted the mt-delayed-part-flush branch March 17, 2022 14:45

/// Whether this disk support parallel write
/// Overrode in remote fs disks.
virtual bool supportParallelWrite() const { return false; }
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw looks like this never worked because remote disks are wrapped into DiskDecorator for DiskRestartProxy and DiskCacheWrapper (old cache version which is stilled turned on by default) and for DiskDecorator this method was not overriden: #38792

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@azat the test, added in this PR, does not work after the issue (which I described above) is fixed (see #38792 checks). If this test does not actually check anything now, I will remove it for now, agree?

Copy link
Copy Markdown
Member Author

@azat azat Jul 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test uses explicit max_insert_delayed_streams_for_parallel_write so it works correctly.

As for the failure in the #38792 you can set max_insert_delayed_streams_for_parallel_write=0 explicitly for INSERT w/o this option and then it will ignore remote disk or not.

Also I would recommend add another test that will ensure that S3 does uses parallel writes (via MEMORY_LIMIT_EXCEEDED during INSERT or thread_ids from query_log)

@LGDHuaOPER
Copy link
Copy Markdown

At https://clickhouse.com/docs/en/integrations/s3/s3-merge-tree has following desc:

Writes are performed in parallel, with a maximum of 100 concurrent file writing threads. max_insert_delayed_streams_for_parallel_write, which has a default value of 1000, controls the number of S3 blobs written in parallel. Since a buffer is required for each file being written (~1MB), this effectively limits the memory consumption of an INSERT. It may be appropriate to lower this value in low server memory scenarios.

I have a question, whether this indicates that under the default configuration, use S3 Table or MergeTree with s3-disk policy, the server requires at least 100 GIB memory(100 * 1000 * 1MB)?

Where can I modify the configuration of 100 concurrent file writing threads?

@azat
Copy link
Copy Markdown
Member Author

azat commented Aug 8, 2022

I have a question, whether this indicates that under the default configuration, use S3 Table or MergeTree with s3-disk policy, the server requires at least 100 GIB memory(100 * 1000 * 1MB)?

No, it is ~1GiB (1000*1MiB)

Where can I modify the configuration of 100 concurrent file writing threads?

You can adjust this setting max_insert_delayed_streams_for_parallel_write, but this is not how much threads S3 writes will use.
But you cannot modify the size of this thread pool, it is a static constant in the code -

constexpr size_t pool_size = 100;

But if you really need this, and has good numbers to prove, you can submit a patch/PR.

@LGDHuaOPER
Copy link
Copy Markdown

Thank you for your answer! My real intention is to find out the real cause of memory jitter when I use S3. Please refer to #38839 .

I really can't find the reason.My situation is like this. There is no memory problem under normal circumstances. Once it reaches the time to moving to S3 DISK(ssd_s3 policy), or at the same time, there are about tens of thousands of lines per second of data insertion (about 1kb per line), and the memory is like this shake. I'm very frustrated.

@LGDHuaOPER
Copy link
Copy Markdown

Then I have a little question. Where is the setting max_insert_delayed_streams_for_parallel_write configured? Is it in the configuration file users.xml -> profiles -> default? Thanks.

@azat
Copy link
Copy Markdown
Member Author

azat commented Aug 9, 2022

I really can't find the reason.

Have you tried to play with this setting?
But it should not eat 40GiB more memory than w/o S3, unless you have tons of parallel INSERT's though...

There is no memory problem under normal circumstances. Once it reaches the time to moving to S3 DISK(ssd_s3 policy), or at the same time, there are about tens of thousands of lines per second of data insertion (about 1kb per line)

This setting (and initial optimization) related only for INSERT, it is not related to merges/TTL moves, so I don't see that it is related.

Then I have a little question. Where is the setting max_insert_delayed_streams_for_parallel_write configured? Is it in the configuration file users.xml -> profiles -> default? Thanks.

Yes, or on a per-query basis.

Please refer to #38839

You should do what @kssenii suggested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-bugfix Pull request with bugfix, not backported by default

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants