Skip to content

RFC: Add MemoryTrackerFaultInjectorInThread#43337

Closed
azat wants to merge 1 commit intoClickHouse:masterfrom
azat:MemoryTrackerFaultInjectorInThread
Closed

RFC: Add MemoryTrackerFaultInjectorInThread#43337
azat wants to merge 1 commit intoClickHouse:masterfrom
azat:MemoryTrackerFaultInjectorInThread

Conversation

@azat
Copy link
Copy Markdown
Member

@azat azat commented Nov 17, 2022

Changelog category (leave one):

  • Not for changelog (changelog entry is not required)

This is an injector that can catch some problems (maybe it worth add it to some major places and enable under some setting?).

@azat azat force-pushed the MemoryTrackerFaultInjectorInThread branch from ff915fe to faa382b Compare November 17, 2022 18:23
@azat
Copy link
Copy Markdown
Member Author

azat commented Nov 18, 2022

Failures (well, of course they don't, since this code is not even used) are not related:

Stress test (asan) — Backward compatibility check: Error message in clickhouse-server.log (see bc_check_error_messages.txt)

2022.11.18 11:33:36.121644 [ 277701 ] {} <Error> test_7.table_for_bad_alters::all_0_0_0_1(MutateFromLogEntryTask): virtual bool DB::ReplicatedMergeMutateTaskBase::executeStep(): Code: 6. DB::Exception: Cannot parse string 'Hello' as UInt32: syntax error at begin of string. Note: there are toUInt32OrZero and toUInt32OrNull functions, which returns zero/NULL instead of throwing exception.: while executing 'FUNCTION _CAST(value2 :: 2, 'UInt32' :: 3) -> _CAST(value2, 'UInt32') UInt32 : 4': (while reading from part /var/lib/clickhouse/disks/s3/data/test_7/table_for_bad_alters/all_0_0_0/): While executing MergeTreeInOrder. (CANNOT_PARSE_TEXT), Stack trace (when copying this message, always include the lines below):

Stress test (debug) — Backward compatibility check: OOM killer (or signal 9) in clickhouse-server.log

Real OOM

Stress test (msan) — OOM killer (or signal 9) in clickhouse-server.log

Looks like slow parts removal

Stress test (tsan) — Backward compatibility check: Error message in clickhouse-server.log (see bc_check_error_messages.txt)

2022.11.18 08:51:58.345562 [ 286783 ] {} <Error> test_11.alter_table1 (b8c9b9c3-b14d-4c5a-8326-92a1e933416b): while loading part -1_10_10_0_134 on path store/b8c/b8c9b9c3-b14d-4c5a-8326-92a1e933416b/-1_10_10_0_134: Code: 499. DB::Exception: The specified key does not exist.: while reading key: test/xmd/buoviyeedbglwwobgbsuinrueujcw, from bucket: test. (S3_ERROR), Stack trace (when copying this message, always include the lines below):
2022.11.18 08:51:58.345706 [ 286943 ] {} <Error> test_11.alter_table0 (d26a04af-2f03-474e-b61d-f9a0a472f134): while loading part -1_0_0_0_134 on path store/d26/d26a04af-2f03-474e-b61d-f9a0a472f134/-1_0_0_0_134: Code: 499. DB::Exception: The specified key does not exist.: while reading key: test/slk/ebvugmudmwzfpqtoznyhzrratkmil, from bucket: test. (S3_ERROR), Stack trace (when copying this message, always include the lines below):
2022.11.18 08:51:58.346277 [ 286731 ] {} <Error> test_11.alter_table0 (d26a04af-2f03-474e-b61d-f9a0a472f134): while loading part -1_10_10_0_134 on path store/d26/d26a04af-2f03-474e-b61d-f9a0a472f134/-1_10_10_0_134: Code: 499. DB::Exception: The specified key does not exist.: while reading key: test/xmd/buoviyeedbglwwobgbsuinrueujcw, from bucket: test. (S3_ERROR), Stack trace (when copying this message, always include the lines below):
2022.11.18 08:51:58.349479 [ 286752 ] {} <Error> test_11.alter_table1 (b8c9b9c3-b14d-4c5a-8326-92a1e933416b): while loading part -1_0_0_0_134 on path store/b8c/b8c9b9c3-b14d-4c5a-8326-92a1e933416b/-1_0_0_0_134: Code: 499. DB::Exception: The specified key does not exist.: while reading key: test/slk/ebvugmudmwzfpqtoznyhzrratkmil, from bucket: test. (S3_ERROR), Stack trace (when copying this message, always include the lines below):
2022.11.18 08:51:58.420682 [ 286783 ] {} <Error> test_11.alter_table1 (b8c9b9c3-b14d-4c5a-8326-92a1e933416b): Detaching broken part /var/lib/clickhouse/disks/s3/store/b8c/b8c9b9c3-b14d-4c5a-8326-92a1e933416b/-1_10_10_0_134 (size: 93.83 KiB). If it happened after update, it is likely because of backward incompatibility. You need to resolve this manually
2022.11.18 08:51:58.449179 [ 286752 ] {} <Error> test_11.alter_table1 (b8c9b9c3-b14d-4c5a-8326-92a1e933416b): Detaching broken part /var/lib/clickhouse/disks/s3/store/b8c/b8c9b9c3-b14d-4c5a-8326-92a1e933416b/-1_0_0_0_134 (size: 90.85 KiB). If it happened after update, it is likely because of backward incompatibility. You need to resolve this manually
2022.11.18 08:51:58.455301 [ 286731 ] {} <Error> test_11.alter_table0 (d26a04af-2f03-474e-b61d-f9a0a472f134): Detaching broken part /var/lib/clickhouse/disks/s3/store/d26/d26a04af-2f03-474e-b61d-f9a0a472f134/-1_10_10_0_134 (size: 93.83 KiB). If it happened after update, it is likely because of backward incompatibility. You need to resolve this manually
2022.11.18 08:51:58.531636 [ 286943 ] {} <Error> test_11.alter_table0 (d26a04af-2f03-474e-b61d-f9a0a472f134): Detaching broken part /var/lib/clickhouse/disks/s3/store/d26/d26a04af-2f03-474e-b61d-f9a0a472f134/-1_0_0_0_134 (size: 90.85 KiB). If it happened after update, it is likely because of backward incompatibility. You need to resolve this manually

Stress test (ubsan) — Backward compatibility check: Error message in clickhouse-server.log (see bc_check_error_messages.txt)

2022.11.18 11:33:17.133756 [ 1360451 ] {} <Error> DiskObjectStorageTransaction: An error occurred while executing transaction's operation #0 (RemoveManyObjectStorageOperation (paths size: 11, keep all batch false, files to keep )): Code: 107. DB::ErrnoException: Cannot open file /var/lib/clickhouse/disks/s3/data/system/session_log/delete_tmp_202211_119_119_0/partition.dat, errno: 2, strerror: No such file or directory. (FILE_DOESNT_EXIST), Stack trace (when copying this message, always include the lines below):
2022.11.18 11:33:17.133859 [ 1356112 ] {} <Error> DiskObjectStorageTransaction: An error occurred while executing transaction's operation #0 (RemoveManyObjectStorageOperation (paths size: 11, keep all batch false, files to keep )): Code: 107. DB::ErrnoException: Cannot open file /var/lib/clickhouse/disks/s3/data/system/session_log/delete_tmp_202211_121_121_0/partition.dat, errno: 2, strerror: No such file or directory. (FILE_DOESNT_EXIST), Stack trace (when copying this message, always include the lines below):
...


std::bernoulli_distribution fault(fault_probability);
if (unlikely(fault_probability > 0.0 && fault(thread_local_rng)))
if (unlikely(fault_probability > 0.0 && fault(thread_local_rng) || MemoryTrackerFaultInjectorInThread::faulty()))
Copy link
Copy Markdown
Member

@serxa serxa Nov 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just small optimization. I guess it should also be inside unlikely

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is already inside unlikely.

But TBH, I would consider this an optimization for reader, in other words to make it more obvious what is the regular code path for reader, not for processor, since it is not that hot place.

@serxa serxa self-assigned this Nov 21, 2022
Copy link
Copy Markdown
Member

@serxa serxa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's usefulness is not clear w/o that setting you are talking about. Maybe add at least one good place and a setting itself to this PR?

@azat
Copy link
Copy Markdown
Member Author

azat commented Nov 21, 2022

Just a few words about this patch, it had been submitted only to "document" what I was referenced in another PR, posted with an RFC tag, maybe needed something similar to test other peaces of code.

And thinking about it a little bit more, and I don't think that it is interesting in it's current form, so I'm going to close it.

It's usefulness is not clear w/o that setting you are talking about. Maybe add at least one good place and a setting itself to this PR?

Something like memory_tracker_major_places_fault_probability, though I don't like this naming.

But anyway I don't think that this is interesting in this way, since first you need to add it into some places, while I don't think that it is a good idea in general

What can work, is patching the code on fly.

@azat azat closed this Nov 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-not-for-changelog This PR should not be mentioned in the changelog

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants