Reduce memory usage for TSan build by flushing shadow memory#30009
Reduce memory usage for TSan build by flushing shadow memory#30009azat wants to merge 11 commits intoClickHouse:masterfrom
Conversation
Periodically builds with TSan (stress/stateless/...) fails eventually
with lots of MEMORY_LIMIT_EXCEEDED errors before.
Usually it fails when it is executed on machines with 128GiB of RAM:
The one that fails [1]:
2021.10.11 15:43:19.344695 [ 385 ] {} <Information> Application: Setting max_server_memory_usage was set to 113.29 GiB (125.88 GiB available * 0.90 max_server_memory_usage_to_ram_ratio)
[1]: https://clickhouse-test-reports.s3.yandex.net/29979/063f9cffabf0365a21787aea7c70b8f8da3397c8/functional_stateless_tests_(thread)/clickhouse-server.log.gz
The one that not fails [2]:
2021.10.07 15:32:04.665542 [ 385 ] {} <Information> Application: Setting max_server_memory_usage was set to 169.99 GiB (188.88 GiB available * 0.90 max_server_memory_usage_to_ram_ratio)
[2]: https://clickhouse-test-reports.s3.yandex.net/0/78e1db209f5527479a1947a2c3c441b56e00617e/functional_stateless_tests_(thread)/clickhouse-server.log.gz
But if you will look at logs you will see that it never goes down:
<details>
$ pigz -cd clickhouse-server.fail.log.gz | fgrep -a 'MemoryTracker: Current memory usage (total)' | head
2021.10.11 15:43:20.152412 [ 385 ] {} <Debug> MemoryTracker: Current memory usage (total): 2.35 GiB.
2021.10.11 15:43:26.000747 [ 619 ] {} <Debug> MemoryTracker: Current memory usage (total): 3.11 GiB.
2021.10.11 15:43:28.000581 [ 619 ] {} <Debug> MemoryTracker: Current memory usage (total): 4.08 GiB.
2021.10.11 15:43:31.000471 [ 619 ] {} <Debug> MemoryTracker: Current memory usage (total): 5.01 GiB.
2021.10.11 15:43:37.482854 [ 708 ] {78cfe719-57e4-461d-b9fc-a4f891154708} <Debug> MemoryTracker: Current memory usage (total): 6.00 GiB.
2021.10.11 15:43:46.179875 [ 619 ] {} <Debug> MemoryTracker: Current memory usage (total): 7.41 GiB.
2021.10.11 15:44:17.005561 [ 619 ] {} <Debug> MemoryTracker: Current memory usage (total): 8.03 GiB.
2021.10.11 15:44:47.000829 [ 619 ] {} <Debug> MemoryTracker: Current memory usage (total): 9.11 GiB.
2021.10.11 15:45:07.490627 [ 1137 ] {8c429bf1-a816-46f4-9386-2cf9442c3f2e} <Debug> MemoryTracker: Current memory usage (total): 10.22 GiB.
2021.10.11 15:45:10.000923 [ 619 ] {} <Debug> MemoryTracker: Current memory usage (total): 11.13 GiB.
$ pigz -cd clickhouse-server.fail.log.gz | fgrep -a 'MemoryTracker: Current memory usage (total)' | tail
2021.10.11 15:50:06.000778 [ 619 ] {} <Debug> MemoryTracker: Current memory usage (total): 104.05 GiB.
2021.10.11 15:50:08.000937 [ 619 ] {} <Debug> MemoryTracker: Current memory usage (total): 105.14 GiB.
2021.10.11 15:50:23.277294 [ 2228 ] {e8b81a58-c6ec-4d37-8172-80b1eafc5b49} <Debug> MemoryTracker: Current memory usage (total): 106.80 GiB.
2021.10.11 15:50:26.000726 [ 619 ] {} <Debug> MemoryTracker: Current memory usage (total): 107.37 GiB.
2021.10.11 15:50:28.000773 [ 619 ] {} <Debug> MemoryTracker: Current memory usage (total): 108.47 GiB.
2021.10.11 15:50:29.001086 [ 619 ] {} <Debug> MemoryTracker: Current memory usage (total): 109.28 GiB.
2021.10.11 15:50:31.000829 [ 619 ] {} <Debug> MemoryTracker: Current memory usage (total): 110.43 GiB.
2021.10.11 15:53:23.000696 [ 619 ] {} <Debug> MemoryTracker: Current memory usage (total): 111.15 GiB.
2021.10.11 15:53:25.000653 [ 619 ] {} <Debug> MemoryTracker: Current memory usage (total): 112.10 GiB.
2021.10.11 15:53:28.000820 [ 619 ] {} <Debug> MemoryTracker: Current memory usage (total): 113.10 GiB.
</details>
And of course eventually fails (and TSan is the only sanitizer that has
such memory usage pattern).
After using flush_memory_ms=2000 and testing on a few tests seems that
it does not grows like that, let's see.
fd22fcc to
7bb032b
Compare
|
@mergify update |
|
Command
Hey, I reacted but my real name is @Mergifyio |
|
Can someone add |
|
@mergify update |
|
Command
Hey, I reacted but my real name is @Mergifyio |
|
@mergify update |
|
Command
Hey, I reacted but my real name is @Mergifyio |
Not a server issue. This one is more interesting. server had been sanitized for ~80min, and it uses only 26GiB of RSS, in compare with previous runs:
While the check itself failed because 30 seconds wasn't enough for executing final This timeout has been introduced in #29992 |
Did not helps: |
CI report [1]:
2021.10.15 04:55:39.000770 [ 634 ] {} <Debug> MemoryTracker: Current memory usage (total): 107.34 GiB.
2021.10.15 04:58:28.805320 [ 392 ] {} <Fatal> Application: Child process was terminated by signal 9 (KILL). If it is not done by 'forcestop' command or manually, the possible cause is OOM Killer (see 'dmesg' and look at the '/var/log/kern.log' for the details).
[1]: https://clickhouse-test-reports.s3.yandex.net/30009/9379506fdd2f0a572156d7b4f9a637a783e8b87d/functional_stateless_tests_(thread).html#fail1
* upstream/master: Use forward declaration for Buffer<> in generic headers Query stage for local Update codegen_select_fuzzer.cpp Update submodule Update toolchain Add toolchain file Add CMakeLists Fix error Preparation to build with Musl Added concept HasIndexOperator Try fix integration test. Fixed test ComplexKeyHashedDictionary fix config parsing Update CHANGELOG.md Update codegen_select_fuzzer.cpp Style + more uncomment Update clickhouse-v21.10-released.md ComplexKeyHashedDictionary fix keys copy PolygonDictionary fix bytes_allocated Flat Hashed dictionary fix bytes_allocated for nullable attributes Update clickhouse-v21.10-released.md fixed typo in gen.py; clickhouse.g applying Nikita's in build-time generation Added an ability to generate data in build time attemp to fix build Fixed build Fix build. FunctionsJSON updated Update string-functions.md Translation external-dicts-dict-sources.md Fix build. Finally fix test Fix clang-tidy warnings in FunctionsJSON code Fixed test Avoid losing any allocations context from merges/mutations Set query_id for mutations/merges Add StorageID::getShortName() Avoid accounting memory from another mutation/merge Fix memory tracking for merges and mutations (by destroying earlier) Remove unused Block member from MutateTask Move 01200_mutations_memory_consumption out from bugs Mark 01200_mutations_memory_consumption as long and allow run in fasttest Fix INSERT SELECT incorrectly fills MATERIALIZED column based of Nullable column fixed type-conversion-functions en-ru Disable fsync_metadata on CI Fixed style check FunctionsJSON move to cpp file commit assert to fix build Fixed style check Fixed tests Update Internals.cpp Fix style. Smaller smoothing window in throttler. Make test_MemoryTracking::test_http not flaky Add description for test_MemoryTracking Update release date Add 21-10 release blog post Remove trailing whitespace FunctionsJSON avoid copying object element during iteration AddDefaultDatabaseVisitor support dictGet Support SQL user defined functions for clickhouse-local Remove BlockInputStream interfaces. Update 00167_read_bytes_from_fs.sql Fix typo em dash fixed Update type-conversion-functions.md Apply suggestions from code review Fix data-race between LogSink::writeMarks() and LogSource::readData() in StorageLog Fix possible data-race between FileChecker and StorageLog/StorageStripeLog Remove some last streams. attemp to fix build Update run.sh Fix test Start server under gdb in functional tests Print more info about memory utilization Remove accident change Remove trash from MergeTreeReadPool codegen_fuzzer: removing errors on warnings from protobuf-generated code in more gentle way move on merge branch change branch for boringssl Speed up part loading for JBOD Updated desc-en Improved description. Update docs/ru/sql-reference/functions/type-conversion-functions.md Update docs/ru/sql-reference/functions/type-conversion-functions.md Update docs/ru/sql-reference/functions/type-conversion-functions.md Update docs/ru/sql-reference/functions/type-conversion-functions.md Update docs/ru/sql-reference/functions/type-conversion-functions.md Update docs/ru/sql-reference/functions/type-conversion-functions.md Update docs/ru/sql-reference/functions/type-conversion-functions.md Update docs/ru/sql-reference/functions/type-conversion-functions.md Update docs/ru/sql-reference/functions/type-conversion-functions.md Update docs/ru/sql-reference/functions/type-conversion-functions.md Update docs/ru/sql-reference/functions/type-conversion-functions.md Update docs/ru/sql-reference/functions/type-conversion-functions.md Update docs/ru/sql-reference/functions/type-conversion-functions.md Update docs/en/operations/server-configuration-parameters/settings.md Update docs/en/operations/server-configuration-parameters/settings.md Fix hardware utilization info printing in client Start server under gdb in functional tests removing code generated files, since protobuf generation is now done in CMake adding codegen fuzzer + code generation script. new contrib added: libprotobuf-mutator Edited and translated to Russian add submodule update Update BorinSSL Check Dedicated Mark/Uncompressed cache for skip indices
P.S. another option is to decrease number of server threads? Reverts: b8a7d78 ("Tracking maximum amount of history in TSan")
* upstream/master:
Minor changes to install script
Fix local break on timeout
Update quotas.md
Update quotas.md
Doc. ArgMax/Min are not supported anymore by SimpleAggregateFunction
clickhouse-test: do not propagate CLICKHOUSE_PORT_HTTP to clickhouse-client
clickhouse-test: use splitlines() over split('\n')
Fix --hung-check in clickhouse-test
Added documentation
Do not try to remove temporary paths that is currently in written by merge/mutation
Rewrite MergeTreeData::clearOldTemporaryDirectories() to use early continue
Fixed test
Fix crash with shortcircuit and locardinality in multiIf
Fixed style check
Updated tests
Added HashedArray dictionary
This cannot be configured in ASAN: const uptr kAllocatorSize = 0x40000000000ULL; // 4T.
static const uptr kSpaceSize = kAllocatorSize;
static const uptr kNumClassesRounded = 64;
static const uptr kRegionSize = kSpaceSize / kNumClassesRounded; // == 64GiBDetails |
* upstream/master: Fix horizontal scroll bar Update release date and add training link Grammar suggestions to distributed.md SQLUserDefinedFunctions support CREATE OR REPLACE, CREATE IF NOT EXISTS Fixed tests Better test ASTDropFunctionQuery formatting fix Fix style check Fix region Trying aws secrets Revert "Remove statuses from actions" Remove statuses from actions Use robot token in actions for statuses Fixed build Fixed tests SQLUserDefinedFunctions added DROP IF EXISTS support ExecutableUDF example Remove master SQLUserDefinedFunctions support lambdas Also run on master Update amis Fix minmax_count projection with primary key in partition expr Fix build Update StorageExecutable.cpp Minor fix in clickhouse/kerberized-hadoop Dockerfile Fix ca-bundle.crt in clickhouse/kerberized-hadoop Update 01236_graphite_mt.sql White list of storages that supports final Done Done Update obsolete comments. Whitespace change in kerberized_hadoop/Dockerfile Add cases to test replaceRegexpAll_bug Update 02051_symlinks_to_user_files.sh Better way StorageExecutable fix small issues client: add ability to print raw profile events Move ProfileEvents packet type from TCPHandler into ProfileEventsExt Update adopters.md Update adopters.md Do not allow reading to empty buffer in MergeTreeReaderStream Verify that all rows was read in MergeTreeReaderCompact Remove unused offset_columns from MergeTreeReaderWide::readRows() Add a test for adjusting last granula with max_read_buffer_size=0 ExecutablePool dictionary source fix borrow timeout milliseconds Review fixes. Send columns description in clickhouse-local DOCSUP-15198: output_format_csv_null_representation setting translation (ClickHouse#29977) Fix special build Fix test Allow identifiers staring with numbers in multiple joins Update memory optimisation for MergingSorted. Review fixes Increase default wait of the server start in clickhouse-test Update 02051_symlinks_to_user_files.sh Move SquashingTransform to Interpreters (to fix split build) Better Remove recursive submodules Minor modification in hardware benchmark Add MSan instrumentation for preadv2 Add extensive test for various read settings BufferWithOwnMemory: take reallocs into account BufferWithOwnMemory: make size aligned not capacity BufferWithOwnMemory: do not try to align if buffer already aligned Fix alignment for prefetch in AsynchronousReadBufferFromFileDescriptor Revert special contribs and set folder manually for them until fixed Update external-dicts-dict-layout.md Update external-dicts-dict-layout.md Update 02051_symlinks_to_user_files.sh Update test Update 02051_symlinks_to_user_files.sh Update src/Storages/StorageFile.cpp Ping CI Update filesystemHelpers.h Allow symlinks in file storage Fix error fix style update Add comments Add RISC-V build Update test Fix tests Adjust the tests to do less work Various fixes to install procedure fix bug add notes fix bug and add test Update LocalServer.cpp Move some files. Fix clickhouse-local syntax exception Tag resource heavy tests as no-parallel init Less threads in local, fix Ok. printing Fix TSan Better Add primary key to minmax_count_projection Fix ProfileInfo. Remove DataStreams folder. Remove DataStreams folder. Better handling exceptions, update tests fix clang-tidy Fix build in fast test Move TTL streams and algo Add space after comma Remove redundant move Fix comments make Ctrl-J to commit add support of window function in antlr grammar Update Client.cpp Delete 01939_network_send_bytes_metrics test Skip test in case of replicated database Fix output String data into Text CapnProto type Try to fix test Better exception handling Update CapnProtoUtils.cpp Fix build Try to fix tests, update capnp lib to eliminate problem with UB sanitizer Add INCORRECT_DATA error code Handle exception when cannot extract value from struct, add test for it Fix test Fix style, better check in enum comparison Fix style Fix typo Add CapnProto output format, refactor CapnProto input format Remove catching boost::program_options error in Client Add test Add log levels updates Fix handling exception 'unrecognised option' in clickhouse-local and client support nullable arguments in function initializeAggregation Update run.sh Start keeper asynchronously if has connection to other nodes Start server under gdb in functional tests better interfaces for IDataType and ISerialization Use upstream replxx Fix 01939_network_send_bytes_metrics Fix test Fix test Remove unused headers and handle exception 'unrecognised option' in clickhouse-local Update 00652_replicated_mutations_default_database_zookeeper.sh fix check for nondeterministic mutations Link FAQ from Debian installation more fixes Do not manipulate FOLDER property on INTERFACE library targets Remove debug message Reorganiza contrib IDE folders More enhancements for query obfuscator Pull new image each time Fix style check one more time: Fix style check Remove PVS check Fixes in termination lambda Update termination lambda Add termination lambda Metrics lambda Update worker script preserve table alias when adding default database Fix Import json Fix workflow Add finish check check if query context exist before using cache Bump Disable PVS check First worker version Add lambda code for token rotation Add init worker script Missclick One more time update docker image Better Add logging Use correct user Test Fix stupid bug Missed file Trying other way Moar Followup Followup Trying to fix Fix licence name Trying one more time Something strange Checkout submodules for PVS Add PVS check More copypaste Fix style check Actions Remove Trying output Remove Trying run Shell bash Don't give up No idea Trying Add on Trying split actions Followup Branding Maybe supports html Trying annotations Trying reports Branding? Almost there Followup One more time Fixes Trying to path images Fix accident changes: Fix accident changes: Fix accident changes: Fix build One more Track changed files Track changed files Create changed image More fixes More fixes More fixes More fixes More fixes Fix Docker image Docker images check Docker images check Fix check Test Check for orgs request Better More debug Moar Trying better Missed file Fix better More flexible labels Fix run check Fix Bump Add token Fix Moar FGix Fix fix Fix Fix Fix Trying beter Style check Trying workflow Fix yaml lint Remove debug Better Better Other way One more try More verbose Trying one more time Followup Followup Trying update More try Fix Parent checks More debug One more time Fix Fix more Trying other way Trying more Fix spaces Trying to update check Upload from separate dir Remove description Fix Better stylecheck Followup Fix Report html One more try Something wrong One more One more time Trying other way Fix Trying style check Better Test Trying docker Trying self hosted action
|
Closed in favor of #30579 |
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Reduce memory usage for TSan build by flushing shadown memory (
flush_memory_ms=2000)Periodically builds with TSan (stress/stateless/...) fails eventually
with lots of MEMORY_LIMIT_EXCEEDED errors before.
Usually it fails when it is executed on machines with 128GiB of RAM:
The one that fails 1:
The one that not fails 2:
But if you will look at logs you will see that it never goes down:
Details
And of course eventually fails (and TSan is the only sanitizer that has
such memory usage pattern).
After using flush_memory_ms=2000 and testing on a few tests seems that
it does not grows like that, let's see.
Cc: @alexey-milovidov
NOTE: marked as
Draftsince I want to look at memory usage before merge