Skip to content

Fix tests that relies on checking stack size under TSan#30579

Merged
alexey-milovidov merged 1 commit intoClickHouse:masterfrom
azat:tsan-recursion
Oct 24, 2021
Merged

Fix tests that relies on checking stack size under TSan#30579
alexey-milovidov merged 1 commit intoClickHouse:masterfrom
azat:tsan-recursion

Conversation

@azat
Copy link
Copy Markdown
Member

@azat azat commented Oct 23, 2021

Changelog category (leave one):

  • Not for changelog (changelog entry is not required)

Under TSan using too much stack requires too much RSS for shadow memory,
and neither of this TSAN_OPTIONS helps:

  • history_size=2
  • flush_memory_ms=2000
  • memory_limit_mb=50000

So instead, change allowed limit of the stack size in checkStackSize(),
to address exessive memory usage for the server compiled with TSan.

Note, that before this patch 01763_max_distributed_depth test can
increase RSS of the server to 70GiB.

Supersedes #30009

NOTE: marked as draft for now, since I want to check logs of TSan server, but it should be OK with this patch

@robot-clickhouse robot-clickhouse added the pr-not-for-changelog This PR should not be mentioned in the changelog label Oct 23, 2021
@azat azat marked this pull request as draft October 23, 2021 10:59
Under TSan using too much stack requires too much RSS for shadow memory,
and neither of this TSAN_OPTIONS helps:
- history_size=2
- flush_memory_ms=2000
- memory_limit_mb=50000

So instead, decrease allowed limit of the stack size in checkStackSize()
under TSan, to address exessive memory usage for the server compiled
with TSan.

Note, that before this patch 01763_max_distributed_depth test can
increase RSS of the server to 70GiB.
@azat azat marked this pull request as ready for review October 23, 2021 14:36
@azat
Copy link
Copy Markdown
Member Author

azat commented Oct 23, 2021

Functional stateless tests (thread) — fail: 0, passed: 3466, skipped: 27

Almost OK, but still uses more memory then other builds, although now memory usage is enough to pass tests on CI:

$ pigz -cd clickhouse-server.log.gz | fgrep -a 'MemoryTracker: Current memory usage (total)' | tail -n1
2021.10.23 09:26:17.001575 [ 624 ] {} <Debug> MemoryTracker: Current memory usage (total): 56.09 GiB.

@azat
Copy link
Copy Markdown
Member Author

azat commented Oct 23, 2021

Functional stateless tests (debug) — fail: 1, passed: 3473, skipped: 19

All checks was almost green, but found one flaky test - 00613_shard_distributed_max_execution_time, the problem is that capturing stacktrace (for memory sampling here in particular) slow in debug builds, and increasing max_execution_time a little bit will not helps:

Details
2021.10.23 13:24:47.301169 [ 18120 ] {1affe981-7455-4959-babb-350ab7b19a42} <Trace> Connection (127.0.0.2:9000): Connecting. Database: (not specified). User: default
...
2021.10.23 13:24:53.430160 [ 18120 ] {1affe981-7455-4959-babb-350ab7b19a42} <Trace> Connection (127.0.0.2:9000): Connected to ClickHouse server version 21.11.1.
...
2021.10.23 13:25:12.294665 [ 18120 ] {1affe981-7455-4959-babb-350ab7b19a42} <Debug> MemoryTracker: Peak memory usage (for query): 6.85 MiB.
WITH arrayStringConcat(arrayMap(x -> demangle(addressToSymbol(x)), trace), '\n') AS sym
SELECT
    sym LIKE '%unwind%' AS key,
    count()
FROM system.trace_log
WHERE query_id = '1affe981-7455-4959-babb-350ab7b19a42'
GROUP BY key
ORDER BY count() ASC
SETTINGS allow_introspection_functions = 1

Query id: 8b2a21a6-a28a-4365-9975-55b1f7c957ca

Row 1:
──────
key:     1
count(): 45

Row 2:
──────
key:     0
count(): 72

Example:

sym:     libunwind::CFI_Parser<libunwind::LocalAddressSpace>::parseCIE(libunwind::LocalAddressSpace&, unsigned long, libunwind::CFI_Parser<libunwind::LocalAddressSpace>::CIE_Info*)
libunwind::CFI_Parser<libunwind::LocalAddressSpace>::findFDE(libunwind::LocalAddressSpace&, unsigned long, unsigned long, unsigned int, unsigned long, libunwind::CFI_Parser<libunwind
::LocalAddressSpace>::FDE_Info*, libunwind::CFI_Parser<libunwind::LocalAddressSpace>::CIE_Info*)
libunwind::UnwindCursor<libunwind::LocalAddressSpace, libunwind::Registers_x86_64>::getInfoFromDwarfSection(unsigned long, libunwind::UnwindInfoSections const&, unsigned int)
libunwind::UnwindCursor<libunwind::LocalAddressSpace, libunwind::Registers_x86_64>::setInfoBasedOnIPRegister(bool)
libunwind::UnwindCursor<libunwind::LocalAddressSpace, libunwind::Registers_x86_64>::step()
__unw_step
unw_backtrace
StackTrace::tryCapture()
StackTrace::StackTrace()
MemoryTracker::allocImpl(long, bool)

Let's see at the rate of failures... (for the past month it fails only 3 times)

@alexey-milovidov alexey-milovidov self-assigned this Oct 23, 2021
@alexey-milovidov alexey-milovidov merged commit 830b832 into ClickHouse:master Oct 24, 2021
@azat azat deleted the tsan-recursion branch October 24, 2021 06:00
@azat azat mentioned this pull request Nov 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-not-for-changelog This PR should not be mentioned in the changelog

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants