Skip to content

Smart handling of processes leftovers in tests#67737

Merged
alexey-milovidov merged 13 commits intoClickHouse:masterfrom
azat:tests-processes-leftovers
Aug 9, 2024
Merged

Smart handling of processes leftovers in tests#67737
alexey-milovidov merged 13 commits intoClickHouse:masterfrom
azat:tests-processes-leftovers

Conversation

@azat
Copy link
Copy Markdown
Member

@azat azat commented Aug 3, 2024

Previously processes cleanup on i.e. SIGINT simply did not work, because
the launcher kills only processes in process group, while tests are
launched with start_new_session=True for Popen(), which creates own
process group.

This is needed for killing process group in case of test timeout.

So instead, look at the parent pid, and kill the child process groups.

Also add some logging to make it more informative which processes will be killed.

And now in case of test timeout clickhouse-test will gather stacktraces from clickhouse utilities as well (by sending SIGTSTP to them), this should help with issues like #67736

And also omit python stacktraces in some cases when it is useless and only create noise (server died, clickhouse-test signals, i.e. the most common is SIGINT - Ctrl-C)

Changelog category (leave one):

  • Not for changelog (changelog entry is not required)

@robot-clickhouse-ci-2 robot-clickhouse-ci-2 added the pr-not-for-changelog This PR should not be mentioned in the changelog label Aug 3, 2024
@robot-clickhouse-ci-2
Copy link
Copy Markdown
Contributor

robot-clickhouse-ci-2 commented Aug 3, 2024

This is an automated comment for commit bc2740a with description of existing statuses. It's updated for the latest CI running

✅ Click here to open a full report in a separate page

Successful checks
Check nameDescriptionStatus
AST fuzzerRuns randomly generated queries to catch program errors. The build type is optionally given in parenthesis. If it fails, ask a maintainer for help✅ success
BuildsThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
ClickBenchRuns [ClickBench](https://github.com/ClickHouse/ClickBench/) with instant-attach table✅ success
Compatibility checkChecks that clickhouse binary runs on distributions with old libc versions. If it fails, ask a maintainer for help✅ success
Docker keeper imageThe check to build and optionally push the mentioned image to docker hub✅ success
Docker server imageThe check to build and optionally push the mentioned image to docker hub✅ success
Docs checkBuilds and tests the documentation✅ success
Fast testNormally this is the first check that is ran for a PR. It builds ClickHouse and runs most of stateless functional tests, omitting some. If it fails, further checks are not started until it is fixed. Look at the report to see which tests fail, then reproduce the failure locally as described here✅ success
Flaky testsChecks if new added or modified tests are flaky by running them repeatedly, in parallel, with more randomization. Functional tests are run 100 times with address sanitizer, and additional randomization of thread scheduling. Integration tests are run up to 10 times. If at least once a new test has failed, or was too long, this check will be red. We don't allow flaky tests, read the doc✅ success
Install packagesChecks that the built packages are installable in a clear environment✅ success
Integration testsThe integration tests report. In parenthesis the package type is given, and in square brackets are the optional part/total tests✅ success
Performance ComparisonMeasure changes in query performance. The performance test report is described in detail here. In square brackets are the optional part/total tests✅ success
Push to DockerhubThe check for building and pushing the CI related docker images to docker hub✅ success
Stateful testsRuns stateful functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc✅ success
Stateless testsRuns stateless functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc✅ success
Stress testRuns stateless functional tests concurrently from several clients to detect concurrency-related errors✅ success
Style checkRuns a set of checks to keep the code style clean. If some of tests failed, see the related log from the report✅ success
Unit testsRuns the unit tests for different release types✅ success
Upgrade checkRuns stress tests on server version from last release and then tries to upgrade it to the version from the PR. It checks if the new server can successfully startup without any errors, crashes or sanitizer asserts✅ success

@azat azat changed the title tests: smart handling of processes leftovers Smart handling of processes leftovers in tests Aug 3, 2024
@azat azat force-pushed the tests-processes-leftovers branch from ffca4b8 to b4e3bbc Compare August 3, 2024 20:13
@alexey-milovidov alexey-milovidov added can be tested Allows running workflows for external contributors 🍃 green ci 🌿 Fixing flaky tests in CI labels Aug 3, 2024
@azat
Copy link
Copy Markdown
Member Author

azat commented Aug 4, 2024

Integration tests (asan, old analyzer) [4/6] — fail: 48, passed: 462

@alexey-milovidov
Copy link
Copy Markdown
Member

@azat Merge with master to include the fixes.

@azat azat force-pushed the tests-processes-leftovers branch from b4e3bbc to 07d19b8 Compare August 4, 2024 21:14
@azat
Copy link
Copy Markdown
Member Author

azat commented Aug 4, 2024

Done

@azat
Copy link
Copy Markdown
Member Author

azat commented Aug 5, 2024

Stress test (asan) — check_status.tsv doesn't exists

Uexpected exception, will not retry:  HTTPError :  Code: 503. Code: 439. DB::Exception: Cannot schedule a task: cannot allocate thread (threads=1, jobs=1). (CANNOT_SCHEDULE_TASK) (version 24.8.1.1342)

Integration tests (aarch64) [4/6] — fail: 1, passed: 400

  • test_executable_user_defined_function/test.py::test_executable_function_slow_python

@azat azat force-pushed the tests-processes-leftovers branch from 07d19b8 to e68c9c8 Compare August 5, 2024 06:53
@azat
Copy link
Copy Markdown
Member Author

azat commented Aug 5, 2024

Egh, it could kill more then it should:

Killing process group 1
Processes in process group 1:
1 /bin/bash /run.sh
276 ./minio server --address :11111 ./minio_data
4391 gdb -batch -command script.gdb -p 3556
4392 /usr/bin/perl /usr/bin/ts %Y-%m-%d %H:%M:%S
4443 python3 /usr/bin/stress --hung-check --drop-databases --output-folder test_output --skip-func-tests  --global-time-limit 1200
4460 /bin/sh -c /usr/bin/clickhouse-test --order=random  --database=test_1  --client-option join_use_nulls=1 join_algorithm='parallel_hash' memory_tracker_fault_probability=0.001 merge_tree_read_split_ranges_into_intersecting_and_non_intersecting_injection_probability=0.05 group_by_use_nulls=1 ignore_drop_queries_probability=0.5 allow_experimental_parallel_reading_from_replicas=1 max_parallel_replicas=3 cluster_for_parallel_replicas='parallel_replicas' parallel_replicas_for_non_replicated_merge_tree=1 --global_time_limit=1200  
4462 python3 /usr/bin/clickhouse-test --order=random --database=test_1 --client-option join_use_nulls=1 join_algorithm=parallel_hash memory_tracker_fault_probability=0.001 merge_tree_read_split_ranges_into_intersecting_and_non_intersecting_injection_probability=0.05 group_by_use_nulls=1 ignore_drop_queries_probability=0.5 allow_experimental_parallel_reading_from_replicas=1 max_parallel_replicas=3 cluster_for_parallel_replicas=parallel_replicas parallel_replicas_for_non_replicated_merge_tree=1 --global_time_limit=1200
4495 /bin/sh -c /usr/bin/clickhouse-test --order=random  --client-option join_use_nulls=1 http_make_head_request=0 ignore_drop_queries_probability=0.5 --global_time_limit=1200  
4500 python3 /usr/bin/clickhouse-test --order=random --client-option join_use_nulls=1 http_make_head_request=0 ignore_drop_queries_probability=0.5 --global_time_limit=1200
4530 /bin/sh -c /usr/bin/clickhouse-test --order=random  --client-option memory_tracker_fault_probability=0.001 merge_tree_read_split_ranges_into_intersecting_and_non_intersecting_injection_probability=0.05 ignore_drop_queries_probability=0.5 allow_experimental_parallel_reading_from_replicas=1 max_parallel_replicas=3 cluster_for_parallel_replicas='parallel_replicas' parallel_replicas_for_non_replicated_merge_tree=1 --global_time_limit=1200  
4534 python3 /usr/bin/clickhouse-test --order=random --client-option memory_tracker_fault_probability=0.001 merge_tree_read_split_ranges_into_intersecting_and_non_intersecting_injection_probability=0.05 ignore_drop_queries_probability=0.5 allow_experimental_parallel_reading_from_replicas=1 max_parallel_replicas=3 cluster_for_parallel_replicas=parallel_replicas parallel_replicas_for_non_replicated_merge_tree=1 --global_time_limit=1200
4550 /bin/sh -c /usr/bin/clickhouse-test --order=random  --database=test_7  --client-option join_use_nulls=1 join_algorithm='grace_hash' group_by_use_nulls=1 ignore_drop_queries_probability=0.5 --global_time_limit=1200  
4552 python3 /usr/bin/clickhouse-test --order=random --database=test_7 --client-option join_use_nulls=1 join_algorithm=grace_hash group_by_use_nulls=1 ignore_drop_queries_probability=0.5 --global_time_limit=1200
4643 /bin/sh -c /usr/bin/clickhouse-test --order=random  --client-option implicit_transaction=1 throw_on_unsupported_query_inside_transaction=0 ignore_drop_queries_probability=0.5 --global_time_limit=1200  
4648 python3 /usr/bin/clickhouse-test --order=random --client-option implicit_transaction=1 throw_on_unsupported_query_inside_transaction=0 ignore_drop_queries_probability=0.5 --global_time_limit=1200
4772 /bin/sh -c /usr/bin/clickhouse-test --order=random  --client-option join_use_nulls=1 memory_tracker_fault_probability=0.001 merge_tree_read_split_ranges_into_intersecting_and_non_intersecting_injection_probability=0.05 http_make_head_request=0 ignore_drop_queries_probability=0.5 --global_time_limit=1200  
4782 python3 /usr/bin/clickhouse-test --order=random --client-option join_use_nulls=1 memory_tracker_fault_probability=0.001 merge_tree_read_split_ranges_into_intersecting_and_non_intersecting_injection_probability=0.05 http_make_head_request=0 ignore_drop_queries_probability=0.5 --global_time_limit=1200
4812 /bin/sh -c /usr/bin/clickhouse-test --order=random  --client-option optimize_trivial_approximate_count_query=1 ignore_drop_queries_probability=0.5 allow_experimental_parallel_reading_from_replicas=1 max_parallel_replicas=3 cluster_for_parallel_replicas='parallel_replicas' parallel_replicas_for_non_replicated_merge_tree=1 --global_time_limit=1200  
4825 python3 /usr/bin/clickhouse-test --order=random --client-option optimize_trivial_approximate_count_query=1 ignore_drop_queries_probability=0.5 allow_experimental_parallel_reading_from_replicas=1 max_parallel_replicas=3 cluster_for_parallel_replicas=parallel_replicas parallel_replicas_for_non_replicated_merge_tree=1 --global_time_limit=1200
4924 /bin/sh -c /usr/bin/clickhouse-test --order=random  --database=test_21  --client-option join_algorithm='parallel_hash' memory_tracker_fault_probability=0.001 merge_tree_read_split_ranges_into_intersecting_and_non_intersecting_injection_probability=0.05 group_by_use_nulls=1 ignore_drop_queries_probability=0.5 allow_experimental_parallel_reading_from_replicas=1 max_parallel_replicas=3 cluster_for_parallel_replicas='parallel_replicas' parallel_replicas_for_non_replicated_merge_tree=1 --global_time_limit=1200  
4947 python3 /usr/bin/clickhouse-test --order=random --database=test_21 --client-option join_algorithm=parallel_hash memory_tracker_fault_probability=0.001 merge_tree_read_split_ranges_into_intersecting_and_non_intersecting_injection_probability=0.05 group_by_use_nulls=1 ignore_drop_queries_probability=0.5 allow_experimental_parallel_reading_from_replicas=1 max_parallel_replicas=3 cluster_for_parallel_replicas=parallel_replicas parallel_replicas_for_non_replicated_merge_tree=1 --global_time_limit=1200
5033 /bin/sh -c /usr/bin/clickhouse-test --order=random  --client-option http_make_head_request=0 ignore_drop_queries_probability=0.5 allow_experimental_parallel_reading_from_replicas=1 max_parallel_replicas=3 cluster_for_parallel_replicas='parallel_replicas' parallel_replicas_for_non_replicated_merge_tree=1 --global_time_limit=1200  
5036 python3 /usr/bin/clickhouse-test --order=random --client-option http_make_head_request=0 ignore_drop_queries_probability=0.5 allow_experimental_parallel_reading_from_replicas=1 max_parallel_replicas=3 cluster_for_parallel_replicas=parallel_replicas parallel_replicas_for_non_replicated_merge_tree=1 --global_time_limit=1200
5054 /bin/sh -c /usr/bin/clickhouse-test --order=random  --database=test_25  --client-option join_use_nulls=1 join_algorithm='full_sorting_merge' group_by_use_nulls=1 ignore_drop_queries_probability=0.5 --global_time_limit=1200  
5060 python3 /usr/bin/clickhouse-test --order=random --database=test_25 --client-option join_use_nulls=1 join_algorithm=full_sorting_merge group_by_use_nulls=1 ignore_drop_queries_probability=0.5 --global_time_limit=1200
5177 /bin/sh -c /usr/bin/clickhouse-test --order=random --db-engine="Replicated('/test/db/test_29', 's1', 'r1')"  --database=test_29  --client-option enable_deflate_qpl_codec=1 enable_zstd_qat_codec=1 join_algorithm='auto' max_rows_in_join=1000 group_by_use_nulls=1 ignore_drop_queries_probability=0.5 --global_time_limit=1200  
5180 python3 /usr/bin/clickhouse-test --order=random --db-engine=Replicated('/test/db/test_29', 's1', 'r1') --database=test_29 --client-option enable_deflate_qpl_codec=1 enable_zstd_qat_codec=1 join_algorithm=auto max_rows_in_join=1000 group_by_use_nulls=1 ignore_drop_queries_probability=0.5 --global_time_limit=1200
5261 /bin/sh -c /usr/bin/clickhouse-test --order=random  --database=test_31  --client-option join_use_nulls=1 join_algorithm='parallel_hash' memory_tracker_fault_probability=0.001 merge_tree_read_split_ranges_into_intersecting_and_non_intersecting_injection_probability=0.05 group_by_use_nulls=1 ignore_drop_queries_probability=0.5 --global_time_limit=1200  
5268 python3 /usr/bin/clickhouse-test --order=random --database=test_31 --client-option join_use_nulls=1 join_algorithm=parallel_hash memory_tracker_fault_probability=0.001 merge_tree_read_split_ranges_into_intersecting_and_non_intersecting_injection_probability=0.05 group_by_use_nulls=1 ignore_drop_queries_probability=0.5 --global_time_limit=1200
5289 /bin/sh -c /usr/bin/clickhouse-test --order=random --db-engine="Replicated('/test/db/test_32', 's1', 'r1')"  --client-option enable_deflate_qpl_codec=1 enable_zstd_qat_codec=1 optimize_trivial_approximate_count_query=1 http_make_head_request=0 ignore_drop_queries_probability=0.5 --global_time_limit=1200  
5305 python3 /usr/bin/clickhouse-test --order=random --db-engine=Replicated('/test/db/test_32', 's1', 'r1') --client-option enable_deflate_qpl_codec=1 enable_zstd_qat_codec=1 optimize_trivial_approximate_count_query=1 http_make_head_request=0 ignore_drop_queries_probability=0.5 --global_time_limit=1200
5342 /bin/sh -c /usr/bin/clickhouse-test --order=random  --database=test_33  --client-option join_algorithm='partial_merge' group_by_use_nulls=1 ignore_drop_queries_probability=0.5 allow_experimental_parallel_reading_from_replicas=1 max_parallel_replicas=3 cluster_for_parallel_replicas='parallel_replicas' parallel_replicas_for_non_replicated_merge_tree=1 --global_time_limit=1200  
5349 python3 /usr/bin/clickhouse-test --order=random --database=test_33 --client-option join_algorithm=partial_merge group_by_use_nulls=1 ignore_drop_queries_probability=0.5 allow_experimental_parallel_reading_from_replicas=1 max_parallel_replicas=3 cluster_for_parallel_replicas=parallel_replicas parallel_replicas_for_non_replicated_merge_tree=1 --global_time_limit=1200
Process group 1 should be killed
Traceback (most recent call last):
  File "/usr/bin/clickhouse-test", line 3599, in <module>
    main(args)
  File "/usr/bin/clickhouse-test", line 2957, in main
    total_tests_run += do_run_tests(args.jobs, test_suite)
  File "/usr/bin/clickhouse-test", line 2592, in do_run_tests
    run_tests_array(
  File "/usr/bin/clickhouse-test", line 2302, in run_tests_array
    stop_tests()
  File "/usr/bin/clickhouse-test", line 429, in stop_tests
    cleanup_child_processes(os.getpid())
  File "/usr/bin/clickhouse-test", line 417, in cleanup_child_processes
    child_pgid = os.getpgid(child)
ProcessLookupError: [Errno 3] No such process

Hence CI fails

Should be fixed by 9658be5

@azat azat force-pushed the tests-processes-leftovers branch from e68c9c8 to 9658be5 Compare August 5, 2024 17:58
@azat
Copy link
Copy Markdown
Member Author

azat commented Aug 6, 2024

Stress test (ubsan) — Killed by signal (output files)

Has fatal logs from client:

########## Short fault info ############
(version 24.8.1.1501, build id: , git hash: ) (from thread 53132) Received signal 20
Signal description: Stopped
This is a signal used for debugging purposes by the user.
Stack trace: 0x00005586dbdd07ed 0x00005586dc201c38 0x00007f0f92e6d520 0x00005586dbd71c15 0x00005586dbd6d441 0x00005586dbd6ec0c 0x00005586dbd6d1c3 0x00005586dc0fa0c2 0x00005586dc0f43e8 0x00005586ee1d9b93 0x00005586dc10cdb1 0x00005586cf4dfbfe 0x00007f0f92e54d90 0x00007f0f92e54e40 0x00005586cf4c702e
########################################
(version 24.8.1.1501, build id: , git hash: ) (from thread 53132) (no query) Received signal Stopped (20)
This is a signal used for debugging purposes by the user.
Stack trace: 0x00005586dbdd07ed 0x00005586dc201c38 0x00007f0f92e6d520 0x00005586dbd71c15 0x00005586dbd6d441 0x00005586dbd6ec0c 0x00005586dbd6d1c3 0x00005586dc0fa0c2 0x00005586dc0f43e8 0x00005586ee1d9b93 0x00005586dc10cdb1 0x00005586cf4dfbfe 0x00007f0f92e54d90 0x00007f0f92e54e40 0x00005586cf4c702e

azat added 9 commits August 6, 2024 16:39
Previously processes cleanup on i.e. SIGINT simply did not work, because
the launcher kills only processes in process group, while tests are
launched with start_new_session=True for Popen(), which creates own
process group.

This is needed for killing process group in case of test timeout.

So instead, look at the parent pid, and kill the child process groups.

Also add some logging to make it more explicit which processes will be
killed.

Signed-off-by: Azat Khuzhin <[email protected]>
It is simply useless and only create output that only distracts.

Signed-off-by: Azat Khuzhin <[email protected]>
This is to catch issues like [1].

  [1]: ClickHouse#67736

Signed-off-by: Azat Khuzhin <[email protected]>
It was simply wrong before, but now, with capturing stacktrace that can
take sometime it is a must.

Signed-off-by: Azat Khuzhin <[email protected]>
Previously it was possible to have original pgid from the spawned
threads, that could lead to killing the caller script and in case of CI
it could be init process [1].

  [1]: https://s3.amazonaws.com/clickhouse-test-reports/67737/e68c9c8d16f37f6c25739076c9b071ed97952269/stress_test__asan_/stress_test_run_21.txt

Repro:

    $ echo "SELECT '1" > tests/queries/0_stateless/00001_select_1.sql # break the test
    $ cat /tmp/test.sh
    ./tests/clickhouse-test 0001_select --test-runs 3 --max-failures-chain 1 --no-random-settings --no-random-merge-tree-settings

Before this change:

    $ /tmp/test.sh
    Using queries from '/src/ch/worktrees/clickhouse-upstream/tests/queries' directory
    Connecting to ClickHouse server... OK
    Connected to server 24.8.1.1 @ bef896ce143ea4e0464c9829de6277ba06cc1a53 mt/rename-without-lock-v2
    Running 3 stateless tests (MainProcess).
    00001_select_1:                                                         [ FAIL ]
    Reason: return code:  62
    Code: 62. DB::Exception: Syntax error: failed at position 8 (''1;
    '): '1;
    . Single quoted string is not closed: ''1;
    '. (SYNTAX_ERROR)

    , result:

    stdout:

    Database: test_hz2zwymr
    Child processes of 13041:
    13042 python3 ./tests/clickhouse-test 0001_select --test-runs 3 --max-failures-chain 1 --no-random-settings --no-random-merge-tree-settings
    Killing process group 13040
    Processes in process group 13040:
    13040 -bash
    13042 python3 ./tests/clickhouse-test 0001_select --test-runs 3 --max-failures-chain 1 --no-random-settings --no-random-merge-tree-settings

    [2]+  Stopped                 /tmp/test.sh
    [1]$ Process group 13040 should be killed
    Max failures chain

    [2]+  Killed                  /tmp/test.sh

After:

    $ /tmp/test.sh
    Using queries from '/src/ch/worktrees/clickhouse-upstream/tests/queries' directory
    Connecting to ClickHouse server... OK
    Connected to server 24.8.1.1 @ bef896ce143ea4e0464c9829de6277ba06cc1a53 mt/rename-without-lock-v2
    Running 3 stateless tests (MainProcess).
    00001_select_1:                                                         [ FAIL ]
    Reason: return code:  62
    Code: 62. DB::Exception: Syntax error: failed at position 8 (''1;
    '): '1;
    . Single quoted string is not closed: ''1;
    '. (SYNTAX_ERROR)

    , result:

    stdout:

    Database: test_urz6rk5z
    Child processes of 9782:
    9785 python3 ./tests/clickhouse-test 0001_select --test-runs 3 --max-failures-chain 1 --no-random-settings --no-random-merge-tree-settings
    Max failures chain

Signed-off-by: Azat Khuzhin <[email protected]>
@azat azat force-pushed the tests-processes-leftovers branch from 9658be5 to 72bd43a Compare August 6, 2024 14:43
@azat
Copy link
Copy Markdown
Member Author

azat commented Aug 7, 2024

Now it works as expected, CI failures does not related:

AST fuzzer (debug) — Logical error: 'UNION query x AS (SELECT * FROM department__fuzz_0 UNION ALL SELECT * FROM q UNION ALL SELECT * FROM x) is not recursive'.

@pamarcos pamarcos self-assigned this Aug 7, 2024
@pamarcos pamarcos self-requested a review August 7, 2024 16:37
Copy link
Copy Markdown
Member

@pamarcos pamarcos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice work here 🦸
I left some minor comments and questions

@alexey-milovidov alexey-milovidov added this pull request to the merge queue Aug 9, 2024
Merged via the queue into ClickHouse:master with commit 917920c Aug 9, 2024
@robot-ch-test-poll4 robot-ch-test-poll4 added the pr-synced-to-cloud The PR is synced to the cloud repo label Aug 9, 2024
@azat azat deleted the tests-processes-leftovers branch August 9, 2024 07:38
@vdimir vdimir mentioned this pull request Aug 13, 2024
@azat
Copy link
Copy Markdown
Member Author

azat commented Aug 16, 2024

Yeah, known issue, should be fixed by #68487

azat added a commit to azat/ClickHouse that referenced this pull request Aug 16, 2024
After ClickHouse#67737 the output
will be broken, since in case of timeout it will print to stdout.

Let's just capture it and add it to stderr.

Signed-off-by: Azat Khuzhin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors 🍃 green ci 🌿 Fixing flaky tests in CI pr-not-for-changelog This PR should not be mentioned in the changelog pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants