Smart handling of processes leftovers in tests#67737
Smart handling of processes leftovers in tests#67737alexey-milovidov merged 13 commits intoClickHouse:masterfrom
Conversation
|
This is an automated comment for commit bc2740a with description of existing statuses. It's updated for the latest CI running ✅ Click here to open a full report in a separate page Successful checks
|
ffca4b8 to
b4e3bbc
Compare
|
|
@azat Merge with master to include the fixes. |
b4e3bbc to
07d19b8
Compare
|
Done |
|
07d19b8 to
e68c9c8
Compare
|
Egh, it could kill more then it should: Hence CI fails Should be fixed by 9658be5 |
e68c9c8 to
9658be5
Compare
|
Previously processes cleanup on i.e. SIGINT simply did not work, because the launcher kills only processes in process group, while tests are launched with start_new_session=True for Popen(), which creates own process group. This is needed for killing process group in case of test timeout. So instead, look at the parent pid, and kill the child process groups. Also add some logging to make it more explicit which processes will be killed. Signed-off-by: Azat Khuzhin <[email protected]>
It is simply useless and only create output that only distracts. Signed-off-by: Azat Khuzhin <[email protected]>
This is to catch issues like [1]. [1]: ClickHouse#67736 Signed-off-by: Azat Khuzhin <[email protected]>
It was simply wrong before, but now, with capturing stacktrace that can take sometime it is a must. Signed-off-by: Azat Khuzhin <[email protected]>
Signed-off-by: Azat Khuzhin <[email protected]>
Previously it was possible to have original pgid from the spawned threads, that could lead to killing the caller script and in case of CI it could be init process [1]. [1]: https://s3.amazonaws.com/clickhouse-test-reports/67737/e68c9c8d16f37f6c25739076c9b071ed97952269/stress_test__asan_/stress_test_run_21.txt Repro: $ echo "SELECT '1" > tests/queries/0_stateless/00001_select_1.sql # break the test $ cat /tmp/test.sh ./tests/clickhouse-test 0001_select --test-runs 3 --max-failures-chain 1 --no-random-settings --no-random-merge-tree-settings Before this change: $ /tmp/test.sh Using queries from '/src/ch/worktrees/clickhouse-upstream/tests/queries' directory Connecting to ClickHouse server... OK Connected to server 24.8.1.1 @ bef896ce143ea4e0464c9829de6277ba06cc1a53 mt/rename-without-lock-v2 Running 3 stateless tests (MainProcess). 00001_select_1: [ FAIL ] Reason: return code: 62 Code: 62. DB::Exception: Syntax error: failed at position 8 (''1; '): '1; . Single quoted string is not closed: ''1; '. (SYNTAX_ERROR) , result: stdout: Database: test_hz2zwymr Child processes of 13041: 13042 python3 ./tests/clickhouse-test 0001_select --test-runs 3 --max-failures-chain 1 --no-random-settings --no-random-merge-tree-settings Killing process group 13040 Processes in process group 13040: 13040 -bash 13042 python3 ./tests/clickhouse-test 0001_select --test-runs 3 --max-failures-chain 1 --no-random-settings --no-random-merge-tree-settings [2]+ Stopped /tmp/test.sh [1]$ Process group 13040 should be killed Max failures chain [2]+ Killed /tmp/test.sh After: $ /tmp/test.sh Using queries from '/src/ch/worktrees/clickhouse-upstream/tests/queries' directory Connecting to ClickHouse server... OK Connected to server 24.8.1.1 @ bef896ce143ea4e0464c9829de6277ba06cc1a53 mt/rename-without-lock-v2 Running 3 stateless tests (MainProcess). 00001_select_1: [ FAIL ] Reason: return code: 62 Code: 62. DB::Exception: Syntax error: failed at position 8 (''1; '): '1; . Single quoted string is not closed: ''1; '. (SYNTAX_ERROR) , result: stdout: Database: test_urz6rk5z Child processes of 9782: 9785 python3 ./tests/clickhouse-test 0001_select --test-runs 3 --max-failures-chain 1 --no-random-settings --no-random-merge-tree-settings Max failures chain Signed-off-by: Azat Khuzhin <[email protected]>
5 seconds is too small and not enough to print even few frames. [1]: https://s3.amazonaws.com/clickhouse-test-reports/67737/9658be5eea8351655dd3ea77b8c1d4717bac7999/stress_test__ubsan_.html Signed-off-by: Azat Khuzhin <[email protected]>
…teful Signed-off-by: Azat Khuzhin <[email protected]>
They are too uncontrollable, and likely will leave some clients [1]. [1]: https://s3.amazonaws.com/clickhouse-test-reports/67737/9658be5eea8351655dd3ea77b8c1d4717bac7999/stress_test__ubsan_.html Signed-off-by: Azat Khuzhin <[email protected]>
9658be5 to
72bd43a
Compare
|
Now it works as expected, CI failures does not related:
|
pamarcos
left a comment
There was a problem hiding this comment.
Very nice work here 🦸
I left some minor comments and questions
Signed-off-by: Azat Khuzhin <[email protected]>
Signed-off-by: Azat Khuzhin <[email protected]>
Signed-off-by: Azat Khuzhin <[email protected]>
Signed-off-by: Azat Khuzhin <[email protected]>
|
after this pr we have number of failures with test_name = '[' |
|
Yeah, known issue, should be fixed by #68487 |
After ClickHouse#67737 the output will be broken, since in case of timeout it will print to stdout. Let's just capture it and add it to stderr. Signed-off-by: Azat Khuzhin <[email protected]>
Previously processes cleanup on i.e. SIGINT simply did not work, because
the launcher kills only processes in process group, while tests are
launched with start_new_session=True for Popen(), which creates own
process group.
This is needed for killing process group in case of test timeout.
So instead, look at the parent pid, and kill the child process groups.
Also add some logging to make it more informative which processes will be killed.
And now in case of test timeout
clickhouse-testwill gather stacktraces from clickhouse utilities as well (by sending SIGTSTP to them), this should help with issues like #67736And also omit python stacktraces in some cases when it is useless and only create noise (server died, clickhouse-test signals, i.e. the most common is SIGINT - Ctrl-C)
Changelog category (leave one):