Skip to content

Fix test_distributed_queries_stress#44573

Merged
alexey-milovidov merged 2 commits intomasterfrom
fix-distributed-queries-stress
Dec 27, 2022
Merged

Fix test_distributed_queries_stress#44573
alexey-milovidov merged 2 commits intomasterfrom
fix-distributed-queries-stress

Conversation

@alexey-milovidov
Copy link
Copy Markdown
Member

@alexey-milovidov alexey-milovidov commented Dec 25, 2022

Changelog category (leave one):

  • Not for changelog (changelog entry is not required)

See #21944
This closes #41776
With concurrency 100, it was often terminated by OOM killer.
Our machines for testing have a low amount of memory.
If some crazy shit like HDFS, Minio, Azure is run on the same machine, the failure is not surprising.

@robot-ch-test-poll robot-ch-test-poll added the pr-not-for-changelog This PR should not be mentioned in the changelog label Dec 25, 2022
@alexey-milovidov
Copy link
Copy Markdown
Member Author

@azat Did not help, removing...

@azat
Copy link
Copy Markdown
Member

azat commented Dec 26, 2022

I'm working on a fix in #44537 and after tunning thread limits I don't see OOMs, also there is a separate PR to capture dmesg for integration tests - #44535

P.S. simply decrease the concurrency will not always helps since the problem can be in the amount of threads in the process itself, and this is what #44537 decreases

@alexey-milovidov
Copy link
Copy Markdown
Member Author

#44618

@alexey-milovidov
Copy link
Copy Markdown
Member Author

#44572

azat added a commit to azat/ClickHouse that referenced this pull request Dec 27, 2022
Sometimes one of containers got KILL'ed:

    2022-11-20 15:06:43 [ 317 ] DEBUG : run container_id:roottestdistributedqueriesstress_node1_r1_1 detach:False nothrow:False cmd: ['bash', '-c', "echo 'select * from dist_two where key = 0;\n    select * from dist_two where key = 1;\n    select * from dist_two where key = 2;\n    select * from dist_two where key = 3;\n    select * from dist_two;' | clickhouse benchmark --concurrency=100 --cumulative --delay=0 --timelimit=3 --hedged_connection_timeout_ms=200 --connect_timeout_with_failover_ms=200 --connections_with_failover_max_tries=5 --async_socket_for_remote=0 --distributed_group_by_no_merge=2"] (cluster.py:1745, exec_in_container)
    2022-11-20 15:06:43 [ 317 ] DEBUG : Command:['docker', 'exec', 'roottestdistributedqueriesstress_node1_r1_1', 'bash', '-c', "echo 'select * from dist_two where key = 0;\n    select * from dist_two where key = 1;\n    select * from dist_two where key = 2;\n    select * from dist_two where key = 3;\n    select * from dist_two;' | clickhouse benchmark --concurrency=100 --cumulative --delay=0 --timelimit=3 --hedged_connection_timeout_ms=200 --connect_timeout_with_failover_ms=200 --connections_with_failover_max_tries=5 --async_socket_for_remote=0 --distributed_group_by_no_merge=2"] (cluster.py:95, run_and_check)
    2022-11-20 15:08:48 [ 317 ] DEBUG : Stderr:Loaded 5 queries. (cluster.py:105, run_and_check)
    2022-11-20 15:08:48 [ 317 ] DEBUG : Exitcode:137 (cluster.py:107, run_and_check)

(Note 137 exit code is 128+KILL)

    parallel1_0_dockerd.log:time="2022-11-20T15:08:48.244758252Z" level=debug msg="Revoking external connectivity on endpoint roottestdistributedqueriesstress_node1_r1_1 (82dfd051d379869bf885f90745cd4b097c70cd04bd3b4f86e49096358112fc51)"
    parallel1_0_dockerd.log:time="2022-11-20T15:08:48.445809392Z" level=debug msg="82dfd051d379869bf885f90745cd4b097c70cd04bd3b4f86e49096358112fc51 (1c6863e).deleteSvcRecords(roottestdistributedqueriesstress_node1_r1_1, 172.16.8.2, <nil>, true) updateSvcRecord sid:82dfd051d3798
    parallel1_0_dockerd.log:time="2022-11-20T15:08:48.526045522Z" level=debug msg="Releasing addresses for endpoint roottestdistributedqueriesstress_node1_r1_1's interface on network roottestdistributedqueriesstress_default"

The problem is likely OOM, that is the problem only under ASan with lots
of threads.

v2: tests: decrease concurrency for test_distributed_queries_stress
v3: increase timeout for internal command execution
v4: rebase on top of ClickHouse#44573
Fixes: ClickHouse#41776
Signed-off-by: Azat Khuzhin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-not-for-changelog This PR should not be mentioned in the changelog

Projects

None yet

Development

Successfully merging this pull request may close these issues.

test_distributed_queries_stress is flaky

3 participants