-
Notifications
You must be signed in to change notification settings - Fork 8.3k
Stateless test 00653_verification_monotonic_data_load is flaky #68626
Description
In my PR #68323
2024-08-19 18:32:27 00653_verification_monotonic_data_load: [ FAIL ] 600.05 sec.
The test times out in ClickHouse Stateless tests (tsan, s3 storage)
The test is a shell script where the clickhouse client commands output is piped to a grep. Below is the server debug log for a command which was stuck for 4 minutes and then the test was killed due to hitting 600 seconds.
2024.08.19 18:28:11.330041 [ 10795 ] {} <Trace> TCPHandlerFactory: TCP Request. Address: [::1]:47212
2024.08.19 18:28:11.333131 [ 10795 ] {} <Debug> TCPHandler: Connected ClickHouse client version 24.9.0, revision: 54470, database: test_ltxmr5rm, user: default.
2024.08.19 18:28:11.333490 [ 10795 ] {} <Debug> TCP-Session: c36d5f04-e548-48ff-998d-f742b41219f6 Authenticating user 'default' from [::1]:47212
2024.08.19 18:28:11.333813 [ 10795 ] {} <Debug> TCP-Session: c36d5f04-e548-48ff-998d-f742b41219f6 Authenticated with global context as user 94309d50-4f52-5250-31bd-74fecac179db
2024.08.19 18:28:11.333966 [ 10795 ] {} <Debug> TCP-Session: c36d5f04-e548-48ff-998d-f742b41219f6 Creating session context with user_id: 94309d50-4f52-5250-31bd-74fecac179db
2024.08.19 18:28:11.405175 [ 10795 ] {} <Trace> TCP-Session: c36d5f04-e548-48ff-998d-f742b41219f6 Creating query context from session context, user_id: 94309d50-4f52-5250-31bd-74fecac179db, parent context user: default
2024.08.19 18:28:11.412546 [ 10795 ] {9c87b100-7c2a-404c-8cba-878a6e53e849} <Debug> executeQuery: (from [::1]:47212) (comment: 00653_verification_monotonic_data_load.sh) SELECT count() FROM signed_integer_test_table WHERE toUInt32(val) == 0; (stage: Complete)
2024.08.19 18:28:11.418827 [ 10795 ] {9c87b100-7c2a-404c-8cba-878a6e53e849} <Trace> Planner: Query to stage Complete
2024.08.19 18:28:11.420152 [ 10795 ] {9c87b100-7c2a-404c-8cba-878a6e53e849} <Trace> Planner: Query from stage FetchColumns to stage Complete
2024.08.19 18:28:11.423451 [ 10795 ] {9c87b100-7c2a-404c-8cba-878a6e53e849} <Trace> QueryPlanOptimizePrewhere: The min valid primary key position for moving to the tail of PREWHERE is 0
2024.08.19 18:28:11.424281 [ 10795 ] {9c87b100-7c2a-404c-8cba-878a6e53e849} <Debug> test_ltxmr5rm.signed_integer_test_table (66f2add6-d9c5-4eef-adf0-773f6c90e542) (SelectExecutor): Key condition: (toUInt32(column 0) in [0, 0])
2024.08.19 18:28:11.424598 [ 10795 ] {9c87b100-7c2a-404c-8cba-878a6e53e849} <Trace> test_ltxmr5rm.signed_integer_test_table (66f2add6-d9c5-4eef-adf0-773f6c90e542) (SelectExecutor): Filtering marks by primary and secondary keys
2024.08.19 18:28:11.425041 [ 10795 ] {9c87b100-7c2a-404c-8cba-878a6e53e849} <Trace> test_ltxmr5rm.signed_integer_test_table (66f2add6-d9c5-4eef-adf0-773f6c90e542) (SelectExecutor): Used generic exclusion search over index for part all_1_1_0 with 8 steps
2024.08.19 18:28:11.425290 [ 10795 ] {9c87b100-7c2a-404c-8cba-878a6e53e849} <Debug> test_ltxmr5rm.signed_integer_test_table (66f2add6-d9c5-4eef-adf0-773f6c90e542) (SelectExecutor): Selected 1/1 parts by partition key, 1 parts by primary key, 2/7 marks by primary key, 2 marks to read from 1 ranges
2024.08.19 18:28:11.425603 [ 10795 ] {9c87b100-7c2a-404c-8cba-878a6e53e849} <Trace> test_ltxmr5rm.signed_integer_test_table (66f2add6-d9c5-4eef-adf0-773f6c90e542) (SelectExecutor): Spreading mark ranges among streams (default reading)
2024.08.19 18:28:11.426085 [ 10795 ] {9c87b100-7c2a-404c-8cba-878a6e53e849} <Debug> test_ltxmr5rm.signed_integer_test_table (66f2add6-d9c5-4eef-adf0-773f6c90e542) (SelectExecutor): Reading approx. 2 rows with 1 streams
2024.08.19 18:28:11.448629 [ 10795 ] {9c87b100-7c2a-404c-8cba-878a6e53e849} <Debug> executeQuery: Read 2 rows, 8.00 B in 0.039218 sec., 50.99699117752052 rows/sec., 203.99 B/sec.
2024.08.19 18:28:11.450063 [ 10795 ] {9c87b100-7c2a-404c-8cba-878a6e53e849} <Debug> TCPHandler: Processed in 0.047461623 sec.
2024.08.19 18:32:27.679498 [ 10795 ] {} <Debug> TCPHandler: Done processing connection.
At 18:32:27, the test was killed and hence the "Done processing connection" message is seen. Notice the 4 minute gap. The previous command in the test passed successfully and it's debug output looks like :-
2024.08.19 18:28:10.386297 [ 10816 ] {5a70522e-5958-4ef3-9eb7-fcc7061c3193} <Debug> test_ltxmr5rm.signed_integer_test_table (66f2add6-d9c5-4eef-adf0-773f6c90e542) (SelectExecutor): Selected 1/1 parts by partition key, 1 parts by primary key, 2/7 marks by primary key, 2 marks to read from 1 ranges
2024.08.19 18:28:10.386602 [ 10816 ] {5a70522e-5958-4ef3-9eb7-fcc7061c3193} <Trace> test_ltxmr5rm.signed_integer_test_table (66f2add6-d9c5-4eef-adf0-773f6c90e542) (SelectExecutor): Spreading mark ranges among streams (default reading)
2024.08.19 18:28:10.386977 [ 10816 ] {5a70522e-5958-4ef3-9eb7-fcc7061c3193} <Debug> test_ltxmr5rm.signed_integer_test_table (66f2add6-d9c5-4eef-adf0-773f6c90e542) (SelectExecutor): Reading approx. 2 rows with 1 streams
2024.08.19 18:28:10.403503 [ 10816 ] {5a70522e-5958-4ef3-9eb7-fcc7061c3193} <Debug> executeQuery: Read 2 rows, 8.00 B in 0.02959 sec., 67.59040216289287 rows/sec., 270.36 B/sec.
2024.08.19 18:28:10.405171 [ 10816 ] {5a70522e-5958-4ef3-9eb7-fcc7061c3193} <Debug> TCPHandler: Processed in 0.036433273 sec.
2024.08.19 18:28:10.607434 [ 10816 ] {} <Debug> TCPHandler: Done processing connection.
2024.08.19 18:28:10.609280 [ 10816 ] {} <Debug> TCP-Session: ba83d9ed-0a90-4583-94c4-5e30def3029b Logout, user_id: 94309d50-4f52-5250-31bd-74fecac179db
The test's command looks like this :
${CLICKHOUSE_CLIENT} --query="SELECT count() FROM signed_integer_test_table WHERE toUInt32(val) == 0;" 2>&1 |grep -q "2 marks to read from" && echo "monotonic int case: Int32 -> UInt32"
Maybe the last bytes of output in the TCP socket get stuck somewhere or something to do with socket close etc.