Fix 03172_error_log_table_not_empty#65604
Conversation
Turns out when executing lots of tests in parallel there might be more than 1 min elapsed between the first errors and the second ones.
|
This is an automated comment for commit fa3295b with description of existing statuses. It's updated for the latest CI running ❌ Click here to open a full report in a separate page
Successful checks
|
|
This PR makes the test less flaky but doesn't fix flakiness. The |
devcrafter
left a comment
There was a problem hiding this comment.
It can be a temporary fix. And I'd suggest addressing the comment above (can be in separate PR)
Just to clarify, do you mean
I thought exactly the same, because it'd help to disambiguate the query without the need for filtering within any time window at all. |
|
I'm taking a look and we may have to limit the amount of query ids we store in e.g. the SELECT
name,
code,
value
FROM system.errors
Query id: ff43780c-36c3-4081-bbfc-d0e1961a3a3c
┌─name─────────────────────┬─code─┬─value─┐
1. │ NOT_IMPLEMENTED │ 48 │ 19424 │
2. │ CANNOT_MANIPULATE_SIGSET │ 111 │ 1 │
3. │ NO_ELEMENTS_IN_CONFIG │ 139 │ 2 │
4. │ KEEPER_EXCEPTION │ 999 │ 2 │
└──────────────────────────┴──────┴───────┘
4 rows in set. Elapsed: 0.016 sec. |
Yes, sorry, I meant |
Let's get rid of the time constraints by checking the previous number of errors and making sure they're increased.
|
@pamarcos Not a big deal here, but not sure that merging w/o review is a good thing. AFAIS, the updated test should be run sequentially, i.e. |
@devcrafter you're absolutely right. My apologies 🙏. I rushed it because I saw more instances of the flake test making harm to other PRs CI.
Regarding this, I have the work done in a branch that adds last 10 |
|
That test |
| SELECT throwIf(true, 'error_log', toInt16(111)) SETTINGS allow_custom_error_code_in_throwif=1; -- { serverError 111 } | ||
| SELECT throwIf(true, 'error_log', toInt16(222)) SETTINGS allow_custom_error_code_in_throwif=1; -- { serverError 222 } | ||
| SELECT throwIf(true, 'error_log', toInt16(333)) SETTINGS allow_custom_error_code_in_throwif=1; -- { serverError 333 } | ||
| SELECT sleep(2) format NULL; |
There was a problem hiding this comment.
It should be marked as Tags: no-fasttest since it uses timeouts.
There was a problem hiding this comment.
one type of error is:
2024-07-03 12:44:54 [ip-172-31-35-113] 2024.07.03 12:44:47.969209 [ 186095 ] {fa1e0b0e-25ea-4877-898e-7443db13c751} <Error> executeQuery: Code: 60. DB::Exception: Unknown table expression identifier 'system.error_log' in scope SELECT sum(value) FROM system.error_log WHERE code = 111. (UNKNOWN_TABLE) (version 24.7.1.1) (from [::ffff:127.0.0.1]:43186) (comment: 03172_error_log_table_not_empty.sh) (in query: SELECT sum(value) FROM system.error_log WHERE code = 111), Stack trace (when copying this message, always include the lines below):
the other type is:
2024-06-26 05:54:21 --- /usr/share/clickhouse-test/queries/0_stateless/03172_error_log_table_not_empty.reference 2024-06-26 05:06:35.554695301 +0700
2024-06-26 05:54:21 +++ /tmp/clickhouse-test/0_stateless/03172_error_log_table_not_empty.stdout 2024-06-26 05:54:21.550673271 +0700
2024-06-26 05:54:21 @@ -1,6 +1,6 @@
2024-06-26 05:54:21 1
2024-06-26 05:54:21 1
2024-06-26 05:54:21 1
2024-06-26 05:54:21 -1
2024-06-26 05:54:21 -1
2024-06-26 05:54:21 -1
2024-06-26 05:54:21 +0
2024-06-26 05:54:21 +0
2024-06-26 05:54:21 +0
There was a problem hiding this comment.
the second type of error is stale, it from the old version of the test AND event_time > now() - INTERVAL 1 MINUTE;
There was a problem hiding this comment.
Systems tables are created lazily. It could be not created yet when we read from it.
There was a problem hiding this comment.
It should be marked as Tags: no-fasttest since it uses timeouts.
Here's the reasoning why I closed the PR that added precisely that tag
the second type of error is stale, it from the old version of the test AND event_time > now() - INTERVAL 1 MINUTE;
That was part of the original test. The timing issue was fixed precisely in this PR. Are you sure the error you see is not from a prior and flaky version of the test?
Systems tables are created lazily. It could be not created yet when we read from it.
I don't follow you. At what point is the test reading from the table without flushing logs first?
There was a problem hiding this comment.
It should be marked as Tags: no-fasttest since it uses timeouts.
Here's the reasoning why I closed the PR that added precisely that tag
No. You were trying to add Tags: no-parallel. I'm talking about Tags: no-fasttest because the test uses timeouts.
the second type of error is stale, it from the old version of the test AND event_time > now() - INTERVAL 1 MINUTE;
That was part of the original test. The timing issue was fixed precisely in this PR. Are you sure the error you see is not from a prior and flaky version of the test?
That error from the test before you changes. Your changes fixed that error, that is good.
Systems tables are created lazily. It could be not created yet when we read from it.
I don't follow you. At what point is the test reading from the table without flushing logs first?
here
errors_111=$($CLICKHOUSE_CLIENT -q "SELECT sum(value) FROM system.error_log WHERE code = 111")
There was a problem hiding this comment.
No. You were trying to add Tags: no-parallel. I'm talking about Tags: no-fasttest because the test uses timeouts.
Right, but since it's not using timeouts anymore it shouldn't be needed, right?
here
errors_111=$($CLICKHOUSE_CLIENT -q "SELECT sum(value) FROM system.error_log WHERE code = 111")
You're absolutely right, my bad
There was a problem hiding this comment.
Right, but since it's not using timeouts anymore it shouldn't be needed, right?
Yes. This is why I fight for it and trying remove timeouts. It is better to leave the test to work as fast-test. Then mark it as no-fasttest.
There was a problem hiding this comment.
I see. The test takes ~6s to complete. If that's too long for a fast-test (I don't know where the limit is) let's add the no-fasttest 👍
Fix 03172_error_log_table_not_empty without any time constraints
Let's get rid of the time constraints by checking the previous number of errors and making sure they're increased.
Changelog category (leave one):
#65601