fix flaky 03172_error_log_table_not_empty by CheSema · Pull Request #66093 · ClickHouse/ClickHouse

CheSema · 2024-07-04T13:48:49Z

Changelog category (leave one):

Not for changelog (changelog entry is not required)

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

...

Documentation entry for user-facing changes

Documentation is written (mandatory for new features)

Information about CI checks: https://clickhouse.com/docs/en/development/continuous-integration/

CI Settings (Only check the boxes if you know what you are doing):

Allow: All Required Checks
Allow: Stateless tests
Allow: Stateful tests
Allow: Integration Tests
Allow: Performance tests
Allow: All Builds
Allow: batch 1, 2 for multi-batch jobs
Allow: batch 3, 4, 5, 6 for multi-batch jobs

Exclude: Style check
Exclude: Fast test
Exclude: All with ASAN
Exclude: All with TSAN, MSAN, UBSAN, Coverage
Exclude: All with aarch64, release, debug

pamarcos · 2024-07-04T14:38:15Z

EDIT: Already clarified the first point in #65604 (comment). THe second paragraph still holds.

~~I don't think this is fixing the test, to be honest. I don't understand neither of the 2 changes.~~
~~Why do you need to explicitly flush first thing if the test is already making sure logs are flushed before reading from system.errors_log?~~

Regarding removing the sleep(2), that should do the test way more flaky. Take into account that you can flush the table, but you need to give it time for error logs to be collected minimum at the interval time specified in the settings so that something at all is flushed. It's exactly the same as for 00990_metric_log_table_not_empty

CheSema · 2024-07-04T16:16:38Z

I don't think this is fixing the test, to be honest. I don't understand neither of the 2 changes.

Why do you need to explicitly flush first thing if the test is already making sure logs are flushed before reading from system.errors_log?

That is the line when the read from system.error_log is made before writing to it.

ClickHouse/tests/queries/0_stateless/03172_error_log_table_not_empty.sh

Line 11 in 1e5bc5b

    
           errors_111=$($CLICKHOUSE_CLIENT -q "SELECT sum(value) FROM system.error_log WHERE code = 111")

Regarding removing the sleep(2), that should do the test way more flaky. Take into account that you can flush the table, but you need to give it time for error logs to be collected minimum at the interval time specified in the settings so that something at all is flushed. It's exactly the same as for 00990_metric_log_table_not_empty

SYSTEM FLUSH LOGS flushes the data and waits until it done. No need to sleep.

pamarcos · 2024-07-04T16:19:11Z

SYSTEM FLUSH LOGS flushes the data and waits until it done. No need to sleep.

It's not that the test only needs to wait for the data to be flushed. It needs to wait so that the data is collected as well. If you don't wait, no data will be collected. Forcing a flush doesn't collect the data, since it always relies on periodic intervals for that.

CheSema · 2024-07-04T16:22:04Z

SYSTEM FLUSH LOGS

Please define what you mean by collecting data?

robot-ch-test-poll3 · 2024-07-04T16:22:50Z

This is an automated comment for commit 78a2139 with description of existing statuses. It's updated for the latest CI running

❌ Click here to open a full report in a separate page

Check name	Description	Status
Performance Comparison	Measure changes in query performance. The performance test report is described in detail here. In square brackets are the optional part/total tests	❌ failure
Stateless tests	Runs stateless functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc	❌ failure
Stress test	Runs stateless functional tests concurrently from several clients to detect concurrency-related errors	❌ failure
Upgrade check	Runs stress tests on server version from last release and then tries to upgrade it to the version from the PR. It checks if the new server can successfully startup without any errors, crashes or sanitizer asserts	❌ failure

Successful checks

Check name	Description	Status
AST fuzzer	Runs randomly generated queries to catch program errors. The build type is optionally given in parenthesis. If it fails, ask a maintainer for help	✅ success
Builds	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
ClickBench	Runs [ClickBench](https://github.com/ClickHouse/ClickBench/) with instant-attach table	✅ success
Compatibility check	Checks that clickhouse binary runs on distributions with old libc versions. If it fails, ask a maintainer for help	✅ success
Docker keeper image	The check to build and optionally push the mentioned image to docker hub	✅ success
Docker server image	The check to build and optionally push the mentioned image to docker hub	✅ success
Docs check	Builds and tests the documentation	✅ success
Fast test	Normally this is the first check that is ran for a PR. It builds ClickHouse and runs most of stateless functional tests, omitting some. If it fails, further checks are not started until it is fixed. Look at the report to see which tests fail, then reproduce the failure locally as described here	✅ success
Flaky tests	Checks if new added or modified tests are flaky by running them repeatedly, in parallel, with more randomization. Functional tests are run 100 times with address sanitizer, and additional randomization of thread scheduling. Integration tests are run up to 10 times. If at least once a new test has failed, or was too long, this check will be red. We don't allow flaky tests, read the doc	✅ success
Install packages	Checks that the built packages are installable in a clear environment	✅ success
Integration tests	The integration tests report. In parenthesis the package type is given, and in square brackets are the optional part/total tests	✅ success
Stateful tests	Runs stateful functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc	✅ success
Style check	Runs a set of checks to keep the code style clean. If some of tests failed, see the related log from the report	✅ success
Unit tests	Runs the unit tests for different release types	✅ success

pamarcos · 2024-07-04T16:23:51Z

Please define what you mean by collecting data?

Collect the errors. This method is called by a background thread periodically.

CheSema · 2024-07-04T16:29:55Z

Please define what you mean by collecting data?

Collect the errors. This method is called by a background thread periodically.

Oh, that is a problem.
We have an invariant that SYSTEM FLUSH LOGS does its jobs synchronously. It happened that it is wrong for system.error_log.
Usually all events for system.[some_log] are accumulated in the buffer, which is written by SYSTEM FLUSH LOGS.
Your are pointing that there is no such accumulating process.

pamarcos · 2024-07-04T16:28:43Z

tests/queries/0_stateless/03172_error_log_table_not_empty.sh

 errors_333=$($CLICKHOUSE_CLIENT -q "SELECT sum(value) FROM system.error_log WHERE code = 333")

-# Throw three random errors: 111, 222 and 333 and wait for more than collect_interval_milliseconds to ensure system.error_log is flushed
+# Throw three random errors: 111, 222 and 333 and call flush logs to ensure system.error_log is flushed


Suggested change

# Throw three random errors: 111, 222 and 333 and call flush logs to ensure system.error_log is flushed

# Throw three random errors: 111, 222 and 333, wait for more than collect_time_interval_milliseconds to ensure the errors are collected and then flush logs to ensure system.error_log is flushed

pamarcos · 2024-07-04T16:29:56Z

tests/queries/0_stateless/03172_error_log_table_not_empty.sh

 SELECT throwIf(true, 'error_log', toInt16(111)) SETTINGS allow_custom_error_code_in_throwif=1; -- { serverError 111 }
 SELECT throwIf(true, 'error_log', toInt16(222)) SETTINGS allow_custom_error_code_in_throwif=1; -- { serverError 222 }
 SELECT throwIf(true, 'error_log', toInt16(333)) SETTINGS allow_custom_error_code_in_throwif=1; -- { serverError 333 }
-SELECT sleep(2) format NULL;


We need this. Without it, no errors are collected and tests fail close to 100% of the time:

+++ /home/ubuntu/ClickHouse/ClickHouse/tests/queries/0_stateless/03172_error_log_table_not_empty.stdout 2024-07-04 16:29:37.956423656 +0000 @@ -1,6 +1,6 @@ -1 -1 -1 -1 -1 -1 +0 +0 +0 +0 +0 +0

pamarcos · 2024-07-04T16:30:06Z

tests/queries/0_stateless/03172_error_log_table_not_empty.sh

 SELECT throwIf(true, 'error_log', toInt16(111)) SETTINGS allow_custom_error_code_in_throwif=1; -- { serverError 111 }
 SELECT throwIf(true, 'error_log', toInt16(222)) SETTINGS allow_custom_error_code_in_throwif=1; -- { serverError 222 }
 SELECT throwIf(true, 'error_log', toInt16(333)) SETTINGS allow_custom_error_code_in_throwif=1; -- { serverError 333 }
-SELECT sleep(2) format NULL;


Same for this one. We actually need it.

pamarcos · 2024-07-04T16:36:09Z

Oh, that is a problem. We have an invariant that SYSTEM FLUSH LOGS does its jobs synchronously. It happened that it is wrong for system.error_log. Usually all events for system.[some_log] are accumulated in the buffer, which is written by SYSTEM FLUSH LOGS. Your are pointing that there is no such accumulating process.

system.error_log was developed based on system.metric_log which does the same thing. Both are periodic collectors that aggregate the metrics for the defined time window interval. They're different by nature to most other system logs. It doesn't make sense to collect and flush data during a SYSTEM FLUSH LOGS if the periodic time interval has not elapsed yet, IMHO.

CheSema · 2024-07-04T17:28:21Z

I have restored timeouts. And marked the test as no-fast.

CheSema · 2024-07-05T10:09:04Z

Logical error: isUniqTypes()
https://s3.amazonaws.com/clickhouse-test-reports/66093/78a2139f2a43752196a029995b6965ada359c954/stress_test__tsan_.html

CheSema · 2024-07-05T10:29:48Z

Oh, that is a problem. We have an invariant that SYSTEM FLUSH LOGS does its jobs synchronously. It happened that it is wrong for system.error_log. Usually all events for system.[some_log] are accumulated in the buffer, which is written by SYSTEM FLUSH LOGS. Your are pointing that there is no such accumulating process.

system.error_log was developed based on system.metric_log which does the same thing. Both are periodic collectors that aggregate the metrics for the defined time window interval. They're different by nature to most other system logs. It doesn't make sense to collect and flush data during a SYSTEM FLUSH LOGS if the periodic time interval has not elapsed yet, IMHO.

It took me a while to understand, you are right here.

fix flaky test

1e5bc5b

CheSema mentioned this pull request Jul 4, 2024

Fix 03172_error_log_table_not_empty #65604

Merged

robot-ch-test-poll1 added the pr-not-for-changelog This PR should not be mentioned in the changelog label Jul 4, 2024

CheSema mentioned this pull request Jul 4, 2024

insertion deduplication on retries for materialised views #61601

Merged

nikitamikhaylov approved these changes Jul 4, 2024

View reviewed changes

nikitamikhaylov self-assigned this Jul 4, 2024

pamarcos reviewed Jul 4, 2024

View reviewed changes

restore timeouts, mark as no-fasttests

78a2139

alexey-milovidov approved these changes Jul 4, 2024

View reviewed changes

CheSema mentioned this pull request Jul 5, 2024

Logical error: isUniqTypes() #66122

Closed

CheSema added this pull request to the merge queue Jul 5, 2024

Merged via the queue into master with commit b59b47a Jul 5, 2024

CheSema deleted the chesema-fix-flaky-test branch July 5, 2024 10:32

robot-clickhouse-ci-2 added the pr-synced-to-cloud The PR is synced to the cloud repo label Jul 5, 2024

	# Throw three random errors: 111, 222 and 333 and call flush logs to ensure system.error_log is flushed
	# Throw three random errors: 111, 222 and 333, wait for more than collect_time_interval_milliseconds to ensure the errors are collected and then flush logs to ensure system.error_log is flushed

Conversation

CheSema commented Jul 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Documentation entry for user-facing changes

CI Settings (Only check the boxes if you know what you are doing):

Uh oh!

pamarcos commented Jul 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CheSema commented Jul 4, 2024

Uh oh!

pamarcos commented Jul 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CheSema commented Jul 4, 2024

Uh oh!

robot-ch-test-poll3 commented Jul 4, 2024 • edited by robot-clickhouse-ci-1 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pamarcos commented Jul 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CheSema commented Jul 4, 2024

Uh oh!

pamarcos Jul 4, 2024

Choose a reason for hiding this comment

Uh oh!

pamarcos Jul 4, 2024

Choose a reason for hiding this comment

Uh oh!

pamarcos Jul 4, 2024

Choose a reason for hiding this comment

Uh oh!

pamarcos commented Jul 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CheSema commented Jul 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CheSema commented Jul 5, 2024

Uh oh!

CheSema commented Jul 5, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

CheSema commented Jul 4, 2024 •

edited

Loading

pamarcos commented Jul 4, 2024 •

edited

Loading

pamarcos commented Jul 4, 2024 •

edited

Loading

robot-ch-test-poll3 commented Jul 4, 2024 •

edited by robot-clickhouse-ci-1

Loading

pamarcos commented Jul 4, 2024 •

edited

Loading

pamarcos commented Jul 4, 2024 •

edited

Loading

CheSema commented Jul 4, 2024 •

edited

Loading