Skip to content

Fix 01246_buffer_flush flakiness (by tuning timeouts)#65310

Merged
Avogar merged 1 commit intoClickHouse:masterfrom
azat:tests/01246_buffer_flush
Jun 17, 2024
Merged

Fix 01246_buffer_flush flakiness (by tuning timeouts)#65310
Avogar merged 1 commit intoClickHouse:masterfrom
azat:tests/01246_buffer_flush

Conversation

@azat
Copy link
Copy Markdown
Member

@azat azat commented Jun 14, 2024

CI: https://s3.amazonaws.com/clickhouse-test-reports/0/efb31c1d3f79e04a94087e883bc19553c5604268/stateless_tests__tsan__[3_5].html (cc @alexey-milovidov )

2024.06.14 09:42:08.255603 [ 2128 ] {0017dceb-df44-45b1-a5e8-42a52adc1d15} <Debug> executeQuery: (from [::1]:33350) (comment: 01246_buffer_flush.sql) insert into buffer_01256 select * from system.numbers limit 5; (stage: Complete)
...
2024.06.14 09:42:08.282011 [ 2128 ] {b0a59862-2e66-4664-aa16-8707a6bc0fec} <Debug> executeQuery: (from [::1]:33350) (comment: 01246_buffer_flush.sql) -- sleep 2 (min time) + 1 (round up) + bias (1) = 4
2024.06.14 09:42:12.296662 [ 2128 ] {6e273925-4bee-4e65-bf91-222c14dd2b00} <Debug> executeQuery: (from [::1]:33350) (comment: 01246_buffer_flush.sql) select count() from data_01256; (stage: Complete)
...
2024.06.14 09:42:12.666403 [ 935 ] {} <Debug> StorageBuffer (test_74dkdipr.buffer_01256): Flushing buffer with 5 rows, 40 bytes, age 3 seconds, took 1391 ms (bg).
2024.06.14 09:42:12.666584 [ 935 ] {} <Trace> StorageBuffer (test_74dkdipr.buffer_01256)/Bg: Execution took 1391 ms.

Changelog category (leave one):

  • Not for changelog (changelog entry is not required)

@robot-ch-test-poll2 robot-ch-test-poll2 added the pr-not-for-changelog This PR should not be mentioned in the changelog label Jun 14, 2024
@robot-ch-test-poll2
Copy link
Copy Markdown
Contributor

robot-ch-test-poll2 commented Jun 14, 2024

This is an automated comment for commit a346909 with description of existing statuses. It's updated for the latest CI running

❌ Click here to open a full report in a separate page

Check nameDescriptionStatus
A SyncIf it fails, ask a maintainer for help⏳ pending
CI runningA meta-check that indicates the running CI. Normally, it's in success or pending state. The failed status indicates some problems with the PR❌ failure
Mergeable CheckChecks if all other necessary checks are successful❌ failure
Stateless testsRuns stateless functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc❌ failure
Successful checks
Check nameDescriptionStatus
ClickHouse build checkBuilds ClickHouse in various configurations for use in further steps. You have to fix the builds that fail. Build logs often has enough information to fix the error, but you might have to reproduce the failure locally. The cmake options can be found in the build log, grepping for cmake. Use these options and follow the general build process✅ success
Docs checkBuilds and tests the documentation✅ success
Fast testNormally this is the first check that is ran for a PR. It builds ClickHouse and runs most of stateless functional tests, omitting some. If it fails, further checks are not started until it is fixed. Look at the report to see which tests fail, then reproduce the failure locally as described here✅ success
Flaky testsChecks if new added or modified tests are flaky by running them repeatedly, in parallel, with more randomization. Functional tests are run 100 times with address sanitizer, and additional randomization of thread scheduling. Integration tests are run up to 10 times. If at least once a new test has failed, or was too long, this check will be red. We don't allow flaky tests, read the doc✅ success
Integration testsThe integration tests report. In parenthesis the package type is given, and in square brackets are the optional part/total tests✅ success
PR CheckChecks correctness of the PR's body✅ success
Stateful testsRuns stateful functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc✅ success
Style checkRuns a set of checks to keep the code style clean. If some of tests failed, see the related log from the report✅ success
Unit testsRuns the unit tests for different release types✅ success

select count() from data_01256;
-- sleep 2 (min time) + 1 (round up) + bias (1) = 4
select sleepEachRow(2) from numbers(2) FORMAT Null;
-- It is enough to ensure that the buffer will be flushed earlier then 2*min_time (10 sec)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it is not guaranteed whatsoever. The thread with buffer flush can be delayed for arbitrary time.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is time related test, there is no way to guarantee that the flush will happen, the only thing that came to my mind is to add no-parallel test, to reduce influence of other tests

@alexey-milovidov alexey-milovidov self-assigned this Jun 14, 2024
@azat
Copy link
Copy Markdown
Member Author

azat commented Jun 15, 2024

Stateless tests (release, old analyzer, s3, DatabaseReplicated) [2/4] — fail: 1, passed: 1672, skipped: 36

Something interesting

2024-06-15 06:23:27 [be708bc14dad] 2024.06.14 13:23:26.908995 [ 12446 ] {d00df20e-23ce-49ec-81bf-d63b2508fa6d} <Error> executeQuery: Code: 12. DB::Exception: Parameter out of bound in IColumnString::insertRangeFrom method.: (while reading column d.Array(Variant(String, UInt64))): (while reading from part /var/lib/clickhouse/store/f2c/f2c69d4f-cdaf-4936-80f6-1d9ed0c4a190/all_1_17_2/ in table test_gxu55426.test (f2c69d4f-cdaf-4936-80f6-1d9ed0c4a190) located on disk default of type local, from mark 352 with max_rows_to_read = 9346): While executing MergeTreeSelect(pool: ReadPool, algorithm: Thread). (PARAMETER_OUT_OF_BOUND) (version 24.6.1.3899 (official build)) (from [::1]:33950) (in query: select d.UInt64, d.Date, d.`Array(Variant(String, UInt64))`, d.`Array(Variant(String, UInt64))`.size0, d.`Array(Variant(String, UInt64))`.UInt64, d.`Array(Variant(String, UInt64))`.String from test format Null), Stack trace (when copying this message, always include the lines below):

@alexey-milovidov
Copy link
Copy Markdown
Member

If the test has no guarantee to pass, delete it.

@azat
Copy link
Copy Markdown
Member Author

azat commented Jun 16, 2024

If the test has no guarantee to pass, delete it.

I don't think it is a good idea to remove all such tests, otherwise how can you cover functionality like background flush of Buffer tables (and this not the only one)

Make them non-flaky in 99.9% of the cases should be enough from my perspective (note that 0.01% chance flakiness is not equal to 0.01% of failures)

@alexey-milovidov
Copy link
Copy Markdown
Member

Ok. Waiting for 03036_dynamic_read_subcolumns to be fixed.

@alexey-milovidov alexey-milovidov changed the title Fix 01246_buffer_flush flakiness (by tunning timeouts) Fix 01246_buffer_flush flakiness (by tuning timeouts) Jun 17, 2024
@azat
Copy link
Copy Markdown
Member Author

azat commented Jun 17, 2024

It has been fixed in #65341

@Avogar Avogar added this pull request to the merge queue Jun 17, 2024
Merged via the queue into ClickHouse:master with commit ed3fb0c Jun 17, 2024
@azat azat deleted the tests/01246_buffer_flush branch June 17, 2024 17:58
@robot-clickhouse robot-clickhouse added the pr-synced-to-cloud The PR is synced to the cloud repo label Jun 17, 2024
azat added a commit to azat/ClickHouse that referenced this pull request Jul 7, 2024
- reduce min_time for Buffer's min test
- rewrite the test to .sh to avoid extra sleeping time (with .sql we
  have to wait the max time)
- change the assertion for min test, the time there should not exceed
  max time (100 seconds), this should fix with test flakiness [1] even
  after [2].

  [1]: https://s3.amazonaws.com/clickhouse-test-reports/0/76119a4567ce2ac9c0aff715c1a9ba2607e806e0/stateless_tests__tsan__[3_5].html
  [2]: ClickHouse#65310

Signed-off-by: Azat Khuzhin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-not-for-changelog This PR should not be mentioned in the changelog pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants