Stateless tests: Improve tests speed by fm4v · Pull Request #65186 · ClickHouse/ClickHouse

fm4v · 2024-06-12T16:14:45Z

Results

The total number of jobs for stateless tests per CI run is reduced, from 49 to 28.
The Stateless tests (release) job now runs in 40 minutes instead of 60 minutes.
The number of parallel testing jobs in a single test run is decreased, so I expect tests to be more stable with fewer timeouts.

Changelog category (leave one):

Build/Testing/Packaging Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Stateless tests: Improve tests speed and decrease number of parallel jobs

Documentation entry for user-facing changes

Documentation is written (mandatory for new features)

Information about CI checks: https://clickhouse.com/docs/en/development/continuous-integration/

CI Settings (Only check the boxes if you know what you are doing):

Exclude: Style check
Exclude: Fast test
Exclude: All with ASAN
Exclude: All with TSAN, MSAN, UBSAN, Coverage
Exclude: All with aarch64, release, debug

robot-clickhouse-ci-2 · 2024-06-12T16:17:31Z

This is an automated comment for commit eeb3561 with description of existing statuses. It's updated for the latest CI running

✅ Click here to open a full report in a separate page

Successful checks

Check name	Description	Status
Builds	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
Docs check	Builds and tests the documentation	✅ success
Fast test	Normally this is the first check that is ran for a PR. It builds ClickHouse and runs most of stateless functional tests, omitting some. If it fails, further checks are not started until it is fixed. Look at the report to see which tests fail, then reproduce the failure locally as described here	✅ success
Stateful tests	Runs stateful functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc	✅ success
Stateless tests	Runs stateless functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc	✅ success
Style check	Runs a set of checks to keep the code style clean. If some of tests failed, see the related log from the report	✅ success
Unit tests	Runs the unit tests for different release types	✅ success

Algunenano

I haven't reviewed everything because I see it's in progress, but you are removing many no-parallel tags that are there because the test creates and drops a database and that's not safe to run in parallel with itself.

Most of them are probably added because otherwise the flaky check will detect this problem, and also most of them do not need to create a database and should be using the random one created for each test. This can be done in 3 ways:

Do not do anything.
Use currentDatabase() when you need to reference the string
Use bash and $CURRENT_DATABASE when you need it explicitly in a query. The reference file will replace it by default when checking the results.

tests/queries/0_stateless/00623_truncate_all_tables.sql

tests/queries/0_stateless/01036_no_superfluous_dict_reload_on_create_database.sql

tests/queries/0_stateless/01129_dict_get_join_lose_constness.sql

fm4v · 2024-06-13T11:41:20Z

@Algunenano Yes, I'm trying to remove unnecessary no-parallel tags, and it is safe to create a database in the test if the same database name is not used by other tests. However, it appears that the flaky check fails, but the regular test run is correct.

Algunenano · 2024-06-13T12:19:04Z

However, it appears that the flaky check fails, but the regular test run is correct.

Yes, the flaky test will fail which is bad, because it will run every time we modify the test, making the check worthless. Same if you run the test locally many times.

Maybe it could be worth to slowly remove the creation of databases on those tests and making them truly parallel

fm4v · 2024-06-19T15:58:34Z

To improve test runs, we need to identify what takes the most time in a single test job:

5 min setup
15 min for all parallel tests
40 min for sequential tests

It means we need to make the sequential tests much faster. Tests are tagged with no-parallel for the following reasons:

Usage of shared resources: databases, dictionaries from XML, users, system databases and tables, global SYSTEM commands, etc.
High load due to multi-threaded object creation (replicas, databases, tables)
Inability to pass the flaky check, which runs the same test 100 times in parallel, leading to the use of the no-parallel tag to pass it.

This PR implements several changes to optimize run speed, prioritized by impact on performance:

Run sequential tests on a dedicated CH process in parallel with all other tests.
Convert long-running tests tagged with no-parallel to parallel.
Decrease the randomization of max_threads and max_insert_threads settings.

Results
When test run time is decreased, we can reduce the number of jobs and keep the total run time under 1 hour.

The total number of jobs for stateless tests per CI run is reduced, from 49 to 28.
The Stateless tests (release) job now runs in 40 minutes instead of 60 minutes.
The number of parallel testing jobs in a single test run is decreased, so I expect tests to be more stable with fewer timeouts.

Algunenano

Incredible work.

I've left some comments. Please make sure to verify the sync PR too, since there are some conflicts and the CI is not running there because of it.

docker/test/stateless/run.sh

tests/clickhouse-test

tests/queries/0_stateless/01113_local_dictionary_type_conversion.sql

tests/queries/0_stateless/01125_dict_ddl_cannot_add_column.sql

tests/queries/0_stateless/01259_dictionary_custom_settings_ddl.sql

tests/queries/0_stateless/01658_read_file_to_stringcolumn.sh

maxknv · 2024-07-13T19:37:38Z

We have OOM quite often. Mostly with Stateless tests (msan, distributed cache, s3 storage) [1/3] and Stateless tests (asan, distributed cache, s3 storage) [1/3].
OOM happens with the first job batch in more than 95% of cases. It looks very suspicious. Is the first batch used for non-parallel tests now?

azat · 2024-07-14T06:12:00Z

docker/test/stateless/run.sh

+      clickhouse-client --port 19000 --query "SELECT 1" && break
+      sleep 1
+  done
+fi


You forgot to call setup_logs_replication for this instance, and while you are here, can you fix this for USE_DATABASE_REPLICATED as well

@maxknv Most likely the test in this group is running out of memory, probably because this test is now running in parallel with other tests. I'll fix it.
@azat I'll will fix setup_logs_replication but fixing USE_DATABASE_REPLICATED does not make sense, since the number of these builds and the time to run the tests does not need to be optimized

I meant add setup_logs_replication there as well

azat · 2025-02-04T16:49:22Z

tests/clickhouse-test

        if result.reason is not None:
-            description_full += " - "
-            description_full += result.reason.value
+            description_full += f"\nReason: {result.reason.value} "


Why you prefer to use \n it became too verbose, let's get compact behavior back - #75530

fm4v added the can be tested Allows running workflows for external contributors label Jun 12, 2024

robot-clickhouse-ci-2 added the pr-build Pull request with build/testing/packaging improvement label Jun 12, 2024

Algunenano self-assigned this Jun 13, 2024

Algunenano requested changes Jun 13, 2024

View reviewed changes

tests/queries/0_stateless/00623_truncate_all_tables.sql Show resolved Hide resolved

tests/queries/0_stateless/01036_no_superfluous_dict_reload_on_create_database.sql Show resolved Hide resolved

tests/queries/0_stateless/01129_dict_get_join_lose_constness.sql Show resolved Hide resolved

fm4v force-pushed the optimize-tests branch 11 times, most recently from f585a10 to 9fffce4 Compare June 19, 2024 09:58

fm4v force-pushed the optimize-tests branch 2 times, most recently from 2723a53 to 9c377f8 Compare June 24, 2024 14:12

Algunenano reviewed Jun 25, 2024

View reviewed changes

fm4v enabled auto-merge June 26, 2024 12:35

fm4v disabled auto-merge June 26, 2024 12:59

Algunenano mentioned this pull request Jun 26, 2024

Prevent 03172_error_log_table_not_empty from running in parallel #65725

Closed

Algunenano approved these changes Jun 26, 2024

View reviewed changes

fm4v enabled auto-merge June 27, 2024 07:19

fm4v added this pull request to the merge queue Jun 27, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jun 27, 2024

fm4v added this pull request to the merge queue Jun 27, 2024

Algunenano removed this pull request from the merge queue due to a manual request Jun 27, 2024

fm4v force-pushed the optimize-tests branch 3 times, most recently from 0d89c4e to 1111acb Compare July 2, 2024 13:44

azat mentioned this pull request Jul 3, 2024

Sanitizers builds are broken #66049

Closed

fm4v force-pushed the optimize-tests branch 5 times, most recently from d8d4a61 to 57432b8 Compare July 9, 2024 16:16

Stateless tests: run sequential tests in parallel to other tests

eeb3561

fm4v force-pushed the optimize-tests branch from 57432b8 to eeb3561 Compare July 9, 2024 17:41

fm4v enabled auto-merge July 9, 2024 20:36

fm4v added this pull request to the merge queue Jul 9, 2024

Merged via the queue into master with commit 249c80a Jul 9, 2024

fm4v deleted the optimize-tests branch July 9, 2024 21:53

robot-ch-test-poll1 added the pr-synced-to-cloud The PR is synced to the cloud repo label Jul 9, 2024

This was referenced Jul 9, 2024

Stateless tests: Improve tests speed 2 #66305

Merged

Stateless tests: Improve tests speed 3 #66363

Merged

azat reviewed Jul 14, 2024

View reviewed changes

tavplubix mentioned this pull request Jul 19, 2024

Update ci_config.py #66783

Merged

19 tasks

fm4v mentioned this pull request Aug 27, 2024

Stateless tests: run sequential tests on second instance #68919

Closed

21 tasks

azat reviewed Feb 4, 2025

View reviewed changes

Conversation

fm4v commented Jun 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Documentation entry for user-facing changes

CI Settings (Only check the boxes if you know what you are doing):

Uh oh!

robot-clickhouse-ci-2 commented Jun 12, 2024 • edited by robot-ch-test-poll1 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Algunenano left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fm4v commented Jun 13, 2024

Uh oh!

Algunenano commented Jun 13, 2024

Uh oh!

fm4v commented Jun 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Algunenano left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

maxknv commented Jul 13, 2024

Uh oh!

azat Jul 14, 2024

Choose a reason for hiding this comment

Uh oh!

fm4v Jul 14, 2024

Choose a reason for hiding this comment

Uh oh!

azat Jul 14, 2024

Choose a reason for hiding this comment

Uh oh!

azat Feb 4, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

fm4v commented Jun 12, 2024 •

edited

Loading

robot-clickhouse-ci-2 commented Jun 12, 2024 •

edited by robot-ch-test-poll1

Loading

fm4v commented Jun 19, 2024 •

edited

Loading