Skip to content

Stateless tests: Improve tests speed#65186

Merged
fm4v merged 1 commit intomasterfrom
optimize-tests
Jul 9, 2024
Merged

Stateless tests: Improve tests speed#65186
fm4v merged 1 commit intomasterfrom
optimize-tests

Conversation

@fm4v
Copy link
Copy Markdown
Member

@fm4v fm4v commented Jun 12, 2024

Results

  • The total number of jobs for stateless tests per CI run is reduced, from 49 to 28.
  • The Stateless tests (release) job now runs in 40 minutes instead of 60 minutes.
  • The number of parallel testing jobs in a single test run is decreased, so I expect tests to be more stable with fewer timeouts.

Changelog category (leave one):

  • Build/Testing/Packaging Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Stateless tests: Improve tests speed and decrease number of parallel jobs

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

Information about CI checks: https://clickhouse.com/docs/en/development/continuous-integration/

CI Settings (Only check the boxes if you know what you are doing):

  • Allow: All Required Checks
  • Allow: Stateless tests
  • Allow: Stateful tests
  • Allow: Integration Tests
  • Allow: Performance tests
  • Allow: Normal Builds
  • Allow: Special Builds
  • Allow: All NOT Required Checks
  • Allow: batch 1, 2 for multi-batch jobs
  • Allow: batch 3, 4, 5, 6 for multi-batch jobs

  • Exclude: Style check
  • Exclude: Fast test
  • Exclude: All with ASAN
  • Exclude: All with TSAN, MSAN, UBSAN, Coverage
  • Exclude: All with aarch64, release, debug

  • Do not test
  • Woolen Wolfdog
  • Upload binaries for special builds
  • Disable merge-commit
  • Disable CI cache

@fm4v fm4v added the can be tested Allows running workflows for external contributors label Jun 12, 2024
@robot-clickhouse-ci-2 robot-clickhouse-ci-2 added the pr-build Pull request with build/testing/packaging improvement label Jun 12, 2024
@robot-clickhouse-ci-2
Copy link
Copy Markdown
Contributor

robot-clickhouse-ci-2 commented Jun 12, 2024

This is an automated comment for commit eeb3561 with description of existing statuses. It's updated for the latest CI running

✅ Click here to open a full report in a separate page

Successful checks
Check nameDescriptionStatus
BuildsThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Docs checkBuilds and tests the documentation✅ success
Fast testNormally this is the first check that is ran for a PR. It builds ClickHouse and runs most of stateless functional tests, omitting some. If it fails, further checks are not started until it is fixed. Look at the report to see which tests fail, then reproduce the failure locally as described here✅ success
Stateful testsRuns stateful functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc✅ success
Stateless testsRuns stateless functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc✅ success
Style checkRuns a set of checks to keep the code style clean. If some of tests failed, see the related log from the report✅ success
Unit testsRuns the unit tests for different release types✅ success

@Algunenano Algunenano self-assigned this Jun 13, 2024
Copy link
Copy Markdown
Member

@Algunenano Algunenano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't reviewed everything because I see it's in progress, but you are removing many no-parallel tags that are there because the test creates and drops a database and that's not safe to run in parallel with itself.

Most of them are probably added because otherwise the flaky check will detect this problem, and also most of them do not need to create a database and should be using the random one created for each test. This can be done in 3 ways:

  • Do not do anything.
  • Use currentDatabase() when you need to reference the string
  • Use bash and $CURRENT_DATABASE when you need it explicitly in a query. The reference file will replace it by default when checking the results.

@fm4v
Copy link
Copy Markdown
Member Author

fm4v commented Jun 13, 2024

@Algunenano Yes, I'm trying to remove unnecessary no-parallel tags, and it is safe to create a database in the test if the same database name is not used by other tests. However, it appears that the flaky check fails, but the regular test run is correct.

@Algunenano
Copy link
Copy Markdown
Member

However, it appears that the flaky check fails, but the regular test run is correct.

Yes, the flaky test will fail which is bad, because it will run every time we modify the test, making the check worthless. Same if you run the test locally many times.

Maybe it could be worth to slowly remove the creation of databases on those tests and making them truly parallel

@fm4v fm4v force-pushed the optimize-tests branch 11 times, most recently from f585a10 to 9fffce4 Compare June 19, 2024 09:58
@fm4v
Copy link
Copy Markdown
Member Author

fm4v commented Jun 19, 2024

To improve test runs, we need to identify what takes the most time in a single test job:

  • 5 min setup
  • 15 min for all parallel tests
  • 40 min for sequential tests

It means we need to make the sequential tests much faster. Tests are tagged with no-parallel for the following reasons:

  • Usage of shared resources: databases, dictionaries from XML, users, system databases and tables, global SYSTEM commands, etc.
  • High load due to multi-threaded object creation (replicas, databases, tables)
  • Inability to pass the flaky check, which runs the same test 100 times in parallel, leading to the use of the no-parallel tag to pass it.

This PR implements several changes to optimize run speed, prioritized by impact on performance:

  • Run sequential tests on a dedicated CH process in parallel with all other tests.
  • Convert long-running tests tagged with no-parallel to parallel.
  • Decrease the randomization of max_threads and max_insert_threads settings.

Results
When test run time is decreased, we can reduce the number of jobs and keep the total run time under 1 hour.

  • The total number of jobs for stateless tests per CI run is reduced, from 49 to 28.
  • The Stateless tests (release) job now runs in 40 minutes instead of 60 minutes.
  • The number of parallel testing jobs in a single test run is decreased, so I expect tests to be more stable with fewer timeouts.

@fm4v fm4v force-pushed the optimize-tests branch 2 times, most recently from 2723a53 to 9c377f8 Compare June 24, 2024 14:12
Copy link
Copy Markdown
Member

@Algunenano Algunenano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incredible work.

I've left some comments. Please make sure to verify the sync PR too, since there are some conflicts and the CI is not running there because of it.

@fm4v fm4v enabled auto-merge June 27, 2024 07:19
@fm4v fm4v added this pull request to the merge queue Jun 27, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jun 27, 2024
@fm4v fm4v added this pull request to the merge queue Jun 27, 2024
@Algunenano Algunenano removed this pull request from the merge queue due to a manual request Jun 27, 2024
@fm4v fm4v force-pushed the optimize-tests branch 3 times, most recently from 0d89c4e to 1111acb Compare July 2, 2024 13:44
@fm4v fm4v force-pushed the optimize-tests branch 5 times, most recently from d8d4a61 to 57432b8 Compare July 9, 2024 16:16
@fm4v fm4v force-pushed the optimize-tests branch from 57432b8 to eeb3561 Compare July 9, 2024 17:41
@fm4v fm4v enabled auto-merge July 9, 2024 20:36
@fm4v fm4v added this pull request to the merge queue Jul 9, 2024
Merged via the queue into master with commit 249c80a Jul 9, 2024
@fm4v fm4v deleted the optimize-tests branch July 9, 2024 21:53
@robot-ch-test-poll1 robot-ch-test-poll1 added the pr-synced-to-cloud The PR is synced to the cloud repo label Jul 9, 2024
@maxknv
Copy link
Copy Markdown
Member

maxknv commented Jul 13, 2024

We have OOM quite often. Mostly with Stateless tests (msan, distributed cache, s3 storage) [1/3] and Stateless tests (asan, distributed cache, s3 storage) [1/3].
OOM happens with the first job batch in more than 95% of cases. It looks very suspicious. Is the first batch used for non-parallel tests now?

clickhouse-client --port 19000 --query "SELECT 1" && break
sleep 1
done
fi
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You forgot to call setup_logs_replication for this instance, and while you are here, can you fix this for USE_DATABASE_REPLICATED as well

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maxknv Most likely the test in this group is running out of memory, probably because this test is now running in parallel with other tests. I'll fix it.
@azat I'll will fix setup_logs_replication but fixing USE_DATABASE_REPLICATED does not make sense, since the number of these builds and the time to run the tests does not need to be optimized

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant add setup_logs_replication there as well

if result.reason is not None:
description_full += " - "
description_full += result.reason.value
description_full += f"\nReason: {result.reason.value} "
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why you prefer to use \n it became too verbose, let's get compact behavior back - #75530

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-build Pull request with build/testing/packaging improvement pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants