Skip to content

Fix fault injection in copier and test_cluster_copier flakiness#46120

Merged
alexey-milovidov merged 2 commits intoClickHouse:masterfrom
azat:copier-fixes
Feb 8, 2023
Merged

Fix fault injection in copier and test_cluster_copier flakiness#46120
alexey-milovidov merged 2 commits intoClickHouse:masterfrom
azat:copier-fixes

Conversation

@azat
Copy link
Copy Markdown
Member

@azat azat commented Feb 7, 2023

Let's fix one more long standing flakiness, in the road to PR auto-merge.

Changelog category (leave one):

  • Not for changelog (changelog entry is not required)

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Fix fault injection in copier (--copy-fault-probability) and test_cluster_copier test flakiness

There are very frequent flakiness of test_cluster_copier test, here is an example of copier failures on CI 1:

AssertionError: Instance: s0_1_0 (172.16.29.9). Info: {'ID': '5d68dcb46fdb4b0c54b7c7ba1ddde83b8f34d483bbb32abcb0c52b966444ce82', 'Running': False, 'ExitCode': 85, 'ProcessConfig': {'tty': False, 'entrypoint': '/usr/bin/clickhouse', 'arguments': ['copier', '--config', '/etc/clickhouse-server/config-copier.xml', '--task-path', '/clickhouse-copier/task_simple_4DFWYTDD49', '--task-file', '/task0_description.xml', '--task-upload-force', 'true', '--base-dir', '/var/log/clickhouse-server/copier', '--copy-fault-probability', '0.2', '--experimental-use-sample-offset', '1'], 'privileged': False, 'user': '0'}, 'OpenStdin': False, 'OpenStderr': True, 'OpenStdout': True, 'CanRemove': False, 'ContainerID': 'f356df6694b3cc09ee9830c623681626f8e8d999677c188b9fe911aa702784ca', 'DetachKeys': '', 'Pid': 84332}
assert 85 == 0

But let's look what the error it is, apparently it is UNFINISHED:

SELECT
    name,
    code
FROM system.errors
WHERE ((code % 256) = 85) AND (NOT remote)
SETTINGS system_events_show_zero_values = 1

┌─name─────────────────────────────┬─code─┐
│ FORMAT_IS_NOT_SUITABLE_FOR_INPUT │   85 │
│ UNFINISHED                       │  341 │
│ NO_SUCH_ERROR_CODE               │  597 │
└──────────────────────────────────┴──────┘

Let's verify:

$ grep -r UNFINISHED ./test_cluster_copier/_instances_0/s0_1_0/logs/copier/clickhouse-copier_*
./test_cluster_copier/_instances_0/s0_1_0/logs/copier/clickhouse-copier_20230206220846_368/log.log:2023.02.06 22:09:19.015251 [ 368 ] {} <Error> : virtual int DB::ClusterCopierApp::main(const std::vector<std::string> &): Code: 341. DB::Exception: Too many tries to process table cluster1.default.hits. Abort remaining execution. (UNFINISHED), Stack trace (when copying this message, always include the lines below):

And apparently that it is due to query error with fault injection:

2023.02.06 22:09:15.654724 [ 368 ] {} <Error> Application: An error occurred while processing partition 0: Code: 62. DB::Exception: Syntax error (Query): failed at position 168 ('Native'): Native. Expected one of: token, Dot, OR, AND, BETWEEN, NOT BETWEEN, LIKE, ILIKE, NOT LIKE, NOT ILIKE, IN, NOT IN, GLOBAL IN, GLOBAL NOT IN, MOD, DIV, IS NULL, IS NOT NULL, alias, AS, Comma, OFFSET, WITH TIES, BY, LIMIT, SETTINGS, UNION, EXCEPT, INTERSECT, INTO OUTFILE, FORMAT, end of query. (SYNTAX_ERROR), Stack trace (when copying this message, always include the lines below):

Example:

select x from x limit  1FORMAT Native

Syntax error: failed at position 32 ('Native'):

So fixing this should fix test_cluster_copier flakiness.

Fixes: #30399 (cc @nikitamikhaylov @alexey-milovidov @qoega )

azat added 2 commits February 7, 2023 16:53
There are very frequent flakiness of `test_cluster_copier` test, here is
an example of copier failures on CI [1]:

    AssertionError: Instance: s0_1_0 (172.16.29.9). Info: {'ID': '5d68dcb46fdb4b0c54b7c7ba1ddde83b8f34d483bbb32abcb0c52b966444ce82', 'Running': False, 'ExitCode': 85, 'ProcessConfig': {'tty': False, 'entrypoint': '/usr/bin/clickhouse', 'arguments': ['copier', '--config', '/etc/clickhouse-server/config-copier.xml', '--task-path', '/clickhouse-copier/task_simple_4DFWYTDD49', '--task-file', '/task0_description.xml', '--task-upload-force', 'true', '--base-dir', '/var/log/clickhouse-server/copier', '--copy-fault-probability', '0.2', '--experimental-use-sample-offset', '1'], 'privileged': False, 'user': '0'}, 'OpenStdin': False, 'OpenStderr': True, 'OpenStdout': True, 'CanRemove': False, 'ContainerID': 'f356df6694b3cc09ee9830c623681626f8e8d999677c188b9fe911aa702784ca', 'DetachKeys': '', 'Pid': 84332}
    assert 85 == 0

But let's look what the error it is, apparently it is UNFINISHED:

    SELECT
        name,
        code
    FROM system.errors
    WHERE ((code % 256) = 85) AND (NOT remote)
    SETTINGS system_events_show_zero_values = 1

    ┌─name─────────────────────────────┬─code─┐
    │ FORMAT_IS_NOT_SUITABLE_FOR_INPUT │   85 │
    │ UNFINISHED                       │  341 │
    │ NO_SUCH_ERROR_CODE               │  597 │
    └──────────────────────────────────┴──────┘

Let's verify:

    $ grep -r UNFINISHED ./test_cluster_copier/_instances_0/s0_1_0/logs/copier/clickhouse-copier_*
    ./test_cluster_copier/_instances_0/s0_1_0/logs/copier/clickhouse-copier_20230206220846_368/log.log:2023.02.06 22:09:19.015251 [ 368 ] {} <Error> : virtual int DB::ClusterCopierApp::main(const std::vector<std::string> &): Code: 341. DB::Exception: Too many tries to process table cluster1.default.hits. Abort remaining execution. (UNFINISHED), Stack trace (when copying this message, always include the lines below):

And apparently that it is due to query error with fault injection:

    2023.02.06 22:09:15.654724 [ 368 ] {} <Error> Application: An error occurred while processing partition 0: Code: 62. DB::Exception: Syntax error (Query): failed at position 168 ('Native'): Native. Expected one of: token, Dot, OR, AND, BETWEEN, NOT BETWEEN, LIKE, ILIKE, NOT LIKE, NOT ILIKE, IN, NOT IN, GLOBAL IN, GLOBAL NOT IN, MOD, DIV, IS NULL, IS NOT NULL, alias, AS, Comma, OFFSET, WITH TIES, BY, LIMIT, SETTINGS, UNION, EXCEPT, INTERSECT, INTO OUTFILE, FORMAT, end of query. (SYNTAX_ERROR), Stack trace (when copying this message, always include the lines below):

Example:

    select x from x limit  1FORMAT Native

    Syntax error: failed at position 32 ('Native'):

So fixing this should fix test_cluster_copier flakiness.

  [1]: https://s3.amazonaws.com/clickhouse-test-reports/46045/bd4170e03c6af583a51d12d2c39fa775dcb9997b/integration_tests__release__[4/4].html

Signed-off-by: Azat Khuzhin <[email protected]>
@azat
Copy link
Copy Markdown
Member Author

azat commented Feb 8, 2023

Test failures unrelated:

Integration tests (release) [2/4] — fail: 1, passed: 547, flaky: 0

@alexey-milovidov alexey-milovidov self-assigned this Feb 8, 2023
query += " LIMIT " + limit;

query += "FORMAT Native";
query += " FORMAT Native";
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hilarious!

@alexey-milovidov alexey-milovidov merged commit 8d4a981 into ClickHouse:master Feb 8, 2023
@azat azat deleted the copier-fixes branch February 8, 2023 09:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-not-for-changelog This PR should not be mentioned in the changelog

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky test: test_cluster_copier

3 participants