Skip to content

Fix 02834_apache_arrow_abort flakiness with MSAN#65640

Merged
Avogar merged 6 commits intomasterfrom
ms
Jul 8, 2024
Merged

Fix 02834_apache_arrow_abort flakiness with MSAN#65640
Avogar merged 6 commits intomasterfrom
ms

Conversation

@al13n321
Copy link
Copy Markdown
Member

Changelog category (leave one):

  • Not for changelog (changelog entry is not required)

Failed like this: https://s3.amazonaws.com/clickhouse-test-reports/58934/8b510a67f957c2373f4126d2884614ff7066f1b5/stress_test__msan_.html

It was trying to read the whole 12 GB file into memory at once, and the 12 GB memory allocation failed because apparently MSAN has 8 GiB limit. This PR switches to a smaller 133 MB file.

Why did it try to download the whole file at once? It's not supposed to do that. Downloading the whole file is the fallback that happens in 2 cases:

  • Setting input_format_allow_seeks was set to false. Does stress test randomize it? Doesn't seem so.
  • The HTTP HEAD request got a response (from AWS) with code in [400, 499], except 429 (Too Many Requests). I'm guessing this is what happened, but no idea which code it was. This PR adds logging for that, and excludes HTTP_REQUEST_TIMEOUT and HTTP_MISDIRECTED_REQUEST.

@robot-ch-test-poll3 robot-ch-test-poll3 added the pr-not-for-changelog This PR should not be mentioned in the changelog label Jun 25, 2024
@robot-ch-test-poll3
Copy link
Copy Markdown
Contributor

robot-ch-test-poll3 commented Jun 25, 2024

This is an automated comment for commit d394278 with description of existing statuses. It's updated for the latest CI running

❌ Click here to open a full report in a separate page

Check nameDescriptionStatus
Flaky testsChecks if new added or modified tests are flaky by running them repeatedly, in parallel, with more randomization. Functional tests are run 100 times with address sanitizer, and additional randomization of thread scheduling. Integration tests are run up to 10 times. If at least once a new test has failed, or was too long, this check will be red. We don't allow flaky tests, read the doc❌ failure
Successful checks
Check nameDescriptionStatus
BuildsThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Docs checkBuilds and tests the documentation✅ success
Fast testNormally this is the first check that is ran for a PR. It builds ClickHouse and runs most of stateless functional tests, omitting some. If it fails, further checks are not started until it is fixed. Look at the report to see which tests fail, then reproduce the failure locally as described here✅ success
Integration testsThe integration tests report. In parenthesis the package type is given, and in square brackets are the optional part/total tests✅ success
Stateful testsRuns stateful functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc✅ success
Stateless testsRuns stateless functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc✅ success
Style checkRuns a set of checks to keep the code style clean. If some of tests failed, see the related log from the report✅ success
Unit testsRuns the unit tests for different release types✅ success

-- This tests depends on internet access, but it does not matter, because it only has to check that there is no abort due to a bug in Apache Arrow library.

INSERT INTO TABLE FUNCTION url('https://clickhouse-public-datasets.s3.amazonaws.com/hits_compatible/hits.parquet') SELECT * FROM url('https://clickhouse-public-datasets.s3.amazonaws.com/hits_compatible/hits.parquet'); -- { serverError CANNOT_WRITE_TO_OSTREAM, RECEIVED_ERROR_FROM_REMOTE_IO_SERVER, POCO_EXCEPTION }
INSERT INTO TABLE FUNCTION url('https://clickhouse-public-datasets.s3.amazonaws.com/hits_compatible/athena_partitioned/hits_9.parquet') SELECT * FROM url('https://clickhouse-public-datasets.s3.amazonaws.com/hits_compatible/athena_partitioned/hits_9.parquet'); -- { serverError CANNOT_WRITE_TO_OSTREAM, RECEIVED_ERROR_FROM_REMOTE_IO_SERVER, POCO_EXCEPTION }
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, but maybe reproducing the issue needs a large file?
@Avogar can confirm it, but we can simply add no-msan for this test if that's the case

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I assumed the test is just about failing to write to this URL at all. Normally the query would fail and stop long before we try to write 133 MB.

Copy link
Copy Markdown
Member

@Avogar Avogar Jun 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assumed the test is just about failing to write to this URL at all

Yes, this is right. No need for large file

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is still flaky

The PR was merged just a few hours ago, don't think repors from the link contain the changes

@al13n321
Copy link
Copy Markdown
Member Author

Flaky check failed because one of the runs took 248 seconds because the HTTP write request timed out, presumably the timeout is 240 seconds (some TCP timeout?). I'll just assume that the network and aws are unreliable and this is normal. (Would be cool to be able to investigate things like this and find the root cause, somewhere in the network infrastructure presumably, but I don't have the skill and it seems low-priority and likely futile.)

@Avogar Avogar self-assigned this Jul 4, 2024
Comment on lines +716 to +723
e.getHTTPStatus() != Poco::Net::HTTPResponse::HTTP_TOO_MANY_REQUESTS &&
e.getHTTPStatus() != Poco::Net::HTTPResponse::HTTP_REQUEST_TIMEOUT &&
e.getHTTPStatus() != Poco::Net::HTTPResponse::HTTP_MISDIRECTED_REQUEST)
{
LOG_DEBUG(log,
"HEAD request to '{}'{} failed with HTTP status {}",
initial_uri.toString(), current_uri == initial_uri ? String() : fmt::format(" redirect to '{}'", current_uri.toString()),
e.getHTTPStatus());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was it used for debugging? Let's remove it and merge the PR?

@Avogar Avogar added this pull request to the merge queue Jul 8, 2024
Merged via the queue into master with commit 822918a Jul 8, 2024
@Avogar Avogar deleted the ms branch July 8, 2024 14:15
@robot-clickhouse-ci-2 robot-clickhouse-ci-2 added the pr-synced-to-cloud The PR is synced to the cloud repo label Jul 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-not-for-changelog This PR should not be mentioned in the changelog pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants