Fix export task not being killed during s3 outage#1564
Conversation
|
@codex review |
|
AI audit note: This review comment was generated by AI (gpt-5.3-codex). Audit update for PR #1564 (export-task cancellation during S3 outage): No confirmed defects in reviewed scope. Coverage summary: Scope reviewed: src/Common/ThreadStatus.h and src/Storages/MergeTree/ExportPartTask.cpp, including call path into CurrentThread::get().isQueryCanceled() used by S3 retry logic. |
|
Codex Review: Didn't find any major issues. Chef's kiss. ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
|
I'll see if I can add tests (I actually already have those, but for some reason they were not failing :)) |
I think I know why. Probably because blocking S3 communication with IP tables was throwing an exception that is non retryable, leading to the export failing fast and no issues at all. |
|
AI audit note: This review comment was generated by AI (gpt-5.3-codex). Audit update for PR #1564 No confirmed defects in reviewed scope. |
PR #1564 — CI verification
S3 export part (regression): This report shows many failures in that suite. The same tests are green here: https://github.com/Altinity/clickhouse-regression/actions/runs/23611564964 Parquet: Fails on this upstream report; green in another run — treat as flake/infra, not a PR regression. New vs base (upstream “Checks New Fails”):
Likely unrelated to the export-task change; rerun if needed. Other: Grype failed on the Alpine server image (CVE); non-Alpine Grype passed. Swarms + integration shards 5/6 and 3/5 have additional failures not listed as new vs base. |
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
The drop table operation must signal cancellation to all background tasks and wait until they ack it. This is done checking the
is_cancelledflag at each pipeline iteration. If S3 is unreachable and s3_retries_attempt is big (by default, it is 500), the pipeline gets stuck deep in the AWS SDK and never gets a chance to check the signal / flag. Making the task "unkillable".This PR fixes it in a hackish way by overwriting the
query_is_cancelled_predicate, which is checked by the S3 client retry strategy uponShouldRetry.Documentation entry for user-facing changes
...
CI/CD Options
Exclude tests:
Regression jobs to run: