Fix export task not being killed during s3 outage by arthurpassos · Pull Request #1564 · Altinity/ClickHouse

arthurpassos · 2026-03-20T15:26:08Z

Changelog category (leave one):

Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

The drop table operation must signal cancellation to all background tasks and wait until they ack it. This is done checking the is_cancelled flag at each pipeline iteration. If S3 is unreachable and s3_retries_attempt is big (by default, it is 500), the pipeline gets stuck deep in the AWS SDK and never gets a chance to check the signal / flag. Making the task "unkillable".

This PR fixes it in a hackish way by overwriting the query_is_cancelled_predicate, which is checked by the S3 client retry strategy upon ShouldRetry.

Documentation entry for user-facing changes

...

CI/CD Options

Exclude tests:

Regression jobs to run:

arthurpassos · 2026-03-20T15:26:25Z

@codex review

github-actions · 2026-03-20T15:27:24Z

Workflow [PR], commit [f15590e]

arthurpassos · 2026-03-20T15:29:24Z

AI audit note: This review comment was generated by AI (gpt-5.3-codex).

Audit update for PR #1564 (export-task cancellation during S3 outage):
Reviewed diff: PR page, raw diff

No confirmed defects in reviewed scope.

Coverage summary:

Scope reviewed: src/Common/ThreadStatus.h and src/Storages/MergeTree/ExportPartTask.cpp, including call path into CurrentThread::get().isQueryCanceled() used by S3 retry logic.
Categories failed: none.
Categories passed: call-graph consistency, transition invariants, branch outcomes (success/cancel/exception), cancellation propagation to S3 retry, multithreaded interleavings (task thread vs cancel caller), rollback/partial-update safety in exception path, C++ bug classes (lifetime/race/deadlock/exception-safety/UB) within changed code.
Assumptions/limits: static audit only (no runtime fault injection execution), constrained to the provided PR diff and directly affected call paths.

chatgpt-codex-connector · 2026-03-20T15:38:37Z

Codex Review: Didn't find any major issues. Chef's kiss.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

arthurpassos · 2026-03-20T15:43:06Z

I'll see if I can add tests (I actually already have those, but for some reason they were not failing :))

arthurpassos · 2026-03-20T15:43:44Z

I'll see if I can add tests (I actually already have those, but for some reason they were not failing :))

I think I know why. Probably because blocking S3 communication with IP tables was throwing an exception that is non retryable, leading to the export failing fast and no issues at all.

src/Storages/MergeTree/ExportPartTask.cpp

Enmk

LGTM

arthurpassos · 2026-03-27T12:13:38Z

#1559

Selfeer · 2026-03-27T13:01:05Z

AI audit note: This review comment was generated by AI (gpt-5.3-codex).

Audit update for PR #1564

No confirmed defects in reviewed scope.

Selfeer · 2026-03-27T13:12:32Z

PR #1564 — CI verification

PR: Fix export task not being killed during s3 outage #1564 — fix export task not being killed during S3 outage
Run: https://github.com/Altinity/ClickHouse/actions/runs/23358399348 — commit f15590ebb56d26f7360190c5ac3a844be87e5f9c
Report: https://s3.amazonaws.com/altinity-build-artifacts/PRs/1564/f15590ebb56d26f7360190c5ac3a844be87e5f9c/23358399348/ci_run_report.html

S3 export part (regression): This report shows many failures in that suite. The same tests are green here: https://github.com/Altinity/clickhouse-regression/actions/runs/23611564964

Parquet: Fails on this upstream report; green in another run — treat as flake/infra, not a PR regression.

New vs base (upstream “Checks New Fails”):

01111_create_drop_replicated_db_stress (stateless, amd_debug / s3 sequential)
test_restore_db_replica/...::test_query_after_restore_db_replica[rename table-no exists table-no restart] (integration asan 6/6)
01111_create_drop_replicated_db_stress (stateless, amd_debug / s3 sequential)
Error: QUERY_WAS_CANCELLED — "Query is killed in pending state" leaked to stdout during concurrent Replicated DB create/drop stress.
Known upstream flake — upstream PR #98465 explicitly filters this exception from the test. Earlier flaky reports: #97539, #51512, #95213. No relation to ExportPartTask / ThreadStatus.h changes.
test_restore_db_replica/...::test_query_after_restore_db_replica[rename table-no exists table-no restart] (integration asan 6/6)
Error: Code: 60. Table test_create_table does not exist during DatabaseReplicated::renameTable — race condition in replica sync after DB restore.
Known upstream flake — exact same failure documented in #92486; fix attempts in #94772, #96302. No relation to export-part cancellation code.

Likely unrelated to the export-task change; rerun if needed.

Other: Grype failed on the Alpine server image (CVE); non-Alpine Grype passed. Swarms + integration shards 5/6 and 3/5 have additional failures not listed as new vs base.

fix export task not being killed during s3 outage

513bfb8

arthurpassos added antalya port-antalya PRs to be ported to all new Antalya releases antalya-26.1 labels Mar 20, 2026

Enmk reviewed Mar 20, 2026

View reviewed changes

src/Storages/MergeTree/ExportPartTask.cpp Show resolved Hide resolved

use weakptr to be on the safe side

f15590e

Enmk approved these changes Mar 23, 2026

View reviewed changes

svb-alt added the antalya-26.1.6.20001 label Mar 25, 2026

Selfeer added the verified Approved for release label Mar 27, 2026

Selfeer mentioned this pull request Mar 27, 2026

Antalya 26.1 Backport of #96620 - Iceberg partitioing fix #1579

Merged

27 tasks

zvonand merged commit fdd27ed into antalya-26.1 Mar 27, 2026
258 of 284 checks passed

This was referenced Mar 27, 2026

Antalya 26.1 Backport of #99548 - Parallelize object storage output #1580

Merged

Export Partition - release the part lock when the query is cancelled #1593

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix export task not being killed during s3 outage#1564

Fix export task not being killed during s3 outage#1564
zvonand merged 2 commits intoantalya-26.1from
fix_s3_outage_preventing_export_from_being_cancelled

arthurpassos commented Mar 20, 2026

Uh oh!

arthurpassos commented Mar 20, 2026

Uh oh!

github-actions bot commented Mar 20, 2026 •

edited

Loading

Uh oh!

arthurpassos commented Mar 20, 2026

Uh oh!

chatgpt-codex-connector bot commented Mar 20, 2026

Uh oh!

arthurpassos commented Mar 20, 2026

Uh oh!

arthurpassos commented Mar 20, 2026

Uh oh!

Uh oh!

Enmk left a comment

Uh oh!

arthurpassos commented Mar 27, 2026

Uh oh!

Selfeer commented Mar 27, 2026

Uh oh!

Selfeer commented Mar 27, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

arthurpassos commented Mar 20, 2026

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Documentation entry for user-facing changes

CI/CD Options

Exclude tests:

Regression jobs to run:

Uh oh!

arthurpassos commented Mar 20, 2026

Uh oh!

github-actions bot commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arthurpassos commented Mar 20, 2026

Uh oh!

chatgpt-codex-connector bot commented Mar 20, 2026

Uh oh!

arthurpassos commented Mar 20, 2026

Uh oh!

arthurpassos commented Mar 20, 2026

Uh oh!

Uh oh!

Enmk left a comment

Choose a reason for hiding this comment

Uh oh!

arthurpassos commented Mar 27, 2026

Uh oh!

Selfeer commented Mar 27, 2026

Uh oh!

Selfeer commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR #1564 — CI verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

github-actions bot commented Mar 20, 2026 •

edited

Loading

Selfeer commented Mar 27, 2026 •

edited

Loading