Skip to content

Fix export task not being killed during s3 outage#1564

Merged
zvonand merged 2 commits intoantalya-26.1from
fix_s3_outage_preventing_export_from_being_cancelled
Mar 27, 2026
Merged

Fix export task not being killed during s3 outage#1564
zvonand merged 2 commits intoantalya-26.1from
fix_s3_outage_preventing_export_from_being_cancelled

Conversation

@arthurpassos
Copy link
Copy Markdown
Collaborator

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

The drop table operation must signal cancellation to all background tasks and wait until they ack it. This is done checking the is_cancelled flag at each pipeline iteration. If S3 is unreachable and s3_retries_attempt is big (by default, it is 500), the pipeline gets stuck deep in the AWS SDK and never gets a chance to check the signal / flag. Making the task "unkillable".

This PR fixes it in a hackish way by overwriting the query_is_cancelled_predicate, which is checked by the S3 client retry strategy upon ShouldRetry.

Documentation entry for user-facing changes

...

CI/CD Options

Exclude tests:

  • Fast test
  • Integration Tests
  • Stateless tests
  • Stateful tests
  • Performance tests
  • All with ASAN
  • All with TSAN
  • All with MSAN
  • All with UBSAN
  • All with Coverage
  • All with Aarch64
  • All Regression
  • Disable CI Cache

Regression jobs to run:

  • Fast suites (mostly <1h)
  • Aggregate Functions (2h)
  • Alter (1.5h)
  • Benchmark (30m)
  • ClickHouse Keeper (1h)
  • Iceberg (2h)
  • LDAP (1h)
  • Parquet (1.5h)
  • RBAC (1.5h)
  • SSL Server (1h)
  • S3 (2h)
  • S3 Export (2h)
  • Swarms (30m)
  • Tiered Storage (2h)

@arthurpassos
Copy link
Copy Markdown
Collaborator Author

@codex review

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 20, 2026

Workflow [PR], commit [f15590e]

@arthurpassos arthurpassos added antalya port-antalya PRs to be ported to all new Antalya releases antalya-26.1 labels Mar 20, 2026
@arthurpassos
Copy link
Copy Markdown
Collaborator Author

AI audit note: This review comment was generated by AI (gpt-5.3-codex).

Audit update for PR #1564 (export-task cancellation during S3 outage):
Reviewed diff: PR page, raw diff

No confirmed defects in reviewed scope.

Coverage summary:

Scope reviewed: src/Common/ThreadStatus.h and src/Storages/MergeTree/ExportPartTask.cpp, including call path into CurrentThread::get().isQueryCanceled() used by S3 retry logic.
Categories failed: none.
Categories passed: call-graph consistency, transition invariants, branch outcomes (success/cancel/exception), cancellation propagation to S3 retry, multithreaded interleavings (task thread vs cancel caller), rollback/partial-update safety in exception path, C++ bug classes (lifetime/race/deadlock/exception-safety/UB) within changed code.
Assumptions/limits: static audit only (no runtime fault injection execution), constrained to the provided PR diff and directly affected call paths.

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Chef's kiss.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@arthurpassos
Copy link
Copy Markdown
Collaborator Author

I'll see if I can add tests (I actually already have those, but for some reason they were not failing :))

@arthurpassos
Copy link
Copy Markdown
Collaborator Author

I'll see if I can add tests (I actually already have those, but for some reason they were not failing :))

I think I know why. Probably because blocking S3 communication with IP tables was throwing an exception that is non retryable, leading to the export failing fast and no issues at all.

Copy link
Copy Markdown
Member

@Enmk Enmk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@arthurpassos
Copy link
Copy Markdown
Collaborator Author

#1559

@Selfeer
Copy link
Copy Markdown
Collaborator

Selfeer commented Mar 27, 2026

AI audit note: This review comment was generated by AI (gpt-5.3-codex).

Audit update for PR #1564

No confirmed defects in reviewed scope.

@Selfeer
Copy link
Copy Markdown
Collaborator

Selfeer commented Mar 27, 2026

PR #1564 — CI verification

S3 export part (regression): This report shows many failures in that suite. The same tests are green here: https://github.com/Altinity/clickhouse-regression/actions/runs/23611564964

Parquet: Fails on this upstream report; green in another run — treat as flake/infra, not a PR regression.

New vs base (upstream “Checks New Fails”):

  • 01111_create_drop_replicated_db_stress (stateless, amd_debug / s3 sequential)

  • test_restore_db_replica/...::test_query_after_restore_db_replica[rename table-no exists table-no restart] (integration asan 6/6)

  • 01111_create_drop_replicated_db_stress (stateless, amd_debug / s3 sequential)
    Error: QUERY_WAS_CANCELLED — "Query is killed in pending state" leaked to stdout during concurrent Replicated DB create/drop stress.
    Known upstream flake — upstream PR #98465 explicitly filters this exception from the test. Earlier flaky reports: #97539, #51512, #95213. No relation to ExportPartTask / ThreadStatus.h changes.

  • test_restore_db_replica/...::test_query_after_restore_db_replica[rename table-no exists table-no restart] (integration asan 6/6)
    Error: Code: 60. Table test_create_table does not exist during DatabaseReplicated::renameTable — race condition in replica sync after DB restore.
    Known upstream flake — exact same failure documented in #92486; fix attempts in #94772, #96302. No relation to export-part cancellation code.

Likely unrelated to the export-task change; rerun if needed.

Other: Grype failed on the Alpine server image (CVE); non-Alpine Grype passed. Swarms + integration shards 5/6 and 3/5 have additional failures not listed as new vs base.

@Selfeer Selfeer added the verified Approved for release label Mar 27, 2026
@zvonand zvonand merged commit fdd27ed into antalya-26.1 Mar 27, 2026
258 of 284 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

antalya antalya-26.1 antalya-26.1.6.20001 port-antalya PRs to be ported to all new Antalya releases verified Approved for release

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants