-
Notifications
You must be signed in to change notification settings - Fork 9
Export Part/Partition integration tests (PR #1388) failing consistently under ASAN builds #112
Description
Summary
Integration tests introduced by Altinity/ClickHouse#1388 ("Forward port of export part and partition") are failing consistently in ASAN builds. These failures are polluting CI reports for all open PRs on antalya-26.1 since the tests were merged into the base branch on Mar 4.
Example: CI Report from #1405
Affected Tests
test_export_merge_tree_part_to_object_storage (2 of 7 tests failing):
test_add_column_during_exporttest_drop_column_during_export_snapshot
test_export_replicated_mt_partition_to_object_storage (14 of 20 tests failing):
test_concurrent_exports_to_different_targetstest_drop_source_table_during_exporttest_export_partition_file_already_exists_policytest_export_partition_permissionstest_export_partition_with_mixed_computed_columnstest_export_ttltest_failure_is_logged_in_system_tabletest_inject_short_living_failurestest_multiple_exports_within_a_single_querytest_mutation_in_partition_clausetest_mutations_after_export_partition_startedtest_patch_parts_after_export_partition_startedtest_pending_mutations_skip_before_export_partitiontest_pending_patch_parts_skip_before_export_partition
Failure Pattern
| Build type | Result |
|---|---|
amd_binary |
100% OK |
amd_tsan |
100% OK |
arm_binary |
100% OK |
amd_asan (non-targeted) |
~50% FAIL |
amd_asan (targeted, --count 10) |
~90% FAIL |
What is the "targeted" job?
Integration tests (amd_asan, targeted)automatically selects tests relevant to a PR's changed files using dwarf debug info to map changed code lines to test coverage. It also re-runs previously failed tests from the CI database. Each selected test is executed 10 times (--count 10) under ASAN to detect flaky or non-deterministic behavior.
PR #1388 only ran tests with amd_binary and arm_binary before being merged — no ASAN builds were executed.
Root Cause
Two separate issues:
1. Timeout under ASAN (non-targeted jobs)
Operations like ALTER TABLE ... DROP COLUMN during export exceed the 600-second query timeout when running under ASAN instrumentation (~2–3× overhead). Example from PR #1405 amd_asan 2/6:
subprocess.TimeoutExpired: Command '[...clickhouse client...]' timed out after 600 seconds.
2. Non-idempotent test design (targeted job)
Tests use hardcoded table names (e.g., add_column_during_export_mt_table) without DROP TABLE IF EXISTS in setup. When the targeted job runs with --count 10, the first iteration may fail or leave tables behind, causing subsequent iterations to fail with:
Code: 57. DB::Exception: Table default.add_column_during_export_mt_table already exists. (TABLE_ALREADY_EXISTS)
The targeted job picks up these tests because it re-runs "previously failed tests" from the CIDB — so the initial ASAN timeout failure triggers a feedback loop:
fail → targeted picks it up → fails 10× → targeted picks it up again.
Example: Integration tests (amd_asan, targeted) from #1405
Impact
Since PR #1388 was merged into antalya-26.1 on Mar 4, every PR that updates its branch inherits these tests. Observed across PRs:
- Antalya 26.1 - Forward port of list objects cache #1040 ClickHouse#1405
- 26.1 Antalya port - Alternative syntax for cluster functions ClickHouse#1390
- 26.1 Antalya port - fixes for s3Cluster distributed calls ClickHouse#1395
- 26.1 Antalya port - improvements for cluster requests ClickHouse#1414
- Improvements to partition export ClickHouse#1402
- Antalya 26.1 - fix hang in arm integration tests ClickHouse#1466
Tests Passing Without Issues
The following 12 tests from PR #1388 pass consistently across all build types including ASAN:
test_data_mutations_after_export_startedtest_pending_mutations_skip_before_exporttest_pending_mutations_throw_before_exporttest_pending_patch_parts_skip_before_exporttest_pending_patch_parts_throw_before_exporttest_export_partition_feature_is_disabledtest_kill_exporttest_pending_mutations_throw_before_export_partitiontest_pending_patch_parts_throw_before_export_partitiontest_restart_nodes_during_export
Suggested Fixes
- Increase query timeouts for export tests or add ASAN-aware timeout multipliers
- Add
DROP TABLE IF EXISTS/CREATE TABLE IF NOT EXISTSto test setup for idempotency with targeted--count 10reruns