CI: combine ASan+UBSan into single builds, remove UBSan#99657
Conversation
…d ARM TSan - Replace `AMD_ASAN` with `AMD_ASAN_UBSAN` (`-DSANITIZE=address,undefined`) and add `ENABLE_BUZZHOUSE=1` - Replace `ARM_ASAN` with `ARM_ASAN_UBSAN` (`-DSANITIZE=address,undefined`) and add `ENABLE_BUZZHOUSE=1` - Remove `AMD_UBSAN` build and all jobs that depended on it (redundant now that UBSan is included in both ASAN builds) - Disable `ARM_TSAN` build job (commented out; `AMD_TSAN` covers data races which are architecture-independent) - Add `address,undefined` branch to `cmake/sanitize.cmake` combining ASan and UBSan flags - Regenerate `.github/workflows/` via `python3 -m ci.praktika yaml` Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>
|
Workflow [PR], commit [8e04136] Summary: ❌
|
- Add address,undefined branch to build_clang_builtin.cmake to include both ASan (asan_static, asan, asan_cxx) and UBSan (ubsan_standalone, ubsan_standalone_cxx) runtimes when linking host tools such as protoc. Without this, the linker fails with undefined __asan_* and __ubsan_* symbols because -fno-sanitize-link-runtime is applied but no runtime is provided. - Add address,undefined branch to cmake/sanitize.cmake combining ASan and UBSan compiler flags. - Fix collect_build_flags in clickhouse-test to parse all sanitizer values from -fsanitize=... instead of using elif chain, so both asan and ubsan tags are set for combined builds. Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>
libclang_rt.asan and libclang_rt.ubsan_standalone both contain sanitizer_common symbols, causing duplicate symbol errors when linked together. When combining -fsanitize=address,undefined, clang uses the ASan runtime to cover UBSan instrumentation as well, so only the ASan runtime libraries (asan_static, asan, asan_cxx) should be linked. Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>
| *[ | ||
| Job.ParamSet( | ||
| parameter=f"amd_asan, distributed plan, parallel, {batch}/{total_batches}", | ||
| parameter=f"amd_asan_ubsan, distributed plan, parallel, {batch}/{total_batches}", |
There was a problem hiding this comment.
This rename breaks the dependency filter in ci/workflows/pull_request.py.
FUNCTIONAL_TESTS_PARALLEL_BLOCKING_JOB_NAMES still matches "_asan, distributed plan, parallel", so the renamed jobs (amd_asan_ubsan, distributed plan, parallel) are no longer included. I validated this locally: both amd_asan_ubsan parallel jobs are now excluded from the blocking list.
Impact: release/special/integration jobs can start without waiting for these heavy parallel sanitizer tests, which can increase CI contention and make scheduling behavior inconsistent.
Please update the filter in ci/workflows/pull_request.py to match "_asan_ubsan, distributed plan, parallel" (or use a sanitizer-agnostic predicate).
contrib/CMakeLists.txt: extend the UBSan suppression for third-party libraries to also cover the combined address,undefined sanitizer. The old check (SANITIZE STREQUAL "undefined") didn't match the new value, causing OpenSSL and other contrib libraries to be compiled with UBSan, which aborted ClickHouse at startup due to type-mismatched function pointer calls in OpenSSL (ossl_bsearch). ci/workflows/pull_request.py: fix FUNCTIONAL_TESTS_PARALLEL_BLOCKING_JOB_NAMES predicate from "_asan, distributed plan, parallel" to "_asan_ubsan, distributed plan, parallel" after the rename, so that release/special/integration jobs correctly wait for the heavy parallel sanitizer tests before starting. Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>
contrib/CMakeLists.txt: extend the UBSan suppression for third-party libraries to also cover the combined address,undefined sanitizer. The old check (SANITIZE STREQUAL "undefined") didn't match the new value, causing OpenSSL and other contrib libraries to be compiled with UBSan, which aborted ClickHouse at startup due to type-mismatched function pointer calls in OpenSSL (ossl_bsearch). contrib/isa-l-cmake, lz4-cmake, arrow-cmake: extend the same STREQUAL "undefined" guards that strip -fno-sanitize=undefined from NASM/target compile options to also cover address,undefined. Without this, nasm receives -fno-sanitize=undefined and interprets it as an output format, causing a fatal build error. ci/workflows/pull_request.py: fix FUNCTIONAL_TESTS_PARALLEL_BLOCKING_JOB_NAMES predicate from "_asan, distributed plan, parallel" to "_asan_ubsan, distributed plan, parallel" after the rename, so that release/special/integration jobs correctly wait for the heavy parallel sanitizer tests before starting. Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>
| ArtifactNames.CH_ARM_TSAN, | ||
| ], | ||
| runs_on=RunnerLabels.ARM_LARGE, | ||
| ), |
There was a problem hiding this comment.
extra_validation_build_jobs no longer contains BuildTypes.ARM_TSAN, but ArtifactConfigs.clickhouse_binaries still includes CH_ARM_TSAN (from ci/defs/defs.py). This leaves CH_ARM_TSAN without a producer and drops ARM TSAN signal for PR/branch workflows.
Please either restore the ARM_TSAN Job.ParamSet here, or remove/update all CH_ARM_TSAN artifact wiring consistently in the same PR.
The build_llvm_coverage_job had no dependency set, so it could start immediately without waiting for style check, fast test and tidy builds, unlike all other build jobs. Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>
The combined ASan+UBSan binary is larger than pure ASan — each clickhouse-client process uses ~350MB RSS. Under parallel test mode with multiple concurrent clients the 5GB cgroup limit is exceeded, causing OOM kills. Raise it to 10GB for asan_ubsan jobs only. Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>
The is_coverage/JIT settings block was accidentally included from the working tree. Only the collect_build_flags refactoring (parsing all sanitizer values from -fsanitize=... instead of elif chain) belongs to this branch. Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>
`clickhouse local` (used for post-test log collection) was aborting at exit under ASan+LSan with 45 KB of leaks from OpenSSL provider initialization (`ossl_provider_new`, `OSSL_PROVIDER_try_load_ex`, etc.). OpenSSL 3.x intentionally does not free all global state at exit, so these are false positives. Added `__lsan_default_suppressions` alongside the existing `__lsan_default_options` in `programs/main.cpp` to suppress the five known OpenSSL leak patterns. This fixes the "Unknown error / Lost connection to server" failures in: - Stateless tests (amd_asan_ubsan, distributed plan, parallel, 1/2) - Stateless tests (amd_asan_ubsan, distributed plan, parallel, 2/2) - Stateless tests (arm_asan_ubsan, azure, sequential) CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=99657&sha=24b9445c3f9c8dfc420b55f1a2fb369ae17220be&name_0=PR Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>
| flag = sanitize_flag_map.get(san) | ||
| if flag is not None: | ||
| result.append(flag) | ||
| SANITIZED = bool(result) |
There was a problem hiding this comment.
SANITIZED here is assigned without global SANITIZED, so this write only updates a local variable inside collect_build_flags.
kill_process_group reads the module-level SANITIZED flag, which therefore stays False and skips the 60s wait before termination. In sanitizer jobs this can truncate sanitizer reports.
Please add global SANITIZED in collect_build_flags (next to global RELEASE_NON_SANITIZED) before assigning it.
The `check_fatal_messages_in_logs` function uses `sed -n '/.*anitizer/,$p'`
to detect sanitizer output in `stderr*.log`. When ASan runs on a machine
where the stack address space wraps around, it emits:
==N==WARNING: ASan is ignoring requested __asan_handle_no_return:
stack type: default; top: 0x...; bottom 0x...; size: 0x...
False positive error reports may follow
For details see google/sanitizers#189
These lines match `/.*anitizer/` but were not excluded by the existing
filter (which only excluded the `makecontext/swapcontext` line). As a
result `parse_failure()` found no actionable error pattern and returned
the fallback "Unknown error / Lost connection to server", causing
stateless test jobs to fail even when all tests passed.
Add the two additional boilerplate lines to the exclusion filter.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>
Two issues remained after the previous fix:
1. The ASan stack-size warning block (4 lines) was only partially
filtered in `clickhouse_proc.py`. The last line:
"For details see google/sanitizers#189"
was not excluded, leaving `sanitizer_hits` non-empty and triggering
a spurious "Unknown error / Lost connection to server" result for
jobs that had no real sanitizer errors. Add the missing exclusion.
2. OpenSSL's EVP method cache (`evp_md_new`, `construct_evp_method`,
`CRYPTO_THREAD_lock_new`) is never freed at process exit. These
objects are allocated when the S3 client initializes HMAC via
`AWS SDK -> HMAC_Init_ex -> EVP_MD_fetch`. LSan reports them as
leaks, causing the server to abort with SIGABRT (`halt_on_error=1`),
which `clickhouse-test` records as "Unknown error / Lost connection
to server". Add suppressions to `__lsan_default_suppressions`.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>
…ation tests Combining ASan and UBSan into a single build (PR #99657) runs UBSan in integration tests for the first time, exposing two pre-existing issues: 1. Arrow IPC buffers (`array_binary.h`): `BaseBinaryArray::value_offset` reads offset arrays that may not be 4-byte aligned relative to the IPC buffer base. This is intentional in the Arrow format and safe on x86 — suppress the misaligned-load check for that file. 2. EmbeddedRocksDB full scan (`BufferBase::BufferBase`): when iterating over a RocksDB table in read-only restart mode, `rocksdb::Slice::size()` returns `0xFFFFFFFFFFFFFFFE` (corrupted), causing pointer arithmetic to overflow in `BufferBase`. This is a real pre-existing memory corruption bug specific to `test_rocksdb_read_only`; suppress here until it is fixed. Issue: #99862 Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>
…fect The previous suppressions were ineffective: 1. `src:*/contrib/arrow/cpp/src/arrow/array/array_binary.h*` - the path glob did not match the absolute path used during compilation; broadened to `src:*/contrib/arrow/*` to cover all Arrow contrib sources. 2. `fun:*DB::BufferBase::BufferBase*` - per the file's own caveats, `fun:` does not work for inlined functions, and `BufferBase::BufferBase` is a header-only constructor that is always inlined at call sites. Replaced with `src:*/src/IO/BufferBase.h`. Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>
UBSan's compile-time `src:` suppression for inlined functions must target the TRANSLATION UNIT (.cpp file) where the inlined code is compiled, not the header file where it is defined. All the inlined constructors in the call chain compile down to instructions attributed to the outer .cpp file. - Arrow: `arrow::BaseBinaryArray::value_offset` (misaligned load in `array_binary.h`) is inlined into `ArrowColumnToCHColumn.cpp`. Suppress that translation unit. - RocksDB: `BufferBase::BufferBase` and friends (pointer overflow) are inlined through `ReadBufferFromString` → `KVStorageUtils.h` into `StorageEmbeddedRocksDB.cpp`. Suppress that translation unit. Issue: #99862 Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>
The ubsan_ignorelist.txt approach was ineffective because sccache keys its object-file cache on compiler arguments (including the ignorelist PATH) but not on the ignorelist FILE CONTENT. Changing the ignorelist without changing source files serves stale cached objects, leaving the UBSan instrumentation in place. Fix by adding NO_SANITIZE_UNDEFINED directly to the affected functions, which changes the preprocessed source content and forces sccache to recompile: 1. ArrowColumnToCHColumn.cpp: `readColumnWithStringData<ArrowArray>` — Arrow IPC offset arrays may be unaligned (misaligned-load). This is intentional in the Arrow IPC format and safe on x86. 2. StorageEmbeddedRocksDB.cpp: `EmbeddedRocksDBSource::generateFullScan` — inlined BufferBase pointer arithmetic overflows when a RocksDB slice returns a corrupted size. Pre-existing bug tracked in: #99862 Also revert the now-unnecessary src: entries added to ubsan_ignorelist.txt in the previous attempts. Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>
…geEmbeddedRocksDB Source annotations (NO_SANITIZE_UNDEFINED) were ineffective because the UBSan checks for inlined callees are attributed to the callee's source location, not the annotated caller. ubsan_ignorelist.txt was also ineffective due to sccache not including ignorelist file content in its cache key. Fix: use set_source_files_properties in CMakeLists.txt to pass -fno-sanitize=<check> flags for specific translation units. Changing compiler flags is part of sccache's cache key, guaranteeing a recompile. - ArrowColumnToCHColumn.cpp: -fno-sanitize=alignment Arrow IPC offset arrays are intentionally unaligned (packed int32). The misaligned read is safe on x86. - StorageEmbeddedRocksDB.cpp: -fno-sanitize=pointer-overflow Pre-existing bug: corrupted RocksDB slice (size = 0xFFFFFFFFFFFFFFFE) causes pointer arithmetic overflow in BufferBase when iterating in read-only mode. Tracked in: #99862 Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>
Using relative paths in `set_source_files_properties` was not reliably
targeting the correct source files. Switch to `${CMAKE_CURRENT_SOURCE_DIR}`
absolute paths to ensure the suppression flags are applied.
Also remove `NO_SANITIZE_UNDEFINED` source annotations from
`ArrowColumnToCHColumn.cpp` and `StorageEmbeddedRocksDB.cpp` now that
the CMake approach is fixed.
Remove `amd_release` and `arm_asan_ubsan` targeted integration test
job configs from `job_configs.py` as these artifact variants don't
exist yet.
CI: #99657
Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>
…ol function Convert coverage addresses to file offsets by subtracting the binary's load base so that dumps are portable across processes with different ASLR bases. This matches how `addressToSymbol` works: the fallback path in `SymbolIndex::findSymbol` treats values as raw file offsets when they're not in any mapped object. Add `SymbolIndex::diagnose` to explain why symbol resolution succeeds or fails: returns `"found:<name>"`, `"no_object"`, or `"no_symbol[object=<path>:offset=0x<hex>]"`. Add `debugAddressToSymbol` SQL function backed by `diagnose`, useful for investigating why `coverageCurrent()` produces empty symbols. Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>
…mands Add stateless tests to cover previously-untested code paths: - `04035_json_extract_null_handling`: all `JSONExtract` null-value paths in `JSONExtractTree.cpp` (numeric, string, UUID, date/time, decimal, IP, enum, array, tuple nodes) and the `jsonElementToString` null case - `04036_system_commands_coverage`: `SYSTEM` commands that had no test coverage (`CLEAR_PRIMARY_INDEX_CACHE`, `CLEAR_PAGE_CACHE`, `START/STOP_THREAD_FUZZER`, `RELOAD_FUNCTIONS`, `RELOAD_EMBEDDED_DICTIONARIES`) Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>
When targeted integration tests run with `--count 10 --random-order`, the same test functions execute 10 times in a single cluster session. `test_read_only` and `test_dirctory_missing_after_stop` were not idempotent: a failed run would leave tables or RocksDB lock files behind, causing all subsequent runs to fail: - After a clean run of `test_read_only`, the directory `/var/lib/clickhouse/user_files/test_rocksdb_read_only` persists on disk (dropping a table does not remove its RocksDB directory). In the next run, `CREATE TABLE ... read_only=1` on that path succeeds instead of raising, because a valid database now exists in the directory. - A mid-run failure in either test can leave a table alive, keeping the RocksDB LOCK file held by the server process. Subsequent runs then fail with "lock hold by current process". Fix: add cleanup at the start of each test that drops any leftover tables and removes the RocksDB directories before the test body runs. CI: #99657 Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>
…ssToSymbol function" This reverts commit 75df705.
…stem commands" This reverts commit 38c7f78.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>
| requires=[ArtifactNames.UNITTEST_AMD_UBSAN], | ||
| ), | ||
| ) | ||
| stress_test_jobs = common_stress_job_config.parametrize( |
There was a problem hiding this comment.
The stress_test_jobs matrix was narrowed more than the PR scope describes: this hunk drops amd_release and the non-s3 ARM sanitizer stress variant, leaving only amd_debug, amd_tsan, amd_msan, and arm_asan_ubsan, s3.
That reduces stress coverage for release binaries and plain ARM sanitizer runs across all workflows that consume JobConfigs.stress_test_jobs (PR, master, release). Please either restore those jobs, or document why this coverage reduction is intentional.
…ted checks" This reverts commit fc56ac5.
When targeted integration tests run with --count 10, pytest-xdist spawns multiple workers that all reuse the same ClickHouseCluster (same Docker container, same server process). RocksDB keeps a per-process in-memory lock map: once a path is open, any concurrent Open() from another thread in the same server fails with "lock hold by current process". Fix: derive a per-worker suffix from PYTEST_XDIST_WORKER so each worker uses distinct RocksDB paths and table names. Also add cleanup at the start of each test function to handle leftover state from a failed previous run on the same worker (sequential reruns within one worker). CI: #99657 Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>
When a Buffer is default-constructed (_ptr = nullptr, _used = 0), calling setCapacity with newCapacity > 0 and preserveContent = true computes newSz = 0 and then calls memcpy(dst, nullptr, 0). Passing a null pointer to memcpy is undefined behaviour even when size is 0 (the nonnull attribute on argument 2 applies regardless of size). UBSan correctly flags this and with abort_on_error=1 the ClickHouse server process aborts during Poco::CompressedLogFile construction before binding port 9000, causing integration tests to time out. Fix: skip the memcpy when newSz == 0. Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>
● Yes, it's a real (though benign) bug in ClickHouse's vendored Poco. Look at line 158: std::memcpy(ptr, _ptr, newSz * sizeof(T)); The scenario:
The C standard says memcpy arguments must be valid pointers regardless of size (the nonnull attribute on argument 2 confirms this). So UBSan is correct to flag it. In practice it's harmless, but with The fix is a one-liner in base/poco/Foundation/include/Poco/Buffer.h: Fix: 9bd11e0 |
All test jobs previously using amd_asan_ubsan now use arm_asan_ubsan: - stateless tests flaky check - functional tests distributed plan parallel/sequential - integration tests db disk (master + required + flaky + targeted) Corresponding runner labels and artifact names updated to ARM variants. backport_branches.py build filter extended to include arm_asan_ubsan. Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>
This reverts commit 707d72b.
LLVM Coverage Report
PR changed lines: PR changed-lines coverage: N/A (no coverable changed lines found in .info). |
|
10 jobs fewer actually |
| set (CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${SAN_FLAGS} ${UBSAN_FLAGS}") | ||
| set (CMAKE_C_FLAGS "${CMAKE_C_FLAGS} ${SAN_FLAGS} ${UBSAN_FLAGS}") | ||
|
|
||
| elseif (SANITIZE STREQUAL "address,undefined") |
There was a problem hiding this comment.
This is error prone, can't we just use MATCHES for address and undefined, or they have some incompatible flags?
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
...
Documentation entry for user-facing changes
Combines ASan and UBSan sanitizer builds (
AMD_ASAN_UBSAN,ARM_ASAN_UBSAN) since they are fully compatible in clang (-fsanitize=address,undefined). Removes the standaloneAMD_UBSANbuild (now redundant) and disables theARM_TSANbuild job (data races are architecture-independent,AMD_TSANis sufficient).