Merge branch 'antalya-26.1' into frontport/antalya-26.1/rendezvous_hashing

ianton-ru · web-flow · commit 7e9b163c6b2c · 2026-03-04T23:47:44.000+01:00
diff --git a/.cursor/rules/audit-review.mdc b/.cursor/rules/audit-review.mdc
@@ -0,0 +1,145 @@
+---
+description: Standardize deep feature-audit output and defect reporting
+alwaysApply: true
+---
+
+# Feature Audit Reporting Standard
+
+Use this format when the user asks for a deep audit, fault injection, or review of any feature/change.
+
+## Required Output
+
+- Report **confirmed defects only** first.
+- Classify each finding as **High**, **Medium**, or **Low**.
+- For each finding include:
+  - short title,
+  - concrete impact,
+  - exact file/function reference,
+  - brief proof sketch tied to code path,
+  - at least one **code snippet** that demonstrates the defect condition.
+- Include an **Assumptions & Limits** section for static reasoning:
+  - what was not executed at runtime,
+  - what could not be proven without dynamic testing.
+- Include **audit confidence**:
+  - overall confidence (High/Medium/Low),
+  - what additional evidence would raise confidence.
+
+## Severity Rubric (Required)
+
+- Use consistent severity scoring with these dimensions:
+  - impact on correctness/security/availability,
+  - likelihood under realistic workloads,
+  - blast radius,
+  - exploitability (if security-relevant).
+- Default guidance:
+  - **High**: crash/UB/data corruption/auth bypass/deadlock with realistic trigger.
+  - **Medium**: incorrect behavior or reliability risk requiring narrower preconditions.
+  - **Low**: diagnostics/consistency/maintainability issues without direct correctness break.
+
+## Required Analysis Dimensions
+
+- For very large PRs, require **functional partitioning** before deep analysis:
+  - split review scope into functionality/workstream partitions,
+  - run the full audit loop per partition,
+  - produce per-partition findings and coverage,
+  - deduplicate cross-partition findings by root cause,
+  - end with cross-partition interaction risks and overall summary.
+- Compute and document a **call graph** for changed code before defect analysis:
+  - entrypoints,
+  - dispatch/validation chain,
+  - state/cache/storage interactions,
+  - integration boundaries (network/filesystem/external services),
+  - error/exception propagation paths.
+- Cover full transition flow when relevant:
+  - entrypoint -> processing -> state updates -> outputs/side effects.
+- Include **logical testing of all code paths**:
+  - enumerate reachable branches in changed logic,
+  - define expected outcome per branch (success, handled failure, fail-open/fail-closed, exception),
+  - include malformed input, timeout/integration failure, and concurrency/timing branches.
+- Include **fault category planning before injection**:
+  - define feature-specific logical fault categories from the reviewed code,
+  - list category scope and key transitions affected,
+  - then execute fault injection **category by category** and report findings per category.
+- Require a **full fault-category completion matrix** on every deep audit:
+  - include all generated categories,
+  - process categories one-by-one in explicit order,
+  - mark each category as Executed / Not Applicable / Deferred,
+  - record pass/fail outcome and defects found per category,
+  - provide justification for every Not Applicable or Deferred category.
+- Include **invariant-first analysis**:
+  - define key invariants that must always hold,
+  - map each critical transition to invariant preservation checks.
+- Include **interleaving analysis** for multithreaded paths:
+  - document at least several plausible thread interleavings for shared-state transitions.
+- Include a **transition mapping** from defect to state transition.
+- Include a **logical fault-injection mapping** that shows which injected condition triggers each defect.
+- Include integration impact checks:
+  - config/load-time behavior,
+  - protocol/API behavior,
+  - concurrency/timing behavior,
+  - observability/logging behavior.
+- Include **coverage accounting**:
+  - call-graph nodes covered vs not covered,
+  - transitions reviewed vs not reviewed,
+  - fault categories executed vs skipped (with reason).
+- Include an explicit **coverage stop condition**:
+  - coverage is complete only when each in-scope call-graph node/transition/category is either reviewed or marked skipped with justification.
+- Include **error-contract consistency checks**:
+  - equivalent faults should have equivalent outcomes where intended (reject/exception/error code).
+- Include **performance/resource failure checks**:
+  - high-cardinality input, memory pressure implications, retry storms, lock contention hotspots.
+- Include **rollback/partial-update checks**:
+  - for each mutation sequence, verify state remains consistent if exceptions/cancellation occur mid-path.
+- For C++ codebases, include **major C++ bug-type coverage**:
+  - memory lifetime/use-after-free/use-after-move,
+  - iterator/reference invalidation,
+  - data races and lock-order/deadlock risks,
+  - exception-safety/partial-update hazards,
+  - integer overflow/underflow and signedness errors,
+  - ownership/resource leaks (RAII violations),
+  - undefined behavior from invalid casts/aliasing/lifetime.
+
+## Guardrails
+
+- Do not mix confirmed defects with hypotheticals.
+- Mark uncertain items explicitly as “not confirmed”.
+- Use fail-open/fail-closed language for security-sensitive paths when applicable.
+- Keep summaries concise and actionable.
+- For each confirmed defect, include minimal evidence schema:
+  - trigger condition,
+  - affected transition,
+  - why this is a defect (not just a design preference).
+- For each confirmed defect, also include:
+  - smallest logical reproduction steps,
+  - likely fix direction (one line),
+  - regression test direction (one line),
+  - affected subsystem and blast radius,
+  - code evidence snippet(s) from the referenced file(s).
+- Deduplicate findings by root cause:
+  - one primary defect per root cause, with secondary manifestations listed under it.
+- If no defects are found, explicitly report residual risks and untested paths.
+
+## Canonical Report Order (Required)
+
+1. Scope and partitions (if large PR)
+2. Call graph
+3. Transition matrix
+4. Logical code-path testing summary
+5. Fault categories and category-by-category injection results
+6. Confirmed defects (High/Medium/Low)
+7. Coverage accounting + stop-condition status
+8. Assumptions & Limits
+9. Confidence rating and confidence-raising evidence
+10. Residual risks and untested paths
+
+## Multithreaded DB Priority
+
+For this repository, prioritize concurrency/locking defects early in review because they can cause correctness failures, hangs, and crashes under production load.
+
+- Always check for:
+  - unsynchronized shared-state access and data races,
+  - lock-order inversions and deadlock potential,
+  - iterator/reference invalidation across concurrent mutation,
+  - exception paths that leave shared state partially updated,
+  - shutdown/reload races and stale-pointer/lifetime hazards.
+- Escalate race/deadlock/crash findings with high severity by default unless strong evidence shows limited impact.
diff --git a/.cursor/skills/audit-review/SKILL.md b/.cursor/skills/audit-review/SKILL.md
@@ -0,0 +1,179 @@
+---
+name: audit-review
+description: Perform deep feature audits with transition-matrix and logical fault-injection validation. Use when reviewing complex changes, regressions, state-machine behavior, config interactions, API/protocol flows, and concurrency-sensitive logic.
+---
+
+# Audit Review
+
+## Purpose
+
+Run a repeatable deep audit for any feature and report confirmed defects with severity.
+Default mode is static reasoning unless runtime execution is explicitly performed.
+
+## Workflow
+
+1. If PR scope is large, partition by functionality/workstream first:
+   - define partitions and boundaries,
+   - review each partition independently with the full workflow below,
+   - track per-partition findings and coverage,
+   - deduplicate cross-partition findings by root cause,
+   - finish with cross-partition interaction risks.
+2. Build call graph first:
+   - user/system entrypoints (API, RPC, CLI, worker, scheduler)
+   - dispatch and validation layers
+   - state/storage/cache interactions
+   - downstream integrations (network, filesystem, service calls)
+   - exception and error-propagation paths
+3. Build transition matrix:
+   - request/event entry -> processing stages -> state changes -> outputs/side effects
+   - define key invariants and annotate where each transition must preserve them
+4. Perform logical testing of all code paths:
+   - enumerate all reachable branches in changed logic,
+   - record expected branch outcomes (success, handled failure, fail-open/fail-closed, exception),
+   - include happy path, malformed input, integration timeout/failure, and concurrency/timing branches.
+5. Define logical fault categories from the code under review:
+   - derive categories from actual components, transitions, and dependencies in scope,
+   - document category boundary and affected states/transitions,
+   - prioritize categories by risk and blast radius.
+6. Run logical fault injection category-by-category:
+   - execute one category at a time,
+   - for each category cover success/failure/edge/concurrency paths as applicable,
+   - record pass/fail-open/fail-closed/exception behavior per injected fault.
+   - maintain a category completion matrix with status:
+     - Executed / Not Applicable / Deferred,
+     - outcome,
+     - defects found,
+     - justification for Not Applicable or Deferred.
+7. Confirm each finding with code-path evidence.
+8. Produce coverage accounting:
+   - reviewed vs unreviewed call-graph nodes,
+   - reviewed vs unreviewed transitions,
+   - executed vs skipped fault categories (with reasons).
+   - mark coverage complete only when every in-scope node/transition/category is reviewed or explicitly skipped with justification.
+9. For multithreaded/shared-state paths, perform interleaving analysis:
+   - write several plausible thread interleavings per critical transition,
+   - identify race/deadlock/lifetime hazards per interleaving.
+10. For mutation-heavy paths, perform rollback/partial-update analysis:
+   - reason about exception/cancellation at intermediate points,
+   - verify state invariants still hold.
+
+## C++ Bug-Type Coverage (Required for C++ audits)
+
+- memory lifetime defects (use-after-free/use-after-move/dangling refs)
+- iterator/reference invalidation
+- data races and lock-order/deadlock risks
+- exception-safety and partial-update rollback hazards
+- integer overflow/underflow and signedness conversion bugs
+- ownership/resource leaks (RAII violations)
+- undefined behavior from invalid casts/aliasing/lifetime misuse
+
+## Multithreaded Database Emphasis
+
+For ClickHouse-style multithreaded systems, prioritize these checks before lower-risk issues:
+
+1. Shared mutable state touched by multiple threads without clear synchronization.
+2. Lock hierarchy consistency and potential lock-order inversion/deadlock cycles.
+3. Cross-thread lifetime safety (dangling references/pointers after erase/reload/shutdown).
+4. Concurrent container mutation + iterator/reference use.
+5. Exception/cancellation paths that can leave locks/state inconsistent.
+
+## Output Contract
+
+- Start with confirmed defects only.
+- Group by severity: High, Medium, Low.
+- For each defect include:
+  - title,
+  - impact,
+  - file/function anchor,
+  - fault-injection trigger,
+  - transition mapping,
+  - why it is a defect (not a design preference),
+  - smallest logical repro steps,
+  - likely fix direction (short, concrete: 2-4 bullets or sentences),
+  - regression test direction (short, concrete: 2-4 bullets or sentences),
+  - affected subsystem and blast radius,
+  - at least one code snippet proving the defect.
+- Separate “not confirmed” or “needs runtime proof” from confirmed defects.
+- Include an **Assumptions & Limits** section for static reasoning.
+- Include an overall **confidence rating** and what additional evidence would raise confidence.
+- If no defects are found, include residual risks and untested paths.
+- For large PRs, include per-partition findings/coverage and final cross-partition risk summary.
+- Include a fault-category completion matrix for every deep audit.
+
+### Canonical report order
+
+1. Scope and partitions (if large PR)
+2. Call graph
+3. Transition matrix
+4. Logical code-path testing summary
+5. Fault categories and category-by-category injection results
+6. Confirmed defects (High/Medium/Low)
+7. Coverage accounting + stop-condition status
+8. Assumptions & Limits
+9. Confidence rating and confidence-raising evidence
+10. Residual risks and untested paths
+
+## Standard Audit Report Template (Default: Pointed PR Style)
+
+Default report style should match concise PR review comments:
+- fail-first and action-oriented,
+- only confirmed defects (no pass-by-pass narrative),
+- one short summary line when there are no confirmed defects.
+
+Use the compact template below by default. Use the full 10-section canonical format only when explicitly requested.
+
+```markdown
+Audit update for PR #<id> (<short title/scope>):
+
+Confirmed defects:
+
+- **<Severity>: <short defect title>**
+  - Impact: <concrete user/system impact>
+  - Anchor: `<file>` / `<function or code path>`
+  - Trigger: <smallest condition that triggers defect>
+  - Why defect: <1-2 lines, behavior not preference>
+  - Fix direction (short): <2-4 bullets or sentences>
+  - Regression test direction (short): <2-4 bullets or sentences including positive and edge/failure cases>
+  - Evidence:
+    ```start:end:path
+    // minimal proving snippet
+    ```
+
+<repeat per defect, sorted High -> Medium -> Low>
+
+Coverage summary:
+- Scope reviewed: <partitions or key areas, one line>
+- Categories failed: <count/list>
+- Categories passed: <count only>
+- Assumptions/limits: <one line>
+```
+
+## Severity Rubric
+
+- High: realistic trigger can cause crash/UB/data corruption/auth bypass/deadlock.
+- Medium: correctness/reliability issue with narrower trigger conditions.
+- Low: diagnostics/consistency issues without direct correctness break.
+
+## Checklist
+
+- Verify call graph is explicitly documented before defect analysis.
+- Verify invariants are explicitly listed and checked against transitions.
+- Verify fail-open vs fail-closed behavior where security-sensitive.
+- Verify logical branch coverage for all changed code paths.
+- Verify fault categories are explicitly defined from the reviewed code before injection starts.
+- Verify category-by-category execution and reporting completeness.
+- Verify full fault-category completion matrix is present and complete.
+- Verify concurrency and cache/state transition paths.
+- Verify multithreaded interleavings are explicitly analyzed for critical shared-state paths.
+- Verify rollback/partial-update safety under exception/cancellation points.
+- Verify major C++ bug classes are explicitly covered (or marked not applicable).
+- Verify race/deadlock/crash class defects are prioritized and explicitly reported.
+- Verify error-contract consistency across equivalent fault paths.
+- Verify performance/resource failure classes were considered.
+- Verify findings are deduplicated by root cause.
+- Verify coverage accounting is present (covered vs skipped with reason).
+- Verify stop-condition criteria for coverage completion are explicitly satisfied.
+- Verify every confirmed defect includes code evidence snippets.
+- Verify parser/config/runtime consistency.
+- Verify protocol/API parity across entrypoints.
+- Verify no sensitive-data leakage in logs/errors.
diff --git a/.github/workflows/merge_queue.yml b/.github/workflows/merge_queue.yml
@@ -182,7 +182,7 @@ jobs:
           fi
 
   fast_test:
-    runs-on: [self-hosted, altinity-on-demand, altinity-func-tester]
+    runs-on: [self-hosted, altinity-on-demand, altinity-builder]
     needs: [config_workflow, dockers_build_amd, dockers_build_arm]
     if: ${{ !cancelled() && !contains(needs.*.outputs.pipeline_status, 'failure') && !contains(needs.*.outputs.pipeline_status, 'undefined') && !contains(fromJson(needs.config_workflow.outputs.data).workflow_config.cache_success_base64, 'RmFzdCB0ZXN0') }}
     name: "Fast test"
diff --git a/.github/workflows/pull_request.yml b/.github/workflows/pull_request.yml
@@ -241,7 +241,7 @@ jobs:
           fi
 
   fast_test:
-    runs-on: [self-hosted, altinity-on-demand, altinity-func-tester]
+    runs-on: [self-hosted, altinity-on-demand, altinity-builder]
     needs: [config_workflow, dockers_build_amd, dockers_build_arm, dockers_build_multiplatform_manifest]
     if: ${{ !cancelled() && !contains(needs.*.outputs.pipeline_status, 'failure') && !contains(needs.*.outputs.pipeline_status, 'undefined') && !contains(fromJson(needs.config_workflow.outputs.data).workflow_config.cache_success_base64, 'RmFzdCB0ZXN0') }}
     name: "Fast test"
diff --git a/.github/workflows/pull_request_community.yml b/.github/workflows/pull_request_community.yml
@@ -87,7 +87,7 @@ jobs:
           fi
 
   fast_test:
-    runs-on: [self-hosted, altinity-on-demand, altinity-func-tester]
+    runs-on: [self-hosted, altinity-on-demand, altinity-builder]
     needs: [config_workflow]
     if: ${{ !cancelled() && !contains(needs.*.outputs.pipeline_status, 'failure') && !contains(needs.*.outputs.pipeline_status, 'undefined') && !contains(fromJson(needs.config_workflow.outputs.data).workflow_config.cache_success_base64, 'RmFzdCB0ZXN0') }}
     name: "Fast test"
diff --git a/ci/defs/job_configs.py b/ci/defs/job_configs.py
@@ -137,7 +137,7 @@ class JobConfigs:
     )
     fast_test = Job.Config(
         name=JobNames.FAST_TEST,
-        runs_on=RunnerLabels.AMD_LARGE,
+        runs_on=RunnerLabels.BUILDER_AMD,
         command="python3 ./ci/jobs/fast_test.py",
         # --network=host required for ec2 metadata http endpoint to work
         # --root/--privileged/--cgroupns=host is required for clickhouse-test --memory-limit

Original file line number	Diff line number	Diff line change
`@@ -137,7 +137,7 @@ class JobConfigs:`
`137`	`137`	`)`
`138`	`138`	`fast_test = Job.Config(`
`139`	`139`	`name=JobNames.FAST_TEST,`
`140`		`- runs_on=RunnerLabels.AMD_LARGE,`
	`140`	`+ runs_on=RunnerLabels.BUILDER_AMD,`
`141`	`141`	`command="python3 ./ci/jobs/fast_test.py",`
`142`	`142`	`# --network=host required for ec2 metadata http endpoint to work`
`143`	`143`	`# --root/--privileged/--cgroupns=host is required for clickhouse-test --memory-limit`