Fix TOO_MANY_ROWS exception during exact count optimization with multiple parts#100408
Fix TOO_MANY_ROWS exception during exact count optimization with multiple parts#100408alexey-milovidov wants to merge 18 commits intomasterfrom
Conversation
…iple parts When `optimizeUseAggregateProjection` calls `selectRangesToRead(find_exact_ranges=true)` for the exact count optimization (used for `SELECT count()` queries), the `max_rows_to_read` check in `filterPartsByPrimaryKeyAndSkipIndexes` could throw `TOO_MANY_ROWS` for tables without user-defined projections. This happened because `has_projections` was false, so the graceful exit path (setting `exceeded_row_limits = true` and returning) was not taken. Instead, the exception was thrown immediately, even though the exact count optimization could serve the query by computing counts directly from the primary key analysis without reading all the data. The fix includes `find_exact_ranges` in the `has_projections` flag, consistent with how `support_projection_optimization` already treats it. This allows the exact count optimization to proceed: it computes exact counts from primary key ranges and only needs to read the small number of boundary granules, which are well within the limit. The issue was triggered when CI randomized MergeTree settings caused data to be split into multiple parts with non-aligned granule boundaries, making the total estimated rows slightly exceed the `max_rows_to_read` limit. #100129 Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
|
Workflow [PR], commit [7fa82a6] Summary: ❌
AI ReviewSummaryThis PR fixes a real planner/runtime mismatch for exact-count optimization: when ClickHouse Rules
Final Verdict
|
| /// on the full table scan should not cause an immediate failure. | ||
| bool projection_parts_exist = std::any_of(res_parts.begin(), res_parts.end(), [](const auto & part) { return part.data_part->isProjectionPart(); }); | ||
| bool has_projections = metadata_snapshot->hasProjections() || projection_parts_exist; | ||
| bool has_projections = metadata_snapshot->hasProjections() || projection_parts_exist || find_exact_ranges; |
There was a problem hiding this comment.
SELECT count() with max_rows_to_read / force_primary_key across multiple parts with non-aligned granule boundaries.
Please add a new stateless test (preferably in tests/queries/0_stateless) that reproduces the previous TOO_MANY_ROWS exception and verifies the query succeeds after this fix. Without a dedicated test, this exact planner-path regression is likely to reappear.
The exact count optimization can now compute `count()` results from primary key analysis without reading data, so `count()` queries bypass the `max_rows_to_read` limit on MergeTree tables. Update the test to expect success for `count()` queries and add `SELECT *` tests to verify that non-count queries still respect the limit. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
The exact count optimization that bypasses `max_rows_to_read` only works when `optimize_use_projections` is enabled. Since CI randomizes this setting, the test must keep the original `TOO_MANY_ROWS` expectations for `count()` queries on MergeTree tables. The C++ fix in the previous commit is still correct — it prevents the spurious exception when the projection optimization IS running (as in the 04004 test). Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
… optimization The exact count optimization that bypasses `max_rows_to_read` requires `optimize_use_projections = 1`. Set it explicitly so the test works even when CI randomizes it to false. Also add `SELECT *` queries to verify that non-count queries still fail fast with `TOO_MANY_ROWS`. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…jections` The exact count optimization code path in `getAggregateProjectionCandidates` requires `can_use_minmax_projection` to be true before it will set `only_count_column`. This flag depends on `optimize_use_implicit_projections`. When CI randomizes `optimize_use_implicit_projections = false`, the function returns early (no implicit projections + no explicit aggregate projections), and the exact count optimization never runs. The count query then hits `max_rows_to_read` and throws `TOO_MANY_ROWS`. Also add a dedicated regression test for the original issue (#100129). Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…ct-count-projection
…ct-count-projection
…ary-granule reads The exact count optimization still needs to read boundary granules at range edges. With `index_granularity = 128` and a BETWEEN condition spanning two boundaries, 256 rows of boundary granules need to be read. The previous `max_rows_to_read = 100` was too low, causing execution-time `TOO_MANY_ROWS`. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…ct-count-projection
…ct-count-projection
…ct-count-projection
…oss multiple parts Addresses review feedback: adds a dedicated stateless test that exercises `SELECT count()` with `max_rows_to_read` / `force_primary_key` when data is split across multiple parts with non-aligned granule boundaries. Before the fix in this PR, this would throw `TOO_MANY_ROWS`. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…ct-count-projection
The `SELECT *` without a WHERE clause throws `INDEX_NOT_USED` instead of `TOO_MANY_ROWS` when `force_primary_key` is set. Add a WHERE on the primary key that covers all rows so the query reaches the row limit check. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…ct-count-projection
…ct-count-projection
LLVM Coverage Report
Changed lines: 100.00% (11/11) · Uncovered code |
When
optimizeUseAggregateProjectioncallsselectRangesToRead(find_exact_ranges=true)for the exact count optimization (used forSELECT count()queries), themax_rows_to_readcheck infilterPartsByPrimaryKeyAndSkipIndexescould throwTOO_MANY_ROWSfor tables without user-defined projections.This happened because
has_projectionswasfalse, so the graceful exit path (settingexceeded_row_limits = trueand returning) was not taken. Instead, the exception was thrown immediately, even though the exact count optimization could serve the query by computing counts directly from the primary key analysis without reading all the data.The fix includes
find_exact_rangesin thehas_projectionsflag, consistent with howsupport_projection_optimizationalready treats it. This allows the exact count optimization to proceed: it computes exact counts from primary key ranges and only needs to read the small number of boundary granules, which are well within the limit.The issue was triggered when CI randomized MergeTree settings caused data to be split into multiple parts with non-aligned granule boundaries, making the total estimated rows slightly exceed the
max_rows_to_readlimit.Fixes #100129
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Fix
TOO_MANY_ROWSexception forSELECT count()queries withmax_rows_to_read/force_primary_keywhen data is split across multiple parts with non-aligned granule boundaries.Documentation entry for user-facing changes