Skip to content

Add arrayCombinations, arrayPermutations, arrayPartialPermutations functions#99280

Open
Onyx2406 wants to merge 6 commits intoClickHouse:masterfrom
Onyx2406:feat/array-combinatorics
Open

Add arrayCombinations, arrayPermutations, arrayPartialPermutations functions#99280
Onyx2406 wants to merge 6 commits intoClickHouse:masterfrom
Onyx2406:feat/array-combinatorics

Conversation

@Onyx2406
Copy link
Copy Markdown
Contributor

@Onyx2406 Onyx2406 commented Mar 11, 2026

Changelog category (leave one):

New Feature

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Added arrayCombinations(arr, k), arrayPermutations(arr), and arrayPartialPermutations(arr, k) functions for generating combinations and permutations of array elements.

Documentation entry for user-facing changes

  • Documentation is written (function descriptions and examples in code)

What this PR does

Adds three combinatorics functions for arrays, as proposed in the Intern Tasks 2025/2026:

Function Description Example
arrayCombinations(arr, k) All k-length unordered subsets arrayCombinations([1,2,3], 2)[[1,2],[1,3],[2,3]]
arrayPermutations(arr) All full permutations arrayPermutations([1,2,3])[[1,2,3],[1,3,2],[2,1,3],[2,3,1],[3,1,2],[3,2,1]]
arrayPartialPermutations(arr, k) All k-length ordered selections arrayPartialPermutations([1,2,3], 2)[[1,2],[1,3],[2,1],[2,3],[3,1],[3,2]]

Implementation (2 files, ~250 lines)

  • arrayCombinations.cpp — Iterative index-based algorithm for generating C(n,k) combinations
  • arrayPermutations.cpp — Template class handling both full (std::next_permutation) and partial (recursive swap-based) permutations
  • Follows the arrayShingles pattern: Array(T)Array(Array(T)) with dual offset management
  • Works with any element type (integers, strings, etc.)

Test plan

  • New stateless test 04043_array_combinatorics.sql with 12 test cases:
    • Combinations: k=1, k=2, k=3, k=0, k=n, string elements
    • Full permutations: n=3, n=1, empty array
    • Partial permutations: k=2, k=1, k=0
    • Error cases: k>n, k<0 (BAD_ARGUMENTS)
  • CI (full test suite)

Fixes #43175


Note

Medium Risk
Adds new array functions that can generate very large result sets; although capped at 1M elements, mistakes in size estimation or offsets could still cause performance/memory issues.

Overview
Introduces three new array combinatorics functions: arrayCombinations(arr, k), arrayPermutations(arr), and arrayPartialPermutations(arr, k), each returning Array(Array(T)) and validating k bounds.

All functions include result-size guards (1M element cap) to avoid OOM by precomputing combination/permutation counts and throwing TOO_LARGE_ARRAY_SIZE when exceeded.

Adds a new stateless query test (04043_array_combinatorics) covering normal cases (including k=0, empty arrays), error cases for invalid k, and size-limit behavior.

Written by Cursor Bugbot for commit 412859c. This will update automatically on new commits. Configure here.

Add three combinatorics functions for arrays:
- arrayCombinations(arr, k): all k-length unordered subsets
- arrayPermutations(arr): all full permutations
- arrayPartialPermutations(arr, k): all k-length ordered selections

Fixes ClickHouse#43175
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 3 potential issues.

Fix All in Cursor

…size limit

- Use used-flags approach instead of swap-based recursion for partial
  permutations to produce lexicographic output order
- k=0 now correctly returns [[]] (one empty selection) matching C(n,0)=1
- Add TOO_LARGE_ARRAY_SIZE guard (1M element cap) to prevent OOM
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

{
if (result > limit / (n - i))
return 0;
result = result * (n - i) / (i + 1);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overly aggressive overflow guard rejects valid combination counts

High Severity

The overflow guard result > limit / (n - i) in combinationCountCapped checks whether the intermediate product result * (n - i) exceeds limit, but the actual computation on the next line divides by (i + 1), which can bring the value back well within range. This causes the function to return 0 (interpreted as "too large") for many valid inputs. For example, C(20, 10) = 184,756 is well under the 1,000,000 limit, but at iteration i=7 the intermediate C(20,7) = 77,520 fails the check against 1,000,000 / 13 = 76,923, causing arrayCombinations(range(20), 10) to throw TOO_LARGE_ARRAY_SIZE incorrectly.

Fix in Cursor Fix in Web

@alexey-milovidov alexey-milovidov added the can be tested Allows running workflows for external contributors label Mar 12, 2026
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Mar 12, 2026

Workflow [PR], commit [50793bf]

Summary:

@clickhouse-gh clickhouse-gh bot added the pr-feature Pull request with new product feature label Mar 12, 2026
The overflow guard was too strict — checking before division caused
valid C(n,k) values to be rejected (e.g. C(20,10)=184756). Now uses
GCD cross-cancellation between numerator and denominator factors
to keep intermediates small. Added test for C(20,10) success case.
@alexey-milovidov
Copy link
Copy Markdown
Member

@Onyx2406, almost ready! Let's check the review comments.

Address review comments:

1. Size guard now caps total materialized elements (rows * k) instead
   of just the number of result rows. Previously, inputs like
   `arrayCombinations(range(20), 10)` produced 1.8M elements while
   passing the 1M limit check.

2. Replace O(k^2) GCD cross-cancellation in `combinationCountCapped`
   with an O(k) incremental formula using 128-bit intermediate
   arithmetic. This prevents pathological inputs from spending
   excessive time in the guard before rejection.

3. Add regression tests verifying the total-elements invariant:
   - `arrayCombinations(range(20), 10)` now correctly rejected
   - `arrayPermutations(range(9))` now correctly rejected (362880*9 > 1M)

4. Fix CRLF line endings in source and test files.
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Mar 18, 2026

LLVM Coverage Report

Metric Baseline Current Δ
Lines 83.70% 83.80% +0.10%
Functions 23.90% 23.90% +0.00%
Branches 76.30% 76.30% +0.00%

PR changed lines: PR changed-lines coverage: 98.07% (254/259, 0 noise lines excluded)
Diff coverage report
Uncovered code

@Onyx2406
Copy link
Copy Markdown
Contributor Author

@alexey-milovidov is this good to merge?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-feature Pull request with new product feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Combinatorics functions (on arrays)

2 participants