Skip to content

feat(sql): rnd_symbol_zipfn() function - generates high cardinality symbols with Zipf/Power Law Distribution#6227

Merged
bluestreak01 merged 2 commits intomasterfrom
vi_another_zipf
Oct 6, 2025
Merged

feat(sql): rnd_symbol_zipfn() function - generates high cardinality symbols with Zipf/Power Law Distribution#6227
bluestreak01 merged 2 commits intomasterfrom
vi_another_zipf

Conversation

@bluestreak01
Copy link
Copy Markdown
Member

Useful in testing algorithms with high-cardinality symbols that have skewed distribution of rows.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Oct 4, 2025

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

Introduces a new random Zipf-distributed symbol function factory and a corresponding test suite. The factory parses and validates constant arguments (symbol count, alpha), integrates with execution context RNG, and generates symbols based on a precomputed Zipf cumulative distribution. Tests cover functionality, explain plans, skew behavior, and argument validation errors.

Changes

Cohort / File(s) Summary
Rnd Zipf symbol function
core/src/main/java/io/questdb/griffin/engine/functions/rnd/RndSymbolZipfNFunctionFactory.java
Adds RndSymbolZipfNFunctionFactory with signature and newInstance. Implements a non-deterministic symbol function generating Zipf-distributed values using constant args (n, alpha), precomputes distribution, integrates with Rnd and symbol table, includes validation and plan representation.
Tests for Zipf symbol function
core/src/test/java/io/questdb/test/griffin/engine/functions/rnd/RndSymbolZipfNFunctionFactoryTest.java
New test class extending AbstractFunctionFactoryTest. Verifies distribution characteristics across alphas and symbol counts, explain plan output, edge-case validation (arg count, zero/negative values), and factory wiring via getFunctionFactory().

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant SQL as SQL Parser/Planner
  participant Exec as Execution Context
  participant FF as RndSymbolZipfNFunctionFactory
  participant Func as Func (Zipf Symbol)
  participant RNG as Rnd

  User->>SQL: SELECT rnd_symbol_zipf(n, alpha)
  SQL->>FF: newInstance(args, configuration, Exec)
  FF->>FF: validate constants n>0, alpha>0
  FF->>Func: construct with precomputed Zipf CDF
  Note right of Func: Stores cumulative distribution and symbol table handle
  SQL-->>User: Plan with rnd_symbol_zipf(n, alpha)

  loop For each record
    Exec->>Func: getSymbol(TableRow)
    Func->>RNG: nextDouble()
    RNG-->>Func: r
    Func->>Func: binary search CDF to pick key
    Func-->>Exec: "sym<key>"
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

  • jerrinot

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The title succinctly identifies the introduction of the rnd_symbol_zipfn() SQL function and its purpose of generating high-cardinality symbols following a Zipf (power law) distribution, accurately summarizing the main change in a clear and specific manner.
Description Check ✅ Passed The description explains the intended use of the new function for testing algorithms with high-cardinality symbols and skewed distributions, which aligns directly with the added rnd_symbol_zipfn functionality.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@bluestreak01 bluestreak01 added the SQL Issues or changes relating to SQL execution label Oct 4, 2025
@bluestreak01
Copy link
Copy Markdown
Member Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Oct 4, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (1)
core/src/main/java/io/questdb/griffin/engine/functions/rnd/RndSymbolZipfNFunctionFactory.java (1)

71-75: Consider consistent exception handling for both parameters.

The symbol count retrieval (line 72) includes a try-catch for NumericException, while the alpha retrieval (line 86) does not. For consistency and defensive programming, consider wrapping both in try-catch blocks or confirming that getDouble() never throws exceptions that need handling.

Also applies to: 86-86

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3d99c22 and 367426d.

📒 Files selected for processing (2)
  • core/src/main/java/io/questdb/griffin/engine/functions/rnd/RndSymbolZipfNFunctionFactory.java (1 hunks)
  • core/src/test/java/io/questdb/test/griffin/engine/functions/rnd/RndSymbolZipfNFunctionFactoryTest.java (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (26)
  • GitHub Check: New pull request (SelfHosted Griffin And Fuzz tests on linux-arm64)
  • GitHub Check: New pull request (SelfHosted Other tests on linux-x64-zfs)
  • GitHub Check: New pull request (SelfHosted Other tests on linux-arm64)
  • GitHub Check: New pull request (SelfHosted Griffin And Fuzz tests on linux-x64-zfs)
  • GitHub Check: New pull request (Hosted Running tests on windows-other)
  • GitHub Check: New pull request (Hosted Running tests on windows-pgwire)
  • GitHub Check: New pull request (Hosted Running tests on windows-cairo)
  • GitHub Check: New pull request (Hosted Running tests on windows-fuzz2)
  • GitHub Check: New pull request (Hosted Running tests on windows-fuzz1)
  • GitHub Check: New pull request (Hosted Running tests on windows-griffin)
  • GitHub Check: New pull request (Hosted Running tests on mac-other)
  • GitHub Check: New pull request (Hosted Running tests on mac-pgwire)
  • GitHub Check: New pull request (Hosted Running tests on mac-cairo-fuzz)
  • GitHub Check: New pull request (SelfHosted Other tests Start X64Zfs Agent)
  • GitHub Check: New pull request (Hosted Running tests on mac-cairo)
  • GitHub Check: New pull request (Rust Test and Lint on linux-jdk17)
  • GitHub Check: New pull request (SelfHosted Griffin And Fuzz tests Start ARM Agent)
  • GitHub Check: New pull request (Hosted Running tests on mac-griffin)
  • GitHub Check: New pull request (SelfHosted Griffin And Fuzz tests Start X64Zfs Agent)
  • GitHub Check: New pull request (SelfHosted Other tests Start ARM Agent)
  • GitHub Check: New pull request (Hosted Running tests with cover on linux-other)
  • GitHub Check: New pull request (Hosted Running tests with cover on linux-pgwire)
  • GitHub Check: New pull request (Hosted Running tests with cover on linux-cairo)
  • GitHub Check: New pull request (Hosted Running tests with cover on linux-fuzz)
  • GitHub Check: New pull request (Hosted Running tests with cover on linux-griffin)
  • GitHub Check: New pull request (Check Changes Check changes)
🔇 Additional comments (10)
core/src/main/java/io/questdb/griffin/engine/functions/rnd/RndSymbolZipfNFunctionFactory.java (7)

103-123: LGTM!

The Zipf distribution calculation is mathematically correct. The formula p(k) = (1/k^alpha) / sum(1/i^alpha) properly implements the Zipf probability mass function, and the cumulative distribution is built correctly for inverse transform sampling.


193-202: LGTM!

The inverse transform sampling implementation is correct. The binary search with BIN_SEARCH_SCAN_UP finds the first index where the cumulative probability is >= the random value, and the clamping logic (line 201) properly handles floating-point edge cases.


130-144: LGTM!

The use of separate StringSink instances (sinkA and sinkB) is correct for scenarios where both symbol representations might be needed simultaneously or for thread safety.


66-90: Drop exception handling suggestion for getDouble()
getDouble() doesn’t throw NumericException and is used consistently without try-catch across all rnd function factories.

Likely an incorrect or invalid review comment.


48-51: LGTM!

The function signature correctly declares the parameter types (int, double) and matches the test invocations.


103-123: LGTM!

The Zipf distribution computation is mathematically correct. The normalization constant and cumulative probabilities are properly calculated. The O(n) initialization cost is acceptable since it's performed once during function instantiation.


193-202: LGTM!

The inverse transform sampling via binary search is correctly implemented. The handling of both exact matches (idx >= 0) and insertion points (idx < 0) properly maps uniform random values to Zipf-distributed symbol keys. The clamping at line 201 correctly handles the edge case where the random value approaches 1.0.

core/src/test/java/io/questdb/test/griffin/engine/functions/rnd/RndSymbolZipfNFunctionFactoryTest.java (3)

34-213: LGTM!

The test suite provides excellent coverage:

  • Basic functionality tests with various alpha values (0.5, 1.0, 1.5, 2.0, 5.0)
  • Different symbol counts (2, 5, 1000)
  • Edge case validation (negative/zero arguments)
  • Explain plan verification
  • Distribution skew verification

The expected distribution characteristics correctly reflect Zipf/Power Law behavior where higher alpha values produce stronger concentration on the first symbols.


34-213: LGTM! Comprehensive test coverage.

The test suite effectively validates:

  • Multiple distribution shapes across varying alpha values (0.5, 1.5, 2.0, 5.0)
  • Symbol count ranges from 2 to 1000
  • Explain plan integration for different parameterizations
  • Edge cases (negative/zero alpha, negative/zero symbol count)

The deterministic assertions on exact counts are appropriate for a seeded RNG test framework.


232-234: LGTM! Correct factory wiring.

The getFunctionFactory() override correctly returns an instance of RndSymbolZipfNFunctionFactory, ensuring the test suite exercises the new function factory implementation.

@glasstiger
Copy link
Copy Markdown
Contributor

[PR Coverage check]

😍 pass : 59 / 80 (73.75%)

file detail

path covered line new line coverage
🔵 io/questdb/griffin/engine/functions/rnd/RndSymbolZipfNFunctionFactory.java 48 67 71.64%
🔵 io/questdb/griffin/engine/functions/rnd/RndSymbolZipfFunctionFactory.java 11 13 84.62%

@bluestreak01 bluestreak01 merged commit 4dcd824 into master Oct 6, 2025
35 checks passed
@bluestreak01 bluestreak01 deleted the vi_another_zipf branch October 6, 2025 11:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

SQL Issues or changes relating to SQL execution

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants