feat(sql): rnd_symbol_zipfn() function - generates high cardinality symbols with Zipf/Power Law Distribution#6227
Conversation
…ymbols with Zipf/Power Law Distribution
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the WalkthroughIntroduces a new random Zipf-distributed symbol function factory and a corresponding test suite. The factory parses and validates constant arguments (symbol count, alpha), integrates with execution context RNG, and generates symbols based on a precomputed Zipf cumulative distribution. Tests cover functionality, explain plans, skew behavior, and argument validation errors. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor User
participant SQL as SQL Parser/Planner
participant Exec as Execution Context
participant FF as RndSymbolZipfNFunctionFactory
participant Func as Func (Zipf Symbol)
participant RNG as Rnd
User->>SQL: SELECT rnd_symbol_zipf(n, alpha)
SQL->>FF: newInstance(args, configuration, Exec)
FF->>FF: validate constants n>0, alpha>0
FF->>Func: construct with precomputed Zipf CDF
Note right of Func: Stores cumulative distribution and symbol table handle
SQL-->>User: Plan with rnd_symbol_zipf(n, alpha)
loop For each record
Exec->>Func: getSymbol(TableRow)
Func->>RNG: nextDouble()
RNG-->>Func: r
Func->>Func: binary search CDF to pick key
Func-->>Exec: "sym<key>"
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (1)
core/src/main/java/io/questdb/griffin/engine/functions/rnd/RndSymbolZipfNFunctionFactory.java (1)
71-75: Consider consistent exception handling for both parameters.The symbol count retrieval (line 72) includes a try-catch for
NumericException, while the alpha retrieval (line 86) does not. For consistency and defensive programming, consider wrapping both in try-catch blocks or confirming thatgetDouble()never throws exceptions that need handling.Also applies to: 86-86
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
core/src/main/java/io/questdb/griffin/engine/functions/rnd/RndSymbolZipfNFunctionFactory.java(1 hunks)core/src/test/java/io/questdb/test/griffin/engine/functions/rnd/RndSymbolZipfNFunctionFactoryTest.java(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (26)
- GitHub Check: New pull request (SelfHosted Griffin And Fuzz tests on linux-arm64)
- GitHub Check: New pull request (SelfHosted Other tests on linux-x64-zfs)
- GitHub Check: New pull request (SelfHosted Other tests on linux-arm64)
- GitHub Check: New pull request (SelfHosted Griffin And Fuzz tests on linux-x64-zfs)
- GitHub Check: New pull request (Hosted Running tests on windows-other)
- GitHub Check: New pull request (Hosted Running tests on windows-pgwire)
- GitHub Check: New pull request (Hosted Running tests on windows-cairo)
- GitHub Check: New pull request (Hosted Running tests on windows-fuzz2)
- GitHub Check: New pull request (Hosted Running tests on windows-fuzz1)
- GitHub Check: New pull request (Hosted Running tests on windows-griffin)
- GitHub Check: New pull request (Hosted Running tests on mac-other)
- GitHub Check: New pull request (Hosted Running tests on mac-pgwire)
- GitHub Check: New pull request (Hosted Running tests on mac-cairo-fuzz)
- GitHub Check: New pull request (SelfHosted Other tests Start X64Zfs Agent)
- GitHub Check: New pull request (Hosted Running tests on mac-cairo)
- GitHub Check: New pull request (Rust Test and Lint on linux-jdk17)
- GitHub Check: New pull request (SelfHosted Griffin And Fuzz tests Start ARM Agent)
- GitHub Check: New pull request (Hosted Running tests on mac-griffin)
- GitHub Check: New pull request (SelfHosted Griffin And Fuzz tests Start X64Zfs Agent)
- GitHub Check: New pull request (SelfHosted Other tests Start ARM Agent)
- GitHub Check: New pull request (Hosted Running tests with cover on linux-other)
- GitHub Check: New pull request (Hosted Running tests with cover on linux-pgwire)
- GitHub Check: New pull request (Hosted Running tests with cover on linux-cairo)
- GitHub Check: New pull request (Hosted Running tests with cover on linux-fuzz)
- GitHub Check: New pull request (Hosted Running tests with cover on linux-griffin)
- GitHub Check: New pull request (Check Changes Check changes)
🔇 Additional comments (10)
core/src/main/java/io/questdb/griffin/engine/functions/rnd/RndSymbolZipfNFunctionFactory.java (7)
103-123: LGTM!The Zipf distribution calculation is mathematically correct. The formula
p(k) = (1/k^alpha) / sum(1/i^alpha)properly implements the Zipf probability mass function, and the cumulative distribution is built correctly for inverse transform sampling.
193-202: LGTM!The inverse transform sampling implementation is correct. The binary search with
BIN_SEARCH_SCAN_UPfinds the first index where the cumulative probability is >= the random value, and the clamping logic (line 201) properly handles floating-point edge cases.
130-144: LGTM!The use of separate
StringSinkinstances (sinkAandsinkB) is correct for scenarios where both symbol representations might be needed simultaneously or for thread safety.
66-90: Drop exception handling suggestion for getDouble()
getDouble() doesn’t throw NumericException and is used consistently without try-catch across all rnd function factories.Likely an incorrect or invalid review comment.
48-51: LGTM!The function signature correctly declares the parameter types (int, double) and matches the test invocations.
103-123: LGTM!The Zipf distribution computation is mathematically correct. The normalization constant and cumulative probabilities are properly calculated. The O(n) initialization cost is acceptable since it's performed once during function instantiation.
193-202: LGTM!The inverse transform sampling via binary search is correctly implemented. The handling of both exact matches (idx >= 0) and insertion points (idx < 0) properly maps uniform random values to Zipf-distributed symbol keys. The clamping at line 201 correctly handles the edge case where the random value approaches 1.0.
core/src/test/java/io/questdb/test/griffin/engine/functions/rnd/RndSymbolZipfNFunctionFactoryTest.java (3)
34-213: LGTM!The test suite provides excellent coverage:
- Basic functionality tests with various alpha values (0.5, 1.0, 1.5, 2.0, 5.0)
- Different symbol counts (2, 5, 1000)
- Edge case validation (negative/zero arguments)
- Explain plan verification
- Distribution skew verification
The expected distribution characteristics correctly reflect Zipf/Power Law behavior where higher alpha values produce stronger concentration on the first symbols.
34-213: LGTM! Comprehensive test coverage.The test suite effectively validates:
- Multiple distribution shapes across varying alpha values (0.5, 1.5, 2.0, 5.0)
- Symbol count ranges from 2 to 1000
- Explain plan integration for different parameterizations
- Edge cases (negative/zero alpha, negative/zero symbol count)
The deterministic assertions on exact counts are appropriate for a seeded RNG test framework.
232-234: LGTM! Correct factory wiring.The
getFunctionFactory()override correctly returns an instance ofRndSymbolZipfNFunctionFactory, ensuring the test suite exercises the new function factory implementation.
core/src/main/java/io/questdb/griffin/engine/functions/rnd/RndSymbolZipfNFunctionFactory.java
Show resolved
Hide resolved
core/src/main/java/io/questdb/griffin/engine/functions/rnd/RndSymbolZipfNFunctionFactory.java
Show resolved
Hide resolved
...est/java/io/questdb/test/griffin/engine/functions/rnd/RndSymbolZipfNFunctionFactoryTest.java
Show resolved
Hide resolved
...est/java/io/questdb/test/griffin/engine/functions/rnd/RndSymbolZipfNFunctionFactoryTest.java
Outdated
Show resolved
Hide resolved
[PR Coverage check]😍 pass : 59 / 80 (73.75%) file detail
|
Useful in testing algorithms with high-cardinality symbols that have skewed distribution of rows.