feat(sql): rnd_symbol_zipf() function for generating skewed symbol distributions #6217
feat(sql): rnd_symbol_zipf() function for generating skewed symbol distributions #6217bluestreak01 merged 4 commits intomasterfrom
rnd_symbol_zipf() function for generating skewed symbol distributions #6217Conversation
WalkthroughAdds a new Zipf-based random symbol function factory to the SQL engine and corresponding tests. Minor/no-op repository housekeeping change in .gitignore-like area reported. No changes to exported APIs beyond the new factory. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant SQL as SQL Parser/Planner
participant FF as RndSymbolZipfFunctionFactory
participant Ctx as SqlExecutionContext
participant Func as Zipf Symbol Function
participant Rnd as Rnd (seeded)
SQL->>FF: newInstance(args=[symbols..., alpha], positions, config, Ctx)
FF->>FF: Validate constants and alpha > 0
FF->>Func: Construct Func(symbols, alpha, cumulative probs)
Func->>Ctx: getRandom()
Ctx-->>Func: Rnd instance
Func->>Rnd: seed if needed
SQL-->>Func: Function instance returned
loop For each row evaluation
SQL->>Func: getSymbol(row)
Func->>Rnd: nextDouble()
Func->>Func: linear search over CDF to pick index
Func-->>SQL: selected symbol
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Pre-merge checks and finishing touches❌ Failed checks
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (2)
core/src/main/java/io/questdb/griffin/engine/functions/rnd/RndSymbolZipfFunctionFactory.java (1)
183-191: Use an actual binary search for cumulative samplingThe comment promises a binary search, yet we linearly scan every symbol. With large symbol lists this turns the sampler into an O(n) hotspot. Switching to
Arrays.binarySearchkeeps the cumulative-array approach but restores logarithmic behaviour.+import java.util.Arrays; @@ - for (int i = 0; i < count; i++) { - if (u <= cumulativeProbabilities[i]) { - return i; - } - } - - // Fallback to last element (should not happen due to cumulative[count-1] == 1.0) - return count - 1; + int idx = Arrays.binarySearch(cumulativeProbabilities, u); + if (idx >= 0) { + return idx; + } + idx = -idx - 1; + return idx < count ? idx : count - 1;core/src/test/java/io/questdb/test/griffin/engine/functions/rnd/RndSymbolZipfFunctionFactoryTest.java (1)
90-96: Soften the high-alpha assertion to avoid brittle failuresThese exact counts assume the RNG sequence never changes. A tiny tweak in Zipf normalisation (or RNG implementation) will make this test noisy. Consider asserting that
'1AAPL'dominates and the remaining symbols collectively stay under a small threshold instead of pinning exact totals. Keeps the coverage while avoiding future flakes.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
.gitignore(1 hunks)core/src/main/java/io/questdb/griffin/engine/functions/rnd/RndSymbolZipfFunctionFactory.java(1 hunks)core/src/test/java/io/questdb/test/griffin/engine/functions/rnd/RndSymbolZipfFunctionFactoryTest.java(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (27)
- GitHub Check: New pull request (Coverage Report Coverage Report)
- GitHub Check: New pull request (SelfHosted Other tests on linux-arm64)
- GitHub Check: New pull request (SelfHosted Griffin And Fuzz tests on linux-arm64)
- GitHub Check: New pull request (SelfHosted Griffin And Fuzz tests on linux-x64-zfs)
- GitHub Check: New pull request (SelfHosted Other tests on linux-x64-zfs)
- GitHub Check: New pull request (Hosted Running tests on windows-other)
- GitHub Check: New pull request (Hosted Running tests on windows-pgwire)
- GitHub Check: New pull request (Hosted Running tests on windows-cairo)
- GitHub Check: New pull request (Hosted Running tests on windows-fuzz2)
- GitHub Check: New pull request (Hosted Running tests on windows-fuzz1)
- GitHub Check: New pull request (Hosted Running tests on windows-griffin)
- GitHub Check: New pull request (Hosted Running tests on mac-other)
- GitHub Check: New pull request (Hosted Running tests on mac-pgwire)
- GitHub Check: New pull request (Hosted Running tests on mac-cairo-fuzz)
- GitHub Check: New pull request (Hosted Running tests on mac-cairo)
- GitHub Check: New pull request (Rust Test and Lint on linux-jdk17)
- GitHub Check: New pull request (Hosted Running tests on mac-griffin)
- GitHub Check: New pull request (SelfHosted Griffin And Fuzz tests Start X64Zfs Agent)
- GitHub Check: New pull request (SelfHosted Other tests Start X64Zfs Agent)
- GitHub Check: New pull request (SelfHosted Griffin And Fuzz tests Start ARM Agent)
- GitHub Check: New pull request (SelfHosted Other tests Start ARM Agent)
- GitHub Check: New pull request (Hosted Running tests with cover on linux-other)
- GitHub Check: New pull request (Hosted Running tests with cover on linux-pgwire)
- GitHub Check: New pull request (Hosted Running tests with cover on linux-cairo)
- GitHub Check: New pull request (Hosted Running tests with cover on linux-fuzz)
- GitHub Check: New pull request (Hosted Running tests with cover on linux-griffin)
- GitHub Check: New pull request (Check Changes Check changes)
core/src/main/java/io/questdb/griffin/engine/functions/rnd/RndSymbolZipfFunctionFactory.java
Outdated
Show resolved
Hide resolved
[PR Coverage check]😍 pass : 106 / 127 (83.46%) file detail
|
Generates Zipf/Power Law Distribution of symbols from the list. The probability of symbol decays from left to right with given 'alpha' velocity.
Usage:
Also added weighted, but more verbose function: