perf(sql): optimize ASOF JOIN for dense interleaving of left-hand and right-hand rows#6362
perf(sql): optimize ASOF JOIN for dense interleaving of left-hand and right-hand rows#6362bluestreak01 merged 72 commits intomasterfrom
Conversation
…sof-linear-symbol
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@CodeRabbit please read PR description and review the PR. Pay attention to bugs, NPEs, off-by-ones, edge cases etc |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: Length of output: 2597 🏁 Script executed: Length of output: 15300 🏁 Script executed: Length of output: 5390 🏁 Script executed: Length of output: 6526 🏁 Script executed: Length of output: 4103 🏁 Script executed: Length of output: 1810 🏁 Script executed: Length of output: 1715 🏁 Script executed: Length of output: 34716 🏁 Script executed: Length of output: 40930 🏁 Script executed: Length of output: 2800 🏁 Script executed: Length of output: 11946 🏁 Script executed: Length of output: 281 🏁 Script executed: Length of output: 0 🏁 Script executed: Length of output: 0 🏁 Script executed: Length of output: 285 🏁 Script executed: Length of output: 43 🏁 Script executed: Length of output: 203 🏁 Script executed: Length of output: 128 🏁 Script executed: Length of output: 109 🏁 Script executed: Length of output: 405 🏁 Script executed: Length of output: 732 🏁 Script executed: Length of output: 56 🏁 Script executed: Length of output: 582 🏁 Script executed: Length of output: 618 🏁 Script executed: Length of output: 3432 🏁 Script executed: Length of output: 1049 🏁 Script executed: Length of output: 127 🏁 Script executed: Length of output: 4289 🏁 Script executed: Length of output: 271 @bluestreak01, I've completed a thorough review of the Dense ASOF JOIN implementation. Here are my findings: Critical Issues1. Bug in
|
Results of backward scan are reusable across cursor rewinds.
Yes. Both are equally bad, and aren't supposed to happen. The
The caller guarantees no nulls in array. |
|
Tip For best results, initiate chat on the files or code changes. @mtopolnik, thank you for the clarifications! Backward scan reusability: Understood - the backward scan results are intentionally cached and reusable across Assertion pattern: Makes sense - the Caller guarantees: Acknowledged - the All three points are intentional design decisions. The implementation looks solid! ✏️ Learnings added
|
[PR Coverage check]😍 pass : 368 / 389 (94.60%) file detail
|
| master, | ||
| slave, | ||
| keyTypes, | ||
| new SymbolKeyMappingRecordCopier(joinKeyMapping), |
There was a problem hiding this comment.
@mtopolnik this change broke consistency of the key types provided to the map and the actual Key calls: keyTypes contain [11] (single string key) while the actual copier is using Key#putInt() method.
The Dense ASOF JOIN algorithm is a variant of the Light algorithm:
This difference is highly important when the right-hand table has history that predates the first left-hand row. The entire history will be skipped, except for some recent history needed to find matches of initial few left-hand rows.
Let's use the diagram below to explain the key differences among algorithms. It shows two tables, LHS and RHS. LHS rows are less densely distributed over time than RHS rows, but not much less. We show the rows aligned on timestamp, so there are gaps in the LHS column. These gaps don't represent any LHS rows, it is just the way we visualize the two tables.
Light algo
Light algo uses a forward-only scan of the RHS table. When matching the first RHS symbol (row 6, symbol A), it starts from RHS row 1, and proceeds all the way to row 6, collecting all the symbols into a hashtable. When done, it looks up symbol A in the hashtable and finds the prevailing RHS row is row 4. When matching the next RHS symbol (row 9, symbol C), it resumes the forward scan, touching rows 7, 8 and 9. Then it looks up symbol C, and finds the prevailing row is row 2.
Fast algo
Fast algo uses binary search over RHS timestamps to zero in on row 6 as the most recent row not newer than the first LHS row. Then it scans backward: rows 6, 5, 4, and there it finds the matching symbol A. When matching the next LHS symbol (row 9, symbol C), it uses binary search to zero in on RHS row 9, then scans all the way back to row 2, where it finds symbol C.
When matching symbol A in row LHS row 15, it uses binary search to zero in or RHS row 15, then scans backward, again all the way back to row 4.
There's also an optimization that avoids the fixed cost of binary search by first searching linearly for the matching timestamp in the RHS row, for a smallish number of steps. This doesn't affect the backward search for the symbol.
Memoized algo
The Memoized algo is a variant of the Fast algo. It uses the exact same linear/binary search to find the matching timestamp in the RHS, and then uses the same backward search for the symbol. However, it memorizes for each symbol where it started the backward search, and where it found it.
In our example, this means it handles the first LHS row (6) exactly the same way, scanning backward to row 4. But when it encounters the same symbol A in row 15, it scans backward only until reaching row 6, and then directly uses the remembered result of the previous scan, and matches up with row 4.
With Drive-By caching enabled, Memoized algo will memorize not just the symbol it's looking for, but also any other symbol. However, it can only memorize it on the first encounter. This is valuable for rare symbols that occur deep in the past, but otherwise it just introduces more overhead.
Dense algo
The Dense algo starts like the Fast algo, performing a binary search to zero in on RHS row 6 and searching backward to find symbol A in row 4 of RHS. From then on, it behaves more like the Light algo.
To match up LHS row 9 (symbol C), it first does a linear scan forward from row 6 to row 9 (exactly like the Light algo). Since it didn't find C in this scan, it resumes the backward scan, touching rows 3 and 2, and there it finds the symbol C.
At LHS row 12 (symbol B), it resumes the forward scan, touching rows 10, 11, and 12. Then it finds symbol B in the hashtable, getting row 8 as the prevailing row. No backward scan nedeed here.
At LHS row 15 (symbol A), it resumes the forward scan, touching rows 13, 14, and 15. Then it looks up symbol A in the hashtable of the forward scan, finding nothing. Then it looks up symbol A in the hashtable of the backward scan, and finds it there. The prevailing row is number 4. Again, no backward search was needed.
Discussion
We can see that the Fast and Memoized algos had to touch the most rows. Especially, when matching row 15, Fast algo had to scan backward to row 4, and Memoized did only slighly better, scanning until row 6.
Light algo had to initially scan all the history (rows 1 to 6), but from then on, it only needed to touch the additional rows that came into scope as the LHS timestamp was moving on.
Dense algo had the same advantage as Light, but it didn't have to scan all the history. It scanned only as far back into history as needed to find the most recent occurence of a symbol not yet seen in the forward scan.
Additional changes in the PR
The PR also optimizes symbol-to-symbol joins for the existing Light and Fast algos. It works through the
RecordSink, which decides what data to copy from the table row to a buffer for comparison. Instead of copying the symbol string, it puts just the symbol key, after mapping the left-hand to the right-hand symbol key.After applying this to Light cursor, there was no more need for a dedicated Single Symbol Light cursor, so the PR removes it. There's still some performance gap between writing a dedicated cursor that directly works with symbol keys, but it's a 30% difference and it isn't worth it for the Light cursor.
The Dense cursor has two implementations, one specialized for symbol-to-symbol comparison. This cursor could be critical to the performance of markout analysis, so I thought it's worth it. The implementation works through an abstract class, avoiding code duplication.
Benchmarking
Measurements taken on
r7a.4xlarge.Tables
Trades: 167.5 million rows, Jan 2, 00:00 to Jan 2, 08:00
Prices: 1.01 billion rows, Jan 1, 16:00 to Jan 2, 08:00
Time period of prices is a 50/50 split between history and overlap with trades.
Query
To avoid timeouts, we limit trades to 10 million rows for most measurements. In the end we do Asof Dense with the full dataset.
Fast (Default)
59 seconds
Memoized
42 seconds
Memoized with Drive-By Caching
63 seconds
Light
9.7 seconds
Dense Single Symbol
730 milliseconds
Dense Single Symbol, Full Dataset
The full dataset is 16.7 times larger than the limited one used above.
10.9 seconds
Bonus: Light with no history
Here we remove the history part of
pricestable, so it covers only the time period present intrades. This bringsLightandDenseto an equal footing in terms of algorithm. The remaining difference is the additional single-symbol specialization inDensealgo.1.17 seconds