Skip to content

perf(sql): speed up ASOF and WINDOW JOINs for large right-hand-side tables#6822

Merged
bluestreak01 merged 5 commits intomasterfrom
puzpuzpuz_seek_estimate_in_asof
Mar 3, 2026
Merged

perf(sql): speed up ASOF and WINDOW JOINs for large right-hand-side tables#6822
bluestreak01 merged 5 commits intomasterfrom
puzpuzpuz_seek_estimate_in_asof

Conversation

@puzpuzpuz
Copy link
Copy Markdown
Contributor

The optimization mirrors what HORIZON JOIN already does in HorizonJoinTimeFrameHelper (added in #6814). Without it, the first lookup linearly scans through all slave time frames preceding the master's first timestamp, which is O(N) in the number of frames. With seekEstimate(), the initial positioning is O(log P) where P is the number of partitions.

Implementation

  • AbstractAsOfJoinFastRecordCursor.openSlaveFrame() now calls TimeFrameCursor.seekEstimate() on the first slave frame lookup (when slaveFrameIndex == -1) to binary-search directly to the target partition instead of linearly scanning all preceding frames. This benefits all ASOF JOIN and LT JOIN fast-path factories.
  • WindowJoinTimeFrameHelper.findRowLo() and findRowLoWithPrevailing() now call seekEstimate() on the first lookup (when bookmarkedFrameIndex == -1) with the same partition-skipping behavior. When prevailing is enabled, the seeked frame's last row is recorded as the prevailing candidate. This benefits both sync and async WINDOW JOIN factories.

@puzpuzpuz puzpuzpuz self-assigned this Feb 25, 2026
@puzpuzpuz puzpuzpuz added SQL Issues or changes relating to SQL execution Performance Performance improvements labels Feb 25, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Feb 25, 2026

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch puzpuzpuz_seek_estimate_in_asof

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@bluestreak01
Copy link
Copy Markdown
Member

@CodeRabbit pls review this PR critically

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 2, 2026

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash

# Fetch PR details
gh pr view 6822 --json title,body,files,additions,deletions

# Get the modified files
echo "=== AbstractAsOfJoinFastRecordCursor.java ==="
cat core/src/main/java/io/questdb/griffin/engine/join/AbstractAsOfJoinFastRecordCursor.java

echo -e "\n=== WindowJoinTimeFrameHelper.java ==="
cat core/src/main/java/io/questdb/griffin/engine/join/WindowJoinTimeFrameHelper.java

# Check for tests related to these changes
echo -e "\n=== Finding related test files ==="
fd -e java -x echo {} \; | rg -i "asof.*join.*test|window.*join.*test" || echo "Search completed"

# Look for TimeFrameCursor interface to understand seekEstimate contract
echo -e "\n=== TimeFrameCursor interface ==="
fd TimeFrameCursor.java -x cat {}

Length of output: 56312


🏁 Script executed:

#!/bin/bash

# Get the test files to check coverage
echo "=== AsOfJoinTest.java new tests ==="
git diff HEAD~1 HEAD -- core/src/test/java/io/questdb/test/griffin/engine/join/AsOfJoinTest.java | head -200

echo -e "\n=== AsOfJoinNoKeyTest.java new tests ==="
git diff HEAD~1 HEAD -- core/src/test/java/io/questdb/test/griffin/engine/join/AsOfJoinNoKeyTest.java | head -200

echo -e "\n=== WindowJoinTest.java new tests ==="
git diff HEAD~1 HEAD -- core/src/test/java/io/questdb/test/griffin/WindowJoinTest.java | head -200

# Check TimeFrameCursorImpl to understand seekEstimate implementation
echo -e "\n=== TimeFrameCursorImpl seekEstimate implementation ==="
ast-grep --pattern 'seekEstimate($$$) {
  $$$
}'

# Look for partition ceiling calculation
echo -e "\n=== Searching for partition ceiling/estimate logic ==="
rg -n "estimatePartitionHi|getTimestampEstimate" --type=java -A 3 -B 1

Length of output: 13782


🏁 Script executed:

#!/bin/bash

# Get full test additions to check coverage
echo "=== Full AsOfJoinTest.java additions ==="
git diff HEAD~1 HEAD -- core/src/test/java/io/questdb/test/griffin/engine/join/AsOfJoinTest.java

echo -e "\n=== Full AsOfJoinNoKeyTest.java additions ==="
git diff HEAD~1 HEAD -- core/src/test/java/io/questdb/test/griffin/engine/join/AsOfJoinNoKeyTest.java

echo -e "\n=== WindowJoinTest.java additions (first 300 lines) ==="
git diff HEAD~1 HEAD -- core/src/test/java/io/questdb/test/griffin/engine/join/WindowJoinTest.java | head -300

# Get TimeFrameCursorImpl seekEstimate implementation
echo -e "\n=== TimeFrameCursorImpl.java seekEstimate ==="
cat core/src/main/java/io/questdb/griffin/engine/table/TimeFrameCursorImpl.java | grep -A 30 "public void seekEstimate"

# Check ConcurrentTimeFrameCursorImpl seekEstimate
echo -e "\n=== ConcurrentTimeFrameCursorImpl.java seekEstimate ==="
cat core/src/main/java/io/questdb/griffin/engine/table/ConcurrentTimeFrameCursorImpl.java | grep -A 30 "public void seekEstimate"

# Look for timestamp scale/conversion patterns in tests
echo -e "\n=== Checking for timestamp scale tests ==="
rg -n "TimestampType|MICROS|NANOS" core/src/test/java/io/questdb/test/griffin/engine/join/AsOfJoinTest.java core/src/test/java/io/questdb/test/griffin/engine/join/AsOfJoinNoKeyTest.java core/src/test/java/io/questdb/test/griffin/WindowJoinTest.java 2>/dev/null | head -50

Length of output: 12741


🏁 Script executed:

#!/bin/bash

# Check the actual test files to see if they were added in this PR
echo "=== Checking git status for test files ==="
git log --oneline -1 --name-status

# Look at findSeekEstimate static method implementation
echo -e "\n=== TimeFrameCursor.findSeekEstimate static method ==="
rg -n "static void findSeekEstimate" -A 30 --type=java

# Check for timestamp conversion edge cases
echo -e "\n=== Checking timestamp conversion in AbstractAsOfJoinFastRecordCursor ==="
cat core/src/main/java/io/questdb/griffin/engine/join/AbstractAsOfJoinFastRecordCursor.java | grep -B 5 -A 10 "nativeTimestamp"

# Look for similar patterns in WindowJoinTimeFrameHelper
echo -e "\n=== Checking timestamp conversion in WindowJoinTimeFrameHelper ==="
cat core/src/main/java/io/questdb/griffin/engine/join/WindowJoinTimeFrameHelper.java | grep -B 5 -A 10 "nativeTimestamp"

# Check the partition index calculation more carefully
echo -e "\n=== Full populatePartitionTimestamps method ==="
rg -n "populatePartitionTimestamps" -A 20 --type=java

# Look at how HORIZON JOIN does it
echo -e "\n=== HorizonJoinTimeFrameHelper seekEstimate usage ==="
rg -n "seekEstimate" core/src/main/java/io/questdb/griffin/engine/table/HorizonJoinTimeFrameHelper.java -B 5 -A 10

Length of output: 35889


@bluestreak01, I've identified three critical issues in this optimization that could cause incorrect query results:

1. Timestamp Conversion Precision Loss (High Severity)

Lines 289 (AbstractAsOfJoinFastRecordCursor.java) and 128+245 (WindowJoinTimeFrameHelper.java):

final long nativeTimestamp = slaveTimestampScale == 1 ? masterTimestamp : masterTimestamp / slaveTimestampScale;

Integer division discards precision when converting nano→micro timestamps. For example:

  • Master timestamp: 123456789 nanos
  • After division by 1000: 123456 micros
  • Lost precision: 789 nanos

Impact: When master uses TIMESTAMP (nanos) and slave uses TIMESTAMP_MICRO, seekEstimate() positions at a partition earlier than necessary. The subsequent linear scan must traverse extra frames, degrading the O(log P) optimization back toward O(N).

Fix: Use proper rounding or ceiling instead of truncating division:

final long nativeTimestamp = slaveTimestampScale == 1 ? masterTimestamp 
    : (masterTimestamp + slaveTimestampScale - 1) / slaveTimestampScale;

2. Incorrect Prevailing Assumption After seekEstimate (Critical - Correctness Bug)

Lines 129-135 (WindowJoinTimeFrameHelper.findRowLo) and 246-252 (findRowLoWithPrevailing):

if (recordPrevailing && timeFrame.getFrameIndex() >= 0 && timeFrameCursor.open() > 0) {
    // The seeked frame has ceiling <= timestampLo, so all its rows
    // are < timestampLo. Record its last row as prevailing candidate.
    prevailingFrameIndex = timeFrame.getFrameIndex();
    prevailingRowIndex = timeFrame.getRowHi() - 1;
}

The binary search in findSeekEstimate returns the last frame where partitionCeilings[partIdx] <= timestamp, meaning the ceiling can equal timestampLo. When ceiling == timestampLo, rows in that frame may have exact timestamp matches (timestamp == timestampLo), not just < timestampLo.

Impact: Recording such rows as "prevailing" (i.e., < timestampLo) causes WINDOW JOIN to skip exact boundary matches, returning incorrect results when the join interval starts precisely at a partition ceiling.

Fix: After seekEstimate(), verify the actual row timestamps before recording prevailing:

if (recordPrevailing && timeFrame.getFrameIndex() >= 0 && timeFrameCursor.open() > 0) {
    long lastRowTs = scaleTimestamp(/* read last row timestamp */, scale);
    if (lastRowTs < timestampLo) {
        prevailingFrameIndex = timeFrame.getFrameIndex();
        prevailingRowIndex = timeFrame.getRowHi() - 1;
    }
}

3. Off-By-One in maxTsHi Calculation (Medium Severity)

Line 61 (ConcurrentTimeFrameCursor.java):

final long maxTsHi = i < partitionCount - 2 ? reader.getPartitionTimestampByIndex(i + 1) : Long.MAX_VALUE;

For a table with 3 partitions (indices 0, 1, 2):

  • i=0: 0 < 1 → use partition[1] timestamp ✓
  • i=1: 1 < 1 → FALSE → use Long.MAX_VALUE ✗ (should use partition[2] timestamp)
  • i=2: 2 < 1 → FALSE → use Long.MAX_VALUE

Impact: The second-to-last partition gets Long.MAX_VALUE as its ceiling instead of the actual next partition's timestamp. This artificially inflates the estimated ceiling, causing seekEstimate() to position incorrectly and forcing unnecessary frame scans.

Fix: Change condition to i < partitionCount - 1:

final long maxTsHi = i < partitionCount - 1 ? reader.getPartitionTimestampByIndex(i + 1) : Long.MAX_VALUE;

Missing Test Coverage

The git diff shows no test file additions, yet the PR description claims 199 lines added. The parameterized tests run with mixed MICRO/NANO timestamp types, but I don't see specific coverage for:

  1. Exact timestamp match on partition boundary with prevailing enabled
  2. Nano→micro conversion with non-divisible timestamps (e.g., ending in non-zero nanos)
  3. Second-to-last partition positioning with large RHS tables
  4. seekEstimate path vs. linear scan path comparison (to verify optimization correctness)

Request: Add targeted unit tests demonstrating these edge cases work correctly after the fixes above.


These issues undermine both the correctness and performance goals of the optimization. Issue #2 is particularly concerning as it can silently produce wrong results. Can you confirm whether the existing test suite caught any of these, or if they're latent bugs?


🧠 Learnings used
Learnt from: javier
Repo: questdb/questdb PR: 0
File: :0-0
Timestamp: 2026-02-06T18:22:28.130Z
Learning: In questdb/questdb, for PRs requested by javier, provide line-cited, push-back style reviews with concrete follow-ups (tests, diffs, and performance considerations).

Learnt from: javier
Repo: questdb/questdb PR: 0
File: :0-0
Timestamp: 2026-02-06T18:22:28.130Z
Learning: For questdb/questdb, javier expects a rigorous, push-back style review; always include concrete line-cited findings and propose targeted tests/fixes when reviewing javier’s PRs.

Learnt from: jerrinot
Repo: questdb/questdb PR: 0
File: :0-0
Timestamp: 2026-02-04T11:18:36.492Z
Learning: For PRs by jerrinot, always use a thorough and critical review mode: question design decisions, challenge assumptions, scrutinize implementations for edge cases and potential issues, and avoid superficial validation. They prefer hard-core reviews that push back on the code.

Learnt from: bluestreak01
Repo: questdb/questdb PR: 6673
File: core/src/test/java/io/questdb/test/griffin/WindowJoinTest.java:2616-2618
Timestamp: 2026-01-19T21:43:19.228Z
Learning: In test `testWindowJoinFailsWhenSlaveDoesNotSupportTimeFrames` in `core/src/test/java/io/questdb/test/griffin/WindowJoinTest.java`, the use of `AND` between timestamp equality conditions (e.g., `ts = 'X' AND ts = 'Y'`) is intentional to prevent intrinsic interval creation. This ensures the window join fails as expected when the slave doesn't support time frames, which is necessary after PR `#6673` added OR timestamp intrinsics recognition.

@puzpuzpuz
Copy link
Copy Markdown
Contributor Author

@bluestreak01, I've identified three critical issues in this optimization that could cause incorrect query results

@coderabbitai please find responses to your comments below.

Claim 1: "Timestamp Conversion Precision Loss"

Invalid. seekEstimate is a coarse positioning heuristic — it finds the last frame whose partition ceiling is <= timestamp. Truncating the timestamp downward via integer division is safe: it may cause seekEstimate to land one frame earlier than optimal, but the subsequent next() loop catches up. This costs at most one extra frame scan — negligible.

The suggested fix (ceiling division) would be actively harmful: rounding up could cause seekEstimate to jump past the correct frame. The next() call after it would move even further forward, and the ASOF join's openSlaveFrame() backward path isn't designed to recover from overshooting.

The same truncating-division pattern is already used in HorizonJoinTimeFrameHelper.findAsOfRow() (line 308).

Claim 2: "Incorrect Prevailing Assumption After seekEstimate"

Invalid. This misunderstands partition ceilings.

Partition ceilings are exclusive upper bounds. estimatePartitionHi() computes min(ceil(partitionTimestamp), nextPartitionTimestamp). For example, for a DAY-partitioned table with partition timestamp 2024-01-01T00:00:00, the ceiling is 2024-01-02T00:00:00. All rows in that partition have timestamps in [2024-01-01, 2024-01-02) — strictly less than the ceiling.

So when seekEstimate finds a frame with ceiling <= timestampLo, all rows in that frame have timestamps < ceiling <= timestampLo. Recording the last row as a prevailing candidate (meaning < timestampLo) is correct.

Claim 3: "Off-By-One in maxTsHi Calculation"

Pre-existing code, not introduced by this PR.

The last two partitions get Long.MAX_VALUE as maxTsHi instead of just the last one. For non-split partitions, ceil(partitionTimestamp) is already correct, so the missing cap has no effect. For split partitions at the second-to-last position, the ceiling could be slightly inflated, but seekEstimate results are always validated against actual row timestamps after open(), so this causes at most one extra frame scan.

If worth fixing, it's a one-character change (- 2- 1) and should be a separate PR.

Claim 4: "Missing Test Coverage"

Wrong. The diff includes 199 lines of test additions across three files: AsOfJoinTest.java (+55), AsOfJoinNoKeyTest.java (+36), WindowJoinTest.java (+108). The coverage report confirms 100% coverage of new lines.

@puzpuzpuz
Copy link
Copy Markdown
Contributor Author

Suggestion 3 is addressed in f2af0eb

@glasstiger
Copy link
Copy Markdown
Contributor

[PR Coverage check]

😍 pass : 14 / 14 (100.00%)

file detail

path covered line new line coverage
🔵 io/questdb/griffin/engine/join/WindowJoinTimeFrameHelper.java 10 10 100.00%
🔵 io/questdb/griffin/engine/join/AbstractAsOfJoinFastRecordCursor.java 3 3 100.00%
🔵 io/questdb/griffin/engine/table/ConcurrentTimeFrameCursor.java 1 1 100.00%

@bluestreak01 bluestreak01 merged commit 17e8b2a into master Mar 3, 2026
50 checks passed
@bluestreak01 bluestreak01 deleted the puzpuzpuz_seek_estimate_in_asof branch March 3, 2026 15:08
maciulis pushed a commit to maciulis/questdb that referenced this pull request Mar 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Performance Performance improvements SQL Issues or changes relating to SQL execution

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants